Normalization makes sure all elements lie within zero and one. It is useful to normalize our data, given that the distribution of data is unknown. Moreover, Normalization cannot be used if the distribution is not a bell curve (like Gaussian distributions). Normalization is useful in models such as k-nearest neighbors and artificial neural networks, or anywhere where the data we are using has varying scales or precision (this will be more clear in the example below). For example, consider a dataset that has two features (salary and years worked). Both of these features will have high differences between their values. For example, salary might be a thousand times bigger than years worked – normalization makes sure the entire data in the same range.
Equation:
Not every dataset needs normalization in Machine Learning. It is only required when data features have varying ranges.
Standardizing a vector mostly means scaling a vector to the mean so that the values are closer to the mean than with a standard deviation of one. Standardization is important when data variability is in question. Many variables are measured using different units and scales. For instance, a variable ranging from 0 to 100 will outweigh the effect of a variable that has values between 0 and 1. This may result in a bias, which is why it is important to transform data to comparable scales. Moreover, it also assumes that data is Gaussian (follows a bell curve) in nature. Standardization is also beneficial when your algorithm predicts using a Gaussian distribution, such as linear regression, logistic regression, or linear discriminant analysis.
StandardScaler
utility class from scikit-learn
can be used to remove the mean and scale the data to unit variance.
from sklearn.preprocessing import StandardScalerdata = [[-1, -2], [0.5, 2], [0, 5], [12, 18]]scaler = StandardScaler()scaler.fit(data)print(scaler.transform(data))
MinMaxScalar
from scikit-learn can be applied to a dataset, as shown below:
from sklearn.preprocessing import MinMaxScalerdata = [[-1, -2], [0.5, 2], [0, 5], [12, 18]]scaler = MinMaxScaler()scaler.fit(data)MinMaxScaler()print(scaler.transform(data))
Free Resources