What is feature engineering?

What do you think a data scientist spends most of their time on? Training the model? Collecting data? Performing predictions? Refining algorithms?

No, a data scientist actually spends most of their time on data cleaning and organization. This process of data cleaning, which extracts useful features from raw data using domain knowledge and statistics, is called feature engineering.

Null values

Many methods can be used to perform data cleaning. For example, raw data can have many null values. Null values happen when the data of a particular field or column is missing. If this is an integer column, one solution is to put the median values in the empty fields.

Outliers

There is also the concept of outliers. For example, in data about “heights of people (m)”, you see the value 19.7. You can see that this is an inaccurate value because it seems very unlikely that there is someone with a height of 19.7 m. Therefore, we can conclude that this value is inaccurate.

Outliers are significant in the data cleaning process as they can decrease the accuracy of data, skew the results, and give inaccurate visualization plots. One of the key responsibilities of any data scientist is to remove these outliers. There are a few methods that can be used to remove them:

  1. Domain Knowledge: Sometimes, you can look at the data collected and. from your knowledge, you can conclude if a specific value is correct. Look at our example above – we know that 19.7 m is an inaccurate height using our domain knowledge and common sense.

  2. Percentile: A percentile is a measure used in statistics that indicates the value below which a given percentage of observations in a group of observations falls. For example, the 20th percentile is the value below which 20% of the observations may be found. Equivalently, 80% of the observations are found above the 20th percentile.

Calculating a percentile of data values can be useful. For example, one can clean data using the 95th percentile. Assuming a similar example of heights within a population, you calculate the 95th percentile of heights, and you get 9.2 m as an answer. This tells us that 95% of the people have a height of less than 9.2 m. Here, we can use our domain knowledge to see that it is highly unlikely that anyone would have a height greater than 9.2 m. Now, you can use this knowledge to drop any height above 9.2 m, as it will be an outlier.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved