Data anomalies are unusual or unexpected values in a dataset that don’t fit the normal pattern or behavior.
Key takeaways:
Data should be accurate, valid, complete, consistent, and uniform for reliable analysis.
Missing values can be filled using methods like mean, median, or advanced techniques like KNNImputer in Python.
Outliers are extreme values that distort results and can be identified using scatterplots or boxplots.
Duplicates in the dataset should be removed to avoid unnecessary repetition.
Irrelevant data can be dropped to focus on relevant information for the analysis.
Non-uniform data must be standardized to prevent inconsistencies during analysis.
Cleaning data is essential for ensuring accurate and meaningful results.
Python tools like pandas and Scikit-learn in Python make data cleaning and anomaly detection easier.
Large amounts of data are being generated and collected every day. Whether it be data from smart devices, social media, or the internet, we are surrounded by data. But not all data is perfect. Errors can occur during the generation or collection of data.
The illustration below shows some prominent sources of data:
When analyzing data, we need to validate its reliability. We can break down the analysis of the quality of data into five categories:
Accuracy: Recorded data is within the acceptable range.
Validity: Data meets required standards and suits the needs of time.
Completeness: Data is complete, and portions of it are not missing.
Consistency: Data from a single dataset is consistent and similar.
Uniformity: Measurement metrics for generating data are uniform and consistent.
Any data that does not meet one or more of these categories may be considered an anomaly.
An anomaly in data is something that looks unusual or doesn’t fit with the rest of the data. It could be a mistake or something unexpected.
Anomalies and their detection are widely researched, with applications ranging from identifying health irregularities in the medical field to spotting unusual patterns in financial transactions or cybersecurity.
We’ll now discuss some of the anomalies that might exist within a dataset:
Probably the easiest to detect, missing values are values that are not present in the dataset. Values might not be generated, or there was some issue while collecting and recording them. This means that some information is not available and must be extracted from relevant fields to estimate the missing values. In Python, missing values are represented by NA
meaning “Not Available” or NaN
meaning “Not a Number.” NaN
is used when a numerical value is represented using a non-numerical value.
The illustration below gives an example of missing data:
Missing values can be filled by taking the mean, median, or mode of the data (depending on what suits best). They can also be filled based on data from other columns.
The
scikit-learn
library in Python has a method known asKNNImputer
. KNNImputer extrapolates values based on other columns. For more details, have a look at "What is KNNImputer in scikit-learn?"
Outliers are data values that are not within the acceptable range. They can affect the mean of the dataset drastically. Imagine the following scenario:
The weights of five students are recorded manually. The table below shows the actual weights of students:
Bob | 60 |
Alice | 55 |
Jim | 62 |
Jill | 44 |
Jack | 50 |
The average weight of all students is 54.2.
However, while entering data into the records, the gym instructor accidentally entered Jack’s weight as 500 Kg. This is an outlier since a student is unlikely to weigh that much. The average of all students also increases to 144.2.
Outliers can be observed using a scatterplot or boxplot. Illustrations below show both these plots:
It is difficult to adjust outliers. Outliers can be removed by removing the entire row that contains them. However, if the row is important, outliers can be treated as missing data and adjusted accordingly.
Duplicates refer to data that is the same. They might occur when the same data has been entered more than once. Oftentimes, we need to merge data from multiple sources before it can be processed. Merges are also events where duplicates might arise. Duplicates can be removed by removing the entire extra row.
Sometimes, the data we are analyzing might have columns that we do not need. This is specific to our research question. Secondary research involves analyzing data that has already been collected. Such data might have extra information that is deemed irrelevant to our work. We can choose to get rid of such information by dropping entire columns that we do not require.
Data entry through drop-downs and checklists offers a safer method to input data. However, it might not entertain all possible values. Inconsistencies in data can occur when users are allowed to enter values manually. This can introduce errors such as variable spellings for the same word, different ways of representing data, or qualitative information that can not be scaled.
The illustration below shows examples of non-uniform data:
First, non-uniform data needs to be identified. This can be done by scanning the dataset manually or using relevant functions.
The
pandas
library in Python has the functionvalue_counts
that shows unique values for a particular column of the dataset along with the number of times they occur.
Non-uniform data can then be standardized by encoding them into special values that provide some quantifiable information.
Data generation and collection are prone to errors within the data. These errors need to be identified and fixed before data can be used further. Errors may range from missing data to outliers, duplicates, irrelevant, and non-uniform data.
Haven’t found what you were looking for? Contact Us
Free Resources