Data mining is the process of extracting information from raw data such that it leads us to take necessary action according to the produced data trends.
In this regard, statistics plays a significant role by providing the fundamentals for understanding the key features of large datasets. In this Answer, we will cover the various stages of data mining and explore the crucial role of statistics in every stage with code illustrations to understand different concepts.
Here are the stages of data mining:
Data exploration
Data preprocessing
Sampling techniques
Hypothesis techniques
Model evaluation
Feature selection
We analyze the dataset's trends, outliers, and characteristics in this phase. For this, we use statistics to understand the correlations and dependencies among the features in the dataset and visualize scatter plots, histograms, box plots, and heat maps.
import numpy as npimport matplotlib.pyplot as plt# Generating the random datadata = np.random.normal(loc=0, scale=1, size=1000)mean = np.mean(data)median = np.median(data)std_dev = np.std(data)print("Mean:", mean)print("Median:", median)print("Standard Deviation:", std_dev)# Plotting the Histogramplt.hist(data, bins=30, edgecolor='black')plt.xlabel('Values')plt.ylabel('Frequency')plt.title('Histogram of Random Data')plt.savefig("./output/Plot.png")plt.show()
Data preprocessing involves handling the missing data in the dataset, scaling and normalizing the data, encoding categorical variables, and reducing dimensions while preserving the essential features of the data set. All these statistical methods help mine the data.
import pandas as pdimport numpy as np# A DataSet with missing values in it.data = pd.DataFrame({'A': [1, 2, np.nan, 4, 5],'B': [6, np.nan, 8, 9, 10]})# Dropping the rows with missing valuesdata_cleaned = data.dropna()# Filling missing values with the mean of a columndata_imputed = data.fillna(data.mean())print("Cleaned Data:")print(data_cleaned)print("Imputed Data:")print(data_imputed)
Sampling techniques or sampling distributions are an integral part of statistics. Different samples are taken from the same population in a sample distribution, and sample mean and deviation are calculated. This technique helps us determine parameters such as the mean and deviation of the population. It also helps us quantify the uncertainty associated with the sample data set.
import random# Original datasetdataset = list(range(100))# Random samplingsample_random = random.sample(dataset, 10)# Stratified samplingstrata = [0] * 50 + [1] * 50 # Two stratasample_stratified = random.choices(dataset, weights=strata, k=10)print("Random Sample:", sample_random)print("Stratified Sample:", sample_stratified)
In this data mining stage, we use statistical techniques or methods to make inferences about the population dataset with the help of a sample dataset. It involves collecting data, using the null and alternative hypotheses, and performing statistical methods to evaluate the evidence against the null hypothesis.
from scipy.stats import ttest_indimport numpy as np# Sample data from two groupsgroup1 = np.random.normal(loc=0, scale=1, size=100)group2 = np.random.normal(loc=1, scale=1, size=100)# Independent t-testt_stat, p_value = ttest_ind(group1, group2)print("T-statistic:", t_stat)print("P-value:", p_value)
This part involves the assessment of the model with respect to the unseen data or measuring the performance of different models and deciding which model will be best for a specific dataset. We can assess the model's performance with the help of different parameters such as the F1-score, mean squared error, etc. By these statistical parameters, we can assess the underfitting or overfitting of the model.
from sklearn.model_selection import cross_val_scorefrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import accuracy_scoreimport numpy as np# Generate sample dataX = np.random.normal(loc=0, scale=1, size=(100, 2))y = np.random.choice([0, 1], size=100)# Logistic Regression modelmodel = LogisticRegression()# Cross-validationscores = cross_val_score(model, X, y, cv=5)mean_accuracy = np.mean(scores)print("Cross-Validation Scores:", scores)print("Mean Accuracy:", mean_accuracy)
In data mining, we select features with more information, which means we can find the correlation between the features and the model output. Those features that have a strong correlation with the target will be selected. Another method for feature selection is principle component analysis, where we transform the original features into a smaller set of features with the most negligible correlation.
In summary, in this Answer, we have seen the role of statistics in data mining. Now, solve this quiz to test your understanding of the relationship between statistics and data mining.
What is the role of statistics in data mining?
It helps in data preprocessing and visualizing data trends.
It helps in selecting the suitable features from the large dataset.
It helps in identifying the relationship among the features of the dataset.
All of the above.
Free Resources