Comment spam detection with machine learning

Comments are a way to gather user feedback on social media platforms. However, comments don’t always provide effective user feedback. Users on the internet often use these platforms to promote their businesses or redirect traffic to their websites or social media pages. Therefore, these platforms need to filter out spam comments. Detecting spam comments is a text classification task.

In this Answer, we’ll analyze how platforms use text classification algorithms to detect and filter out spam comments. The dataset used to train the machine learning model uses some example comments and metrics to classify them as spam or not.

Defining the dataset

The dataset used to classify comments as spam contains a mix of spam and non-spam comments. For this Answer, we’ll use the dataset stored as a CSV file titled Spam_comment.csv. Once we’ve trained our model using this dataset, the text classification process will be able to identify the nature of a comment. If the comment has links and mentions of other platforms, the comment will be marked as spam. Otherwise, if the comment has no redirection links or content and has text only, it represents legitimate content by the user and will not be marked as spam.

Spam detection process

The detection process contains a series of steps—initializing from installing libraries and dependencies, refining the dataset, defining the model, training the model, and using the trained model to detect the nature of the comment. Here is a step-by-step process of the spam comment detection process:

Installing the dependencies:

To perform the spam detection process, certain dependencies are required. In Python, we use pip3 to install the required dependencies. For this specific process, we require numpy, pandas, and scikit_learn. To install the dependencies, we use the following commands:

In the code above:

Line 1: Import the pandas library to read the dataset from the CSV file that is used for training the model.
Line 2: Import the numpy library to use arrays in code and run functions on the arrays.
Line 3: Import the train_test_split model from sklearn to split the dataset into two sets—testing and training set.
Line 4: Import CountVectorizer from sklearn text feature to divide the content into tokens and change them into a suitable format for training.
Line 5: Import Bernoulli Naive Bayes model to perform classification because the dataset is in the binary distributionA distribution having only two possible answer like "yes" or "no"..

Dataset usage:

For using the Bernoulli model, it is important to read and divide the dataset among training and testing sets. Before dividing, we must decide our value to identify y and the value used for classification, x. After splitting, the testing set is used to determine the accuracy of models in predicting the value of the isSpam metric of the dataset.

In the code above:

Line 1: Read the dataset from the CSV file Spam_comment.csv.
Lines 2–6: Map the values of the variable isSpam to int values. It maps 0 and 1 to the Not Spam and Spam.
Line 8: Define the input array as x that uses the variable Comment_content from the dataset as a metric.
Line 9: Create the output variable, y, with the data from the isSpam column.

Dataset transformation and model training:

The dataset contains the content of the Comment_content variable in the form of nested arrays. We need to transform it into an array of strings, so we use flattening to create a single-dimensional array. After training the model to predict the values for the isSpam variable, we calculate the accuracy of the trained model. Bernoulli Naive Bayes algorithm is then used to train the dataset for this process.

In the code above:

Line 1: Concatenate the content of the input array and convert it into a one-dimensional array.
Lines 3–4: Check if all the values in the flattened array, x_flat, are strings.
Lines 6–7: Transform the dataset into a token matrix A token matrix is used to convert data into tokens as characters or subwords from the dataset.to train the Bernoulli model.
Line 8: Use the function train_test_split to divide the dataset as 30% in the testing set to test the model and the rest 70% in the training set to train the model.
Line 10: Define the BernoulliNB model.
Line 11: Use the dataset to train the defined model.
Line 12: Print the accuracy of the model after defining the model.

The trained model is now ready to classify the comment as spam or legitimate. The model determines the value of the output variable, y, which points to whether the comment is spam or not.

Testing example:

To test the trained model on an example, we use a variable test_comment to define a string having the comment. Then run the trained model on the comment to predict its type.

In the code above:

Line 1: Define a string, test_comment, for comment representing spam content.
Line 2: Transform the defined comment from string to array.
Line 3: Use the model to predict the nature of the comment and store it in the variable result.
Line 5: Print the output for the first comment example.
Line 7: Define a string, test_comment, for comment representing legitimate content.
Line 8: Transform the defined comment from string to array.
Line 9: Use the model to predict the nature of the comment and store it in the variable result.
Line 11: Print the output for the second comment example.

After running the example, the isSpam variable will either point to the comment as spam or legitimate. This is how we can use text classification in machine learning to predict the nature of the comment.

The running example of the following algorithm is shown below. Run and navigate to the working model to test the custom data.

Free AI Mock Interviews

Coding Interview

Coding PatternsFree Interview

Gain insights and practical experience with coding patterns through targeted MCQs and coding problems, designed to match and challenge your expertise level.

System Design

You TubeFree Interview

Learn to design a video streaming platform like YouTube by tackling functional and non-functional requirements, core components, and high-level to detailed design challenges.

Free Resources