How to build an odd word picker using Python and NLP

In this shot, we will build an NLP engine that will pick the odd word from a set of words. For example, if we have a list of words like “Apple”, “Mango”, “Party”, and “Juice,” it is clear that ‘Party’ is the odd word out.

For this, we are going to use Gensim’s word2vec model. Gensim provides an optimum implementation of word2vec’s CBOWContinuous Bag of Words model and Skip-Gram model.

Similarity between two words

Before moving on, you need to download the word2vec vectorshttps://huggingface.co/LoganKilpatrick/GoogleNews-vectors-negative300/blob/main/GoogleNews-vectors-negative300.bin.gz. Remember that the file size is ~1.65GB.We suggest you work on Google Colab for this as the size of the file is very large.

Open your Google Colab and run the command below to get your word vectors.

!wget -P /root/input/ -c "https://huggingface.co/LoganKilpatrick/GoogleNews-vectors-negative300/blob/main/GoogleNews-vectors-negative300.bin.gz"

This command will download it on Google servers and will save a lot of time.

Now, let’s install the packages we require, as shown below.

pip install gensim
pip install scikit-learn

You can run the command above in both Google Colab and on your local machine (if you’re using that).

Now, let’s move on to the coding part by first importing the packages in the following way:

We imported two packages that will be used in the following way:

The gensim package will be used to load the word vectors that we downloaded.
KeyedVectors essentially contain the mapping between words and embedding. After training, it can be used to directly query those embedding in various ways.
We will use scikit-learn's cosine similarity to calculate the distance between two words. This distance metric is commonly used and provides good results for various problems.

Create the function

Next, we are going to create a function that will take a list of strings and return a string that is very different from all of them. This function can be created as shown below.

The explanation for the code above is:

In line 4, we used list comprehension in Python to get the list of word vectors of all the words that are passed as input, and saved them in the all_word_vectors list.
In line 5, we calculated the mean of all the vectors. We did this because we will calculate the distance of each word vector from the average vector. The vector with the minimum distance will be our result.
In line 7, we assumed the minimum similarity to be 1.0, which is the maximum value we can get. We can compare this value later on and check if we get any value smaller than 1.0. The smallest value will give the desired result.
In line 9 to line 13, we iterated over each word vector and found the vector with the minimum distance from the average vector.
Finally, we returned our result.

Test the function

Now that we have our function ready, let’s use it with some inputs.

Free AI Mock Interviews

Coding Interview

Coding PatternsFree Interview

Gain insights and practical experience with coding patterns through targeted MCQs and coding problems, designed to match and challenge your expertise level.

System Design

YouTubeFree Interview

Learn to design a video streaming platform like YouTube by tackling functional and non-functional requirements, core components, and high-level to detailed design challenges.

Free Resources

License: Creative Commons-Attribution-ShareAlike 4.0 (CC-BY-SA 4.0)