How to build an odd word picker using Python and NLP

In this shot, we will build an NLP engine that will pick the odd word from a set of words. For example, if we have a list of words like “Apple”, “Mango”, “Party”, and “Juice,” it is clear that ‘Party’ is the odd word out.

For this, we are going to use Gensim’s word2vec model. Gensim provides an optimum implementation of word2vec’s CBOWContinuous Bag of Words model and Skip-Gram model.

Similarity between two words

Before moving on, you need to download the word2vec vectorshttps://huggingface.co/LoganKilpatrick/GoogleNews-vectors-negative300/blob/main/GoogleNews-vectors-negative300.bin.gz. Remember that the file size is ~1.65GB.We suggest you work on Google Colab for this as the size of the file is very large.

Open your Google Colab and run the command below to get your word vectors.

!wget -P /root/input/ -c "https://huggingface.co/LoganKilpatrick/GoogleNews-vectors-negative300/blob/main/GoogleNews-vectors-negative300.bin.gz"

This command will download it on Google servers and will save a lot of time.

Now, let’s install the packages we require, as shown below.

pip install gensim
pip install scikit-learn

You can run the command above in both Google Colab and on your local machine (if you’re using that).

Now, let’s move on to the coding part by first importing the packages in the following way:

from gensim.models import KeyedVectors
from sklearn.metrics.pairwise import cosine_similarity
print('Imported Successfully!')

We imported two packages that will be used in the following way:

  • The gensim package will be used to load the word vectors that we downloaded.
  • KeyedVectors essentially contain the mapping between words and embedding. After training, it can be used to directly query those embedding in various ways.
  • We will use scikit-learn's cosine similarity to calculate the distance between two words. This distance metric is commonly used and provides good results for various problems.

Create the function

Next, we are going to create a function that will take a list of strings and return a string that is very different from all of them. This function can be created as shown below.

# Accepts list_of_words and the word2vec vectors
def odd_one_out(words,word_vectors):
all_word_vectors = [word_vectors[w] for w in words]
avg_vector = np.mean(all_word_vectors, axis = 0)
odd_one_out = None
min_sim = 1.0
for w in words:
sim = cosine_similarity([word_vectors[w]],[avg_vector])
if sim < min_sim:
min_sim = sim
odd_one_out = w
return odd_one_out
print("Function Created Successfully!")

The explanation for the code above is:

  • In line 4, we used list comprehension in Python to get the list of word vectors of all the words that are passed as input, and saved them in the all_word_vectors list.
  • In line 5, we calculated the mean of all the vectors. We did this because we will calculate the distance of each word vector from the average vector. The vector with the minimum distance will be our result.
  • In line 7, we assumed the minimum similarity to be 1.0, which is the maximum value we can get. We can compare this value later on and check if we get any value smaller than 1.0. The smallest value will give the desired result.
  • In line 9 to line 13, we iterated over each word vector and found the vector with the minimum distance from the average vector.
  • Finally, we returned our result.

Test the function

Now that we have our function ready, let’s use it with some inputs.

word_vectors = KeyedVectors.load_word2vec_format('/root/input/GoogleNews-vectors-negative300.bin.gz', binary=True)
list_of_words = ["apple","mango","party","juice","orange"]
print(odd_one_out(list_of_words,word_vectors))

What do you think the output is going to be? Choose an option and let’s see whether you can find the correct output or not.

Q

What will be the output of the code above?

A)

apple

B)

mango

C)

party

D)

juice

E)

orange

This correct answer is chosen because all of the other words were food items and somewhat related to each other. Thus, our system was able to correctly find the odd word out from the given set of words.

Free Resources