In this shot, we will build an NLP engine that will pick the odd word from a set of words. For example, if we have a list of words like “Apple”, “Mango”, “Party”, and “Juice,” it is clear that ‘Party’ is the odd word out.
For this, we are going to use Gensim’s word2vec
model. Gensim provides an optimum implementation of word2vec’s
Before moving on, you need to download the
Open your Google Colab and run the command below to get your word vectors.
!wget -P /root/input/ -c "https://huggingface.co/LoganKilpatrick/GoogleNews-vectors-negative300/blob/main/GoogleNews-vectors-negative300.bin.gz"
This command will download it on Google servers and will save a lot of time.
Now, let’s install the packages we require, as shown below.
pip install gensim
pip install scikit-learn
You can run the command above in both Google Colab and on your local machine (if you’re using that).
Now, let’s move on to the coding part by first importing the packages in the following way:
from gensim.models import KeyedVectorsfrom sklearn.metrics.pairwise import cosine_similarityprint('Imported Successfully!')
We imported two packages that will be used in the following way:
gensim
package will be used to load the word vectors that we downloaded.KeyedVectors
essentially contain the mapping between words and embedding. After training, it can be used to directly query those embedding in various ways.scikit-learn's
cosine similarity to calculate the distance between two words. This distance metric is commonly used and provides good results for various problems.Next, we are going to create a function that will take a list of strings and return a string that is very different from all of them. This function can be created as shown below.
# Accepts list_of_words and the word2vec vectorsdef odd_one_out(words,word_vectors):all_word_vectors = [word_vectors[w] for w in words]avg_vector = np.mean(all_word_vectors, axis = 0)odd_one_out = Nonemin_sim = 1.0for w in words:sim = cosine_similarity([word_vectors[w]],[avg_vector])if sim < min_sim:min_sim = simodd_one_out = wreturn odd_one_outprint("Function Created Successfully!")
The explanation for the code above is:
all_word_vectors
list.1.0
, which is the maximum value we can get. We can compare this value later on and check if we get any value smaller than 1.0
. The smallest value will give the desired result.Now that we have our function ready, let’s use it with some inputs.
word_vectors = KeyedVectors.load_word2vec_format('/root/input/GoogleNews-vectors-negative300.bin.gz', binary=True)list_of_words = ["apple","mango","party","juice","orange"]print(odd_one_out(list_of_words,word_vectors))
What do you think the output is going to be? Choose an option and let’s see whether you can find the correct output or not.
What will be the output of the code above?
apple
mango
party
juice
orange
This correct answer is chosen because all of the other words were food items and somewhat related to each other. Thus, our system was able to correctly find the odd word out from the given set of words.