Yes, embeddings are not limited to text. Many frameworks support embeddings for diverse data types like images, audio, and video. These are particularly useful in multimedia search and recommendation systems.
Text embeddings transform how AI models understand and respond to human language, bridging the gap between AI and human communication. Before diving into generating text embeddings, it’s important to understand what they are.
Text embeddings are a numerical representation of the text. They represent words or phrases as vectors in a high-dimensional space that holds the underlying meaning of the text. These embeddings allow the large language models (LLMs) to understand and process relationships between words in sentences. For example, the embedding vector of “felines say” will be more similar to the embedding vector of “meow” than that of “roar.”
Why do we convert text to embeddings? As computers interpret data as numbers, by turning text into embeddings, we provide them with a way to interpret and analyze complex human language. This capability enables them to handle complex language tasks such as clustering, classification, topic identification, etc., more effectively.
OpenAI offers a range of embedding models for different performance and cost needs. Let’s explore them one by one:
text-embedding-ada-002
: This is an earlier model with an embedding size of 1536. It performs well for standard tasks. However, it offers less multilingual accuracy (scores 31.4% on the
text-embedding-3-small
: This advanced embedding model, released in 2024, offers significant improvements in performance for multilingual tasks (scores 44% on the MIRACL benchmark). It is a highly efficient and cost-effective model with a reduced embedding size of 512 dimensions.
text-embedding-3-large
: This is the most powerful embedding model, achieving the highest performance for complex tasks (54.9% on MIRACL benchmark). It offers a large embedding size of up to 3072 dimensions, which supports more detailed representations but comes at a higher cost.
We’ll use text-embedding-3-small
in this Answer due to its balance of performance, efficiency, and cost-effectiveness for multilingual applications. Now, let’s look at how to generate text embeddings using OpenAI’s API in Python.
Before we begin, we need to install the OpenAI Python library on our system. We can install it using the command:
pip install openai
After that, we need an OpenAI API key to use the embedding models. Now, we are all set to use them for our tasks.
Let’s begin by generating the embeddings for the sentence “Educative answers section is helpful.”
from openai import OpenAIimport os# Initializing OpenAI object by providing an OpenAI keyclient = OpenAI(api_key = os.environ["OPENAI_KEY"])response = client.embeddings.create(input = "Educative answers section is helpful",model= "text-embedding-3-small")print(response)
Lines 1–3: We import the OpenAI
class from the openai
library to access the embedding models and os
for accessing environment variables.
Lines 5–7: We initialize the OpenAI
object as a client
using the OpenAI API key.
Line 9: We used the embedding.create()
function to generate embeddings.
Line 10: We provide the sentence “Educative answers section is helpful,” for which we need to generate embeddings in the input
parameter.
Line 11: We provide the model name in the model
parameter.
Line 14: We print the embeddings generated by the model.
After executing the code, we can see that our model text-embedding-3-small
successfully generated embeddings that capture all the necessary details for the provided sentence.
Embedding models are also used to find
In this example, we’ll use the dot product to find the similarity between the phrases “feline friends say” and “meow.”
from openai import OpenAIimport numpy as npimport os# Initializing OpenAI object by providing an OpenAI keyclient = OpenAI(api_key = os.environ["OPENAI_KEY"])# Generating embeddings for textresponse = client.embeddings.create(input = ["feline friends say", "meow"],model="text-embedding-3-small")# Extracting embedding of each textembedding_a = response.data[0].embeddingembedding_b = response.data[1].embedding# Finding similarity between embeddings using the dot productsimilarity_score = np.dot(embedding_a, embedding_b)print(similarity_score)
Lines 1–3: We import the required libraries.
Lines 6–8: We initialize the OpenAI
object as a client
using the OpenAI API key.
Lines 11–14: We used the embeddings.create()
function to generate embeddings for the two phrases “feline friends say” and “meow” using the text-embedding-3-small
model. The function returns a response that includes the embeddings for both phrases.
Lines 17–18: We extract the embeddings from the response
object for each phrase in embedding_a
and embedding_b
variables.
Line 21: We calculate the similarity score of two phrases by taking the dot product of embedding_a
and embedding_b
.
Line 23: We print the similarity score, which tells us how close the two phrases are semantically.
Let’s implement a simple semantic search system that compares embedding vectors to find the top-n most similar items in a dataset.
from openai import OpenAIfrom sklearn.metrics.pairwise import cosine_similarityimport os# Initializing OpenAI object by providing an OpenAI keyclient = OpenAI(api_key = os.environ["OPENAI_KEY"])# Defining get_embedding function which returns embedding of the given textdef get_embedding(text, model):return client.embeddings.create(input = [text], model=model).data[0].embedding# Example datasetdataset = ["sparrow", "carrot", "lion", "peas", "parrot"]# Generating embedding of each element in dataset and storing it in dictionary#in following pattern: "key: sparrow and value: embedding for sparrow"dataset_embeddings = {word: get_embedding(word, model='text-embedding-3-small') for word in dataset}# defining search fucntion which find the top-3 matches against the search querydef search(dataset_embeddings, query, n=3):# Generating embeddings of queryquery_embedding = get_embedding(query, model='text-embedding-3-small')# Finding similarity of query with every element of dataset and storing it in dictionarysimilarity = {word: cosine_similarity([embedding], [query_embedding])[0][0] for word, embedding in dataset_embeddings.items()}# Sorting the dictionary in descending order to get the top n resultsresult = sorted(similarity.items(), key=lambda item: item[1], reverse=True)[:n]return resultresults = search(dataset_embeddings, "cat", n=3)# Printing top-n search results for the queryfor word, similarity in results:print(f"{word}: {similarity}")
Lines 1–3: We import the required libraries.
Lines 6–8: We initialize the OpenAI
object as a client
using the OpenAI API key.
Lines 11–12: We define a get_embedding()
function that finds embeddings of the given text using the specified model.
Lines 15–19: We generated a sample dataset and created embeddings for each item.
Lines 22–31: We define the search()
function, which first finds the embedding of the query and then uses cosine similarity to find the similarity of a query with each dataset item. At last, we sort the results in descending order and extract the top-n most similar items.
Line 33: We call the search()
function to find the top-3 similar items for query "cat".
Lines 36–37: We print the top 3 results for the query.
Text embeddings offer an incredible range of capabilities in NLP. They provide a wide range of applications by capturing the true essence of text with just a few lines of Python code. Whether we aim to build a recommendation system, categorize documents, or visualize conceptual relationships, OpenAI’s embeddings are an invaluable resource in our array of tools.
Haven’t found what you were looking for? Contact Us
Free Resources