Best Open Source Sentence Embedding Models in August 2024

Unlike us, AI systems and machine learning models do not understand normal texts, images, and audio. Machine learning models require numerical data so the raw data (images, text, audio) is converted into complex vectors called embeddings that ML models can effectively understand. The dimensions of these vectors range from hundreds to thousands depending upon the model. Embeddings map input data to vectors in a high-dimensional space where similar data points are close to each other. For instance, in text embeddings, words with similar meanings are represented by vectors that are close together. This helps keep the contextual information and relationships within the data which enables models to efficiently perform tasks like classification, clustering, and similarity search.

What are the different types of embeddings?

Depending on the type of input data embeddings can be created using different techniques like text embedding models to convert words, phrases, or entire texts into vectors. These vectors capture semantic meaning, allowing the model to understand relationships between words. On the other hand techniques like Convolutional Neural Networks (CNNs) convert images into vectors by capturing features such as edges, textures, and patterns. Audio signals are converted into vectors by capturing features like pitch, tone, and rhythm. However, our focus today is text embedding and its different types.

Depending upon their purposes and characteristics I am going to briefly discuss different text embedding models.

 Word-level Embedding Models

Word-level embedding models focus on generating vector representations for individual words. These models capture the semantic relationships between words based on their co-occurrence in data the model is trained on. They are simple and efficient but cannot account for context within sentences. Examples include Word2Vec, GloVe, and FastText.

Contextual Word Embedding Models

These models are based on transformer architectures, which utilize mechanisms like self-attention to understand and weigh the importance of each word in relation to its context. This approach enables the models to capture nuanced meanings and dependencies. BERT, for example, uses bidirectional training to consider the context from both directions, while RoBERTa also uses bidirectional training but enhances it with more data and optimized techniques. 

Sentence level Embedding Models

Sentence-level embedding models create vector representations for entire sentences, encapsulating their overall semantic meaning. These models are particularly useful for tasks that involve understanding sentence-level semantics, such as semantic similarity and text summarization. Notable models in this category are Sentence-BERT, Universal Sentence Encoder, and Pegasus.

The effectiveness of the type of embedding model depends on the specific NLP task at hand. For example, contextual word-based embedding models like BERT and RoBERTA perform better for tasks like named entity recognition and part-of-speech tagging. Conversely, sentence-based models like Sentence-BERT and Universal Sentence Encoder are a great choice for semantic similarity, paraphrase detection, and summarization tasks. However, I will be focusing on the sentence-based embedding models today. The good news is that all these models are open source and open for anyone to use. 

Best sentence embedding models 

If you are looking for the best open-source sentence embedding models, the HuggingFace mteb leaderboard is where you should look. With the highest overall performance on multiple benchmarks, bge-en-icl stands out as the best model for sentence embeddings. With 7.11B parameters, it has the highest overall average score and performs well across most categories, which makes it highly reliable and versatile. However, if memory usage and modal size are an issue, then stella_en_1.5B_v5 is a great lightweight alternative in comparison to the first one. This model has significantly lower memory requirements and closely matches the performance of bge-en-icl, even surpassing it on some tasks like STS and summarization.

If you want to try it out, here is what the code will look like. Make sure to download all the required dependencies.

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import Union, List
import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer
import uvicorn
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

app = FastAPI()

model_name = 'oInstruct-small-Embedding-v0'
token = os.getenv('HUGGINGFACEHUB_API_TOKEN')

if token:
    model = AutoModel.from_pretrained("avsolatorio/NoInstruct-small-Embedding-v0", token=token, trust_remote_code=True)
    tokenizer = AutoTokenizer.from_pretrained("avsolatorio/NoInstruct-small-Embedding-v0", token=token, trust_remote_code=True)
else:
    model = AutoModel.from_pretrained("avsolatorio/NoInstruct-small-Embedding-v0")
    tokenizer = AutoTokenizer.from_pretrained("avsolatorio/NoInstruct-small-Embedding-v0")

def get_embedding(text: Union[str, list[str]], mode: str = "sentence"):
    model.eval()

    assert mode in ("query", "sentence"), f"mode={mode} was passed but only `query` and `sentence` are the supported modes."

    if isinstance(text, str):
        text = [text]

    inp = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

    with torch.no_grad():
        output = model(**inp)

    # The model is optimized to use the mean pooling for queries,
    # while the sentence / document embedding uses the [CLS] representation.

    if mode == "query":
        vectors = output.last_hidden_state * inp["attention_mask"].unsqueeze(2)
        vectors = vectors.sum(dim=1) / inp["attention_mask"].sum(dim=-1).view(-1, 1)
    else:
        vectors = output.last_hidden_state[:, 0, :]

    return vectors

class TextData(BaseModel):
    texts: List[str]

@app.post("/embeddings")
async def create_embeddings(text_data: TextData):
    try:
        embeddings = get_embedding(text_data.texts)
        return {"embeddings": embeddings.tolist()}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=3000)

Our pick for best lightweight sentence embedding models

Sometimes your use case just requires a fairly good-performing but lightweight model. Luckily, there are plenty of models that use a fraction of the memory these big models use and provide results comparable to bigger models. These models are as good as the best ones for some tasks, e.g. NoInstruct-small-Embedding-v0 scores 30.6 for summarization whereas bge-en-icl scores 30.77. There are quite a few in this category that I am going to suggest you check out, however, almost all of them work great. Here is a plug-and-use code for all these models except the 3rd one. 

  1. Alibaba-NLP/gte-base-en-v1.5
  2. avsolatorio/GIST-Embedding-v0
  3. avsolatorio/NoInstruct-small-Embedding-v0
  4. avsolatorio/GIST-small-Embedding-v0
  5. nfgrad/stella-base-en-v2

If you want to try NoInstruct-small-Embedding-v0 which is the best-performing in the lightest category of models you can use this code, again don’t forget to download the required parameters.


import torch
import torch.nn.functional as F

from torch import Tensor
from transformers import AutoTokenizer, AutoModel


def last_token_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
    if left_padding:
        return last_hidden_states[:, -1]
    else:
        sequence_lengths = attention_mask.sum(dim=1) - 1
        batch_size = last_hidden_states.shape[0]
        return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]


def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'<instruct>{task_description}\n<query>{query}'

def get_detailed_example(task_description: str, query: str, response: str) -> str:
    return f'<instruct>{task_description}\n<query>{query}\n<response>{response}'

def get_new_queries(queries, query_max_len, examples_prefix, tokenizer):
    inputs = tokenizer(
        queries,
        max_length=query_max_len - len(tokenizer('<s>', add_special_tokens=False)['input_ids']) - len(
            tokenizer('\n<response></s>', add_special_tokens=False)['input_ids']),
        return_token_type_ids=False,
        truncation=True,
        return_tensors=None,
        add_special_tokens=False
    )
    prefix_ids = tokenizer(examples_prefix, add_special_tokens=False)['input_ids']
    suffix_ids = tokenizer('\n<response>', add_special_tokens=False)['input_ids']
    new_max_length = (len(prefix_ids) + len(suffix_ids) + query_max_len + 8) // 8 * 8 + 8
    new_queries = tokenizer.batch_decode(inputs['input_ids'])
    for i in range(len(new_queries)):
        new_queries[i] = examples_prefix + new_queries[i] + '\n<response>'
    return new_max_length, new_queries

task = 'Given a web search query, retrieve relevant passages that answer the query.'
examples = [
  {'instruct': 'Given a web search query, retrieve relevant passages that answer the query.',
   'query': 'what is a virtual interface',
   'response': "A virtual interface is a software-defined abstraction that mimics the behavior and characteristics of a physical network interface. It allows multiple logical network connections to share the same physical network interface, enabling efficient utilization of network resources. Virtual interfaces are commonly used in virtualization technologies such as virtual machines and containers to provide network connectivity without requiring dedicated hardware. They facilitate flexible network configurations and help in isolating network traffic for security and management purposes."},
  {'instruct': 'Given a web search query, retrieve relevant passages that answer the query.',
   'query': 'causes of back pain in female for a week',
   'response': "Back pain in females lasting a week can stem from various factors. Common causes include muscle strain due to lifting heavy objects or improper posture, spinal issues like herniated discs or osteoporosis, menstrual cramps causing referred pain, urinary tract infections, or pelvic inflammatory disease. Pregnancy-related changes can also contribute. Stress and lack of physical activity may exacerbate symptoms. Proper diagnosis by a healthcare professional is crucial for effective treatment and management."}
]
examples = [get_detailed_example(e['instruct'], e['query'], e['response']) for e in examples]
examples_prefix = '\n\n'.join(examples) + '\n\n' # if there not exists any examples, just set examples_prefix = ''
queries = [
    get_detailed_instruct(task, 'how much protein should a female eat'),
    get_detailed_instruct(task, 'summit define')
]
documents = [
    "As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
    "Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the leaders of two or more governments."
]
query_max_len, doc_max_len = 512, 512

tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-en-icl')
model = AutoModel.from_pretrained('BAAI/bge-en-icl')
model.eval()

new_query_max_len, new_queries = get_new_queries(queries, query_max_len, examples_prefix, tokenizer)

query_batch_dict = tokenizer(new_queries, max_length=new_query_max_len, padding=True, truncation=True, return_tensors='pt')
doc_batch_dict = tokenizer(documents, max_length=doc_max_len, padding=True, truncation=True, return_tensors='pt')

with torch.no_grad():
    query_outputs = model(**query_batch_dict)
    query_embeddings = last_token_pool(query_outputs.last_hidden_state, query_batch_dict['attention_mask'])
    doc_outputs = model(**doc_batch_dict)
    doc_embeddings = last_token_pool(doc_outputs.last_hidden_state, doc_batch_dict['attention_mask'])
    
# normalize embeddings
query_embeddings = F.normalize(query_embeddings, p=2, dim=1)
doc_embeddings = F.normalize(doc_embeddings, p=2, dim=1)

# Output the embeddings
print("Query Embeddings:", query_embeddings)
print("Document Embeddings:", doc_embeddings)

Sentence embedding models with multilingual support

If you need the best models for semantic search that support more than one language then here are some cool ones that will give you similar embeddings for the same text in different languages. You are not required to specify the input language. These models can find semantically similar sentences within the same or different languages. The multilingual model distiluse-base-multilingual-cased-v1 works well for clustering and semantic search and supports 15 languages. Another model paraphrase-multilingual-MiniLM-L12-v2 is compatible with the same tasks, provides support for more than 50 languages, and is even lighter than the first one. You can simply plug in the model in the repository provided earlier and use them.

Sentence and image embedding model

These models can embed images and text into a joint vector space. They are useful for text-to-image search, image-to-image search, image clustering, and zero-shot image classification. Clip-ViT-B-32-multilingual-v1 multilingual adaptation of the OpenAI CLIP-ViT-B32 model. It maps text (in 50+ languages) and images into a common dense vector space, ensuring that matching texts and images are closely aligned. This is pretty lightweight and will work if plugged into the provided code.

Have fun experimenting with different models.