Text Embeddings, Classification, and Semantic Search | by Shaw Talebi – Towards Data Science

Imports

We start by importing dependencies and the synthetic dataset.

Next, well generate the text embeddings. Instead of using the OpenAI API, we will use an open-source model from the Sentence Transformers Python library. This model was specifically fine-tuned for semantic search.

To see the different resumes in the dataset and their relative locations in concept space, we can use PCA to reduce the dimensionality of the embedding vectors and visualize the data on a 2D plot (code is on GitHub).

From this view we see the resumes for a given role tend to clump together.

Now, to do a semantic search over these resumes, we can take a user query, translate it into a text embedding, and then return the nearest resumes in the embedding space. Heres what that looks like in code.

Printing the roles of the top 10 results, we see almost all are data engineers, which is a good sign.

Lets look at the resume of the top search results.

Although this is a made-up resume, the candidate likely has all the necessary skills and experience to fulfill the users needs.

Another way to look at the search results is via the 2D plot from before. Heres what that looks like for a few queries (see plot titles).

While this simple search example does a good job of matching particular candidates to a given query, it is not perfect. One shortcoming is when the user query includes a specific skill. For example, in the query Data Engineer with Apache Airflow experience, only 1 of the top 5 results have Airflow experience.

This highlights that semantic search is not better than keyword-based search in all situations. Each has its strengths and weaknesses.

Thus, a robust search system will employ so-called hybrid search, which combines the best of both techniques. While there are many ways to design such a system, a simple approach is applying keyword-based search to filter down results, followed by semantic search.

Two additional strategies for improving search are using a Reranker and fine-tuning text embeddings.

A Reranker is a model that directly compares two pieces of text. In other words, instead of computing the similarity between pieces of text via a distance metric in the embedding space, a Reranker computes such a similarity score directly.

Rerankers are commonly used to refine search results. For example, one can return the top 25 results using semantic search and then refine to the top 5 with a Reranker.

Fine-tuning text embeddings involves adapting an embedding model for a particular domain. This is a powerful approach because most embedding models are based on a broad collection of text and knowledge. Thus, they may not optimally organize concepts for a specific industry, e.g. data science and AI.

Although everyone seems focused on the potential for AI agents and assistants, recent innovations in text-embedding models have unlocked countless opportunities for simple yet high-value ML use cases.

Here, we reviewed two widely applicable use cases: text classification and semantic search. Text embeddings enable simpler and cheaper alternatives to LLM-based methods while still capturing much of the value.

Cloud Hosting

Text Embeddings, Classification, and Semantic Search | by Shaw Talebi – Towards Data Science

Recent Posts

Categories

Archives

Media Sites

Pages

Site admin