AI อะไรเนี่ย
Hugging Face

Sentence Transformers v5.4 Adds Multimodal Embedding and Reranking

Tools

Sentence Transformers v5.4 Adds Multimodal Embedding and Reranking

The popular Sentence Transformers Python library has just released its v5.4 update, bringing powerful new capabilities for multimodal embedding and reranking. This significant upgrade empowers developers to work with text, images, audio, and video using a single, familiar API, opening up a world of possibilities for advanced AI applications.

What's New in Sentence Transformers v5.4?

At its core, Sentence Transformers v5.4 introduces the ability to encode and compare inputs from various modalities – not just text. This means you can now transform images, audio clips, and even videos into fixed-size numerical vectors (embeddings), placing them in a shared embedding space alongside text embeddings. This unified representation is key to understanding and comparing diverse data types.

The update features two main advancements: multimodal embedding models and multimodal reranker models. Embedding models map inputs from different modalities into this shared space, allowing you to compute similarity between, for example, a text query and an image. Multimodal reranker models, on the other hand, can score the relevance of mixed-modality pairs, providing a more nuanced understanding of how different data types relate to each other. You can dive deeper into these features in the Multimodal Sentence Transformers Blog Post.

Why Multimodality Matters for AI Workflows

This upgrade dramatically expands the scope of applications you can build with Sentence Transformers. Imagine being able to search a database of product images using a natural language query, or building a retrieval-augmented generation (RAG) pipeline that can pull context from both text documents and relevant video clips. These are now easily achievable.

Key use cases enabled by v5.4 include visual document retrieval, cross-modal search, and sophisticated multimodal RAG pipelines. Developers can now create AI systems that are more aware of the world, processing information from various sources just like humans do. For instance, you could use models like BAAI/BGE-VL-MLLM-S1 to build robust systems that handle complex queries involving both text and images.

Getting Started and Key Requirements

Adopting these new multimodal features is straightforward. Installation requires modality-specific extras, for example, for image support, you'd run:

pip install -U "sentence-transformers[image]"

Loading a multimodal model is familiar:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Qwen/Qwen3-VL-Embedding-2B", revision="refs/pr/23")

Once loaded, the model.encode() method is versatile, accepting images as URLs, local file paths, or PIL Image objects. For models like Qwen/Qwen3-VL-Embedding-2B, encoded image embeddings will have a dimension of 2048.

It's important to note the hardware requirements for VLM (Vision-Language Model)-based models. Qwen3-VL-2B models typically require around 8 GB of VRAM, while larger 8B variants will need approximately 20 GB. CPU inference for these models is significantly slower than using a GPU, so a capable GPU is highly recommended for practical use.

Read more: Multimodal Sentence Transformers Blog Post and start integrating multimodal capabilities into your projects today!