Multimodal Embeddings

The Embedding and Reranking API is in Preview. The feature and the corresponding documentation might change at any time during the preview period.

Multimodal embedding models transform unstructured data from multiple modalities into a shared vector space. Voyage multimodal embedding models support text, images, and video, such as figures, photos, slide decks, document screenshots, and video clips. This removes the need for text extraction or ETL pipelines.

Unlike multimodal models like CLIP, which process text, images, and video separately, Voyage multimodal embedding models vectorize inputs containing interleaved text, images, and video. CLIP's architecture prevents it from being usable in mixed-modality searches, as text, image, and video vectors often align with irrelevant items of the same modality. Voyage multimodal embedding models reduce this bias by processing all inputs through a single backbone.

Available Models

Model	Context Length	Dimensions	Description
`voyage-multimodal-3.5`	32,000 tokens	1024 (default), 256, 512, 2048	Rich multimodal embedding model that can vectorize interleaved text and visual data, such as screenshots of PDFs, slides, tables, figures, videos, and more. To learn more, see the blog post.

Older Models

The following older models are still accessible from our API, but we recommend using the new models above for better quality and efficiency.

Model	Context Length	Dimensions	Description
`voyage-multimodal-3`	32,000 tokens	1024	Processes text and images into unified embeddings. Supports images from 50,000 to 2 million pixels. To learn more, see the blog post.

Tutorial

For a tutorial on using multimodal embeddings, see Semantic Search with Voyage AI Embeddings.

Usage

Language

Back

Contextualized Chunk Embeddings

Rerankers