Docs Menu
Docs Home
/

Multimodal Embeddings

Multimodal embedding models transform unstructured data from multiple modalities into a shared vector space. Voyage multimodal embedding models support text, images, and video, such as figures, photos, slide decks, document screenshots, and video clips. This removes the need for text extraction or ETL pipelines.

Unlike multimodal models like CLIP, which process text, images, and video separately, Voyage multimodal embedding models vectorize inputs containing interleaved text, images, and video. CLIP's architecture prevents it from being usable in mixed-modality searches, as text, image, and video vectors often align with irrelevant items of the same modality. Voyage multimodal embedding models reduce this bias by processing all inputs through a single backbone.

Model
Context Length
Dimensions
Description

voyage-multimodal-3.5

32,000 tokens

1024 (default), 256, 512, 2048

Rich multimodal embedding model that can vectorize interleaved text and visual data, such as screenshots of PDFs, slides, tables, figures, videos, and more.

To learn more, see the blog post.

For a tutorial on using multimodal embeddings, see Semantic Search with Voyage AI Embeddings.

Back

Contextualized Chunk Embeddings

On this page