Multimodal embedding models transform unstructured data from multiple modalities into a shared vector space. Voyage multimodal embedding models support text, images, and video, such as figures, photos, slide decks, document screenshots, and video clips. This removes the need for text extraction or ETL pipelines.
Unlike multimodal models like CLIP, which process text, images, and video separately, Voyage multimodal embedding models vectorize inputs containing interleaved text, images, and video. CLIP's architecture prevents it from being usable in mixed-modality searches, as text, image, and video vectors often align with irrelevant items of the same modality. Voyage multimodal embedding models reduce this bias by processing all inputs through a single backbone.
Available Models
Model | Context Length | Dimensions | Description |
|---|---|---|---|
| 32,000 tokens | 1024 (default), 256, 512, 2048 | Rich multimodal embedding model that can vectorize interleaved text and visual data, such as screenshots of PDFs, slides, tables, figures, videos, and more. To learn more, see the blog post. |
Model | Context Length | Dimensions | Description |
|---|---|---|---|
| 32,000 tokens | 1024 | Processes text and images into unified embeddings. Supports images from 50,000 to 2 million pixels. To learn more, see the blog post. |
Tutorial
For a tutorial on using multimodal embeddings, see Semantic Search with Voyage AI Embeddings.