For AI agents: a documentation index is available at https://www.mongodb.com/docs/llms.txt — markdown versions of all pages are available by appending .md to any URL path.
Make the MongoDB docs better! We value your opinion. Share your feedback for a chance to win $100.
MongoDB Branding Shape
Click here >
Docs Menu

Tokenization

Given an input, the first step of the embedding and reranking process is to split it into a list of tokens. Our servers automatically perform this tokenization step when you call the API. The Python client includes methods that allow you to try the tokenizer before calling the API.

Use the tokenize method to tokenize a list of texts for a specific model.

Example

import voyageai
# Initialize client (uses VOYAGE_API_KEY environment variable)
vo = voyageai.Client()
texts = [
"The Mediterranean diet emphasizes fish, olive oil, and vegetables, believed to reduce chronic diseases.",
"Photosynthesis in plants converts light energy into glucose and produces essential oxygen."
]
# Tokenize the texts
tokenized = vo.tokenize(texts, model="voyage-4-large")
for i in range(len(texts)):
print(tokenized[i].tokens)
['The', 'ĠMediterranean', 'Ġdiet', 'Ġemphasizes', 'Ġfish', ',', 'Ġolive', 'Ġoil', ',', 'Ġand', 'Ġvegetables', ',', 'Ġbelieved', 'Ġto', 'Ġreduce', 'Ġchronic', 'Ġdiseases', '.']
['Photos', 'ynthesis', 'Ġin', 'Ġplants', 'Ġconverts', 'Ġlight', 'Ġenergy', 'Ġinto', 'Ġglucose', 'Ġand', 'Ġproduces', 'Ġessential', 'Ġoxygen', '.']

View the parameters for the tokenize method.

Parameter
Type
Required
Description

texts

Array of Strings (List[str])

Yes

A list of texts to be tokenized.

model

String

Yes

Name of the model to be tokenized for. Valid values: voyage-4-large, voyage-4, voyage-4-lite, rerank-2.5, rerank-2.5-lite, voyage-multimodal-3.5, voyage-multimodal-3.

View the response for the tokenize method.

This method returns a list of tokenizers.Encoding objects:

Attribute
Type
Description

tokens

A list of tokenizers.Encoding objects, each representing the tokenized results of an input text string.

Use the count_tokens method to count the number of tokens in a list of texts for a specific model.

Example

import voyageai
# Initialize client (uses VOYAGE_API_KEY environment variable)
vo = voyageai.Client()
texts = [
"The Mediterranean diet emphasizes fish, olive oil, and vegetables, believed to reduce chronic diseases.",
"Photosynthesis in plants converts light energy into glucose and produces essential oxygen."
]
# Count total tokens
total_tokens = vo.count_tokens(texts, model="voyage-4-large")
print(total_tokens)
32

View the parameters for the count_tokens method.

Parameter
Type
Required
Description

texts

Array of Strings (List[str])

Yes

A list of texts to count the tokens for.

model

String

Yes

Name of the model to be counted for. Valid values: voyage-4-large, voyage-4, voyage-4-lite, rerank-2.5, rerank-2.5-lite, voyage-multimodal-3.5, voyage-multimodal-3.

View the response for the count_tokens method.

This method returns an integer:

Attribute
Type
Description

total_tokens

Integer

The total number of tokens in the input texts.

Use the count_usage method to count the number of tokens and pixels in a list of inputs for a specific model.

Note

Voyage embedding models have context length limits. If your text exceeds the limit, truncate the text before calling the API, or specify the truncation argument to True.

Example

import voyageai
import PIL
# Initialize client (uses VOYAGE_API_KEY environment variable)
vo = voyageai.Client()
# Create input with text and image
inputs = [
["This is a banana.", PIL.Image.open('banana.jpg')]
]
# Count tokens and pixels
usage = vo.count_usage(inputs, model="voyage-multimodal-3.5")
print(usage)
{'text_tokens': 5, 'image_pixels': 2000000, 'total_tokens': 3576}

View the parameters for the count_usage method.

Parameter
Type
Required
Description

inputs

List of dictionaries or List of Lists (List[dict] or List[List[Union[str, PIL.Image.Image]]])

Yes

A list of text, image, and video sequences for which to count text tokens, image pixels, video frames, and total tokens. The list elements follow the same format as the inputs parameter of voyageai.Client.multimodal_embed(), except that image URLs are not supported. To learn more, see Multimodal Embeddings.

model

String

Yes

Name of the model (which affects how inputs are counted). Supported models are voyage-multimodal-3.5 (recommended) and voyage-multimodal-3. For other models that support only text, use the voyageai.Client.count_tokens() function to calculate token counts.

View the response for the count_usage method.

This method returns a dictionary containing the following attributes:

Attribute
Type
Description

text_tokens

Integer

The total number of text tokens in the list of inputs.

image_pixels

Integer

The total number of image pixels in the list of inputs.

video_pixels

Integer

The total number of video pixels in the list of inputs.

total_tokens

Integer

The combined total of text, image, and video tokens. Every 560 image pixels counts as a token, while every 1120 video pixels counts as a token.

Consider the following when using the tokenizer:

  • Modern NLP models typically convert a text string into a list of tokens. Frequent words, such as "you" and "apple," are tokens by themselves. In contrast, rare or long words are broken into multiple tokens, for example, "uncharacteristically" is dissected into four tokens, "un", "character", "ist", and "ically". One word roughly corresponds to 1.2 to 1.5 tokens on average, depending on the complexity of the domain.

    The tokens produced by our tokenizer have an average of 5 characters, suggesting that you can roughly estimate the number of tokens by dividing the number of characters in the text string by 5. To determine the exact number of tokens, use the count_tokens() method.

  • Voyage's tokenizers are also available on Hugging Face. You can access the tokenizer associated with a particular model by using the following code:

    from transformers import AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained('voyageai/voyage-4-large')
  • tiktoken is a popular open-source tokenizer. Voyage models use different tokenizers. Therefore, our tokenizer generates a different list of tokens for a given text compared to tiktoken. Statistically, the number of tokens produced by our tokenizer is on average 1.1 to 1.2 times that of tiktoken. To determine the exact number of tokens, use the count_tokens() method.