Tokenization

Given an input, the first step of the embedding and reranking process is to split it into a list of tokens. Our servers automatically perform this tokenization step when you call the API. The Python client includes methods that allow you to try the tokenizer before calling the API.

`tokenize` Method

Use the tokenize method to tokenize a list of texts for a specific model.

Example

import voyageai
# Initialize client (uses VOYAGE_API_KEY environment variable)
vo = voyageai.Client()
texts = [
    "The Mediterranean diet emphasizes fish, olive oil, and vegetables, believed to reduce chronic diseases.",
    "Photosynthesis in plants converts light energy into glucose and produces essential oxygen."
]
# Tokenize the texts
tokenized = vo.tokenize(texts, model="voyage-4-large")
for i in range(len(texts)):
    print(tokenized[i].tokens)

['The', 'ĠMediterranean', 'Ġdiet', 'Ġemphasizes', 'Ġfish', ',', 'Ġolive', 'Ġoil', ',', 'Ġand', 'Ġvegetables', ',', 'Ġbelieved', 'Ġto', 'Ġreduce', 'Ġchronic', 'Ġdiseases', '.']
['Photos', 'ynthesis', 'Ġin', 'Ġplants', 'Ġconverts', 'Ġlight', 'Ġenergy', 'Ġinto', 'Ġglucose', 'Ġand', 'Ġproduces', 'Ġessential', 'Ġoxygen', '.']

Parameters

View the parameters for the tokenize method.

Parameter	Type	Required	Description
`texts`	Array of Strings (`List[str]`)	Yes	A list of texts to be tokenized.
`model`	String	Yes	Name of the model to be tokenized for. Valid values: `voyage-4-large`, `voyage-4`, `voyage-4-lite`, `rerank-2.5`, `rerank-2.5-lite`, `voyage-multimodal-3.5`, `voyage-multimodal-3`.

Response

View the response for the tokenize method.

This method returns a list of tokenizers.Encoding objects:

Attribute	Type	Description
`tokens`	List of tokenizers.Encoding	A list of `tokenizers.Encoding` objects, each representing the tokenized results of an input text string.

`count_tokens` Method

Use the count_tokens method to count the number of tokens in a list of texts for a specific model.

Example

import voyageai
# Initialize client (uses VOYAGE_API_KEY environment variable)
vo = voyageai.Client()
texts = [
    "The Mediterranean diet emphasizes fish, olive oil, and vegetables, believed to reduce chronic diseases.",
    "Photosynthesis in plants converts light energy into glucose and produces essential oxygen."
]
# Count total tokens
total_tokens = vo.count_tokens(texts, model="voyage-4-large")
print(total_tokens)

Parameters

View the parameters for the count_tokens method.

Parameter	Type	Required	Description
`texts`	Array of Strings (`List[str]`)	Yes	A list of texts to count the tokens for.
`model`	String	Yes	Name of the model to be counted for. Valid values: `voyage-4-large`, `voyage-4`, `voyage-4-lite`, `rerank-2.5`, `rerank-2.5-lite`, `voyage-multimodal-3.5`, `voyage-multimodal-3`.

Response

View the response for the count_tokens method.

This method returns an integer:

Attribute	Type	Description
`total_tokens`	Integer	The total number of tokens in the input texts.

`count_usage` Method

Use the count_usage method to count the number of tokens and pixels in a list of inputs for a specific model.

Note

Voyage embedding models have context length limits. If your text exceeds the limit, truncate the text before calling the API, or specify the truncation argument to True.

Example

import voyageai
import PIL
# Initialize client (uses VOYAGE_API_KEY environment variable)
vo = voyageai.Client()
# Create input with text and image
inputs = [
    ["This is a banana.", PIL.Image.open('banana.jpg')]
]
# Count tokens and pixels
usage = vo.count_usage(inputs, model="voyage-multimodal-3.5")
print(usage)

{'text_tokens': 5, 'image_pixels': 2000000, 'total_tokens': 3576}

Parameters

View the parameters for the count_usage method.

Parameter	Type	Required	Description
`inputs`	List of dictionaries or List of Lists (`List[dict]` or `List[List[Union[str, PIL.Image.Image]]]`)	Yes	A list of text, image, and video sequences for which to count text tokens, image pixels, video frames, and total tokens. The list elements follow the same format as the `inputs` parameter of `voyageai.Client.multimodal_embed()`, except that image URLs are not supported. To learn more, see Multimodal Embeddings.
`model`	String	Yes	Name of the model (which affects how inputs are counted). Supported models are `voyage-multimodal-3.5` (recommended) and `voyage-multimodal-3`. For other models that support only text, use the `voyageai.Client.count_tokens()` function to calculate token counts.

Response

View the response for the count_usage method.

This method returns a dictionary containing the following attributes:

Attribute	Type	Description
`text_tokens`	Integer	The total number of text tokens in the list of inputs.
`image_pixels`	Integer	The total number of image pixels in the list of inputs.
`video_pixels`	Integer	The total number of video pixels in the list of inputs.
`total_tokens`	Integer	The combined total of text, image, and video tokens. Every 560 image pixels counts as a token, while every 1120 video pixels counts as a token.

Considerations

Consider the following when using the tokenizer:

Modern NLP models typically convert a text string into a list of tokens. Frequent words, such as "you" and "apple," are tokens by themselves. In contrast, rare or long words are broken into multiple tokens, for example, "uncharacteristically" is dissected into four tokens, "un", "character", "ist", and "ically". One word roughly corresponds to 1.2 to 1.5 tokens on average, depending on the complexity of the domain.

The tokens produced by our tokenizer have an average of 5 characters, suggesting that you can roughly estimate the number of tokens by dividing the number of characters in the text string by 5. To determine the exact number of tokens, use the count_tokens() method.
Voyage's tokenizers are also available on Hugging Face. You can access the tokenizer associated with a particular model by using the following code:
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('voyageai/voyage-4-large')
tiktoken is a popular open-source tokenizer. Voyage models use different tokenizers. Therefore, our tokenizer generates a different list of tokens for a given text compared to tiktoken. Statistically, the number of tokens produced by our tokenizer is on average 1.1 to 1.2 times that of tiktoken. To determine the exact number of tokens, use the count_tokens() method.

Back

RAG

Flexible Dimensions & Quantization

Tokenization

tokenize Method

Example

Parameters

Response

count_tokens Method

Example

Parameters

Response

count_usage Method

Note

Example

Parameters

Response

Considerations

`tokenize` Method

`count_tokens` Method

`count_usage` Method