Given an input, the first step of the embedding and reranking process is to split it into a list of tokens. Our servers automatically perform this tokenization step when you call the API. The Python client includes methods that allow you to try the tokenizer before calling the API.
tokenize Method
Use the tokenize method to tokenize a list of texts for a specific model.
Example
import voyageai # Initialize client (uses VOYAGE_API_KEY environment variable) vo = voyageai.Client() texts = [ "The Mediterranean diet emphasizes fish, olive oil, and vegetables, believed to reduce chronic diseases.", "Photosynthesis in plants converts light energy into glucose and produces essential oxygen." ] # Tokenize the texts tokenized = vo.tokenize(texts, model="voyage-4-large") for i in range(len(texts)): print(tokenized[i].tokens)
['The', 'Ä Mediterranean', 'Ä diet', 'Ä emphasizes', 'Ä fish', ',', 'Ä olive', 'Ä oil', ',', 'Ä and', 'Ä vegetables', ',', 'Ä believed', 'Ä to', 'Ä reduce', 'Ä chronic', 'Ä diseases', '.'] ['Photos', 'ynthesis', 'Ä in', 'Ä plants', 'Ä converts', 'Ä light', 'Ä energy', 'Ä into', 'Ä glucose', 'Ä and', 'Ä produces', 'Ä essential', 'Ä oxygen', '.']
Parameter | Type | Required | Description |
|---|---|---|---|
| Array of Strings ( | Yes | A list of texts to be tokenized. |
| String | Yes | Name of the model to be tokenized for. Valid values:
|
This method returns a list of tokenizers.Encoding objects:
Attribute | Type | Description |
|---|---|---|
| List of tokenizers.Encoding | A list of |
count_tokens Method
Use the count_tokens method to count the number of tokens in a list of texts for a specific model.
Example
import voyageai # Initialize client (uses VOYAGE_API_KEY environment variable) vo = voyageai.Client() texts = [ "The Mediterranean diet emphasizes fish, olive oil, and vegetables, believed to reduce chronic diseases.", "Photosynthesis in plants converts light energy into glucose and produces essential oxygen." ] # Count total tokens total_tokens = vo.count_tokens(texts, model="voyage-4-large") print(total_tokens)
32
Parameter | Type | Required | Description |
|---|---|---|---|
| Array of Strings ( | Yes | A list of texts to count the tokens for. |
| String | Yes | Name of the model to be counted for. Valid values:
|
count_usage Method
Use the count_usage method to count the number of
tokens and pixels in a list of inputs for a specific model.
Note
Voyage embedding models have context length limits. If your text exceeds the limit,
truncate the text before calling the API, or specify the
truncation argument to True.
Example
import voyageai import PIL # Initialize client (uses VOYAGE_API_KEY environment variable) vo = voyageai.Client() # Create input with text and image inputs = [ ["This is a banana.", PIL.Image.open('banana.jpg')] ] # Count tokens and pixels usage = vo.count_usage(inputs, model="voyage-multimodal-3.5") print(usage)
{'text_tokens': 5, 'image_pixels': 2000000, 'total_tokens': 3576}
Parameter | Type | Required | Description |
|---|---|---|---|
| List of dictionaries or List of Lists ( | Yes | A list of text, image, and video sequences for which to count text tokens,
image pixels, video frames, and total tokens. The list elements follow the same
format as the |
| String | Yes | Name of the model (which affects how inputs are counted).
Supported models are |
This method returns a dictionary containing the following attributes:
Attribute | Type | Description |
|---|---|---|
| Integer | The total number of text tokens in the list of inputs. |
| Integer | The total number of image pixels in the list of inputs. |
| Integer | The total number of video pixels in the list of inputs. |
| Integer | The combined total of text, image, and video tokens. Every 560 image pixels counts as a token, while every 1120 video pixels counts as a token. |
Considerations
Consider the following when using the tokenizer:
Modern NLP models typically convert a text string into a list of tokens. Frequent words, such as "you" and "apple," are tokens by themselves. In contrast, rare or long words are broken into multiple tokens, for example, "uncharacteristically" is dissected into four tokens, "un", "character", "ist", and "ically". One word roughly corresponds to 1.2 to 1.5 tokens on average, depending on the complexity of the domain.
The tokens produced by our tokenizer have an average of 5 characters, suggesting that you can roughly estimate the number of tokens by dividing the number of characters in the text string by 5. To determine the exact number of tokens, use the
count_tokens()method.Voyage's tokenizers are also available on Hugging Face. You can access the tokenizer associated with a particular model by using the following code:
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('voyageai/voyage-4-large') tiktokenis a popular open-source tokenizer. Voyage models use different tokenizers. Therefore, our tokenizer generates a different list of tokens for a given text compared totiktoken. Statistically, the number of tokens produced by our tokenizer is on average 1.1 to 1.2 times that oftiktoken. To determine the exact number of tokens, use thecount_tokens()method.