Docs Menu
Docs Home
/ /

How to Index Fields for Autocompletion

You can use the MongoDB Search autocomplete type to index text values in string fields for autocompletion. You can query fields indexed as autocomplete type using the autocomplete operator.

You can also use the autocomplete type to index:

  • Fields whose value is an array of strings. To learn more, see How to Index the Elements of an Array.

  • String fields inside an array of documents indexed as the embeddedDocuments type.

Tip

If you have a large number of documents and a wide range of data against which you want to run MongoDB Search queries using the autocomplete operator, building this index can take some time. Alternatively, you can create a separate index with only the autocomplete type to reduce the impact on other indexes and queries while the index builds.

To learn more, see MongoDB Search Index Performance Considerations.

autocomplete type is not included in the dynamic mapping by default. To index fields as autocomplete type, you can use static mappings or configure a typeSet to include autocomplete in dynamic mapping.

The MongoDB Search autocomplete type takes the following parameters:

Option
Type
Necessity
Description
Default

type

string

required

Human-readable label that identifies this field type. Value must be string.

analyzer

string

optional

Name of the analyzer to use with this autocomplete mapping. You can use any MongoDB Search analyzer except the lucene.kuromoji language analyzer and the following custom analyzer tokenizers and token filters:

lucene.standard

maxGrams

int

optional

Maximum number of characters per indexed sequence. The value limits the character length of indexed tokens. When you search for terms longer than the maxGrams value, MongoDB Search truncates the tokens to the maxGrams length.

Note

We recommend setting the maxGrams value to be less than or equal to 15 to optimize performance. A higher value increases the size of the index and can impact performance. If you require more than 15 characters for autocompletion, we recommend configuring custom analyzer to avoid truncating queries.

15

minGrams

int

optional

Minimum number of characters per indexed sequence. We recommend 4 for the minimum value. A value that is less than 4 could impact performance because the size of the index can become very large. We recommend the default value of 2 for edgeGram only.

2

tokenization

enum

optional

Tokenization strategy to use when indexing the field for autocompletion. Value can be one of the following:

  • edgeGram - create indexable tokens, referred to as grams, from variable-length character sequences starting at the left side of the words as delimited by the analyzer used with this autocomplete mapping.

  • rightEdgeGram - create indexable tokens, referred to as grams, from variable-length character sequences starting at the right side of the words as delimited by the analyzer used with this autocomplete mapping.

  • nGram - create indexable tokens, referred to as grams, by sliding a variable-length character window over a word. MongoDB Search creates more tokens for nGram than edgeGram or rightEdgeGram. Therefore, nGram takes more space and time to index the field. nGram is better suited for querying languages with long, compound words or languages that don't use spaces.

edgeGram, rightEdgeGram, and nGram are applied at the letter-level. For example, consider the following sentence:

The quick brown fox jumps over the lazy dog.

When tokenized with a minGrams value of 2 and a maxGrams value of 5, MongoDB Search indexes the following sequence of characters based on the tokenization value you choose:

th
the
the{SPACE}
the q
qu
qui
quic
uick
...
og
dog
{SPACE}dog
y dog
zy
azy
lazy
{SPACE}lazy
he
the
{SPACE}the
r the
er
ver
over
{SPACE}over
...
th
the
the{SPACE}
the q
he
he{SPACE}
he q
he qu
e{SPACE}
e q
e qu
e qui
{SPACE}q
{SPACE}qu
{SPACE}qui
{SPACE}quic
qu
qui
quic
quick
...

Indexing a field for autocomplete with an edgeGram, rightEdgeGram, or nGram tokenization strategy is more computationally expensive than indexing a string field. The index takes more space than an index with regular string fields.

For the specified tokenization strategy, MongoDB Search applies the following process to concatenate sequential tokens before emitting them. This process is sometimes referred to as "shingling". MongoDB Search emits tokens between minGrams and maxGrams characters in length:

  • Keeps tokens less than minGrams.

  • Joins tokens greater than minGrams but less than maxGrams to subsequent tokens to create tokens up to the specified maximum number of characters in length.

edgeGram

foldDiacritics

boolean

optional

Flag that indicates whether to perform normalizations such as including or removing diacritics from the indexed text. Value can be one of the following:

  • true - perform normalizations such as ignoring diacritic marks in the index and query text. For example, a search for cafè returns results with the characters cafè and cafe because MongoDB Search returns results with and without diacritics.

  • false - don't perform normalizations such as ignoring diacritic marks in the index and query text. So, MongoDB Search returns only results that match the strings with or without diacritics in the query. For example, a search for cafè returns results only with the characters cafè. A search for cafe returns results only with the characters cafe.

true

similarity.type

string

optional

Name of the similarity algorithm to use with this string mapping when scoring with the autocomplete operator. Value can be one of the following: bm25, boolean, or stableTfl.

To learn more about the available similarity algorithms, see Score Details.

bm25

To learn more about the autocomplete operator and see example queries, see autocomplete.

For examples that demonstrate how to run case-insensitive, prefix, starts with, and contains queries using regex expressions, see Use MongoDB Search Instead of Regex Queries.

Back

array

On this page