Docs Menu
Docs Home
/
MongoDB Atlas
/ / / /

How to Index Fields for Autocompletion

On this page

  • Define the Index for the autocomplete Type
  • Configure autocomplete Field Properties
  • Try an Example for the autocomplete Type

You can use the Atlas Search autocomplete type to index text values in string fields for autocompletion. You can query fields indexed as autocomplete type using the autocomplete operator.

You can also use the autocomplete type to index:

  • Fields whose value is an array of strings. To learn more, see How to Index the Elements of an Array.

  • String fields inside an array of documents indexed as the embeddedDocuments type.

Tip

If you have a large number of documents and a wide range of data against which you want to run Atlas Search queries using the autocomplete operator, building this index can take some time. Alternatively, you can create a separate index with only the autocomplete type to reduce the impact on other indexes and queries while the index builds.

To learn more, see Atlas Search Index Performance Considerations.

Atlas Search doesn't dynamically index fields of type autocomplete. You must use static mappings to index autocomplete fields. You can use the Visual Editor or the JSON Editor in the Atlas UI to index fields of type autocomplete.

To define the index for the autocomplete type, choose your preferred configuration method in the Atlas UI and then select the database and collection.

  1. Click Refine Your Index to configure your index.

  2. In the Field Mappings section, click Add Field to open the Add Field Mapping window.

  3. Click Customized Configuration.

  4. Select the field to index from the Field Name dropdown.

    Note

    You can't index fields that contain the dollar ($) sign at the start of the field name.

    For field names that contain the term email or url, the Atlas Search Visual Editor recommends using a custom analyzer with the uaxUrlEmail tokenizer for indexing email addresses or URL values. Click Create urlEmailAnalyzer to create and apply the custom analyzer to the Autocomplete Properties for the field.

  5. Click the Data Type dropdown and select Autocomplete.

  6. (Optional) Expand and configure the Token Properties for the field. To learn more, see Configure token Field Properties.

  7. Click Add.

The following is the JSON syntax for the autocomplete type. Replace the default index definition with the following. To learn more about the fields, see Field Properties.

1{
2 "mappings": {
3 "dynamic": true|false,
4 "fields": {
5 "<field-name>": {
6 "type": "autocomplete",
7 "analyzer": "<lucene-analyzer>",
8 "tokenization": "edgeGram|rightEdgeGram|nGram",
9 "minGrams": <2>,
10 "maxGrams": <15>,
11 "foldDiacritics": true|false
12 }
13 }
14 }
15}

The Atlas Search autocomplete type takes the following parameters:

Option
Type
Necessity
Description
Default

type

string

required

Human-readable label that identifies this field type. Value must be string.

analyzer

string

optional

Name of the analyzer to use with this autocomplete mapping. You can use any Atlas Search analyzer except the lucene.kuromoji language analyzer and the following custom analyzer tokenizers and token filters:

lucene.standard

maxGrams

int

optional

Maximum number of characters per indexed sequence. The value limits the character length of indexed tokens. When you search for terms longer than the maxGrams value, Atlas Search truncates the tokens to the maxGrams length.

15

minGrams

int

optional

Minimum number of characters per indexed sequence. We recommend 4 for the minimum value. A value that is less than 4 could impact performance because the size of the index can become very large. We recommend the default value of 2 for edgeGram only.

2

tokenization

enum

optional

Tokenization strategy to use when indexing the field for autocompletion. Value can be one of the following:

  • edgeGram - create indexable tokens, referred to as grams, from variable-length character sequences starting at the left side of the words as delimited by the analyzer used with this autocomplete mapping.

  • rightEdgeGram - create indexable tokens, referred to as grams, from variable-length character sequences starting at the right side of the words as delimited by the analyzer used with this autocomplete mapping.

  • nGram - create indexable tokens, referred to as grams, by sliding a variable-length character window over a word. Atlas Search creates more tokens for nGram than edgeGram or rightEdgeGram. Therefore, nGram takes more space and time to index the field. nGram is better suited for querying languages with long, compound words or languages that don't use spaces.

edgeGram, rightEdgeGram, and nGram are applied at the letter-level. For example, consider the following sentence:

The quick brown fox jumps over the lazy dog.

When tokenized with a minGrams value of 2 and a maxGrams value of 5, Atlas Search indexes the following sequence of characters based on the tokenization value you choose:

th
the
the{SPACE}
the q
qu
qui
quic
uick
...
og
dog
{SPACE}dog
y dog
zy
azy
lazy
{SPACE}lazy
he
the
{SPACE}the
r the
er
ver
over
{SPACE}over
...
th
the
the{SPACE}
the q
he
he{SPACE}
he q
he qu
e{SPACE}
e q
e qu
e qui
{SPACE}q
{SPACE}qu
{SPACE}qui
{SPACE}quic
qu
qui
quic
quick
...

Indexing a field for autocomplete with an edgeGram, rightEdgeGram, or nGram tokenization strategy is more computationally expensive than indexing a string field. The index takes more space than an index with regular string fields.

For the specified tokenization strategy, Atlas Search applies the following process to concatenate sequential tokens before emitting them. This process is sometimes referred to as "shingling". Atlas Search emits tokens between minGrams and maxGrams characters in length:

  • Keeps tokens less than minGrams.

  • Joins tokens greater than minGrams but less than maxGrams to subsequent tokens to create tokens up to the specified maximum number of characters in length.

edgeGram

foldDiacritics

boolean

optional

Flag that indicates whether to include or remove diacritics from the indexed text. Value can be one of the following:

  • true - ignore diacritic marks in the index and query text. Returns results with and without diacritic marks. For example, a search for cafè returns results with the characters cafè and cafe.

  • false - include diacritic marks in the index and query text. Returns only results that match the strings with or without diacritics in the query. For example, a search for cafè returns results only with the characters cafè. A search for cafe returns results only with the characters cafe.

true

The following index definition example uses the sample_mflix.movies collection. If you have the sample data already loaded on your cluster, you can use the Visual Editor or JSON Editor in the Atlas UI to configure the index. After you select your preferred configuration method, select the database and collection, and refine your index to add field mappings.

The following index definition example indexes only the title field as the autocomplete type to support search-as-you-type queries against that field using the autocomplete operator. The index definition also specifies the following:

  • Use the standard analyzer to divide text values into terms based on word boundaries.

  • Use the edgeGram tokenization strategy to index characters starting at the left side of the words .

  • Index a minimum of 3 characters per indexed sequence.

  • Index a maximum of 5 characters per indexed sequence.

  • Include diacritic marks in the index and query text.

  1. In the Add Field Mapping window, select title from the Field Name dropdown.

  2. Click the Data Type dropdown and select Autocomplete.

  3. Make the following changes to the Autocomplete Properties:

    Max Grams

    Set value to 5.

    Min Grams

    Set value to 3.

    Tokenization

    Select edgeGram from dropdown.

    Fold Diacritics

    Select false from dropdown.

  4. Click Add.

Replace the default index definition with the following index definition.

1{
2 "mappings": {
3 "dynamic": false,
4 "fields": {
5 "title": {
6 "type": "autocomplete",
7 "analyzer": "lucene.standard",
8 "tokenization": "edgeGram",
9 "minGrams": 3,
10 "maxGrams": 5,
11 "foldDiacritics": false
12 }
13 }
14 }
15}

The following index definition example uses the sample_mflix.movies collection. If you have the sample data already loaded on your cluster, you can use the Visual Editor or JSON Editor in the Atlas UI to configure the index. After you select your preferred configuration method, select the database and collection, and refine your index to add field mappings.

You can index a field as other types also by specifying the other types in the array. For example, the following index definition indexes the title field as the following types:

  • autocomplete type to support autocompletion for queries using the autocomplete operator.

  • string type to support text search using operators such text, phrase, and so on.

  1. In the Add Field Mapping window, select title from the Field Name dropdown.

  2. Click the Data Type dropdown and select Autocomplete.

  3. Make the following changes to the Autocomplete Properties:

    Max Grams

    Set value to 15.

    Min Grams

    Set value to 2.

    Tokenization

    Select edgeGram from dropdown.

    Fold Diacritics

    Select false from dropdown.

  4. Click Add.

  5. Repeat steps b through d.

  6. Click the Data Type dropdown and select String.

  7. Accept the default String Properties settings and click Add.

Replace the default index definition with the following index definition.

1{
2 "mappings": {
3 "dynamic": true|false,
4 "fields": {
5 "title": [
6 {
7 "type": "autocomplete",
8 "analyzer": "lucene.standard",
9 "tokenization": "edgeGram",
10 "minGrams": 2,
11 "maxGrams": 15,
12 "foldDiacritics": false
13 },
14 {
15 "type": "string"
16 }
17 ]
18 }
19 }
20}

The following index definition example uses the sample_mflix.users collection. If you have the sample data already loaded on your cluster, you can use the Visual Editor or JSON Editor in the Atlas UI to configure the index. After you select your preferred configuration method, select the database and collection, and refine your index to add field mappings.

The following index definition example indexes only the email field as the autocomplete type to support search-as-you-type queries against that field using the autocomplete operator. The index definition specifies the following:

  • Use the keyword analyzer to accept a string or array of strings as a parameter and index them as a single term (token).

  • Use the nGram tokenizer to tokenize text into chunks, or "n-grams", of given sizes.

  • Index a minimum of 3 characters per indexed sequence.

  • Index a maximum of 15 characters per indexed sequence.

  • Include diacritic marks in the index and query text.

You can also use the uaxUrlEmail tokenizer to tokenizes URLs and email addresses. To learn more, see uaxUrlEmail.

  1. In the Add Field Mapping window, select email from the Field Name dropdown.

  2. Click the Data Type dropdown and select Autocomplete.

  3. Make the following changes to the Autocomplete Properties:

    Analyzer

    Select lucene.keyword from the dropdown.

    Max Grams

    Set value to 15.

    Min Grams

    Set value to 3.

    Tokenization

    Select nGram from the dropdown.

    Fold Diacritics

    Select false from dropdown.

  4. Click Add.

Replace the default index definition with the following index definition.

1{
2 "mappings": {
3 "dynamic": true,
4 "fields": {
5 "email": {
6 "type": "autocomplete",
7 "analyzer": "lucene.keyword",
8 "tokenization": "nGram",
9 "minGrams": 3,
10 "maxGrams": 15,
11 "foldDiacritics": false
12 }
13 }
14 }
15}

Tip

See also: Additional Index Definition Examples

Back

array