Docs Menu
Docs Home
/
MongoDB Atlas
/ / / /

Standard Analyzer

The standard analyzer is the default for all Atlas Search indexes and queries. It divides text into terms based on word boundaries, which makes it language-neutral for most use cases. It converts all terms to lower case and removes punctuation. It provides grammar-based tokenization that recognizes email addresses, acronyms, Chinese-Japanese-Korean characters, alphanumerics, and more.

You can see the tokens that the standard analyzer creates for a built-in static string in the Atlas UI Visual Editor when you Refine Your Index. The Index Configurations section displays the index and search tokens that the standard analyzer creates if you expand View text analysis of your selected index configuration to help you select the analyzer to use in your index.

Important

Atlas Search won't index string fields where analyzer tokens exceed 32766 bytes in size. If using the keyword analyzer, string fields which exceed 32766 bytes will not be indexed.

The following example index definition specifies an index on the title field in the sample_mflix.movies collection using the standard analyzer. If you loaded the collection on your cluster, you can create the example index using the Atlas UI Visual Editor or the JSON Editor. After you select your preferred configuration method, select the database and collection.

  1. Click Refine Your Index to configure your index.

  2. In the Field Mappings section, click Add Field Mapping to open the Add Field Mapping window.

  3. Click Customized Configuration.

  4. Select title from the Field Name dropdown.

  5. Click the Data Type dropdown and select String if it isn't already selected.

  6. Expand String Properties and make the following changes:

    Index Analyzer
    Select lucene.standard from the dropdown if it isn't already selected.
    Search Analyzer
    Select lucene.standard from the dropdown if it isn't already selected.
    Index Options
    Use the default offsets.
    Store
    Use the default true.
    Ignore Above
    Keep the default setting.
    Norms
    Use the default include.
  7. Click Add.

  8. Click Save Changes.

  9. Click Create Search Index.

  1. Replace the default index definition with the following index definition.

    {
    "mappings": {
    "fields": {
    "title": {
    "type": "string",
    "analyzer": "lucene.standard"
    }
    }
    }
    }
  2. Click Next.

  3. Click Create Search Index.

The following query searches the title field for the term action and limits the output to two results.

db.movies.aggregate([
{
"$search": {
"text": {
"query": "action",
"path": "title"
}
}
},
{
"$limit": 2
},
{
"$project": {
"_id": 0,
"title": 1
}
}
])
[
{
title: 'Action Jackson'
},
{
title: 'Class Action'
}
]

Atlas Search returned these documents because it matched the query term action to the token action for the documents, which Atlas Search created by doing the following for the text in the title field using the lucene.standard analyzer:

  • Convert the text to lowercase.

  • Split the text based on word boundaries and create separate tokens.

The following table shows the tokens (searchable terms) that Atlas Search creates using the Standard Analyzer and, by contrast, the tokens that Atlas Search creates for the Keyword Analyzer and Whitespace Analyzer for the documents in the results:

Title
Standard Analyzer Tokens
Keyword Analyzer Tokens
Whitespace Analyzer Tokens
Action Jackson
action, jackson
Action Jackson
Action, Jackson
Class Action
class, action
Class Action
Class, Action

If you index the field using the:

  • Keyword Analyzer, Atlas Search wouldn't match the documents in the results for the query term action because the keyword analyzer matches only documents in which the search term matches the entire contents of the field (Action Jackson and Class Action) exactly.

  • Whitespace Analyzer, Atlas Search wouldn't match the documents in the results for the query term action because the whitespace analyzer tokenizes the title field value in its original case (Action) and the query term has the lowercase action, which doesn't match the whitespace analyzer token.

Back

1: Process Data with Analyzers