Whitespace Analyzer
The whitespace
analyzer divides text into searchable terms (tokens)
wherever it finds a whitespace character. It leaves all text in its
original letter case.
You can see the tokens that the whitespace
analyzer creates for a
built-in static string in the Atlas UI Visual Editor
when you Refine Your Index. The Index
Configurations section displays the index and search tokens that the
whitespace
analyzer creates if you expand View text
analysis of your selected index configuration to help you select the
analyzer to use in your index.
Important
Atlas Search won't index string fields where analyzer tokens exceed 32766 bytes in size. If using the keyword analyzer, string fields which exceed 32766 bytes will not be indexed.
Example
The following example index definition specifies an index on
the title
field in the sample_mflix.movies
collection using the whitespace
analyzer. If you loaded the
collection on your cluster, you can create the example index using
the Atlas UI Visual Editor or the JSON Editor. After you select
your preferred configuration method, select the database and collection.
Click Refine Your Index to configure your index.
In the Field Mappings section, click Add Field to open the Add Field Mapping window.
Select
title
from the Field Name dropdown.Click Customized Configuration.
Click the Data Type dropdown and select String if it isn't already selected.
Expand String Properties and make the following changes:
Index Analyzer
Select
lucene.whitespace
from the dropdown.Search Analyzer
Select
lucene.whitespace
from the dropdown.Index Options
Use the default
offsets
.Store
Use the default
true
.Ignore Above
Keep the default setting.
Norms
Use the default
include
.Click Add.
Click Save Changes.
Click Create Search Index.
Replace the default index definition with the following index definition.
{ "mappings": { "fields": { "title": { "type": "string", "analyzer": "lucene.whitespace", "searchAnalyzer": "lucene.whitespace" } } } } Click Next.
Click Create Search Index.
The following query searches for the term Lion's
in the title
field.
db.movies.aggregate([ { "$search": { "text": { "query": "Lion's", "path": "title" } } }, { "$project": { "_id": 0, "title": 1 } } ])
[ { title: 'Lion's Den' }, { title: 'The Lion's Mouth Opens' } ]
Atlas Search returns these documents by doing the following for the text in the
title
field using the lucene.whitespace
analyzer:
Retain the original letter case for the text.
Divide the text into tokens wherever it finds a whitespace character.
The following table shows the tokens (searchable terms) that Atlas Search creates using the Whitespace Analyzer and, by contrast, the Simple Analyzer and Keyword Analyzer for the documents in the results:
Title | Whitespace Analyzer Tokens | Simple Analyzer Tokens | Keyword Analyzer Tokens |
---|---|---|---|
|
|
|
|
|
|
|
|
The index that uses whitespace
analyzer is case-sensitive.
Therefore, Atlas Search is able to match the query term Lion's
to the token
Lion's
created by the whitespace
analyzer.