Character Filters
On this page
Character filters examine text one character at a time and perform filtering operations. Character filters require a type field, and some take additional options as well.
"charFilters": [ { "type": "<filter-type>", "<additional-option>": <value> } ]
Atlas Search supports four types of character filters:
The following sample index definitions and queries use the sample
collection named minutes
. If you add
the minutes
collection to a database in your Atlas cluster, you
can create the following sample indexes from the Visual Editor or JSON
Editor in the Atlas UI and run the sample queries against this
collection. To create these indexes, after you select your preferred
configuration method in the Atlas UI, select the database and
collection, and refine your index as shown in the examples on this page
to add custom analyzers that use character filters.
Note
When you add a custom analyzer using the Visual Editor in the Atlas UI, the Atlas UI displays the following details about the analyzer in the Custom Analyzers section.
Name | Label that identifies the custom analyzer. |
Used In | Fields that use the custom analyzer. Value is None if custom analyzer isn't used to analyze any fields. |
Character Filters | Atlas Search character filters configured in the custom analyzer. |
Tokenizer | Atlas Search tokenizer configured in the custom analyzer. |
Token Filters | Atlas Search token filters configured in the custom analyzer. |
Actions | Clickable icons that indicate the actions that you can perform on the custom analyzer.
|
htmlStrip
The htmlStrip
character filter strips out HTML constructs.
Attributes
It has the following attributes:
Name | Type | Required? | Description |
---|---|---|---|
| string | yes | Human-readable label that identifies this character filter type.
Value must be |
| array of strings | yes | List that contains the HTML tags to exclude from filtering. |
Example
The following index definition example indexes the text.en_US
field in the minutes collection
using a custom analyzer named htmlStrippingAnalyzer
. The
custom analyzer specifies the following:
Remove all HTML tags from the text except the
a
tag using thehtmlStrip
character filter.Generate tokens based on word break rules from the Unicode Text Segmentation algorithm using the standard tokenizer.
In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type
htmlStrippingAnalyzer
in the Analyzer Name field.Expand Character Filters and click Add character filter.
Select htmlStrip from the dropdown and type
a
in the ignoredTags field.Click Add character filter.
Expand Tokenizer if it's collapsed and select standard from the dropdown.
Click Add to add the custom analyzer to your index.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the text.en_US nested field.
Select text.en_US nested from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select
htmlStrippingAnalyzer
from the Index Analyzer and Search Analyzer dropdowns.Click Add, then Save Changes.
Replace the default index definition with the following:
1 { 2 "mappings": { 3 "fields": { 4 "text": { 5 "type": "document", 6 "dynamic": true, 7 "fields": { 8 "en_US": { 9 "analyzer": "htmlStrippingAnalyzer", 10 "type": "string" 11 } 12 } 13 } 14 } 15 }, 16 "analyzers": [{ 17 "name": "htmlStrippingAnalyzer", 18 "charFilters": [{ 19 "type": "htmlStrip", 20 "ignoredTags": ["a"] 21 }], 22 "tokenizer": { 23 "type": "standard" 24 }, 25 "tokenFilters": [] 26 }] 27 }
The following query looks for occurrences of the string head
in
the text.en_US
field of the minutes
collection.
1 db.minutes.aggregate([ 2 { 3 "$search": { 4 "text": { 5 "query": "head", 6 "path": "text.en_US" 7 } 8 } 9 }, 10 { 11 "$project": { 12 "_id": 1, 13 "text.en_US": 1 14 } 15 } 16 ])
[ { _id: 2, text: { en_US: "The head of the sales department spoke first." } }, { _id: 3, text: { en_US: "<body>We'll head out to the conference room by noon.</body>" } } ]
Atlas Search doesn't return the document with _id: 1
because the
string head
is part of the HTML tag <head>
. The
document with _id: 3
contains HTML tags, but the string
head
is elsewhere so the document is a match. The following
table shows the tokens that Atlas Search generates for the text.en_US
field values in documents _id: 1
, _id: 2
, and _id: 3
in
the minutes collection using the
htmlStrippingAnalyzer
.
Document ID | Output Tokens |
---|---|
|
|
|
|
|
|
icuNormalize
The icuNormalize
character filter normalizes text with the ICU Normalizer. It is based on Lucene's
ICUNormalizer2CharFilter.
Attributes
It has the following attribute:
Name | Type | Required? | Description |
---|---|---|---|
| string | yes | Human-readable label that identifies this character filter type.
Value must be |
Example
The following index definition example indexes the message
field
in the minutes collection using a
custom analyzer named normalizingAnalyzer
. The custom analyzer
specifies the following:
Normalize the text in the
message
field value using theicuNormalize
character filter.Tokenize the words in the field based on occurrences of whitespace between words using the whitespace tokenizer.
In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type
normalizingAnalyzer
in the Analyzer Name field.Expand Character Filters and click Add character filter.
Select icuNormalize from the dropdown and click Add character filter.
Expand Tokenizer if it's collapsed and select whitespace from the dropdown.
Click Add to add the custom analyzer to your index.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the message field.
Select message from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select the
normalizingAnalyzer
from the Index Analyzer and Search Analyzer dropdowns.Click Add, then Save Changes.
Replace the default index definition with the following:
1 { 2 "mappings": { 3 "fields": { 4 "message": { 5 "type": "string", 6 "analyzer": "normalizingAnalyzer" 7 } 8 } 9 }, 10 "analyzers": [ 11 { 12 "name": "normalizingAnalyzer", 13 "charFilters": [ 14 { 15 "type": "icuNormalize" 16 } 17 ], 18 "tokenizer": { 19 "type": "whitespace" 20 }, 21 "tokenFilters": [] 22 } 23 ] 24 }
The following query searches for occurrences of the string
no
(for number) in the message
field of the minutes
collection.
1 db.minutes.aggregate([ 2 { 3 "$search": { 4 "text": { 5 "query": "no", 6 "path": "message" 7 } 8 } 9 }, 10 { 11 "$project": { 12 "_id": 1, 13 "message": 1, 14 "title": 1 15 } 16 } 17 ])
[ { _id: 4, title: 'The daily huddle on tHe StandUpApp2', message: 'write down your signature or phone №' } ]
Atlas Search matched document with _id: 4
to the query term no
because it normalized the numero symbol №
in the field using the
icuNormalize
character filter and created the token no
for
that typographic abbreviation of the word "number". Atlas Search generates
the following tokens for the message
field value in document
_id: 4
using the normalizingAnalyzer
:
Document ID | Output Tokens |
---|---|
|
|
mapping
The mapping
character filter applies user-specified normalization
mappings to characters. It is based on Lucene's MappingCharFilter.
Attributes
It has the following attributes:
Name | Type | Required? | Description |
---|---|---|---|
| string | yes | Human-readable label that identifies this character filter type.
Value must be |
| object | yes | Object that contains a comma-separated list of mappings. A
mapping indicates that one character or group of characters
should be substituted for another, in the format
|
Example
The following index definition example indexes the
page_updated_by.phone
field in the minutes collection using a custom analyzer named
mappingAnalyzer
. The custom analyzer specifies the following:
Remove instances of hyphen (
-
), dot (.
), open parenthesis ((
), close parenthesis ()
), and space characters in the phone field using themapping
character filter.Tokenize the entire input as a single token using the keyword tokenizer.
In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type
mappingAnalyzer
in the Analyzer Name field.Expand Character Filters and click Add character filter.
Select mapping from the dropdown and click Add mapping.
Enter the following characters in the Original field, one at a time, and leave the corresponding Replacement field empty.
OriginalReplacement-
.
(
)
{SPACE}
Click Add character filter.
Expand Tokenizer if it's collapsed and select keyword from the dropdown.
Click Add to add the custom analyzer to your index.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the page_updated_by.phone (nested) field.
Select page_updated_by.phone (nested) from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select
mappingAnalyzer
from the Index Analyzer and Search Analyzer dropdowns.Click Add, then Save Changes.
Replace the default index definition with the following:
1 { 2 "mappings": { 3 "fields": { 4 "page_updated_by": { 5 "fields": { 6 "phone": { 7 "analyzer": "mappingAnalyzer", 8 "type": "string" 9 } 10 }, 11 "type": "document" 12 } 13 } 14 }, 15 "analyzers": [ 16 { 17 "name": "mappingAnalyzer", 18 "charFilters": [ 19 { 20 "mappings": { 21 "-": "", 22 ".": "", 23 "(": "", 24 ")": "", 25 " ": "" 26 }, 27 "type": "mapping" 28 } 29 ], 30 "tokenizer": { 31 "type": "keyword" 32 } 33 } 34 ] 35 }
The following query searches the page_updated_by.phone
field for
the string 1234567890
.
1 db.minutes.aggregate([ 2 { 3 "$search": { 4 "text": { 5 "query": "1234567890", 6 "path": "page_updated_by.phone" 7 } 8 } 9 }, 10 { 11 "$project": { 12 "_id": 1, 13 "page_updated_by.phone": 1, 14 "page_updated_by.last_name": 1 15 } 16 } 17 ])
[ { _id: 1, page_updated_by: { last_name: 'AUERBACH', phone: '(123)-456-7890' } } ]
The Atlas Search results contain one document where the numbers in the
phone
string match the query string. Atlas Search matched the
document to the query string even though the query doesn't
include the parentheses around the phone area code and the
hyphen between the numbers because Atlas Search removed these
characters using the mapping
character filter and created a
single token for the field value. Specifically, Atlas Search generated
the following token for the phone
field in document with
_id: 1
:
Document ID | Output Tokens |
---|---|
|
|
Atlas Search would also match document with _id: 1
for searches
for (123)-456-7890
, 123-456-7890
, 123.456.7890
, and
so on because for How to Index String Fields fields, Atlas Search also
analyzes search query terms using the index analyzer (or if
specified, using the searchAnalyzer
). The following table shows
the tokens that Atlas Search creates by removing instances of hyphen
(-
), dot (.
), open parenthesis ((
), close parenthesis (
)
), and space characters in the query term:
Query Term | Output Tokens |
---|---|
|
|
|
|
|
|
Tip
See also: Additional Sample Index Definitions and Queries
shingle token filter
regexCaptureGroup tokenizer
persian
The persian
character filter replaces instances of zero-width
non-joiner
with the space character. This character filter is based on Lucene's
PersianCharFilter.
Attributes
It has the following attribute:
Name | Type | Required? | Description |
---|---|---|---|
| string | yes | Human-readable label that identifies this character filter type.
Value must be |
Example
The following index definition example indexes the text.fa_IR
field in the minutes collection
using a custom analyzer named persianCharacterIndex
. The
custom analyzer specifies the following:
Apply the
persian
character filter to replace non-printing characters in the field value with the space character.Use the whitespace tokenizer to create tokens based on occurrences of whitespace between words.
In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type
persianCharacterIndex
in the Analyzer Name field.Expand Character Filters and click Add character filter.
Select persian from the dropdown and click Add character filter.
Expand Tokenizer if it's collapsed and select whitespace from the dropdown.
Click Add to add the custom analyzer to your index.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the text.fa_IR (nested) field.
Select text.fa_IR (nested) from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select the
persianCharacterIndex
from the Index Analyzer and Search Analyzer dropdowns.Click Add, then Save Changes.
Replace the default index definition with the following:
1 { 2 "analyzer": "lucene.standard", 3 "mappings": { 4 "fields": { 5 "text": { 6 "dynamic": true, 7 "fields": { 8 "fa_IR": { 9 "analyzer": "persianCharacterIndex", 10 "type": "string" 11 } 12 }, 13 "type": "document" 14 } 15 } 16 }, 17 "analyzers": [ 18 { 19 "name": "persianCharacterIndex", 20 "charFilters": [ 21 { 22 "type": "persian" 23 } 24 ], 25 "tokenizer": { 26 "type": "whitespace" 27 } 28 } 29 ] 30 }
The following query searches the text.fa_IR
field for the term
صحبت
.
1 db.minutes.aggregate([ 2 { 3 "$search": { 4 "text": { 5 "query": "صحبت", 6 "path": "text.fa_IR" 7 } 8 } 9 }, 10 { 11 "$project": { 12 "_id": 1, 13 "text.fa_IR": 1, 14 "page_updated_by.last_name": 1 15 } 16 } 17 ])
[ { _id: 2, page_updated_by: { last_name: 'OHRBACH' }, text: { fa_IR: 'ابتدا رئیس بخش فروش صحبت کرد' } } ]
Atlas Search returns the _id: 2
document that contains the query term.
Atlas Search matches the query term to the document by first replacing
instances of zero-width non-joiners with the space character and
then creating individual tokens for each word in the field value
based on occurrences of whitespace between words. Specifically, Atlas Search
generates the following tokens for document with _id: 2
:
Document ID | Output Tokens |
---|---|
|
|