Character Filters

On this page

htmlStrip

Attributes
Example
icuNormalize
Attributes
Example
mapping
Attributes
Example
persian
Attributes
Example

Character filters examine text one character at a time and perform filtering operations. Character filters require a type field, and some take additional options as well.

"charFilters": [
  {
    "type": "<filter-type>",
    "<additional-option>": <value>
  }
]

Atlas Search supports four types of character filters:

htmlStrip
icuNormalize
mapping
persian

The following sample index definitions and queries use the sample collection named minutes. If you add the minutes collection to a database in your Atlas cluster, you can create the following sample indexes from the Visual Editor or JSON Editor in the Atlas UI and run the sample queries against this collection. To create these indexes, after you select your preferred configuration method in the Atlas UI, select the database and collection, and refine your index as shown in the examples on this page to add custom analyzers that use character filters.

Note

When you add a custom analyzer using the Visual Editor in the Atlas UI, the Atlas UI displays the following details about the analyzer in the Custom Analyzers section.

Name	Label that identifies the custom analyzer.
Used In	Fields that use the custom analyzer. Value is None if custom analyzer isn't used to analyze any fields.
Character Filters	Atlas Search character filters configured in the custom analyzer.
Tokenizer	Atlas Search tokenizer configured in the custom analyzer.
Token Filters	Atlas Search token filters configured in the custom analyzer.
Actions	Clickable icons that indicate the actions that you can perform on the custom analyzer. Click to edit the custom analyzer. Click to delete the custom analyzer.

htmlStrip

The htmlStrip character filter strips out HTML constructs.

Attributes

It has the following attributes:

Name	Type	Required?	Description
`type`	string	yes	Human-readable label that identifies this character filter type. Value must be `htmlStrip`.
`ignoredTags`	array of strings	yes	List that contains the HTML tags to exclude from filtering.

Example

The following index definition example indexes the text.en_US field in the minutes collection using a custom analyzer named htmlStrippingAnalyzer. The custom analyzer specifies the following:

Remove all HTML tags from the text except the a tag using the htmlStrip character filter.
Generate tokens based on word break rules from the Unicode Text Segmentation algorithm using the standard tokenizer.

In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type htmlStrippingAnalyzer in the Analyzer Name field.
Expand Character Filters and click Add character filter.
Select htmlStrip from the dropdown and type a in the ignoredTags field.
Click Add character filter.
Expand Tokenizer if it's collapsed and select standard from the dropdown.
Click Add to add the custom analyzer to your index.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the text.en_US nested field.
Select text.en_US nested from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select htmlStrippingAnalyzer from the Index Analyzer and Search Analyzer dropdowns.
Click Add, then Save Changes.

Replace the default index definition with the following:

1  {
2    "mappings": {
3      "fields": {
4        "text": {
5          "type": "document",
6          "dynamic": true,
7          "fields": {
8            "en_US": {
9              "analyzer": "htmlStrippingAnalyzer",
10              "type": "string"
11            }
12          }
13        }
14      }
15    },
16    "analyzers": [{
17      "name": "htmlStrippingAnalyzer",
18      "charFilters": [{
19        "type": "htmlStrip",
20        "ignoredTags": ["a"]
21      }],
22      "tokenizer": {
23        "type": "standard"
24      },
25      "tokenFilters": []
26    }]
27  }

The following query looks for occurrences of the string head in the text.en_US field of the minutes collection.

1 db.minutes.aggregate([
2   {
3     "$search": {
4       "text": {
5         "query": "head",
6         "path": "text.en_US"
7       }
8     }
9   },
10   {
11     "$project": {
12       "_id": 1,
13       "text.en_US": 1
14     }
15   }
16 ])

[
  {
    _id: 2,
    text: { en_US: "The head of the sales department spoke first." }
  },
  {
    _id: 3,
    text: {
      en_US: "<body>We'll head out to the conference room by noon.</body>"
    }
  }
]

Atlas Search doesn't return the document with _id: 1 because the string head is part of the HTML tag <head>. The document with _id: 3 contains HTML tags, but the string head is elsewhere so the document is a match. The following table shows the tokens that Atlas Search generates for the text.en_US field values in documents _id: 1, _id: 2, and _id: 3 in the minutes collection using the htmlStrippingAnalyzer.

Document ID	Output Tokens
`_id: 1`	`This`, `page`, `deals`, `with`, `department`, `meetings`
`_id: 2`	`The`, `head`, `of`, `the`, `sales`, `department`, `spoke`, `first`
`_id: 3`	`We'll`, `head`, `out`, `to`, `the`, `conference`, `room`, `by`, `noon`

icuNormalize

The icuNormalize character filter normalizes text with the ICU Normalizer. It is based on Lucene's ICUNormalizer2CharFilter.

Attributes

It has the following attribute:

Name	Type	Required?	Description
`type`	string	yes	Human-readable label that identifies this character filter type. Value must be `icuNormalize`.

Example

The following index definition example indexes the message field in the minutes collection using a custom analyzer named normalizingAnalyzer. The custom analyzer specifies the following:

Normalize the text in the message field value using the icuNormalize character filter.
Tokenize the words in the field based on occurrences of whitespace between words using the whitespace tokenizer.

In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type normalizingAnalyzer in the Analyzer Name field.
Expand Character Filters and click Add character filter.
Select icuNormalize from the dropdown and click Add character filter.
Expand Tokenizer if it's collapsed and select whitespace from the dropdown.
Click Add to add the custom analyzer to your index.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the message field.
Select message from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select the normalizingAnalyzer from the Index Analyzer and Search Analyzer dropdowns.
Click Add, then Save Changes.

Replace the default index definition with the following:

1 {
2   "mappings": {
3     "fields": {
4       "message": {
5         "type": "string",
6         "analyzer": "normalizingAnalyzer"
7       }
8     }
9   },
10   "analyzers": [
11     {
12       "name": "normalizingAnalyzer",
13       "charFilters": [
14         {
15           "type": "icuNormalize"
16         }
17       ],
18       "tokenizer": {
19         "type": "whitespace"
20       },
21       "tokenFilters": []
22     }
23   ]
24 }

The following query searches for occurrences of the string no (for number) in the message field of the minutes collection.

1 db.minutes.aggregate([
2   {
3     "$search": {
4       "text": {
5         "query": "no",
6         "path": "message"
7       }
8     }
9   },
10   {
11     "$project": {
12       "_id": 1,
13       "message": 1,
14       "title": 1
15     }
16   }
17 ])

[
  {
    _id: 4,
    title: 'The daily huddle on tHe StandUpApp2',
    message: 'write down your signature or phone №'
  }
]

Atlas Search matched document with _id: 4 to the query term no because it normalized the numero symbol № in the field using the icuNormalize character filter and created the token no for that typographic abbreviation of the word "number". Atlas Search generates the following tokens for the message field value in document _id: 4 using the normalizingAnalyzer:

Document ID	Output Tokens
`_id: 4`	`write`, `down`, `your`, `signature`, `or`, `phone`, `no`

mapping

The mapping character filter applies user-specified normalization mappings to characters. It is based on Lucene's MappingCharFilter.

Attributes

It has the following attributes:

Name	Type	Required?	Description
`type`	string	yes	Human-readable label that identifies this character filter type. Value must be `mapping`.
`mappings`	object	yes	Object that contains a comma-separated list of mappings. A mapping indicates that one character or group of characters should be substituted for another, in the format `<original> : <replacement>`.

Example

The following index definition example indexes the page_updated_by.phone field in the minutes collection using a custom analyzer named mappingAnalyzer. The custom analyzer specifies the following:

Remove instances of hyphen (-), dot (.), open parenthesis ((), close parenthesis ( )), and space characters in the phone field using the mapping character filter.
Tokenize the entire input as a single token using the keyword tokenizer.

In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type mappingAnalyzer in the Analyzer Name field.
Expand Character Filters and click Add character filter.
Select mapping from the dropdown and click Add mapping.
Enter the following characters in the Original field, one at a time, and leave the corresponding Replacement field empty.
Original
Replacement
-
.
(
)
{SPACE}
Click Add character filter.
Expand Tokenizer if it's collapsed and select keyword from the dropdown.
Click Add to add the custom analyzer to your index.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the page_updated_by.phone (nested) field.
Select page_updated_by.phone (nested) from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select mappingAnalyzer from the Index Analyzer and Search Analyzer dropdowns.
Click Add, then Save Changes.

Replace the default index definition with the following:

1 {
2   "mappings": {
3     "fields": {
4       "page_updated_by": {
5         "fields": {
6           "phone": {
7             "analyzer": "mappingAnalyzer",
8             "type": "string"
9           }
10         },
11         "type": "document"
12       }
13     }
14   },
15   "analyzers": [
16     {
17       "name": "mappingAnalyzer",
18       "charFilters": [
19         {
20           "mappings": {
21             "-": "",
22             ".": "",
23             "(": "",
24             ")": "",
25             " ": ""
26           },
27           "type": "mapping"
28         }
29       ],
30       "tokenizer": {
31         "type": "keyword"
32       }
33     }
34   ]
35 }

The following query searches the page_updated_by.phone field for the string 1234567890.

1 db.minutes.aggregate([
2   {
3     "$search": {
4       "text": {
5         "query": "1234567890",
6         "path": "page_updated_by.phone"
7       }
8     }
9   },
10   {
11     "$project": {
12       "_id": 1,
13       "page_updated_by.phone": 1,
14       "page_updated_by.last_name": 1
15     }
16   }
17 ])

[
  {
    _id: 1,
    page_updated_by: { last_name: 'AUERBACH', phone: '(123)-456-7890' }
  }
]

The Atlas Search results contain one document where the numbers in the phone string match the query string. Atlas Search matched the document to the query string even though the query doesn't include the parentheses around the phone area code and the hyphen between the numbers because Atlas Search removed these characters using the mapping character filter and created a single token for the field value. Specifically, Atlas Search generated the following token for the phone field in document with _id: 1:

Document ID	Output Tokens
`_id: 1`	`1234567890`

Atlas Search would also match document with _id: 1 for searches for (123)-456-7890, 123-456-7890, 123.456.7890, and so on because for How to Index String Fields fields, Atlas Search also analyzes search query terms using the index analyzer (or if specified, using the searchAnalyzer). The following table shows the tokens that Atlas Search creates by removing instances of hyphen (-), dot (.), open parenthesis ((), close parenthesis ( )), and space characters in the query term:

Query Term	Output Tokens
`(123)-456-7890`	`1234567890`
`123-456-7890`	`1234567890`
`123.456.7890`	`1234567890`

Tip

persian

The persian character filter replaces instances of zero-width non-joiner with the space character. This character filter is based on Lucene's PersianCharFilter.

Attributes

It has the following attribute:

Name	Type	Required?	Description
`type`	string	yes	Human-readable label that identifies this character filter type. Value must be `persian`.

Example

The following index definition example indexes the text.fa_IR field in the minutes collection using a custom analyzer named persianCharacterIndex. The custom analyzer specifies the following:

Apply the persian character filter to replace non-printing characters in the field value with the space character.
Use the whitespace tokenizer to create tokens based on occurrences of whitespace between words.

In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type persianCharacterIndex in the Analyzer Name field.
Expand Character Filters and click Add character filter.
Select persian from the dropdown and click Add character filter.
Expand Tokenizer if it's collapsed and select whitespace from the dropdown.
Click Add to add the custom analyzer to your index.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the text.fa_IR (nested) field.
Select text.fa_IR (nested) from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select the persianCharacterIndex from the Index Analyzer and Search Analyzer dropdowns.
Click Add, then Save Changes.

Replace the default index definition with the following:

1 {
2   "analyzer": "lucene.standard",
3   "mappings": {
4     "fields": {
5       "text": {
6         "dynamic": true,
7         "fields": {
8           "fa_IR": {
9             "analyzer": "persianCharacterIndex",
10             "type": "string"
11           }
12         },
13         "type": "document"
14       }
15     }
16   },
17   "analyzers": [
18     {
19       "name": "persianCharacterIndex",
20       "charFilters": [
21         {
22           "type": "persian"
23         }
24       ],
25       "tokenizer": {
26         "type": "whitespace"
27       }
28     }
29   ]
30 }

The following query searches the text.fa_IR field for the term صحبت.

1 db.minutes.aggregate([
2   {
3     "$search": {
4       "text": {
5         "query": "صحبت",
6         "path": "text.fa_IR"
7       }
8     }
9   },
10   {
11     "$project": {
12       "_id": 1,
13       "text.fa_IR": 1,
14       "page_updated_by.last_name": 1
15     }
16   }
17 ])

[
  {
    _id: 2,
    page_updated_by: { last_name: 'OHRBACH' },
    text: { fa_IR: 'ابتدا رئیس بخش فروش صحبت کرد' }
  }
]

Atlas Search returns the _id: 2 document that contains the query term. Atlas Search matches the query term to the document by first replacing instances of zero-width non-joiners with the space character and then creating individual tokens for each word in the field value based on occurrences of whitespace between words. Specifically, Atlas Search generates the following tokens for document with _id: 2:

Document ID	Output Tokens
`_id: 2`	`ابتدا`, `رئیس`, `بخش`, `فروش`, `صحبت`, `کرد`

Back

Custom

Tokenizers

1	{
2	"mappings": {
3	"fields": {
4	"text": {
5	"type": "document",
6	"dynamic": true,
7	"fields": {
8	"en_US": {
9	"analyzer": "htmlStrippingAnalyzer",
10	"type": "string"
11	}
12	}
13	}
14	}
15	},
16	"analyzers": [{
17	"name": "htmlStrippingAnalyzer",
18	"charFilters": [{
19	"type": "htmlStrip",
20	"ignoredTags": ["a"]
21	}],
22	"tokenizer": {
23	"type": "standard"
24	},
25	"tokenFilters": []
26	}]
27	}

1	db.minutes.aggregate([
2	{
3	"$search": {
4	"text": {
5	"query": "head",
6	"path": "text.en_US"
7	}
8	}
9	},
10	{
11	"$project": {
12	"_id": 1,
13	"text.en_US": 1
14	}
15	}
16	])

1	{
2	"mappings": {
3	"fields": {
4	"message": {
5	"type": "string",
6	"analyzer": "normalizingAnalyzer"
7	}
8	}
9	},
10	"analyzers": [
11	{
12	"name": "normalizingAnalyzer",
13	"charFilters": [
14	{
15	"type": "icuNormalize"
16	}
17	],
18	"tokenizer": {
19	"type": "whitespace"
20	},
21	"tokenFilters": []
22	}
23	]
24	}

1	{
2	"mappings": {
3	"fields": {
4	"page_updated_by": {
5	"fields": {
6	"phone": {
7	"analyzer": "mappingAnalyzer",
8	"type": "string"
9	}
10	},
11	"type": "document"
12	}
13	}
14	},
15	"analyzers": [
16	{
17	"name": "mappingAnalyzer",
18	"charFilters": [
19	{
20	"mappings": {
21	"-": "",
22	".": "",
23	"(": "",
24	")": "",
25	" ": ""
26	},
27	"type": "mapping"
28	}
29	],
30	"tokenizer": {
31	"type": "keyword"
32	}
33	}
34	]
35	}

1	{
2	"analyzer": "lucene.standard",
3	"mappings": {
4	"fields": {
5	"text": {
6	"dynamic": true,
7	"fields": {
8	"fa_IR": {
9	"analyzer": "persianCharacterIndex",
10	"type": "string"
11	}
12	},
13	"type": "document"
14	}
15	}
16	},
17	"analyzers": [
18	{
19	"name": "persianCharacterIndex",
20	"charFilters": [
21	{
22	"type": "persian"
23	}
24	],
25	"tokenizer": {
26	"type": "whitespace"
27	}
28	}
29	]
30	}

Original	Replacement
`-`
`.`
`(`
`)`
{SPACE}

Note

htmlStrip

Attributes

Example

icuNormalize

Attributes

Example

mapping

Attributes

Example

Tip

See also: Additional Sample Index Definitions and Queries

persian

Attributes

Example