Collation
New in version 3.4.
Collation allows users to specify language-specific rules for string comparison, such as rules for lettercase and accent marks.
You can specify collation for a collection or a view, an index, or specific operations that support collation.
To specify collation when you query documents in the MongoDB Atlas UI, see Specify Collation.
Collation Document
A collation document has the following fields:
{ locale: <string>, caseLevel: <boolean>, caseFirst: <string>, strength: <int>, numericOrdering: <boolean>, alternate: <string>, maxVariable: <string>, backwards: <boolean> }
When specifying collation, the locale
field is mandatory; all
other collation fields are optional. For descriptions of the fields,
see Collation Document.
Default collation parameter values vary depending on which locale you specify. For a complete list of default collation parameters and the locales they are associated with, see Collation Default Parameters.
Field | Type | Description | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| string | The ICU locale. See Supported Languages and Locales for a list of supported locales. To specify simple binary comparison, specify | ||||||||||||
| integer | Optional. The level of comparison to perform. Corresponds to ICU Comparison Levels. Possible values are:
See ICU Collation: Comparison Levels for details. | ||||||||||||
| boolean | Optional. Flag that determines whether to include case comparison
at If
If For more information, see ICU Collation: Case Level. | ||||||||||||
| string | Optional. A field that determines sort order of case differences during tertiary level comparisons. Possible values are:
| ||||||||||||
| boolean | Optional. Flag that determines whether to compare numeric strings as numbers or as strings. If If Default is | ||||||||||||
| string | Optional. Field that determines whether collation should consider whitespace and punctuation as base characters for purposes of comparison. Possible values are:
See ICU Collation: Comparison Levels for more information. Default is | ||||||||||||
| string | Optional. Field that determines up to which characters are considered
ignorable when Possible values are:
| ||||||||||||
| boolean | Optional. Flag that determines whether strings with diacritics sort from back of the string, such as with some French dictionary ordering. If If The default value is | ||||||||||||
| boolean | Optional. Flag that determines whether to check if text require normalization and to perform normalization. Generally, majority of text does not require this normalization processing. If If The default value is See https://unicode-org.github.io/icu/userguide/collation/concepts.html#normalization for details. |
Operations that Support Collation
You can specify collation for the following operations:
Note
You cannot specify multiple collations for an operation. For example, you cannot specify different collations per field, or if performing a find with a sort, you cannot use one collation for the find and another for the sort.
Commands | mongosh Methods |
---|---|
| |
Individual update, replace, and delete operations in
|
[1] | (1, 2) Some index types do not support collation. See Collation and Unsupported Index Types for details. |
Behavior
Local Variants
Some collation locales have variants, which employ special language-specific rules. To specify a locale variant, use the following syntax:
{ "locale" : "<locale code>@collation=<variant>" }
For example, to use the unihan
variant of the Chinese collation:
{ "locale" : "zh@collation=unihan" }
For a complete list of all collation locales and their variants, see Collation Locales.
Collation and Views
You can specify a default collation for a view at creation time. If no collation is specified, the view's default collation is the "simple" binary comparison collator. That is, the view does not inherit the collection's default collation.
String comparisons on the view use the view's default collation. An operation that attempts to change or override a view's default collation will fail with an error.
If creating a view from another view, you cannot specify a collation that differs from the source view's collation.
If performing an aggregation that involves multiple views, such as with
$lookup
or$graphLookup
, the views must have the same collation.
Collation and Index Use
To use an index for string comparisons, an operation must also specify the same collation. That is, an index with a collation cannot support an operation that performs string comparisons on the indexed fields if the operation specifies a different collation.
Warning
Because indexes that are configured with collation use ICU collation keys to achieve sort order, collation-aware index keys may be larger than index keys for indexes without collation.
For example, the collection myColl
has an index on a string
field category
with the collation locale "fr"
.
db.myColl.createIndex( { category: 1 }, { collation: { locale: "fr" } } )
The following query operation, which specifies the same collation as the index, can use the index:
db.myColl.find( { category: "cafe" } ).collation( { locale: "fr" } )
However, the following query operation, which by default uses the "simple" binary collator, cannot use the index:
db.myColl.find( { category: "cafe" } )
For a compound index where the index prefix keys are not strings, arrays, and embedded documents, an operation that specifies a different collation can still use the index to support comparisons on the index prefix keys.
For example, the collection myColl
has a compound index on the
numeric fields score
and price
and the string field
category
; the index is created with the collation locale
"fr"
for string comparisons:
db.myColl.createIndex( { score: 1, price: 1, category: 1 }, { collation: { locale: "fr" } } )
The following operations, which use "simple"
binary collation
for string comparisons, can use the index:
db.myColl.find( { score: 5 } ).sort( { price: 1 } ) db.myColl.find( { score: 5, price: { $gt: NumberDecimal( "10" ) } } ).sort( { price: 1 } )
The following operation, which uses "simple"
binary collation
for string comparisons on the indexed category
field, can use
the index to fulfill only the score: 5
portion of the query:
db.myColl.find( { score: 5, category: "cafe" } )
Important
Matches against document keys, including embedded document keys, use simple binary comparison. This means that a query for a key like "foo.bár" will not match the key "foo.bar", regardless of the value you set for the strength parameter.
Collation and Unsupported Index Types
The following indexes only support simple binary comparison and do not support collation:
text indexes,
2d indexes, and
geoHaystack indexes.
Tip
To create a text
, a 2d
, or a geoHaystack
index on a
collection that has a non-simple collation, you must explicitly
specify {collation: {locale: "simple"} }
when creating the
index.
Restrictions
numericOrdering
When specifying the numericOrdering
as true
the following
restrictions apply:
Only contiguous non-negative integer substrings of digits are considered in the comparisons.
numericOrdering
does not support:+
-
decimal separators, like decimal points and decimal commas
exponents
Only Unicode code points in the Number or Decimal Digit (Nd) category are treated as digits.
If a digit length exceeds 254 characters, the excess characters are treated as a separate number.
Consider a collection with the following string number and decimal values:
db.c.insertMany( [ { "n" : "1" }, { "n" : "2" }, { "n" : "2.1" }, { "n" : "-2.1" }, { "n" : "2.2" }, { "n" : "2.10" }, { "n" : "2.20" }, { "n" : "-10" }, { "n" : "10" }, { "n" : "20" }, { "n" : "20.1" } ] )
The following find
query uses a
collation document containing the numericOrdering
parameter:
db.c.find( { }, { _id: 0 } ).sort( { n: 1 } ).collation( { locale: 'en_US', numericOrdering: true } )
The operation returns the following results:
[ { n: '-2.1' }, { n: '-10' }, { n: '1' }, { n: '2' }, { n: '2.1' }, { n: '2.2' }, { n: '2.10' }, { n: '2.20' }, { n: '10' }, { n: '20' }, { n: '20.1' } ]
numericOrdering: true
sorts the string values in ascending order as if they were numeric values.The two negative values
-2.1
and-10
are not sorted in the expected sort order because they have unsupported-
characters.The value
2.2
is sorted before the value2.10
, due to the fact that thenumericOrdering
parameter does not support decimal values.As a result,
2.2
and2.10
are sorted in lexicographic order.