Atlas Search Multi-Language Data Modeling

Ethan Steininger, Harshad Dhavale2 min read • Published Sep 07, 2022 • Updated Sep 09, 2022

Atlas Search

Rate this tutorial

We live in an increasingly globalized economy. By extension, users have expectations that our applications will understand the context of their culture and by extension: language.

Luckily, most search engines—including, Atlas Search—support multiple languages. This article will walk through three options of query patterns, data models, and index definitions to support your various multilingual application needs.

To illustrate the options, we will create a fictitious scenario. We manage a recipe search application that supports three cultures, and by extension, languages: English, Japanese (Kuromoji), and German. Our users are located around the globe and need to search for recipes in their native language.

1. Single field

We have one document for each language in the same collection, and thus each field is indexed separately as its own language. This simplifies the query patterns and UX at the expense of bloated index storage.

Document:

1 [
2   {"name":"すし"},
3   {"name":"Fish and Chips"},
4   {"name":"Käsespätzle"}
5 ]

Index:

1 {
2   "name":"recipes",
3   "mappings": {
4     "dynamic": false,
5     "fields": {
6       "name": {
7         "type": "string",
8         "analyzer": "lucene.kuromoji"
9       },
10       "name": {
11         "type": "string",
12         "analyzer": "lucene.english"
13       },
14       "name": {
15         "type": "string",
16         "analyzer": "lucene.german"
17       }
18     }
19   }
20 }

Query:

1 {
2   "$search": {
3     "index": "recipes",
4     "text": {
5       "query": "Fish and Chips",
6       "path": "name"
7     }
8   }
9 }

Pros:

One single index definition.
Don’t need to specify index name or path based on user’s language.
Can support multiple languages in a single query.

Cons:

As more fields get added, the index definition needs to change.
Index definition payload is potentially quite large (static field mapping per language).
Indexing fields as irrelevant languages causing larger index size than necessary.

2. Multiple collections

We have one collection and index per language, which allows us to isolate the different recipe languages. This could be useful if we have more recipes in some languages than others at the expense of lots of collections and indexes.

Documents:

1 recipes_jp:
2 [{"name":"すし"}]
3 
4 recipes_en:
5 [{"name":"Fish and Chips"}]
6 
7 recipes_de:
8 [{"name":"Käsespätzle"}]

Index:

1 {
2   "name":"recipes_jp",
3   "mappings": {
4     "dynamic": false,
5     "fields": {
6       "name": {
7         "type": "string",
8         "analyzer": "lucene.kuromoji"
9       }
10     }
11   }
12 }
13 
14 {
15   "name":"recipes_en",
16   "mappings": {
17     "dynamic": false,
18     "fields": {
19       "name": {
20         "type": "string",
21         "analyzer": "lucene.english"
22       }
23     }
24   }
25 }
26 
27 {
28   "name":"recipes_de",
29   "mappings": {
30     "dynamic": false,
31     "fields": {
32       "name": {
33         "type": "string",
34         "analyzer": "lucene.german"
35       }
36     }
37   }
38 }

Query:

1 {
2   "$search": {
3     "index": "recipes_jp"
4     "text": {
5       "query": "すし",
6       "path": "name"
7     }
8   }
9 }

Pros:

Can copy the same index definition for each collection (replacing the language).
Isolate different language documents.

Cons:

Developers have to provide the language name in the index path in advance.
Need to potentially copy documents between collections on update.
Each index is a change stream cursor, so could be expensive to maintain.

3. Multiple fields

By embedding each language in a parent field, we can co-locate the translations of each recipe in each document.

Document:

1 {
2   "name": {
3     "en":"Fish and Chips",
4     "jp":"すし",
5     "de":"Käsespätzle"
6   }
7 }

Index:

1 {
2   "name":"multi_language_names",
3   "mappings": {
4     "dynamic": false,
5     "fields": {
6       "name": {
7         "fields": {
8           "de": {
9             "analyzer": "lucene.german",
10             "type": "string"
11           },
12           "en": {
13             "analyzer": "lucene.english",
14             "type": "string"
15           },
16           "jp": {
17             "analyzer": "lucene.kuromoji",
18             "type": "string"
19           }
20         },
21         "type": "document"
22       }
23     }
24   }
25 }

Query:

1 {
2   "$search": {
3     "index": "multi_language_names"
4     "text": {
5       "query": "Fish and Chips",
6       "path": "name.en"
7     }
8   }
9 }

Pros:

Easier to manage documents.
Index definition is sparse.

Cons:

Index definition payload is potentially quite large (static field mapping per language).
More complex query and UX.

Rate this tutorial

Tutorial

Configure Email/Password Authentication in MongoDB Atlas App Services

Mar 13, 2024 | 3 min read

Tutorial

Java Meets Queryable Encryption: Developing a Secure Bank Account Application

Oct 08, 2024 | 14 min read

Article

5 Ways to Reduce Costs With MongoDB Atlas

Sep 11, 2024 | 3 min read

Tutorial

How to Use Cohere Embeddings and Rerank Modules With MongoDB Atlas

Aug 14, 2024 | 10 min read

1. Single field
2. Multiple collections
3. Multiple fields

Atlas

Atlas Search Multi-Language Data Modeling

1. Single field

2. Multiple collections

3. Multiple fields

Related

Configure Email/Password Authentication in MongoDB Atlas App Services

Java Meets Queryable Encryption: Developing a Secure Bank Account Application

5 Ways to Reduce Costs With MongoDB Atlas

How to Use Cohere Embeddings and Rerank Modules With MongoDB Atlas

Table of Contents

1	[
2	{"name":"すし"},
3	{"name":"Fish and Chips"},
4	{"name":"Käsespätzle"}
5	]

1	{
2	"name":"recipes",
3	"mappings": {
4	"dynamic": false,
5	"fields": {
6	"name": {
7	"type": "string",
8	"analyzer": "lucene.kuromoji"
9	},
10	"name": {
11	"type": "string",
12	"analyzer": "lucene.english"
13	},
14	"name": {
15	"type": "string",
16	"analyzer": "lucene.german"
17	}
18	}
19	}
20	}

1	{
2	"$search": {
3	"index": "recipes",
4	"text": {
5	"query": "Fish and Chips",
6	"path": "name"
7	}
8	}
9	}

1	recipes_jp:
2	[{"name":"すし"}]
3
4	recipes_en:
5	[{"name":"Fish and Chips"}]
6
7	recipes_de:
8	[{"name":"Käsespätzle"}]

1	{
2	"name":"recipes_jp",
3	"mappings": {
4	"dynamic": false,
5	"fields": {
6	"name": {
7	"type": "string",
8	"analyzer": "lucene.kuromoji"
9	}
10	}
11	}
12	}
13
14	{
15	"name":"recipes_en",
16	"mappings": {
17	"dynamic": false,
18	"fields": {
19	"name": {
20	"type": "string",
21	"analyzer": "lucene.english"
22	}
23	}
24	}
25	}
26
27	{
28	"name":"recipes_de",
29	"mappings": {
30	"dynamic": false,
31	"fields": {
32	"name": {
33	"type": "string",
34	"analyzer": "lucene.german"
35	}
36	}
37	}
38	}

1	{
2	"$search": {
3	"index": "recipes_jp"
4	"text": {
5	"query": "すし",
6	"path": "name"
7	}
8	}
9	}

1	{
2	"name": {
3	"en":"Fish and Chips",
4	"jp":"すし",
5	"de":"Käsespätzle"
6	}
7	}

1	{
2	"name":"multi_language_names",
3	"mappings": {
4	"dynamic": false,
5	"fields": {
6	"name": {
7	"fields": {
8	"de": {
9	"analyzer": "lucene.german",
10	"type": "string"
11	},
12	"en": {
13	"analyzer": "lucene.english",
14	"type": "string"
15	},
16	"jp": {
17	"analyzer": "lucene.kuromoji",
18	"type": "string"
19	}
20	},
21	"type": "document"
22	}
23	}
24	}
25	}

1	{
2	"$search": {
3	"index": "multi_language_names"
4	"text": {
5	"query": "Fish and Chips",
6	"path": "name.en"
7	}
8	}
9	}