Data Modeling and Schema Design for Atlas Search

Erik Hatcher23 min read • Published Sep 04, 2024 • Updated Sep 04, 2024

Atlas Search

Rate this article

Data modeling in MongoDB is the art and science of organizing data within a database to represent entities, attributes, and relationships to optimally handle business objectives.

There are three types of data models: conceptual, logical, and physical. Conceptually, for search, we have content and it needs to be findable and rankable. Logically, content is generally segregated by types of data (e.g., books and library patron users are different types of entities in the system), each type having its own set of attributes and relationships.

The most primal level of data modeling is where the pedal meets the metal, almost literally — at the physical level. The physical model consists of a schema design, laying out how data is structured into databases, collections, shards, fields, field types, and indexes.

The structural and type flexibility of MongoDB’s document model leaves the exercise of schema design to the developer to create the best layout for the workloads given physical, hardware, and software constraints. Design patterns have emerged by distilling and abstracting successful techniques for dealing with various common needs. And, as with any software, there are system constraints within which search operates.

This document serves as a blueprint for designing schemas for these common usage patterns which leverage the power of the underlying search system and works within the boundaries imposed by the hardware and software environments.

Use case workloads

Age-old data modeling adage: “The optimal grouping of objects into documents is determined by the workload.”

Our first stop in schema design for search is understanding the workloads that our application will demand of Atlas Search. Here are the common use cases, given a search request:

Paged search results: return the top (10 or so) matching documents, in sorted or relevancy ranked order, with or without facets
Process all matching documents: docs may go through successive pipeline stages that need the entire set of results to continue, such as $group and $sort
Return only facet counts, with no actual matching documents
Relevant as-you-type suggestions
Replace $match with $search to index and query on many more fields than can be indexed with database b-tree indexes.
“Find anything anywhere” - best practices; compared to $match and $elemMatch

Physical constraints

Wise modeling master: “We model to face and answer to constraints.”

The architecture of Atlas Search frames what’s possible. Given the way Atlas Search is built, here’s a list of the main physical parameters that factor into schema design decisions:

The $search stage queries a single Atlas Search index.
An Atlas Search index is tied to a single Atlas collection.
A mapping defines which fields are to be indexed in what manner, mapped by field name and data type.
Search criteria is used to score and return matching collection documents.

It’s worth calling out that these particular architectural decisions are subject to change as Atlas Search progresses. There are already several works in progress behind the scenes of the product that alter, and relax, some of these architectural decisions, opening the door to some interesting ways of doing things that aren’t possible currently. These patterns will change as the system’s physical capabilities improve. As they say, we’re working on it… always!

The preceding list of architecturally controlled physical operating parameters fit within the aggregation pipeline framework as a $search stage, which makes calls to a search service built on top of the powerful search engine, Lucene. Let’s look at both ends of the architecture, taking into consideration the typical workloads demanded of $search and the constraints it works within. First, let’s see how $search fits into the aggregation pipeline. Then, we’ll look at how the underlying index structure operates at the lowest level of Atlas Search.

Top down — how $search fits into an aggregation pipeline

The $search stage must be, because of pragmatic choices made during implementation, the first stage in the aggregation pipeline (or a $unionWith stage). There’s no input set of documents to this stage. The output of this stage consists of an ordered list of documents and some additional search metadata. The output of the $search stage can be further aggregated using any of the available stages that take a list of documents as input, such as $limit, $skip, $project, $addField, $facet, $match, $group, $lookup, and others, though post-search aggregating stages can have a drastic effect on performance as explained here later.

What about $searchMeta? This stage performs a search, but rather than return the matching documents, it returns search result metadata such as the total count of matching documents and facets, if they were requested. This stage is handy if all you need is this metadata, whereas $search returns all of this same metadata (as

$$SEARCH_META

) along with the matching documents. To use $search or $searchMeta: If you want matching documents with or without facets, use $search. If you only need the total count and/or facets, use $searchMeta.

When $search executes from the mongod (mongo “database”) node running the aggregation pipeline stages, it makes a remote call to a Java-based mongot (mongo “text”) process. The mongot process interfaces with the Lucene APIs for indexing, searching, ranking, highlighting, and faceting.

Important things to note as it relates to schema design: Matching collection documents are returned, in relevancy ranked order. While a search may result in matching thousands, millions, or even billions of documents, the underlying search engine is designed to match, rank, and return the documents quickly, providing the best matches first. The mongot service provides results in batches to the calling aggregation pipeline and will get invoked until the calling pipeline has exhausted the cursor and processed all of the matches or short-circuited from a stage like $limit.

Warning: Running aggregation pipelines in the Atlas UI or Compass will apply an implicit $limit to show only the first several resultant documents. Searches are quick. The same aggregation pipeline, if naively run from a driver-using code environment, may pull the entire cursors worth of documents into an array data structure. Retrieving all search results into an array can be quite time- and memory-consuming. It’s usually the desired pattern to $limit search results and paginate through them as needed. If all goes well, the best are at the top, so users won’t need to see many. :)

These working conditions of the $search stage inform some design decisions: that it must be the first stage in the pipeline, that it returns documents in relevancy order, that it crosses process boundaries returning batches of matches, and unless otherwise specifically limited all matching search results will, eventually, be returned. That last point is an important one — the number of results fed through the pipeline after $search has a direct and increasingly dramatic impact on the performance. For the best result, design for limited results, leaning on relevancy tuning to get the best first. Pagination through search results is efficient and quick when using the searchAfter token mechanism.

Scalability of the search process can be controlled by adding and increasing the use of Search Nodes. You can set specific nodes in your cluster to run only the Atlas Search mongot process. When the Atlas Search mongot process is run separately, availability improves and workloads are balanced. To learn more, see the Search Nodes Architecture documentation.

Bottom up: Lucene and mongot

At the core of Atlas Search lives Lucene, a mature, powerful, open source search engine library. Lucene’s solid abstractions shine through the calling layers, as well as its limitations.

“Index objects” are the individual units of data (Lucene also calls them “documents”, though are referred to as “index objects” to avoid more terminology confusion since they aren’t necessarily one-to-one) added to the Lucene index, consisting of one index object per database collection document, plus one for each mapped embedded document. Embedded documents are covered in detail below. How sub-documents nested within main collection documents are handled, depending on scale, could impact the physical model sharding design. This means using embedded documents will result in a higher number of indexed Lucene objects and needs to be considered for sizing. Lucene has a two-billion limit on index objects. Going beyond 2B docs requires sharding.

Indexing

After an index configuration is created/updated, all documents in the collection are (re)indexed. All fields of each document are available for indexing, given the index configuration mapping rules. The physical storage space taken by the index structure depends entirely on the index configuration.

Searching

The mongot process powers the $search stage, handling successive requests to pull search results in batches. These remote calls for batches of results shape what’s performantly possible. Lucene is optimized to serve and page through the most relevant or top sorted documents quickly, satisfying the most common search use case of where it is expected users will find what they are looking for in those top results. Queries often match vastly more documents than would be presented to a user at one time, and only the top ones may be of interest.

The successive batches of results requested by the $search stage allow for the actual documents returned to be cut off at a desired limit. Regardless of how many matching results are actually brought back into the aggregation pipeline, the exact count of matching documents can be provided, or a lower bound estimate for better performance, so the user can be presented with the magnitude of results available.

Summarizing the physical constraints

The key performance tips and system limitations are detailed on the Atlas Search Index Performance page. Here are the main physical constraint highlights as they pertain to designing a schema for search use cases:

Database and search server are separate processes, potentially on separate nodes for more scalability.
Batching a large search result set across processes and the network can be a major bottleneck.
2B index objects per Atlas Search index; index objects consist of one for each document, and one for each explicitly mapped embedded document.

Schema considerations

Atlas Search maps document fields to indexes on Lucene index objects by field name and the data type. This direct connection to field name and type dictates what’s searchable and in what way, and is a crucial consideration in designing a schema for a search workload.

Consistent document structure matters. Lucene indexes fields based on the field names, types, and structure of each collection document. It’s important to standardize document structure in Atlas Search backed collections:

_id: If the collection is merged/aggregated specifically for search from data structured differently in other collections, stable identifiers matter for updates.
Do not mix field types. For a given field name, its type should remain the same in all documents that have that field. Example: A price field is a floating point number price:37.99 in one document. It should not be a string price: "4.95" in another document.
If you plan to search or facet on any aspect of nested documents, a consistent layout is crucial. Index mappings are based on field name and structural placement.
Field name is the key. Atlas Search architecture works best with a simple field_name=value structure, as index mapping configuration is first driven by the name of a field, and the internal Lucene index data structures are founded upon that field name.

Further, if field types are inconsistent, each will be indexed based on the field mapping setting for that particular type. Queries and faceting are also type-specific. In the above price example, a numeric range query would only be able to match the numeric price value, and string faceting would only show the string price value.

Field mappings

Whether and how a field is to be indexed or not depends on the index configuration mapping.

The logic of whether a field is to be indexed follows this flow:

An explicit field name mapping takes precedence.
If dynamic field mapping is enabled, and the field is not explicitly mapped, it will be mapped using the default rules for that field's data type.

Dynamic field mapping can be enabled at a document's top level downward, or any specific sub-document downward. Attributes on nested documents can be handled with a sub-document level mapping to encapsulate a section or all Atlas Search-able fields for a document.

Field mapping details are officially documented in the Define Field Mappings doc.

These are the field types supported by Atlas Search alongside the search operators available for those types (* dynamically mapped fields):

boolean*: equals/in
date*: equals/in/near/range
number*: equals/in/near/range
objectId*: equals/in
string*: moreLikeThis/phrase/regex/text/wildcard
document*: maps nested fields to parent doc
embeddedDocument: maps to separate child index object
token: equals/in/range
geo: geoShape/geoWithin/near
autocomplete
Faceting: stringFacet, dateFacet, numberFacet

Basic data types

Indexing behavior is configurable within the options for the data type of that particular field instance. Numeric field types can be indexed for numeric equals, in, and range findability and/or range bucket faceting (counts per numeric range over a result set), and likewise for date field types.

Boolean values are indexed for equals and in findability.

Strings

String fields are special, as they are analyzed and broken down into terms that are indexed into an inverted index data structure designed for word/token level matching. Query ability with fast, relevancy ranked results on a pile of full text documents stems from wisely tokenizing and further analyzing and organizing string text into an inverted index.

Arrays

The data type of the values within an array is used to determine how the field is mapped and indexed. Arrays themselves aren’t their own field type. Nested documents by default are mapped as document type fields, and any fields within it are flattened and indexed as if they were fields on the main document with dotted name syntax, bringing all values of fields at the same relative path into an array of flattened values. There’s good examples in the How to Index the Elements of an Array documentation. Faceting on arrays of numbers and dates is currently not supported, but all search operators for those types are.

Nested documents

The MongoDB mantra “data that is accessed together is stored together” builds upon the strengths of the document database, driving effective designs by nesting related and commonly accessed data into single values or arrays of attributes and sub-documents. Atlas Search best practices adjust the mantra to “data that is searched together is indexed together.” By the nature of its architecture, an Atlas Search index indexes all the mapped fields of a document together in a single structure so that any number of the indexed fields can be used in a single query for matching and ranking. There’s one index object per collection document, at least. Nested document structures present an interesting search challenge — are these structures to be intricately and independently matchable, or perhaps more straightforwardly considered a flat set of attributes used to match the main document?

It’s important to remember that the results of a $search are top-level documents, not individual nested documents on their own. Top-level documents returned include the full document, including the nested structure; nested documents are not returned independently, or separately, from their top-level parents.

There are two ways to handle indexing and searching content within nested documents: document and embeddedDocuments.

Flattened/document type: This is the default with dynamic mapping. It flattens sub-document structures into values on the root document. Queries match these flattened arrays and lose the nuance of the intra-embeddedDocument matching.
embeddedDocuments: You must configure this — it doesn’t come by default. It indexes each specified embeddedDocument as a separate Lucene “index objects” tied to the parent index object. It allows for embeddedDocument-specific queries, similar to the databases $elemMatch stage.

Nested document structures can be several layers deep, and each level can be mapped in Atlas Search as either document or embeddedDocuments — making it possible to index and intricately query multiple nested levels deep. Just because you can map things crazily doesn’t mean you should. Simpler and flatter is generally better.

Let’s illustrate the pros and cons of the various ways to index nested structure with an example Parents collection with a single “Mom” document, which has two kids. Kid1 is 5 years old. Kid2 is 11 years old:

1 {
2  "name": "Mom",
3  "age": 37,
4  "children": [
5    {
6      "name": "Kid1",
7      "age": 5
8    },
9    {
10      "name": "Kid2",
11      "age": 11
12    }
13  ]
14 }

Flattened index objects

Use dynamic mapping, or an explicit document mapping, to flatten the example document into a single index object in Lucene.

This flat structure allows us to query for Parents with a child 5 years old and also have a child named “Kid2”, matching this “Mom” document. However, we cannot query for Parents who have a 5-year-old child who is also named “Kid2”.

Embedded index objects

This same example document, with children mapped as an embeddedDocuments type, creates three index objects, one for each mapped embedded document and the one for the main document. Here's the mapping definition:

1 {
2   "mappings": {
3     "dynamic": false,
4     "fields":{
5       "children": [
6         {
7           "type": "embeddedDocuments",
8           "dynamic": true
9         }
10       ]
11     }
12   }
13 }

In the index structure, these documents are co-located as a physical contiguous block. Any update to the document, in any piece of its structure, reindexes that document, which in turn reindexes the entire block of mapped embedded documents.

Operationally, an update to an index object in Lucene is atomically a delete and re-add. Reindexing a document with mapped embedded documents causes that entire block of index objects to be deleted and then re-added. Constantly updating documents with mapped embeddedDocuments thrashes the Lucene indexed objects structure; the more embedded documents, the heavier it is to add, update, or delete a collection document.

Searches that query embedded documents are able to query each independent embedded “index object” document separately and factor in the associated score into the top-level parents computed relevancy ranking score.

To embed or not? It depends. Mapped embedded documents come at a cost and should only be used if the nuanced matching is needed and the explosion in the number of index objects is manageable. And even if that type of matching is needed, another deciding factor is the scale — a single Lucene index can contain a maximum of 2B documents. If you’re pushing up to that scale, sharding is the next step. A collection of 200M database documents each with 10 embedded documents explodes to 2.2B (!) “index objects” ((10+1) * 200M), blowing up beyond what may be feasible or affordable.

Does the use case require matching across exact sub-document structure? Perhaps those sub-documents deserve to be first class main documents in a search-based collection rather than contorting a nested structure to achieve something that fits best with an unwound flatter structure. When considering embedded document mapping for search use cases, also consider turning the document structure inside-out so that the entities to be returned from a search request are first-class documents.

Schema design process for search

The search use cases that Atlas Search is asked to support need to be evaluated within the identified physical constraints to produce a suitable schema design. Atlas Search indexes content based directly on each field’s name and data type and returns the main matching documents from search requests. Modeling documents to use case needs is straightforward; what you index is what you get - you index “movies” and $search returns movies, or you index products, and $search returns products.. The documents in the collection are what is returned and thus conceptually represent the types and granularity of “things” the user is being provided in response to a query.

Let’s use a movies search engine example: Movies themselves make sense to be a first class search result item. What about cast members? Are they simply attributes and a means to find a movie? Or are they first class entities that warrant being an independent item in search results? First class search result items should be modeled as documents in a collection. How about genre? Would the users of a movie search system want to get to Documentaries as an entity unto itself, by typing only “doc” and it being the first suggestion?

Build with the end experience in mind

The search experience delivered by your application is a crucial part of the usefulness and satisfaction provided to your users; lack of a quality search feature will deter users. Delightful search experiences provide facilities such as as-you-type suggestions, faceted navigation, and keyword highlighting, all while factoring in our explicit and implicit preferences by filtering out or boosting items. The quality of your applications search experience consists of much more than just a list of documents for a search query. It could be incorporating search into subtle areas such as “more like this” sections of existing pages, providing context-sensitive filtering and navigation of organized content, and other UX enhancements. The takeaway is this: Model and build to this experience.

Some questions to ask yourself as you begin the modeling process:

What types of queries do you expect?
How is your content best organized for navigation and searchability?
When search results are presented, how can the app best help users narrow or expand their search?

Useful search design patterns

Don’t look at patterns to find a problem to solve, but rather find the patterns that best fit the challenges you’re facing. Several of the tried and true MongoDB design patterns work well for search use cases.

Pattern name	Search use cases
Computed	The Computed Pattern fits when enriched data can be calculated in advance and leveraged for straightforward matching or relevancy weighting.
Entity Pattern	The domain Entity Pattern for search fits for cross-collection, or nested entity searching. Findability is key. This specialized collection and Atlas Search index structure address the “find anything anywhere” and “relevant as you type suggest” use cases. This search-specific pattern is discussed in more detail in a later section.
Extended Reference	The Extended Reference Pattern fits when data needs to be pulled together to be able to be searched together, while also living comfortably in other collections and structures for other important workloads.
Polymorphic	In the Polymorphic Pattern, documents could represent typed entities, where the types represent an object model hierarchy. Polymorphic sub-documents can have independent mapping configurations, handily.
Single Collection	If you’re leveraging the Single-Collection Pattern, you’re in luck as this pattern most directly fits Atlas Search — one place to store and find it all. See also: the domain Entity Pattern for search.
Tree	When you’ve got hierarchical data within search use cases, the Tree Pattern provides some ideas. Collections of documents are flat. A flat set of documents is returned from $search. A field that represents a hierarchy as a simple “/path/string” can bring depth to an otherwise flat system. A “/path/string” can be searched by prefix (e.g., `wildcard` on “/path” to get docs from there and below).

Attribute Pattern and search

The Attribute Pattern represents key/value pairs in a structure that does not introduce a potentially unlimited number of field names by representing a key/value pair as a JSON { k: "key", v: "value"} structure. Keys are values to a k field; keys are field names. Atlas Search uses field names for mapping, and thus a specific attribute key cannot be mapped, and even more insidious is that v values could potentially be of different types of data, even if all values for a particular key are of the same type because of the blending of all values together in a single index. The embeddedDocuments field type can help with conservative use of the Attribute Pattern for some capabilities but that comes with other considerations (see the nested documents section above). The best course of action with attributes that need to be searched? Model them as straightforward field_name: value fields on the documents where they can be handled in a type sensitive manner. All values of a given field are indexed together and separate from other fields.

Entity pattern for search

The Entity Pattern models all key domain entities in a way that supports the “find anything anywhere” and “relevant as you type suggestions” use cases. Entities are represented as first class documents in a specialized collection and Atlas Search index configuration tuned for matching domain entities in many ways and allowing for customized ranking factors. All of your domain entities can be indexed and findable in a myriad of ways using this pattern. Of course, scale matters, so this isn’t a practical pattern for billions of things, though many hundreds of millions are generally a pretty tractable scale. How many unique entities do you have in your system? A quick ballpark estimate for a realistic product catalog system, representing products, customers, and orders could have, say, 10k products across 10 categories, serving around 1M customers who each could have a billing and shipping address within any of the 50 United States. Across around 200k cities, that would be 10k products + 1M customers + 2M addresses + 50 states + 200k cities = ~3.2M entities which is no big deal scale-wise.

If your situation consists of only a single collection, with only one or perhaps more entity types, and they are modeled as first class documents already, this pattern is effectively what you’ve already got — all domain entities represented as first class documents in a single collection. A collection of all domain entities blended together, typed, demonstrates the Single-Collection Pattern.

It's often the case that your main collection represents one type of entity as documents, and other domain entities as metadata fields or nested documents. For example, let’s take the sample movies data available within Atlas: As a user types, movie titles certainly should be suggested. But what about cast member names? Can I find movies starring Keanu Reeves by typing only "kea"? What about documentaries by only typing "doc"? Both cast (actor name) and genre fields (fields, not documents) are, for a movie search application, truly first class domain entity objects, and in a findability sense, deserve to be first class documents in a collection backed by Atlas Search.

It’s a simple model with the following basic schema:

field	description
_id	unique id for this entity, recommended in the form <type>-<natural id>
type	entity/object type, e.g., movie, brand, person product, category, saved_search
name	the name or title of the entity, which would generally be unique per type

It’s important that entity documents have stable, unique identifiers, as the entities will be regularly refreshed from their source of truth collection. Assigning a type to each entity allows for filtering, grouping, boosting, or faceting by type.

Additionally, modeling entities directly as individual documents allows each to carry optional metadata fields to assist in ranking, displaying, filtering, or grouping them. In the movies example, cast member entity documents could carry a computed number of movies the actor has been in, and the average rating of all those movies. These types of computed values, aggregated during the entities collection sync process, exhibit the Computed Pattern.

At the heart of this solution, the straightforward document model feeds the name field through a sophisticated index configuration, which analyzes it in numerous ways, enabling incredible query and ranking power.

The Entity Pattern for Search debuted in Relevant As-You-Type Suggestions Search Solution (see the data modeling approach section) is specialized for the as-you-type/autocomplete/typeahead use case.

Entity collection uses

As-you-type suggest
Type-specific lookups (dropdown of Categories, States, Names)
General index of All (your) Things

This collection can be used to initiate lookups and limit the results to only a few (or even just one) result, making looking up the canonical, more detailed, data straightforward and performant.

Basic search operations on an entities collection make these use cases straightforward and performant:

Filtering by type
- Dropdown entry for cities (“type” equals “city”)
- Saved searches: Log users’ “saved searches” as entity type “saved_search” with an additional “username” field on each of those types of documents. As a user types, their saved searches can be displayed. No additional collection would be needed — this could be the source of truth collection for “saved searches.” For this type of entity, it would also be wise to add a timestamp, so they could be sorted by recency.
Faceting by type
- As a user types, entities are returned, and additionally, the types of entities found and their count within the matched result set can be returned. These counts are useful for users to see the bigger context of the types of things being matched and facilities filtered navigation by type from there.
- With no filter criteria, a $seachMeta faceting by type will quickly give the full set of entity types available. These global counts can be useful, again, for navigation of your “domain entity space” and also for diagnostic purposes to ensure the number of entities of each type matches the counts from the source of truth data.
Boosting by type
- Perhaps certain entity types make sense to boost over other entity types in the results. As a user types in a movie database search, genre entity types and movie entity types could be boosted over cast member name types, so more commonly searched for entities surface ahead in a list of suggestions.
- Similarly, a user's saved search entities could be prioritized over product or category types of entities to increase the value to your customers' experience.
Boosting by entity weights
- In a movie database search, boost cast members by a factor related to the ratings of the movies they’ve been in. Average rating of all movies for an actor is not the best weight to use, as that prioritizes an actor who’s only been in one highly rated movie ahead of actors who have been in lots of movies, most of which have rated well. But there’s ways to give seasoned actors a boost by using a sum of all their movie ratings rather than an average. The more good movies they’ve been in, the higher their boost. Boost factors based on the entity data itself provide a flexible way to add “smart” weightings to results based on your business goals.

Populating an entity collection from existing data

You’ve already got data, and it’s well modeled for other use cases nested within sub-structures of related documents. However, the structure of that data doesn’t lend itself to being first-class findable. Leave the data where it fits the other use cases best, keeping the source of truth right where it already is. The unique entities nested within documents in other collections can generally be extracted and grouped uniquely, computing any useful values along the way. Maintaining a synchronization of data into the entities collection does require consideration and a process to implement.

A great example of an export/merge process comes from the movies collection data, where genre is an attribute of a movie yet deserves to be a first class domain entity. From the movies collection, we unwind the genres field from its nested per-movie value, then group genres into unique values, counting the number of movies for each genre along the way. Then, we format the unique genres into the entities collection schema (_id, type, name, etc.) where the _id is unique per genre and made to fit uniquely within the _id space of a blended collection of various entity types. And finally, these genre entity documents are merged into the catch-all entities collection. The $merge stage will overwrite documents with the same _id, making it important to uniquely and stabally identity entities so they can easily be updated.

The following code snippet can be seen in full context on GitHub.

1 movies.aggregate([
2   {
3     $unwind: "$genres",
4   },
5   {
6     $group: {
7       _id: "$genres",
8       num_movies: { $count: {} },
9     },
10   },
11   {
12     $project: {
13       _id: {$concat: [ "genre", "-", "$_id" ]},
14       type: "genre",
15       name: "$_id",
16       weight: "$num_movies"
17     }
18   },
19   { $merge: { into: "entities" } }
20 ]);

Each entity type could have its own synchronization process and type-specific additional fields.

Entity index configuration

The entities collection’s index configuration specifies the mappings on the type and name fields, otherwise leaving it dynamically mapped to accommodate additional custom fields as needed.

View a full example configuration for the as-you-type use case.

A few specific pieces of this configuration are worth highlighting. The entities type field is indexed two different ways: as a token for filtering or boosting by exact value and stringFacet for faceting-by-type needs.

1 "type": [
2   {
3     "type": "token"
4   },
5   {
6     "type": "stringFacet"
7    }
8 ]

The heart of entity findability stems from the indexing strategies applied to the name field. Atlas Search provides a powerful “multi” indexing technique for string fields, where that one string value is indexed numerous ways to facilitate any number of ways to match it. Strings are analyzed, a process creating a series of indexable units called terms. Terms represent words, character fragments, phonetics, or other lexical extraction tricks. In the example configuration mentioned above, the name field is indexed with just the left edge (“starts with”), as English text (“search”, “searches”, “searching”, “searched” all indexed the same), lowercase (case-insensitive matching), phonetic (“sounds like”), shingled (pairs of words concatenated and indexed together, good for relevancy boosting of words commonly used together), and using the default language-agnostic word-break delimited extraction analysis.

These various ways of analyzing the name field give the text and phrase operators potent ways of matching and ranking. Here’s an example as-you-type suggest query scenario, matching liberally on very little query text (“matr”) yet boosting best matches to the top:

From this Atlas Search Playground, we weigh each analyzed variant of the name field differently, and then multiply the result by any specified weight on the matched entity.:

1 [
2   {
3     "$search": {
4       "compound": {
5         "should": [
6           {
7             "text": {
8               "query": "matr",
9               "path": {
10                 "value": "name",
11                 "multi": "exact"
12               },
13               "score": {
14                 "boost": {
15                   "value": 10
16                 }
17               }
18             }
19           },
20           {
21             "wildcard": {
22               "query": "matr*",
23               "path": {
24                 "value": "name",
25                 "multi": "lowercased"
26               },
27               "allowAnalyzedField": true,
28               "score": {
29                 "boost": {
30                   "value": 5
31                 }
32               }
33             }
34           },
35           {
36             "text": {
37               "query": "matr",
38               "path": {
39                 "value": "name",
40                 "multi": "shingled"
41               },
42               "score": {
43                 "boost": {
44                   "value": 7
45                 }
46               }
47             }
48           },
49           {
50             "phrase": {
51               "query": "matr",
52               "path": {
53                 "value": "name",
54                 "multi": "edge"
55               },
56               "slop": 100
57             }
58           },
59           {
60             "text": {
61               "query": "matr",
62               "path": {
63                 "value": "name",
64                 "multi": "english"
65               },
66               "fuzzy": {
67                 "prefixLength": 1,
68                 "maxExpansions": 10
69               },
70               "score": {
71                 "boost": {
72                   "value": 0.7
73                 }
74               }
75             }
76           },
77           {
78             "text": {
79               "query": "matr",
80               "path": {
81                 "value": "name",
82                 "multi": "phonetic"
83               },
84               "score": {
85                 "boost": {
86                   "value": 0.5
87                 }
88               }
89             }
90           }
91         ],
92         "score": {
93           "boost": {
94             "path": "weight",
95             "undefined": 1
96           }
97         }
98       },
99       "scoreDetails": true,
100       "highlight": {
101         "path": [
102           "name",
103           {
104             "value": "name",
105             "multi": "edge"
106           },
107           {
108             "value": "name",
109             "multi": "lowercased"
110           },
111           {
112             "value": "name",
113             "multi": "english"
114           },
115           {
116             "value": "name",
117             "multi": "shingled"
118           }
119         ]
120       }
121     }
122   },
123   {
124     "$limit": 10
125   },
126   {
127     "$project": {
128       "name": 1,
129       "type": 1,
130       "score": {
131         "$meta": "searchScore"
132       }
133     }
134   }
135 ]

Searching across multiple indexed fields, performantly, in combinations with one another is a superpower of Atlas Search — and it’s the key to nuanced and controllable result relevancy.

Type-specific metadata

To avoid field name collisions on any custom entity type specific fields, the fields can be nested into type-specific sub-documents, like this book sub-structure:

1 {
2   _id: 1,
3   type: 'book',
4   name: 'Lucene in Action',
5   book: {
6     pages: 472,
7     ....
8   }
9 }

The Atlas Search index configuration of the entities collection could then add specific document-type mappings for overrides to the default dynamic mappings in place, perhaps making the number of pages of book entities facetable:

1 'book': {
2   type: 'document',
3   fields: { 
4    pages: {
5     type: numberFacet
6    }
7  }
8 }

Summary of search design best practices

Begin with the end result in mind
- What’s the UI/UX that is being powered by search?
- What types of objects/entities are returned and displayed?
Model items to be returned from search queries as first class documents in their own collection.
Same-named fields should have the same data type across all documents.
“Data that is searched together, is indexed together.”

Join us in the Atlas Search community forum to further discuss data modeling and schema design for Atlas Search.

Top Comments in Forums

There are no comments on this article yet.

Start the Conversation

Rate this article

Article

Audio Find - Atlas Vector Search for Audio

Sep 09, 2024 | 11 min read

Tutorial

How to Use Custom Archival Rules and Partitioning on MongoDB Atlas Online Archive

May 31, 2023 | 5 min read

Tutorial

Getting Started With MongoDB and C

Sep 17, 2024 | 12 min read

Tutorial

Securely Connect MongoDB to Cloud-Offered Kubernetes Clusters

Sep 09, 2024 | 4 min read

Atlas

Data Modeling and Schema Design for Atlas Search

Use case workloads

Physical constraints

Top down — how $search fits into an aggregation pipeline

Bottom up: Lucene and mongot

Indexing

Searching

Summarizing the physical constraints

Schema considerations

Field mappings

Basic data types

Strings

Arrays

Nested documents

Flattened index objects

Embedded index objects

Schema design process for search

Build with the end experience in mind

Useful search design patterns

Attribute Pattern and search

Entity pattern for search

Entity collection uses

Populating an entity collection from existing data

Entity index configuration

Type-specific metadata

Summary of search design best practices

Top Comments in Forums

Related

Audio Find - Atlas Vector Search for Audio

How to Use Custom Archival Rules and Partitioning on MongoDB Atlas Online Archive

Getting Started With MongoDB and C

Securely Connect MongoDB to Cloud-Offered Kubernetes Clusters

Table of Contents

1	{
2	"name": "Mom",
3	"age": 37,
4	"children": [
5	{
6	"name": "Kid1",
7	"age": 5
8	},
9	{
10	"name": "Kid2",
11	"age": 11
12	}
13	]
14	}

1	{
2	"mappings": {
3	"dynamic": false,
4	"fields":{
5	"children": [
6	{
7	"type": "embeddedDocuments",
8	"dynamic": true
9	}
10	]
11	}
12	}
13	}

1	movies.aggregate([
2	{
3	$unwind: "$genres",
4	},
5	{
6	$group: {
7	_id: "$genres",
8	num_movies: { $count: {} },
9	},
10	},
11	{
12	$project: {
13	_id: {$concat: [ "genre", "-", "$_id" ]},
14	type: "genre",
15	name: "$_id",
16	weight: "$num_movies"
17	}
18	},
19	{ $merge: { into: "entities" } }
20	]);

1	[
2	{
3	"$search": {
4	"compound": {
5	"should": [
6	{
7	"text": {
8	"query": "matr",
9	"path": {
10	"value": "name",
11	"multi": "exact"
12	},
13	"score": {
14	"boost": {
15	"value": 10
16	}
17	}
18	}
19	},
20	{
21	"wildcard": {
22	"query": "matr*",
23	"path": {
24	"value": "name",
25	"multi": "lowercased"
26	},
27	"allowAnalyzedField": true,
28	"score": {
29	"boost": {
30	"value": 5
31	}
32	}
33	}
34	},
35	{
36	"text": {
37	"query": "matr",
38	"path": {
39	"value": "name",
40	"multi": "shingled"
41	},
42	"score": {
43	"boost": {
44	"value": 7
45	}
46	}
47	}
48	},
49	{
50	"phrase": {
51	"query": "matr",
52	"path": {
53	"value": "name",
54	"multi": "edge"
55	},
56	"slop": 100
57	}
58	},
59	{
60	"text": {
61	"query": "matr",
62	"path": {
63	"value": "name",
64	"multi": "english"
65	},
66	"fuzzy": {
67	"prefixLength": 1,
68	"maxExpansions": 10
69	},
70	"score": {
71	"boost": {
72	"value": 0.7
73	}
74	}
75	}
76	},
77	{
78	"text": {
79	"query": "matr",
80	"path": {
81	"value": "name",
82	"multi": "phonetic"
83	},
84	"score": {
85	"boost": {
86	"value": 0.5
87	}
88	}
89	}
90	}
91	],
92	"score": {
93	"boost": {
94	"path": "weight",
95	"undefined": 1
96	}
97	}
98	},
99	"scoreDetails": true,
100	"highlight": {
101	"path": [
102	"name",
103	{
104	"value": "name",
105	"multi": "edge"
106	},
107	{
108	"value": "name",
109	"multi": "lowercased"
110	},
111	{
112	"value": "name",
113	"multi": "english"
114	},
115	{
116	"value": "name",
117	"multi": "shingled"
118	}
119	]
120	}
121	}
122	},
123	{
124	"$limit": 10
125	},
126	{
127	"$project": {
128	"name": 1,
129	"type": 1,
130	"score": {
131	"$meta": "searchScore"
132	}
133	}
134	}
135	]

1	{
2	_id: 1,
3	type: 'book',
4	name: 'Lucene in Action',
5	book: {
6	pages: 472,
7	....
8	}
9	}

1	'book': {
2	type: 'document',
3	fields: {
4	pages: {
5	type: numberFacet
6	}
7	}
8	}