Improve Your App's Search Results with Auto-Tuning

Isa Torres, Ethan Steininger5 min read • Published Oct 20, 2021 • Updated Aug 14, 2024

Atlas Search

Rate this tutorial

Historically, the only way to improve your app’s search query relevance is through manual intervention. For example, you can introduce score boosting to multiply a base relevance score in the presence of particular fields. This ensures that searches where a key present in some fields weigh higher than others. This is, however, fixed by nature. The results are dynamic but the logic itself doesn’t change.

The following project will showcase how to leverage synonyms to create a feedback loop that is self-tuning, in order to deliver incrementally more relevant search results to your users—all without complex machine learning models!

Example

We have a food search application where a user searches for “Romanian Food.” Assuming that we’re logging every user's clickstream data (their step-by-step interaction with our application), we can take a look at this “sequence” and compare it to other results that have yielded a strong CTA (call-to-action): a successful checkout.

Another user searched for “German Cuisine” and that had a very similar clickstream sequence. Well, we can build a script that analyzes both these users’ (and other users’) clickstreams, identify similarities, we can tell the script to append it to a synonyms document that contains “German,” “Romanian,” and other more common cuisines, like “Hungarian.”

Here’s a workflow of what we’re looking to accomplish:

Tutorial

Step 1: Log user’s clickstream activity

In our app tier, as events are fired, we log them to a clickstreams collection, like:

1 [{
2 		"session_id": "1",
3 		"event_id": "search_query",
4 		"metadata": {
5 			"search_value": "romanian food"
6 		},
7 		"timestamp": "1"
8 	},
9 	{
10 		"session_id": "1",
11 		"event_id": "add_to_cart",
12 		"product_category":"eastern european cuisine",
13 		"timestamp": "2"
14 	},
15 	{
16 		"session_id": "1",
17 		"event_id": "checkout",
18 		"timestamp": "3"
19 	},
20 	{
21 		"session_id": "1",
22 		"event_id": "payment_success",
23 		"timestamp": "4"
24 	},
25 	{
26 		"session_id": "2",
27 		"event_id": "search_query",
28 		"metadata": {
29 			"search_value": "hungarian food"
30 		},
31 		"timestamp": "1"
32 	},
33 	{
34 		"session_id": "2",
35 		"event_id": "add_to_cart",
36 		"product_category":"eastern european cuisine",
37 		"timestamp": "2"
38 	}
39 ]

In this simplified list of events, we can conclude that {"session_id":"1"} searched for “romanian food,” which led to a higher conversion rate, payment_success, compared to {"session_id":"2"}, who searched “hungarian food” and stalled after the add_to_cart event. You can import this data yourself using sample_data.json.

Let’s prepare the data for our search_tuner script.

Step 2: Create a view that groups by session_id, then filters on the presence of searches

By the way, it’s no problem that only some documents have a metadata field. Our $group operator can intelligently identify the ones that do vs don’t.

1 [
2     # first we sort by timestamp to get everything in the correct sequence of events,
3     # as that is what we'll be using to draw logical correlations
4     {
5         '$sort': {
6             'timestamp': 1
7         }
8     },
9     # next, we'll group by a unique session_id, include all the corresponding events, and begin
10     # the filter for determining if a search_query exists
11     {
12         '$group': {
13             '_id': '$session_id',
14             'events': {
15                 '$push': '$$ROOT'
16             },
17             'isSearchQueryPresent': {
18                 '$sum': {
19                     '$cond': [
20                         {
21                             '$eq': [
22                                 '$event_id', 'search_query'
23                             ]
24                         }, 1, 0
25                     ]
26                 }
27             }
28         }
29     },
30     # we hide session_ids where there is no search query
31     # then create a new field, an array called searchQuery, which we'll use to parse
32     {
33         '$match': {
34             'isSearchQueryPresent': {
35                 '$gte': 1
36             }
37         }
38     },
39     {
40         '$unset': 'isSearchQueryPresent'
41     },
42     {
43         '$set': {
44             'searchQuery': '$events.metadata.search_value'
45         }
46     }
47 ]

Let’s create the view by building the query, then going into Compass and adding it as a new collection called group_by_session_id_and_search_query:

Here’s what it will look like:

1 [
2   {
3     "session_id": "1",
4     "events": [
5       {
6         "event_id": "search_query",
7         "search_value": "romanian food"
8       },
9       {
10         "event_id": "add_to_cart",
11         "context": {
12           "cuisine": "eastern european cuisine"
13         }
14       },
15       {
16         "event_id": "checkout"
17       },
18       {
19         "event_id": "payment_success"
20       }
21     ],
22     "searchQuery": "romanian food"
23   }, {
24     "session_id": "2",
25     "events": [
26       {
27         "event_id": "search_query",
28         "search_value": "hungarian food"
29       },
30       {
31         "event_id": "add_to_cart",
32         "context": {
33           "cuisine": "eastern european cuisine"
34         }
35       },
36       {
37         "event_id": "checkout"
38       }
39     ],
40     "searchQuery": "hungarian food"
41   },
42   {
43     "session_id": "3",
44     "events": [
45       {
46         "event_id": "search_query",
47         "search_value": "italian food"
48       },
49       {
50         "event_id": "add_to_cart",
51         "context": {
52           "cuisine": "western european cuisine"
53         }
54       }
55     ],
56     "searchQuery": "sad food"
57   }
58 ]

Step 3: Build a scheduled job that compares similar clickstreams and pushes the resulting synonyms to the synonyms collection

1 // Provide a success indicator to determine which session we want to
2 // compare any incomplete sessions with
3 const successIndicator = "payment_success"
4 
5 //  what percentage similarity between two sets of click/event streams
6 //  we'd accept to be determined as similar enough to produce a synonym
7 //  relationship
8 const acceptedConfidence = .9
9 
10 //  boost the confidence score when the following values are present
11 //  in the eventstream
12 const eventBoosts = {
13   successIndicator: .1
14 }
15 
16 /**
17  * Enrich sessions with a flattened event list to make comparison easier.
18  * Determine if the session is to be considered successful based on the success indicator.
19  * @param {*} eventList List of events in a session.
20  * @returns {any} Calculated values used to determine if an incomplete session is considered to
21  * be related to a successful session.
22  */
23 const enrichEvents = (eventList) => {
24   return {
25     eventSequence: eventList.map(event => { return event.event_id }).join(';'),
26     isSuccessful: eventList.some(event => { return event.event_id === successIndicator })
27   }
28 }
29 
30 /**
31  * De-duplicate common tokens in two strings
32  * @param {*} str1
33  * @param {*} str2
34  * @returns Returns an array with the provided strings with the common tokens removed
35  */
36 const dedupTokens = (str1, str2) => {
37   const splitToken = ' '
38   const tokens1 = str1.split(splitToken)
39   const tokens2 = str2.split(splitToken)
40   const dupedTokens = tokens1.filter(token => { return tokens2.includes(token)});
41   const dedupedStr1 = tokens1.filter(token => { return !dupedTokens.includes(token)});
42   const dedupedStr2 = tokens2.filter(token => { return !dupedTokens.includes(token)});
43 
44   return [ dedupedStr1.join(splitToken), dedupedStr2.join(splitToken) ]
45 }
46 
47 const findMatchingIndex = (synonyms, results) => {
48   let matchIndex = -1
49   for(let i = 0; i < results.length; i++) {
50     for(const synonym of synonyms) {
51       if(results[i].synonyms.includes(synonym)){
52         matchIndex = i;
53         break;
54       }
55     }
56   }
57   return matchIndex;
58 }
59 /**
60  * Inspect the context of two matching sessions.
61  * @param {*} successfulSession
62  * @param {*} incompleteSession
63  */
64 const processMatch = (successfulSession, incompleteSession, results) => {
65   console.log(`=====\nINSPECTING POTENTIAL MATCH: ${ successfulSession.searchQuery} = ${incompleteSession.searchQuery}`);
66   let contextMatch = true;
67 
68   // At this point we can assume that the sequence of events is the same, so we can
69   // use the same index when comparing events
70   for(let i = 0; i < incompleteSession.events.length; i++) {
71     // if we have a context, let's compare the kv pairs in the context of
72     // the incomplete session with the successful session
73     if(incompleteSession.events[i].context){
74       const eventWithContext = incompleteSession.events[i]
75       const contextKeys = Object.keys(eventWithContext.context)
76 
77       try {
78         for(const key of contextKeys) {
79           if(successfulSession.events[i].context[key] !== eventWithContext.context[key]){
80             // context is not the same, not a match, let's get out of here
81             contextMatch = false
82             break;
83           }
84          }
85       } catch (error) {
86         contextMatch = false;
87         console.log(`Something happened, probably successful session didn't have a context for an event.`);
88       }
89     }
90   }
91 
92   // Update results
93   if(contextMatch){
94     console.log(`VALIDATED`);
95     const synonyms = dedupTokens(successfulSession.searchQuery, incompleteSession.searchQuery, true)
96     const existingMatchingResultIndex = findMatchingIndex(synonyms, results)
97     if(existingMatchingResultIndex >= 0){
98       const synonymSet = new Set([...synonyms, ...results[existingMatchingResultIndex].synonyms])
99       results[existingMatchingResultIndex].synonyms = Array.from(synonymSet)
100     }
101     else{
102       const result = {
103         "mappingType": "equivalent",
104         "synonyms": synonyms
105       }
106       results.push(result)
107     }
108 
109   }
110   else{
111     console.log(`NOT A MATCH`);
112   }
113 
114   return results;
115 }
116 
117 /**
118  * Compare the event sequence of incomplete and successful sessions
119  * @param {*} successfulSessions
120  * @param {*} incompleteSessions
121  * @returns
122  */
123 const compareLists = (successfulSessions, incompleteSessions) => {
124   let results = []
125   for(const successfulSession of successfulSessions) {
126     for(const incompleteSession of incompleteSessions) {
127       // if the event sequence is the same, let's inspect these sessions
128       // to validate that they are a match
129       if(successfulSession.enrichments.eventSequence.includes(incompleteSession.enrichments.eventSequence)){
130         processMatch(successfulSession, incompleteSession, results)
131       }
132     }
133   }
134   return results
135 }
136 
137 const processSessions = (sessions) => {
138   // console.log(`Processing the following list:`, JSON.stringify(sessions, null, 2));
139   // enrich sessions for processing
140   const enrichedSessions = sessions.map(session => {
141     return { ...session, enrichments: enrichEvents(session.events)}
142   })
143   // separate successful and incomplete sessions
144   const successfulEvents = enrichedSessions.filter(session => { return session.enrichments.isSuccessful})
145   const incompleteEvents = enrichedSessions.filter(session => { return !session.enrichments.isSuccessful})
146 
147   return compareLists(successfulEvents, incompleteEvents);
148 }
149 
150 /**
151  * Main Entry Point
152  */
153 const main = () => {
154   const results = processSessions(eventsBySession);
155   console.log(`Results:`, results);
156 }
157 
158 main();
159 
160 module.exports = processSessions;

Run the script yourself.

Step 4: Enhance our search query with the newly appended synonyms

1 [
2     {
3         '$search': {
4             'index': 'synonym-search',
5             'text': {
6                 'query': 'hungarian',
7                 'path': 'cuisine-type'
8             },
9             'synonyms': 'similarCuisines'
10         }
11     }
12 ]

See the synonyms tutorial.

Next Steps

There you have it, folks. We’ve taken raw data recorded from our application server and put it to use by building a feedback that encourages positive user behavior.

By measuring this feedback loop against your KPIs, you can build a simple A/B test against certain synonyms and user patterns to optimize your application!

Rate this tutorial

Article

Auto Pausing Inactive Clusters

Sep 09, 2024 | 10 min read

Article

A Decisioning Framework for MongoDB $regex and $text vs Atlas Search

May 30, 2023 | 5 min read

Tutorial

Flexible Querying With Atlas Search

Jul 12, 2024 | 3 min read

Tutorial

Accelerate Your AI journey: Simplify Gen AI RAG With MongoDB Atlas & Google’s Vertex AI Reasoning Engine

Aug 16, 2024 | 6 min read

Example
Tutorial
Next Steps

Atlas

Improve Your App's Search Results with Auto-Tuning

Example

Tutorial

Step 1: Log user’s clickstream activity

Step 2: Create a view that groups by session_id, then filters on the presence of searches

Step 3: Build a scheduled job that compares similar clickstreams and pushes the resulting synonyms to the synonyms collection

Step 4: Enhance our search query with the newly appended synonyms

Next Steps

Related

Auto Pausing Inactive Clusters

A Decisioning Framework for MongoDB $regex and $text vs Atlas Search

Flexible Querying With Atlas Search

Accelerate Your AI journey: Simplify Gen AI RAG With MongoDB Atlas & Google’s Vertex AI Reasoning Engine

Table of Contents

1	[{
2	"session_id": "1",
3	"event_id": "search_query",
4	"metadata": {
5	"search_value": "romanian food"
6	},
7	"timestamp": "1"
8	},
9	{
10	"session_id": "1",
11	"event_id": "add_to_cart",
12	"product_category":"eastern european cuisine",
13	"timestamp": "2"
14	},
15	{
16	"session_id": "1",
17	"event_id": "checkout",
18	"timestamp": "3"
19	},
20	{
21	"session_id": "1",
22	"event_id": "payment_success",
23	"timestamp": "4"
24	},
25	{
26	"session_id": "2",
27	"event_id": "search_query",
28	"metadata": {
29	"search_value": "hungarian food"
30	},
31	"timestamp": "1"
32	},
33	{
34	"session_id": "2",
35	"event_id": "add_to_cart",
36	"product_category":"eastern european cuisine",
37	"timestamp": "2"
38	}
39	]

1	[
2	# first we sort by timestamp to get everything in the correct sequence of events,
3	# as that is what we'll be using to draw logical correlations
4	{
5	'$sort': {
6	'timestamp': 1
7	}
8	},
9	# next, we'll group by a unique session_id, include all the corresponding events, and begin
10	# the filter for determining if a search_query exists
11	{
12	'$group': {
13	'_id': '$session_id',
14	'events': {
15	'$push': '$$ROOT'
16	},
17	'isSearchQueryPresent': {
18	'$sum': {
19	'$cond': [
20	{
21	'$eq': [
22	'$event_id', 'search_query'
23	]
24	}, 1, 0
25	]
26	}
27	}
28	}
29	},
30	# we hide session_ids where there is no search query
31	# then create a new field, an array called searchQuery, which we'll use to parse
32	{
33	'$match': {
34	'isSearchQueryPresent': {
35	'$gte': 1
36	}
37	}
38	},
39	{
40	'$unset': 'isSearchQueryPresent'
41	},
42	{
43	'$set': {
44	'searchQuery': '$events.metadata.search_value'
45	}
46	}
47	]

1	[
2	{
3	"session_id": "1",
4	"events": [
5	{
6	"event_id": "search_query",
7	"search_value": "romanian food"
8	},
9	{
10	"event_id": "add_to_cart",
11	"context": {
12	"cuisine": "eastern european cuisine"
13	}
14	},
15	{
16	"event_id": "checkout"
17	},
18	{
19	"event_id": "payment_success"
20	}
21	],
22	"searchQuery": "romanian food"
23	}, {
24	"session_id": "2",
25	"events": [
26	{
27	"event_id": "search_query",
28	"search_value": "hungarian food"
29	},
30	{
31	"event_id": "add_to_cart",
32	"context": {
33	"cuisine": "eastern european cuisine"
34	}
35	},
36	{
37	"event_id": "checkout"
38	}
39	],
40	"searchQuery": "hungarian food"
41	},
42	{
43	"session_id": "3",
44	"events": [
45	{
46	"event_id": "search_query",
47	"search_value": "italian food"
48	},
49	{
50	"event_id": "add_to_cart",
51	"context": {
52	"cuisine": "western european cuisine"
53	}
54	}
55	],
56	"searchQuery": "sad food"
57	}
58	]

1	// Provide a success indicator to determine which session we want to
2	// compare any incomplete sessions with
3	const successIndicator = "payment_success"
4
5	// what percentage similarity between two sets of click/event streams
6	// we'd accept to be determined as similar enough to produce a synonym
7	// relationship
8	const acceptedConfidence = .9
9
10	// boost the confidence score when the following values are present
11	// in the eventstream
12	const eventBoosts = {
13	successIndicator: .1
14	}
15
16	/**
17	* Enrich sessions with a flattened event list to make comparison easier.
18	* Determine if the session is to be considered successful based on the success indicator.
19	* @param {*} eventList List of events in a session.
20	* @returns {any} Calculated values used to determine if an incomplete session is considered to
21	* be related to a successful session.
22	*/
23	const enrichEvents = (eventList) => {
24	return {
25	eventSequence: eventList.map(event => { return event.event_id }).join(';'),
26	isSuccessful: eventList.some(event => { return event.event_id === successIndicator })
27	}
28	}
29
30	/**
31	* De-duplicate common tokens in two strings
32	* @param {*} str1
33	* @param {*} str2
34	* @returns Returns an array with the provided strings with the common tokens removed
35	*/
36	const dedupTokens = (str1, str2) => {
37	const splitToken = ' '
38	const tokens1 = str1.split(splitToken)
39	const tokens2 = str2.split(splitToken)
40	const dupedTokens = tokens1.filter(token => { return tokens2.includes(token)});
41	const dedupedStr1 = tokens1.filter(token => { return !dupedTokens.includes(token)});
42	const dedupedStr2 = tokens2.filter(token => { return !dupedTokens.includes(token)});
43
44	return [ dedupedStr1.join(splitToken), dedupedStr2.join(splitToken) ]
45	}
46
47	const findMatchingIndex = (synonyms, results) => {
48	let matchIndex = -1
49	for(let i = 0; i < results.length; i++) {
50	for(const synonym of synonyms) {
51	if(results[i].synonyms.includes(synonym)){
52	matchIndex = i;
53	break;
54	}
55	}
56	}
57	return matchIndex;
58	}
59	/**
60	* Inspect the context of two matching sessions.
61	* @param {*} successfulSession
62	* @param {*} incompleteSession
63	*/
64	const processMatch = (successfulSession, incompleteSession, results) => {
65	console.log(`=====\nINSPECTING POTENTIAL MATCH: ${ successfulSession.searchQuery} = ${incompleteSession.searchQuery}`);
66	let contextMatch = true;
67
68	// At this point we can assume that the sequence of events is the same, so we can
69	// use the same index when comparing events
70	for(let i = 0; i < incompleteSession.events.length; i++) {
71	// if we have a context, let's compare the kv pairs in the context of
72	// the incomplete session with the successful session
73	if(incompleteSession.events[i].context){
74	const eventWithContext = incompleteSession.events[i]
75	const contextKeys = Object.keys(eventWithContext.context)
76
77	try {
78	for(const key of contextKeys) {
79	if(successfulSession.events[i].context[key] !== eventWithContext.context[key]){
80	// context is not the same, not a match, let's get out of here
81	contextMatch = false
82	break;
83	}
84	}
85	} catch (error) {
86	contextMatch = false;
87	console.log(`Something happened, probably successful session didn't have a context for an event.`);
88	}
89	}
90	}
91
92	// Update results
93	if(contextMatch){
94	console.log(`VALIDATED`);
95	const synonyms = dedupTokens(successfulSession.searchQuery, incompleteSession.searchQuery, true)
96	const existingMatchingResultIndex = findMatchingIndex(synonyms, results)
97	if(existingMatchingResultIndex >= 0){
98	const synonymSet = new Set([...synonyms, ...results[existingMatchingResultIndex].synonyms])
99	results[existingMatchingResultIndex].synonyms = Array.from(synonymSet)
100	}
101	else{
102	const result = {
103	"mappingType": "equivalent",
104	"synonyms": synonyms
105	}
106	results.push(result)
107	}
108
109	}
110	else{
111	console.log(`NOT A MATCH`);
112	}
113
114	return results;
115	}
116
117	/**
118	* Compare the event sequence of incomplete and successful sessions
119	* @param {*} successfulSessions
120	* @param {*} incompleteSessions
121	* @returns
122	*/
123	const compareLists = (successfulSessions, incompleteSessions) => {
124	let results = []
125	for(const successfulSession of successfulSessions) {
126	for(const incompleteSession of incompleteSessions) {
127	// if the event sequence is the same, let's inspect these sessions
128	// to validate that they are a match
129	if(successfulSession.enrichments.eventSequence.includes(incompleteSession.enrichments.eventSequence)){
130	processMatch(successfulSession, incompleteSession, results)
131	}
132	}
133	}
134	return results
135	}
136
137	const processSessions = (sessions) => {
138	// console.log(`Processing the following list:`, JSON.stringify(sessions, null, 2));
139	// enrich sessions for processing
140	const enrichedSessions = sessions.map(session => {
141	return { ...session, enrichments: enrichEvents(session.events)}
142	})
143	// separate successful and incomplete sessions
144	const successfulEvents = enrichedSessions.filter(session => { return session.enrichments.isSuccessful})
145	const incompleteEvents = enrichedSessions.filter(session => { return !session.enrichments.isSuccessful})
146
147	return compareLists(successfulEvents, incompleteEvents);
148	}
149
150	/**
151	* Main Entry Point
152	*/
153	const main = () => {
154	const results = processSessions(eventsBySession);
155	console.log(`Results:`, results);
156	}
157
158	main();
159
160	module.exports = processSessions;

1	[
2	{
3	'$search': {
4	'index': 'synonym-search',
5	'text': {
6	'query': 'hungarian',
7	'path': 'cuisine-type'
8	},
9	'synonyms': 'similarCuisines'
10	}
11	}
12	]