Explore Developer Center's New Chatbot! MongoDB AI Chatbot can be accessed at the top of your navigation to answer all your MongoDB questions.

MongoDB Developer
Atlas
plus
Sign in to follow topics
MongoDB Developer Centerchevron-right
Developer Topicschevron-right
Productschevron-right
Atlaschevron-right

Improve Your App's Search Results with Auto-Tuning

Isa Torres, Ethan Steininger5 min read • Published Oct 20, 2021 • Updated Aug 14, 2024
AtlasSearch
Facebook Icontwitter iconlinkedin icon
Rate this tutorial
star-empty
star-empty
star-empty
star-empty
star-empty
Historically, the only way to improve your app’s search query relevance is through manual intervention. For example, you can introduce score boosting to multiply a base relevance score in the presence of particular fields. This ensures that searches where a key present in some fields weigh higher than others. This is, however, fixed by nature. The results are dynamic but the logic itself doesn’t change.
The following project will showcase how to leverage synonyms to create a feedback loop that is self-tuning, in order to deliver incrementally more relevant search results to your users—all without complex machine learning models!

Example

We have a food search application where a user searches for “Romanian Food.” Assuming that we’re logging every user's clickstream data (their step-by-step interaction with our application), we can take a look at this “sequence” and compare it to other results that have yielded a strong CTA (call-to-action): a successful checkout.
Another user searched for “German Cuisine” and that had a very similar clickstream sequence. Well, we can build a script that analyzes both these users’ (and other users’) clickstreams, identify similarities, we can tell the script to append it to a synonyms document that contains “German,” “Romanian,” and other more common cuisines, like “Hungarian.”
Here’s a workflow of what we’re looking to accomplish:
diagram of what we're looking to accomplish

Tutorial

Step 1: Log user’s clickstream activity

In our app tier, as events are fired, we log them to a clickstreams collection, like:
1[{
2 "session_id": "1",
3 "event_id": "search_query",
4 "metadata": {
5 "search_value": "romanian food"
6 },
7 "timestamp": "1"
8 },
9 {
10 "session_id": "1",
11 "event_id": "add_to_cart",
12 "product_category":"eastern european cuisine",
13 "timestamp": "2"
14 },
15 {
16 "session_id": "1",
17 "event_id": "checkout",
18 "timestamp": "3"
19 },
20 {
21 "session_id": "1",
22 "event_id": "payment_success",
23 "timestamp": "4"
24 },
25 {
26 "session_id": "2",
27 "event_id": "search_query",
28 "metadata": {
29 "search_value": "hungarian food"
30 },
31 "timestamp": "1"
32 },
33 {
34 "session_id": "2",
35 "event_id": "add_to_cart",
36 "product_category":"eastern european cuisine",
37 "timestamp": "2"
38 }
39]
In this simplified list of events, we can conclude that {"session_id":"1"} searched for “romanian food,” which led to a higher conversion rate, payment_success, compared to {"session_id":"2"}, who searched “hungarian food” and stalled after the add_to_cart event. You can import this data yourself using sample_data.json.
Let’s prepare the data for our search_tuner script.

Step 2: Create a view that groups by session_id, then filters on the presence of searches

By the way, it’s no problem that only some documents have a metadata field. Our $group operator can intelligently identify the ones that do vs don’t.
1[
2 # first we sort by timestamp to get everything in the correct sequence of events,
3 # as that is what we'll be using to draw logical correlations
4 {
5 '$sort': {
6 'timestamp': 1
7 }
8 },
9 # next, we'll group by a unique session_id, include all the corresponding events, and begin
10 # the filter for determining if a search_query exists
11 {
12 '$group': {
13 '_id': '$session_id',
14 'events': {
15 '$push': '$$ROOT'
16 },
17 'isSearchQueryPresent': {
18 '$sum': {
19 '$cond': [
20 {
21 '$eq': [
22 '$event_id', 'search_query'
23 ]
24 }, 1, 0
25 ]
26 }
27 }
28 }
29 },
30 # we hide session_ids where there is no search query
31 # then create a new field, an array called searchQuery, which we'll use to parse
32 {
33 '$match': {
34 'isSearchQueryPresent': {
35 '$gte': 1
36 }
37 }
38 },
39 {
40 '$unset': 'isSearchQueryPresent'
41 },
42 {
43 '$set': {
44 'searchQuery': '$events.metadata.search_value'
45 }
46 }
47]
Let’s create the view by building the query, then going into Compass and adding it as a new collection called group_by_session_id_and_search_query:
screenshot of creating a view in compass
screenshot of the view in compass
Here’s what it will look like:
1[
2 {
3 "session_id": "1",
4 "events": [
5 {
6 "event_id": "search_query",
7 "search_value": "romanian food"
8 },
9 {
10 "event_id": "add_to_cart",
11 "context": {
12 "cuisine": "eastern european cuisine"
13 }
14 },
15 {
16 "event_id": "checkout"
17 },
18 {
19 "event_id": "payment_success"
20 }
21 ],
22 "searchQuery": "romanian food"
23 }, {
24 "session_id": "2",
25 "events": [
26 {
27 "event_id": "search_query",
28 "search_value": "hungarian food"
29 },
30 {
31 "event_id": "add_to_cart",
32 "context": {
33 "cuisine": "eastern european cuisine"
34 }
35 },
36 {
37 "event_id": "checkout"
38 }
39 ],
40 "searchQuery": "hungarian food"
41 },
42 {
43 "session_id": "3",
44 "events": [
45 {
46 "event_id": "search_query",
47 "search_value": "italian food"
48 },
49 {
50 "event_id": "add_to_cart",
51 "context": {
52 "cuisine": "western european cuisine"
53 }
54 }
55 ],
56 "searchQuery": "sad food"
57 }
58]

Step 3: Build a scheduled job that compares similar clickstreams and pushes the resulting synonyms to the synonyms collection

1// Provide a success indicator to determine which session we want to
2// compare any incomplete sessions with
3const successIndicator = "payment_success"
4
5// what percentage similarity between two sets of click/event streams
6// we'd accept to be determined as similar enough to produce a synonym
7// relationship
8const acceptedConfidence = .9
9
10// boost the confidence score when the following values are present
11// in the eventstream
12const eventBoosts = {
13 successIndicator: .1
14}
15
16/**
17 * Enrich sessions with a flattened event list to make comparison easier.
18 * Determine if the session is to be considered successful based on the success indicator.
19 * @param {*} eventList List of events in a session.
20 * @returns {any} Calculated values used to determine if an incomplete session is considered to
21 * be related to a successful session.
22 */
23const enrichEvents = (eventList) => {
24 return {
25 eventSequence: eventList.map(event => { return event.event_id }).join(';'),
26 isSuccessful: eventList.some(event => { return event.event_id === successIndicator })
27 }
28}
29
30/**
31 * De-duplicate common tokens in two strings
32 * @param {*} str1
33 * @param {*} str2
34 * @returns Returns an array with the provided strings with the common tokens removed
35 */
36const dedupTokens = (str1, str2) => {
37 const splitToken = ' '
38 const tokens1 = str1.split(splitToken)
39 const tokens2 = str2.split(splitToken)
40 const dupedTokens = tokens1.filter(token => { return tokens2.includes(token)});
41 const dedupedStr1 = tokens1.filter(token => { return !dupedTokens.includes(token)});
42 const dedupedStr2 = tokens2.filter(token => { return !dupedTokens.includes(token)});
43
44 return [ dedupedStr1.join(splitToken), dedupedStr2.join(splitToken) ]
45}
46
47const findMatchingIndex = (synonyms, results) => {
48 let matchIndex = -1
49 for(let i = 0; i < results.length; i++) {
50 for(const synonym of synonyms) {
51 if(results[i].synonyms.includes(synonym)){
52 matchIndex = i;
53 break;
54 }
55 }
56 }
57 return matchIndex;
58}
59/**
60 * Inspect the context of two matching sessions.
61 * @param {*} successfulSession
62 * @param {*} incompleteSession
63 */
64const processMatch = (successfulSession, incompleteSession, results) => {
65 console.log(`=====\nINSPECTING POTENTIAL MATCH: ${ successfulSession.searchQuery} = ${incompleteSession.searchQuery}`);
66 let contextMatch = true;
67
68 // At this point we can assume that the sequence of events is the same, so we can
69 // use the same index when comparing events
70 for(let i = 0; i < incompleteSession.events.length; i++) {
71 // if we have a context, let's compare the kv pairs in the context of
72 // the incomplete session with the successful session
73 if(incompleteSession.events[i].context){
74 const eventWithContext = incompleteSession.events[i]
75 const contextKeys = Object.keys(eventWithContext.context)
76
77 try {
78 for(const key of contextKeys) {
79 if(successfulSession.events[i].context[key] !== eventWithContext.context[key]){
80 // context is not the same, not a match, let's get out of here
81 contextMatch = false
82 break;
83 }
84 }
85 } catch (error) {
86 contextMatch = false;
87 console.log(`Something happened, probably successful session didn't have a context for an event.`);
88 }
89 }
90 }
91
92 // Update results
93 if(contextMatch){
94 console.log(`VALIDATED`);
95 const synonyms = dedupTokens(successfulSession.searchQuery, incompleteSession.searchQuery, true)
96 const existingMatchingResultIndex = findMatchingIndex(synonyms, results)
97 if(existingMatchingResultIndex >= 0){
98 const synonymSet = new Set([...synonyms, ...results[existingMatchingResultIndex].synonyms])
99 results[existingMatchingResultIndex].synonyms = Array.from(synonymSet)
100 }
101 else{
102 const result = {
103 "mappingType": "equivalent",
104 "synonyms": synonyms
105 }
106 results.push(result)
107 }
108
109 }
110 else{
111 console.log(`NOT A MATCH`);
112 }
113
114 return results;
115}
116
117/**
118 * Compare the event sequence of incomplete and successful sessions
119 * @param {*} successfulSessions
120 * @param {*} incompleteSessions
121 * @returns
122 */
123const compareLists = (successfulSessions, incompleteSessions) => {
124 let results = []
125 for(const successfulSession of successfulSessions) {
126 for(const incompleteSession of incompleteSessions) {
127 // if the event sequence is the same, let's inspect these sessions
128 // to validate that they are a match
129 if(successfulSession.enrichments.eventSequence.includes(incompleteSession.enrichments.eventSequence)){
130 processMatch(successfulSession, incompleteSession, results)
131 }
132 }
133 }
134 return results
135}
136
137const processSessions = (sessions) => {
138 // console.log(`Processing the following list:`, JSON.stringify(sessions, null, 2));
139 // enrich sessions for processing
140 const enrichedSessions = sessions.map(session => {
141 return { ...session, enrichments: enrichEvents(session.events)}
142 })
143 // separate successful and incomplete sessions
144 const successfulEvents = enrichedSessions.filter(session => { return session.enrichments.isSuccessful})
145 const incompleteEvents = enrichedSessions.filter(session => { return !session.enrichments.isSuccessful})
146
147 return compareLists(successfulEvents, incompleteEvents);
148}
149
150/**
151 * Main Entry Point
152 */
153const main = () => {
154 const results = processSessions(eventsBySession);
155 console.log(`Results:`, results);
156}
157
158main();
159
160module.exports = processSessions;
Run the script yourself.

Step 4: Enhance our search query with the newly appended synonyms

1[
2 {
3 '$search': {
4 'index': 'synonym-search',
5 'text': {
6 'query': 'hungarian',
7 'path': 'cuisine-type'
8 },
9 'synonyms': 'similarCuisines'
10 }
11 }
12]

Next Steps

There you have it, folks. We’ve taken raw data recorded from our application server and put it to use by building a feedback that encourages positive user behavior.
By measuring this feedback loop against your KPIs, you can build a simple A/B test against certain synonyms and user patterns to optimize your application!

Facebook Icontwitter iconlinkedin icon
Rate this tutorial
star-empty
star-empty
star-empty
star-empty
star-empty
Related
Article

Using SuperDuperDB to Accelerate AI Development on MongoDB Atlas Vector Search


Sep 18, 2024 | 6 min read
Tutorial

Part #1: Build Your Own Vector Search with MongoDB Atlas and Amazon SageMaker


Sep 18, 2024 | 4 min read
Tutorial

How to Improve LLM Applications With Parent Document Retrieval Using MongoDB and LangChain


Dec 13, 2024 | 15 min read
Tutorial

Semantic search with Jina Embeddings v2 and MongoDB Atlas


Dec 05, 2023 | 12 min read
Table of Contents
  • Example