Improve Your App's Search Results with Auto-Tuning
Rate this tutorial
Historically, the only way to improve your app’s search query relevance is through manual intervention. For example, you can introduce score boosting to multiply a base relevance score in the presence of particular fields. This ensures that searches where a key present in some fields weigh higher than others. This is, however, fixed by nature. The results are dynamic but the logic itself doesn’t change.
The following project will showcase how to leverage synonyms to create a feedback loop that is self-tuning, in order to deliver incrementally more relevant search results to your users—all without complex machine learning models!
We have a food search application where a user searches for “Romanian Food.” Assuming that we’re logging every user's clickstream data (their step-by-step interaction with our application), we can take a look at this “sequence” and compare it to other results that have yielded a strong CTA (call-to-action): a successful checkout.
Another user searched for “German Cuisine” and that had a very similar clickstream sequence. Well, we can build a script that analyzes both these users’ (and other users’) clickstreams, identify similarities, we can tell the script to append it to a synonyms document that contains “German,” “Romanian,” and other more common cuisines, like “Hungarian.”
Here’s a workflow of what we’re looking to accomplish:
In our app tier, as events are fired, we log them to a clickstreams collection, like:
1 [{ 2 "session_id": "1", 3 "event_id": "search_query", 4 "metadata": { 5 "search_value": "romanian food" 6 }, 7 "timestamp": "1" 8 }, 9 { 10 "session_id": "1", 11 "event_id": "add_to_cart", 12 "product_category":"eastern european cuisine", 13 "timestamp": "2" 14 }, 15 { 16 "session_id": "1", 17 "event_id": "checkout", 18 "timestamp": "3" 19 }, 20 { 21 "session_id": "1", 22 "event_id": "payment_success", 23 "timestamp": "4" 24 }, 25 { 26 "session_id": "2", 27 "event_id": "search_query", 28 "metadata": { 29 "search_value": "hungarian food" 30 }, 31 "timestamp": "1" 32 }, 33 { 34 "session_id": "2", 35 "event_id": "add_to_cart", 36 "product_category":"eastern european cuisine", 37 "timestamp": "2" 38 } 39 ]
In this simplified list of events, we can conclude that {"session_id":"1"} searched for “romanian food,” which led to a higher conversion rate, payment_success, compared to {"session_id":"2"}, who searched “hungarian food” and stalled after the add_to_cart event.
You can import this data yourself using sample_data.json.
Let’s prepare the data for our search_tuner script.
By the way, it’s no problem that only some documents have a metadata field. Our $group operator can intelligently identify the ones that do vs don’t.
1 [ 2 # first we sort by timestamp to get everything in the correct sequence of events, 3 # as that is what we'll be using to draw logical correlations 4 { 5 '$sort': { 6 'timestamp': 1 7 } 8 }, 9 # next, we'll group by a unique session_id, include all the corresponding events, and begin 10 # the filter for determining if a search_query exists 11 { 12 '$group': { 13 '_id': '$session_id', 14 'events': { 15 '$push': '$$ROOT' 16 }, 17 'isSearchQueryPresent': { 18 '$sum': { 19 '$cond': [ 20 { 21 '$eq': [ 22 '$event_id', 'search_query' 23 ] 24 }, 1, 0 25 ] 26 } 27 } 28 } 29 }, 30 # we hide session_ids where there is no search query 31 # then create a new field, an array called searchQuery, which we'll use to parse 32 { 33 '$match': { 34 'isSearchQueryPresent': { 35 '$gte': 1 36 } 37 } 38 }, 39 { 40 '$unset': 'isSearchQueryPresent' 41 }, 42 { 43 '$set': { 44 'searchQuery': '$events.metadata.search_value' 45 } 46 } 47 ]
Let’s create the view by building the query, then going into Compass and adding it as a new collection called group_by_session_id_and_search_query:
Here’s what it will look like:
1 [ 2 { 3 "session_id": "1", 4 "events": [ 5 { 6 "event_id": "search_query", 7 "search_value": "romanian food" 8 }, 9 { 10 "event_id": "add_to_cart", 11 "context": { 12 "cuisine": "eastern european cuisine" 13 } 14 }, 15 { 16 "event_id": "checkout" 17 }, 18 { 19 "event_id": "payment_success" 20 } 21 ], 22 "searchQuery": "romanian food" 23 }, { 24 "session_id": "2", 25 "events": [ 26 { 27 "event_id": "search_query", 28 "search_value": "hungarian food" 29 }, 30 { 31 "event_id": "add_to_cart", 32 "context": { 33 "cuisine": "eastern european cuisine" 34 } 35 }, 36 { 37 "event_id": "checkout" 38 } 39 ], 40 "searchQuery": "hungarian food" 41 }, 42 { 43 "session_id": "3", 44 "events": [ 45 { 46 "event_id": "search_query", 47 "search_value": "italian food" 48 }, 49 { 50 "event_id": "add_to_cart", 51 "context": { 52 "cuisine": "western european cuisine" 53 } 54 } 55 ], 56 "searchQuery": "sad food" 57 } 58 ]
1 // Provide a success indicator to determine which session we want to 2 // compare any incomplete sessions with 3 const successIndicator = "payment_success" 4 5 // what percentage similarity between two sets of click/event streams 6 // we'd accept to be determined as similar enough to produce a synonym 7 // relationship 8 const acceptedConfidence = .9 9 10 // boost the confidence score when the following values are present 11 // in the eventstream 12 const eventBoosts = { 13 successIndicator: .1 14 } 15 16 /** 17 * Enrich sessions with a flattened event list to make comparison easier. 18 * Determine if the session is to be considered successful based on the success indicator. 19 * @param {*} eventList List of events in a session. 20 * @returns {any} Calculated values used to determine if an incomplete session is considered to 21 * be related to a successful session. 22 */ 23 const enrichEvents = (eventList) => { 24 return { 25 eventSequence: eventList.map(event => { return event.event_id }).join(';'), 26 isSuccessful: eventList.some(event => { return event.event_id === successIndicator }) 27 } 28 } 29 30 /** 31 * De-duplicate common tokens in two strings 32 * @param {*} str1 33 * @param {*} str2 34 * @returns Returns an array with the provided strings with the common tokens removed 35 */ 36 const dedupTokens = (str1, str2) => { 37 const splitToken = ' ' 38 const tokens1 = str1.split(splitToken) 39 const tokens2 = str2.split(splitToken) 40 const dupedTokens = tokens1.filter(token => { return tokens2.includes(token)}); 41 const dedupedStr1 = tokens1.filter(token => { return !dupedTokens.includes(token)}); 42 const dedupedStr2 = tokens2.filter(token => { return !dupedTokens.includes(token)}); 43 44 return [ dedupedStr1.join(splitToken), dedupedStr2.join(splitToken) ] 45 } 46 47 const findMatchingIndex = (synonyms, results) => { 48 let matchIndex = -1 49 for(let i = 0; i < results.length; i++) { 50 for(const synonym of synonyms) { 51 if(results[i].synonyms.includes(synonym)){ 52 matchIndex = i; 53 break; 54 } 55 } 56 } 57 return matchIndex; 58 } 59 /** 60 * Inspect the context of two matching sessions. 61 * @param {*} successfulSession 62 * @param {*} incompleteSession 63 */ 64 const processMatch = (successfulSession, incompleteSession, results) => { 65 console.log(`=====\nINSPECTING POTENTIAL MATCH: ${ successfulSession.searchQuery} = ${incompleteSession.searchQuery}`); 66 let contextMatch = true; 67 68 // At this point we can assume that the sequence of events is the same, so we can 69 // use the same index when comparing events 70 for(let i = 0; i < incompleteSession.events.length; i++) { 71 // if we have a context, let's compare the kv pairs in the context of 72 // the incomplete session with the successful session 73 if(incompleteSession.events[i].context){ 74 const eventWithContext = incompleteSession.events[i] 75 const contextKeys = Object.keys(eventWithContext.context) 76 77 try { 78 for(const key of contextKeys) { 79 if(successfulSession.events[i].context[key] !== eventWithContext.context[key]){ 80 // context is not the same, not a match, let's get out of here 81 contextMatch = false 82 break; 83 } 84 } 85 } catch (error) { 86 contextMatch = false; 87 console.log(`Something happened, probably successful session didn't have a context for an event.`); 88 } 89 } 90 } 91 92 // Update results 93 if(contextMatch){ 94 console.log(`VALIDATED`); 95 const synonyms = dedupTokens(successfulSession.searchQuery, incompleteSession.searchQuery, true) 96 const existingMatchingResultIndex = findMatchingIndex(synonyms, results) 97 if(existingMatchingResultIndex >= 0){ 98 const synonymSet = new Set([...synonyms, ...results[existingMatchingResultIndex].synonyms]) 99 results[existingMatchingResultIndex].synonyms = Array.from(synonymSet) 100 } 101 else{ 102 const result = { 103 "mappingType": "equivalent", 104 "synonyms": synonyms 105 } 106 results.push(result) 107 } 108 109 } 110 else{ 111 console.log(`NOT A MATCH`); 112 } 113 114 return results; 115 } 116 117 /** 118 * Compare the event sequence of incomplete and successful sessions 119 * @param {*} successfulSessions 120 * @param {*} incompleteSessions 121 * @returns 122 */ 123 const compareLists = (successfulSessions, incompleteSessions) => { 124 let results = [] 125 for(const successfulSession of successfulSessions) { 126 for(const incompleteSession of incompleteSessions) { 127 // if the event sequence is the same, let's inspect these sessions 128 // to validate that they are a match 129 if(successfulSession.enrichments.eventSequence.includes(incompleteSession.enrichments.eventSequence)){ 130 processMatch(successfulSession, incompleteSession, results) 131 } 132 } 133 } 134 return results 135 } 136 137 const processSessions = (sessions) => { 138 // console.log(`Processing the following list:`, JSON.stringify(sessions, null, 2)); 139 // enrich sessions for processing 140 const enrichedSessions = sessions.map(session => { 141 return { ...session, enrichments: enrichEvents(session.events)} 142 }) 143 // separate successful and incomplete sessions 144 const successfulEvents = enrichedSessions.filter(session => { return session.enrichments.isSuccessful}) 145 const incompleteEvents = enrichedSessions.filter(session => { return !session.enrichments.isSuccessful}) 146 147 return compareLists(successfulEvents, incompleteEvents); 148 } 149 150 /** 151 * Main Entry Point 152 */ 153 const main = () => { 154 const results = processSessions(eventsBySession); 155 console.log(`Results:`, results); 156 } 157 158 main(); 159 160 module.exports = processSessions;
1 [ 2 { 3 '$search': { 4 'index': 'synonym-search', 5 'text': { 6 'query': 'hungarian', 7 'path': 'cuisine-type' 8 }, 9 'synonyms': 'similarCuisines' 10 } 11 } 12 ]
There you have it, folks. We’ve taken raw data recorded from our application server and put it to use by building a feedback that encourages positive user behavior.
By measuring this feedback loop against your KPIs, you can build a simple A/B test against certain synonyms and user patterns to optimize your application!
Related
Article
Using SuperDuperDB to Accelerate AI Development on MongoDB Atlas Vector Search
Sep 18, 2024 | 6 min read
Tutorial
Part #1: Build Your Own Vector Search with MongoDB Atlas and Amazon SageMaker
Sep 18, 2024 | 4 min read