Vector search relevant matched keyword retrival

I am using vector search :

{
        $vectorSearch: {
          queryVector: embedding,
          path: 'plot_embedding',
          numCandidates: 10000,
          limit: 10,
          index: 'vector_index',
        },
      },
      {
        $group: {
          _id: null,
          docs: { $push: '$$ROOT' },
        },
      },
      {
        $unwind: {
          path: '$docs',
          includeArrayIndex: 'rank',
        },
      },
      // {
      //   $addFields: {
      //     vs_score: { // add prres 3
      //       $divide: [1.0, { $add: ['$rank', vector_penalty, 1] }],
      //     },
      //   },
      // },
      {
        $addFields: {
          vs_score: {
            $round: [{
              $divide: [1.0, { $add: ['$rank', vector_penalty, 1] }]
            }, 3]
          }
        }
      },
      {
        $project: {
          vs_score: 1,
          _id: '$docs._id',
          name: '$docs.name',
          summary: '$docs.summary',
          website_url: '$docs.website_url',
        },
      },

I want get the keyword from each retrieved record, based on which the vector search has decided that this record is related to the actual searchKeyword. On mongo atlas vecor search Index, I have created embeddings for summary,technologies,name,clients fields only. for example if i search for “human health”. I get one record which has “human health” word in its summary field, I get another record which contains “human heart” related info in its summary (so vector search is giving results based on related keywords or meanings). Hence I want those keywords from each record : [“human health”, “human heart”]

To achieve what you’re looking for, you can use a combination of MongoDB aggregation stages and text processing methods. However, MongoDB’s vector search doesn’t natively provide the specific keywords that match from the fields. To extract the keywords or terms that contributed to the relevance of the result, you would need to analyze the content of the fields returned in the search result.

Approach

  1. Perform Vector Search: You retrieve the relevant documents based on their vector proximity to the search query.
  2. Extract Relevant Terms: Post-process the results to extract terms (like “human health” and “human heart”) that are relevant to the query. You can do this by matching the query terms against the text in the relevant fields.

Here’s how you might implement this in your MongoDB aggregation pipeline:

javascript

Copy code

[
  {
    $vectorSearch: {
      queryVector: embedding,
      path: 'plot_embedding',
      numCandidates: 10000,
      limit: 10,
      index: 'vector_index',
    },
  },
  {
    $group: {
      _id: null,
      docs: { $push: '$ROOT' },
    },
  },
  {
    $unwind: {
      path: '$docs',
      includeArrayIndex: 'rank',
    },
  },
  {
    $addFields: {
      vs_score: {
        $round: [{
          $divide: [1.0, { $add: ['$rank', vector_penalty, 1] }]
        }, 3]
      },
      matched_keywords: {
        $filter: {
          input: { $split: ['$docs.summary', ' '] }, // Split summary into words
          as: 'word',
          cond: { $regexMatch: { input: '$word', regex: searchKeyword, options: 'i' } }
        }
      }
    }
  },
  {
    $project: {
      vs_score: 1,
      _id: '$docs._id',
      name: '$docs.name',
      summary: '$docs.summary',
      website_url: '$docs.website_url',
      matched_keywords: 1,
    },
  }
]

Explanation:

  1. $vectorSearch: This stage performs the vector search and retrieves the most relevant documents.
  2. $group and $unwind: These stages allow you to rank the results and work with each document individually.
  3. $addFields with $filter: This stage extracts words from the summary field that match the searchKeyword. You can adjust the field (e.g., name, technologies, clients) depending on where you expect the keywords to appear. The $regexMatch operator finds words related to your search query in a case-insensitive manner.
  4. $project: This stage projects the relevant fields, including the matched keywords, along with the vector search score.

Important Notes:

  • Regex Matching: You might need to refine the regex to capture phrases or specific keywords better.
  • Keyword Extraction: The current example only extracts keywords from the summary field. You can expand this to other fields like name, technologies, and clients by modifying the $filter logic.
  • Advanced Keyword Extraction: For more sophisticated keyword extraction, you may want to use natural language processing (NLP) techniques outside of MongoDB, such as using a separate service or application logic. This can help identify semantically similar terms or phrases.

This pipeline gives you a starting point to extract keywords that the vector search may have used to determine relevance. Further tuning may be needed based on your data and search requirements.