Smart Filtering: A Guide to Generating Pre-filters for Semantic Search
Vipul Bhardwaj, Fabian Valle20 min read • Published Sep 03, 2024 • Updated Oct 02, 2024
FULL APPLICATION
Rate this tutorial
Ever searched for "old black and white comedies" only to be bombarded with a mix of modern action flicks? Frustrating, right? That’s the challenge with traditional search engines — they often struggle to understand the nuances of our queries, leaving us wading through irrelevant results.
This is where smart filtering comes in. It's a game-changer that uses metadata and vector search to deliver search results that truly match your intent. Imagine finding exactly the classic comedies you crave, without the hassle.
In this blog, we'll dive into what smart filtering is, how it works, and why it's essential for building better search experiences. Let's uncover the magic behind this technology and explore how it can revolutionize the way you search.
Vector search is a powerful tool that helps computers understand the meaning behind data, not just the words themselves. Instead of matching keywords, it focuses on the underlying concepts and relationships. Imagine searching for "dog" and getting results that include "puppy," "canine," and even images of dogs. That's the magic of vector search!
How does it work? Well, it transforms data into mathematical representations called vectors. These vectors are like coordinates on a map, and similar data points are closer together in this vector space. When you search for something, the system finds the vectors closest to your query, giving you results that are semantically similar.
While vector search is fantastic at understanding context, it sometimes falls short when it comes to simple filtering tasks. For instance, finding all movies released before 2000 requires precise filtering, not just semantic understanding. This is where smart filtering comes in to complement vector search.
While vector search brings us closer to understanding the true meaning of queries, there's still a gap between what users want and what search engines deliver. Complex search queries like "earliest comedy movies before 2000" can still be a challenge. Semantic search might understand the concepts of "comedy" and "movies," but it might struggle with the specifics of "earliest" and "before 2000."
This is where the results start to get messy. We might get a mix of old and new comedies, or even dramas that were mistakenly included. To truly satisfy users, we need a way to refine these search results and make them more precise. That's where pre-filters come into play.
![Flowchart showing the process of generating pre-filters for a user query. The flow begins with metadata filtering, where the user query and metadata are passed to a query constructor that creates a pre-filter query. This is then translated into a MongoDB query. The flow continues with a time range query constructor, implemented as an LLM agent, which uses a QueryExecutorMongoDB tool to query data from a vector database. Finally, the metadata filter and time-based query filter are merged to create the final pre-filter.][1]
Smart filtering is the solution to this challenge. It's a technique that uses a dataset's metadata to create specific filters, refining search results and making them more accurate and efficient. By analyzing the information about your data, like its structure, content, and attributes, smart filtering can identify relevant criteria to filter your search.
Imagine searching for "comedy movies released before 2000." Smart filtering would use metadata like genre, release date, and potentially even plot keywords to create a filter that only includes movies matching those criteria. This way, you get a list of exactly what you want, without the irrelevant noise.
Smart filtering is a multi-step process that involves extracting information from your data, analyzing it, and creating specific filters based on your needs. Let's break it down:
- Metadata extraction: The first step is to gather relevant information about your data. This includes details like:
- Data structure: How is the data organized (e.g., tables, documents)?
- Attributes: What kind of information is included (e.g., title, description, release date)?
- Data types: What format is the data in (e.g., text, numbers, dates)?
- Pre-filter generation: Once you have the metadata, you can start creating pre-filters. These are specific conditions that data must meet to be included in the search results. For example, if you're searching for comedy movies released before 2000, you might create pre-filters for:
- Genre: comedy
- Release date: before 2000
- Integration with vector search: The final step is to combine these pre-filters with your vector search. This ensures that the vector search only considers data points that match your pre-defined criteria.
By following these steps, smart filtering significantly improves the accuracy and efficiency of your search results.
To be successful with this tutorial, you will need:
- The IDE of your choosing. This tutorial uses a Jupyter notebook. Please feel free to run your commands directly from a notebook.
- An OpenAI API key. We will use OpenAI LLM to embed our data and generate filters. You will need access to:
text-embedding-ada-002
embedding model.gpt-4o
for text generation.
- Python <4.0, >=3.8.1.
The following instructions are for running in a notebook but can be adapted to run in your IDE. There will just be some differences.
Install the required dependencies.
1 !pip install pymongo==4.7.2 langchain-core==0.2.6 langchain-openai==0.1.7 langchain==0.2.1 langchain-community==0.2.4 lark==1.1.9
For the purpose of this tutorial, we will use some sample movie data. We will define a list of LangChain documents. Each document represents one movie that has a
page_content
with movie description and some metadata with it. In the sample data, the metadata has release_date
, rating
, and genre
.1 from langchain_core.documents import Document 2 3 # Sample movies data. The metadata has release_date, rating, genre, director 4 docs = [ 5 Document( 6 page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose", 7 metadata={"release_date": "1994-04-15", "rating": 7.7, "genre": ["action", "scifi", "adventure"]}, 8 ), 9 Document( 10 page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...", 11 metadata={"release_date": "2010-07-16", "rating": 8.2, "genre": ["action", "thriller"]}, 12 ), 13 Document( 14 page_content="A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea", 15 metadata={"release_date": "2006-11-25", "rating": 8.6, "genre": ["anime", "thriller", "scifi"]}, 16 ), 17 Document( 18 page_content="A bunch of normal-sized women are supremely wholesome and some men pine after them", 19 metadata={"release_date": "2019-12-25", "rating": 8.3, "genre": ["romance", "drama", "comedy"]}, 20 ), 21 Document( 22 page_content="Toys come alive and have a blast doing so", 23 metadata={"release_date": "1995-11-22", "genre": ["animation", "fantasy"]}, 24 ), 25 Document( 26 page_content="The toys embark on a rescue mission to save Woody after he is stolen by a toy collector.", 27 metadata={"release_date": "1999-11-24", "genre": ["animation", "adventure", "comedy"]}, 28 ), 29 Document( 30 page_content="The toys face an uncertain future as they are accidentally donated to a daycare center, leading to a thrilling escape plan.", 31 metadata={"release_date": "2010-06-18", "genre": ["animation", "adventure", "comedy"]}, 32 ) 33 ]
1 import getpass 2 from pymongo import MongoClient 3 4 # set up your MongoDB connection 5 connection_string = getpass.getpass(prompt= "Enter connection string WITH USER + PASS here") 6 7 client = MongoClient(connection_string) 8 9 # name your database and collection anything you want since it will be created when you enter your data 10 database_name = "smart_filtering" 11 collection_name = "movies" 12 collection = client[database_name][collection_name]
For this tutorial, we are using OpenAI's
text-embedding-ada-002
embedding model.1 import json 2 3 from langchain_openai import OpenAIEmbeddings 4 5 # openAI API credentials 6 openai_api_base = getpass.getpass(prompt= "Put in OpenAI URL here") 7 openai_api_key = getpass.getpass(prompt= "Put in OpenAI API Key here") 8 9 # default_headers is optional 10 default_headers = getpass.getpass(prompt= "Put in OpenAI API headers if applicable here") 11 default_headers = json.loads(default_headers) if default_headers else None 12 13 embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key, 14 openai_api_base=openai_api_base, 15 default_headers=default_headers, 16 model="text-embedding-ada-002")
We will use MongoDBAtlasVectorSearch retriever from LangChain to embed and ingest our data. It will accept a list of docs, the embedding object we created earlier, and the MongoDB client. This step will initialize our MongoDB collection with the movies data with embeddings.
1 from langchain.vectorstores import MongoDBAtlasVectorSearch 2 3 # This step will generate the embeddings and insert the documents 4 vectorStore = MongoDBAtlasVectorSearch.from_documents(docs, embeddings, collection=collection)
After initialization, the inserted data will look like the following:
![Flowchart showing the process of generating pre-filters for a user query. The flow begins with metadata filtering, where the user query and metadata are passed to a query constructor that creates a pre-filter query. This is then translated into a MongoDB query. The flow continues with a time range query constructor, implemented as an LLM agent, which uses a QueryExecutorMongoDB tool to query data from a vector database. Finally, the metadata filter and time-based query filter are merged to create the final pre-filter.][2]
We have the
page_content
in the text
field which is default for MongoDBAtlasVectorSearch vector store. The embeddings is an array of floats and saved under embedding
. The metadata fields are present as release_date
, rating
, and genre
.Before we can perform a search on our data, we need to create a search index. Follow these steps to create a search index.
- In AtlasUI, go to your collection
smart_filtering.movies
. - Click on
Search Indexes
. - Click on
Create Index
. - Under
Atlas Search
selectJson Editor
. - Name your index as
default
and copy/paste the below index definition:
1 { 2 "analyzer": "lucene.standard", 3 "searchAnalyzer": "lucene.standard", 4 "mappings": { 5 "fields": { 6 "embedding": { 7 "type": "knnVector", 8 "dimensions": 1536, 9 "similarity": "cosine" 10 }, 11 "rating": { 12 "type": "number" 13 }, 14 "release_date": { 15 "type": "token" 16 }, 17 "genre": { 18 "type": "token" 19 } 20 } 21 } 22 }
The index will take a couple of seconds to build. After the index builds successfully, we will be ready to query our data.
Let’s say a user wants to find documents for the latest movie released before some date in the animation genre with this query:
I want to watch a movie released before year 2000 in the animation genre with the latest release date
.We will try semantic search for this query and see what results we get.
1 vectorStore = MongoDBAtlasVectorSearch( collection, embeddings ) 2 3 query = "I want to watch a movie released before year 2000 in the animation genre with the latest release date" 4 5 docs = vectorStore.similarity_search(query) 6 for doc in docs: 7 print(doc.page_content)
Output: We received four movies in the output, three of which are not relevant to the user’s query. This is the problem with semantic search that we will solve using smart filtering.
1 A bunch of scientists bring back dinosaurs and mayhem breaks loose 2 A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea 3 Toys come alive and have a blast doing so 4 Leo DiCaprio gets lost in a dream within a dream within a dream within a ...
Let us analyze the filtering requirements from the user query:
I want to watch a movie released before year 2000 in the animation genre with the latest release date
.There are two types of filter requirements in the user query and we will solve it in two stages.
We are performing the filtering in two stages because these are two different tasks and we should perform one task at a time with LLMs to get better results.
We are performing the filtering in two stages because these are two different tasks and we should perform one task at a time with LLMs to get better results.
Stage 1 — metadata filter: A pre-filter can be generated based on the query and metadata.
- Release before year 2000 can be a potential filter.
- Animation genre can be a potential filter.
Stage 2 — time-based filter: We will also need to account for the latest release date. We will need to query the data to find the latest release date movie.
We will be using LangChain’s load_query_constructor_runnable to generate our filter query and then we will be using MongoDBAtlasTranslator to convert the query to a valid MongoDB query. We will need the below to pass to load_query_constructor_runnable:
- A description of the content of our data that will be passed in document_content
- Metadata attributes of our data passed in attribute_info
- Prompt with some examples that the LLM will use to generate the query
For the purpose of this tutorial, we will define the metadata that we will be using for the filtering purpose. We will need to define the content and provide a
document_content_description
, name
, description
, and type
of each field. This requires some basic understanding of the data.Description of the content of our data:
1 document_content_description = "Brief summary of a movie"
Metadata attributes of our data:
1 from langchain.chains.query_constructor.base import AttributeInfo 2 3 metadata_field_info = [ 4 AttributeInfo( 5 name="genre", 6 description="Keywords for filtering: ['animation', 'action', 'comedy', 'romance', 'thriller']", 7 type="[string]", 8 ), 9 AttributeInfo( 10 name="release_date", 11 description="The date the movie was released on", 12 type="string", 13 ), 14 AttributeInfo( 15 name="rating", description="A 1-10 rating for the movie", type="float" 16 ), 17 ]
Our goal is to extract meaningful information from the user query that we can use for the metadata filtering. We will pass the metadata of our data in the context such that the LLM gets an idea of the information that can be used as a filter. We will use a few-shot prompting technique to generate our results.
Few-shot prompting is a technique used with large language models, where the model is given a few examples of a task within the prompt to help guide it to produce the desired output.
Note: Please update the prompt as per your use case.
The below prompt will be passed as
schema_prompt
in LangChain’s load_query_constructor_runnable that will be used to generate the query.The prompt will be used to instruct LLM on how to generate the query. We will use the prompt defined in the LangChain’s query constructor prompt but we will change it as per our use case.
Let’s break down the prompt and understand it:
Let’s break down the prompt and understand it:
- In the beginning, we instruct the LLM to output the result in JSON format with the rewritten query and filter as keys.
- We instruct the LLM to not include any information in the new query that is already accounted for in the filter.
- The variables are wrapped in
{}
in the prompt that will be filled later.
1 from langchain_core.prompts import PromptTemplate 2 3 4 DEFAULT_SCHEMA = """\ 5 << Structured Request Schema >> 6 When responding use a markdown code snippet with a JSON object formatted in the following schema: 7 8 ```json 9 {{{{ 10 "query": string \\ rewritten user's query after removing the information handled by the filter 11 "filter": string \\ logical condition statement for filtering documents 12 }}}} 13 ``` 14 15 The query string should be re-written. Any conditions in the filter should not be mentioned in the query as well. 16 17 A logical condition statement is composed of one or more comparison and logical operation statements. 18 19 A comparison statement takes the form: `comp(attr, val)`: 20 - `comp` ({allowed_comparators}): comparator 21 - `attr` (string): name of attribute to apply the comparison to 22 - `val` (string): is the comparison value 23 24 A logical operation statement takes the form `op(statement1, statement2, ...)`: 25 - `op` ({allowed_operators}): logical operator 26 - `statement1`, `statement2`, ... (comparison statements or logical operation statements): one or more statements to apply the operation to 27 28 Make sure that you only use the comparators and logical operators listed above and no others. 29 Make sure that filters only refer to attributes that exist in the data source. 30 Make sure that filters only use the attributed names with its function names if there are functions applied on them. 31 Make sure that filters only use format `YYYY-MM-DD` when handling date data typed values. 32 Make sure you understand the user's intent while generating a date filter. Use a range comparators such as gt | gte | lt | lte for partial dates. 33 Make sure that filters take into account the descriptions of attributes and only make comparisons that are feasible given the type of data being stored. 34 Make sure that filters are only used as needed. If there are no filters that should be applied return "NO_FILTER" for the filter value.\ 35 """ 36 DEFAULT_SCHEMA_PROMPT = PromptTemplate.from_template(DEFAULT_SCHEMA)
Now, we will define some examples that we will use in our prompt. The examples will help the LLM to generate better results. We will need to define metadata for our example, a user query, and an expected answer.
Let’s define three data sources:
- A songs data source that has a content description and attributes definition.
1 SONG_DATA_SOURCE = """\ 2 ```json 3 {{ 4 "content": "Lyrics of a song", 5 "attributes": {{ 6 "artist": {{ 7 "type": "string", 8 "description": "Name of the song artist" 9 }}, 10 "length": {{ 11 "type": "integer", 12 "description": "Length of the song in seconds" 13 }}, 14 "genre": {{ 15 "type": "[string]", 16 "description": "The song genre, one or many of [\"pop\", \"rock\" or \"rap\"]" 17 }}, 18 "release_dt": {{ 19 "type": "string", 20 "description": "Release date of the song." 21 }} 22 }} 23 }} 24 ```\ 25 """
- A movies data source. This is similar to our sample data that we are trying to solve. Adding it in the few-shot examples can improve our results.
1 MOVIES_DATA_SOURCE = """\ 2 ```json 3 {{ 4 "content": "Brief summary of a movie", 5 "attributes": {{ 6 "release_date": {{ 7 "type": "string", 8 "description": "The release date of the movie" 9 }}, 10 "genre": {{ 11 "type": "[string]", 12 "description": "Keywords for filtering: ['anime', 'action', 'comedy', 'romance', 'thriller']" 13 }} 14 }} 15 }} 16 ```\ 17 """
- A generic keyword data source. The LLM was struggling with generating correct query format for the keywords/array filtering so we added this to improve our results.
1 KEYWORDS_DATA_SOURCE = """\ 2 ```json 3 {{ 4 "content": "Documents store", 5 "attributes": {{ 6 "tags": {{ 7 "type": "[string]", 8 "description": "Keywords for filtering: ['rag', 'genai', 'gpt', 'langchain', 'llamaindex']" 9 }} 10 }} 11 }} 12 ````\ 13 """
Note: Please add some examples as per your use case to enhance the results.
Now, let’s define some example user queries and expected answers:
1 KEYWORDS_DATA_SOURCE_ANSWER = """\ 2 ```json 3 {{ 4 "query": "Give me updates", 5 "filter": "in(\\"tags\\", [\\"rag\\", \\"langchain\\"])" 6 }} 7 ````\ 8 """ 9 10 KEYWORDS_DATE_DATA_SOURCE_ANSWER = """\ 11 ```json 12 {{ 13 "query": "Tell me updates on connectors based on the latest documentation", 14 "filter": "in(\\"tags\\", [\\"langchain\\", \\"llamaindex\\"])" 15 }} 16 ````\ 17 """ 18 19 FULL_ANSWER = """\ 20 ```json 21 {{ 22 "query": "songs about teenage romance", 23 "filter": "and(or(eq(\\"artist\\", \\"Taylor Swift\\"), eq(\\"artist\\", \\"Katy Perry\\")), lt(\\"length\\", 180), in(\\"genre\\", [\\"pop\\"]), and(gt(\\"release_dt\\", \\"2010-12-31\\"), lt(\\"release_dt\\", \\"2020-01-01\\")))" 24 }} 25 ```\ 26 """ 27 28 DATE_ANSWER = """\ 29 ```json 30 {{ 31 "query": "Recommend a movie with latest release date", 32 "filter": "and(lt(\\"release_date\\", \\"2010-01-01\\"), in(\\"genre\\", [\\"action\\", \\"thriller\\"])" 33 }} 34 ```\ 35 """ 36 37 NO_FILTER_ANSWER = """\ 38 ```json 39 {{ 40 "query": "", 41 "filter": "NO_FILTER" 42 }} 43 ```\ 44 """
Putting the above together, we can define our examples with the data source definition, user query, and expected answer:
1 DEFAULT_EXAMPLES = [ 2 { 3 "i": 1, 4 "data_source": MOVIES_DATA_SOURCE, 5 "user_query": "Recommend an action or thriller genre movie release before 2010 and latest release date", 6 "structured_request": DATE_ANSWER, 7 }, 8 { 9 "i": 2, 10 "data_source": MOVIES_DATA_SOURCE, 11 "user_query": "Recommend a latest movie", 12 "structured_request": NO_FILTER_ANSWER 13 }, 14 { 15 "i": 3, 16 "data_source": SONG_DATA_SOURCE, 17 "user_query": "What are songs by Taylor Swift or Katy Perry about teenage romance under 3 minutes long in the dance pop genre released before 1 January 2020 and after 31 December, 2010", 18 "structured_request": FULL_ANSWER, 19 }, 20 { 21 "i": 4, 22 "data_source": SONG_DATA_SOURCE, 23 "user_query": "What are songs that were not published on Spotify", 24 "structured_request": NO_FILTER_ANSWER, 25 }, 26 { 27 "i": 5, 28 "data_source": KEYWORDS_DATA_SOURCE, 29 "user_query": "Give me updates on rag with langchain", 30 "structured_request": KEYWORDS_DATA_SOURCE_ANSWER 31 }, 32 { 33 "i": 6, 34 "data_source": KEYWORDS_DATA_SOURCE, 35 "user_query": "Tell me updates on langchain and llamaindex connectors based on the latest documentation", 36 "structured_request": KEYWORDS_DATE_DATA_SOURCE_ANSWER 37 } 38 ]
Let us define a utility function that we will use to process our filters before returning.
1 def enforce_constraints(input_json): 2 def process_value(value): 3 if isinstance(value, (str, int)): 4 return value 5 elif isinstance(value, list) and all(isinstance(item, str) for item in value): 6 return value 7 elif isinstance(value, dict) and 'date' in value and isinstance(value['date'], str): 8 return value['date'] 9 else: 10 raise ValueError("Invalid value type") 11 12 def process_dict(d): 13 if not isinstance(d, dict): 14 return d 15 processed_dict = {} 16 for k, v in d.items(): 17 if k.startswith("$") and isinstance(v, list): 18 # Handling $and and $or conditions 19 processed_dict[k] = [process_dict(item) for item in v] 20 elif k.startswith("$"): 21 processed_dict[k] = process_value(v) 22 else: 23 processed_dict[k] = process_dict(v) 24 return processed_dict 25 26 return process_dict(input_json)
Now, let's go ahead and define our
generate_metadata_filter
function that we will be using to generate our metadata filters.1 from langchain.chains.query_constructor.base import load_query_constructor_runnable 2 from langchain_community.query_constructors.mongodb_atlas import MongoDBAtlasTranslator 3 4 translator = MongoDBAtlasTranslator() 5 allowed_operators = translator.allowed_operators 6 allowed_comparators = translator.allowed_comparators 7 8 def generate_metadata_filter(query, llm, document_content_description, metadata_field_info, schema_prompt, examples): 9 """ 10 This method will use the query constructor and generate the pre-filters for a list of datasets. 11 :param query: query submitted by the user 12 :param llm: llm instance 13 :param document_content_description: Data description 14 :param metadata_field_info: metadata fields information 15 :param schema_prompt: prompt instructions that will be passed to the llm 16 :param examples: list of examples 17 :return (dict): Returns pre-filter and new query for each dataset. 18 """ 19 query_constructor = load_query_constructor_runnable( 20 llm=llm, 21 document_contents=document_content_description, 22 attribute_info=metadata_field_info, 23 schema_prompt=schema_prompt, 24 examples=examples, 25 allowed_comparators=allowed_comparators, 26 allowed_operators=allowed_operators 27 ) 28 query = f"""Answer the below question:\n 29 Question: {query} 30 """ 31 structured_query = query_constructor.invoke(query) 32 new_query, new_kwargs = translator.visit_structured_query(structured_query) 33 pre_filter = enforce_constraints(new_kwargs) 34 pre_filter = pre_filter = pre_filter.get("pre_filter", {}) if pre_filter else {} 35 return pre_filter, new_query
Let us define our LLM object for text generation. We will use LangChain’s ChatOpenAI for this purpose. We will use the
gpt-4o
model for our filter generation.1 from langchain_openai import ChatOpenAI 2 3 llm = ChatOpenAI(openai_api_key=openai_api_key, 4 openai_api_base=openai_api_base, 5 default_headers=default_headers, 6 model="gpt-4o")
Now that we have everything we need to generate the metadata filter, let’s give it a try.
1 query = "I want to watch a movie released before year 2000 in the animation genre with the latest release date" 2 3 pre_filter, new_query = generate_metadata_filter( 4 query=query, 5 llm=llm, 6 document_content_description=document_content_description, 7 metadata_field_info=metadata_field_info, 8 schema_prompt=DEFAULT_SCHEMA_PROMPT, 9 examples=DEFAULT_EXAMPLES) 10 11 print(pre_filter) 12 print(new_query) 13 14 vectorStore = MongoDBAtlasVectorSearch( collection, embeddings ) 15 16 docs = vectorStore.similarity_search(query=query, pre_filter=pre_filter) 17 for doc in docs: 18 print(doc.page_content)
Output:
Generated filter and new query:
Generated filter and new query:
1 {'$and': [{'release_date': {'$lt': '2000-01-01'}}, {'genre': {'$in': ['animation']}}]} 2 I want to watch a movie with the latest release date
As you can see in the generated filter, we were able to extract the information from the user’s query and the metadata, such as “release before year 2000” and “animation genre,” that can be used to pre-filter the data before running the semantic search.
Note that we are also returning a new query after removing the filters that we generated. This will be helpful in the next stage of filter generation.
O/P docs:
1 Toys come alive and have a blast doing so 2 The toys embark on a rescue mission to save Woody after he is stolen by a toy collector.
We have received two documents in the output which is a better result than before. But we are still not able to get the “latest release date” movie.
And, to find the latest movie, we will need to query our data so we will move to Stage 2, filter generation.
The purpose of this stage is to generate the filters that can be used to find the
movies with the latest release date
.Note: We will use the filter generated in the Stage 1 in this stage because we want to find the
movie with the latest release date
in movies released before 2000 and in the animation category
.We will need to query our MongoDB collection via LLM. For this purpose, we will be defining some tools that the LLM can use to query our data.
Let us define the tools to allow LLM to query our MongoDB collection.
1 import json 2 import logging 3 import os 4 import traceback 5 from typing import Dict, Optional, Type, Union, List 6 7 from pymongo import MongoClient 8 from langchain_core.callbacks import CallbackManagerForToolRun 9 from langchain_core.pydantic_v1 import BaseModel, Field 10 from langchain_core.tools import BaseTool 11 12 class MongoDBClient: 13 """Data helper for querying MongoDB Vector Indexes.""" 14 15 def __init__(self, collection): 16 self.collection = collection 17 18 def run_aggregate_pipeline(self, pipeline: List[Dict]) -> List[Dict]: 19 documents = list(self.collection.aggregate(pipeline)) 20 return documents 21 22 class BaseMongoDBTool(BaseModel): 23 """Base tool for interacting with MongoDB.""" 24 25 client: MongoDBClient = Field(exclude=True) 26 match_filter: dict = Field(exclude=True) 27 28 class Config(BaseTool.Config): 29 pass 30 31 class _QueryExecutorMongoDBToolInput(BaseModel): 32 pipeline: str = Field(..., description="A valid MongoDB pipeline in JSON string format") 33 34 class QueryExecutorMongoDBTool(BaseMongoDBTool, BaseTool): 35 name: str = "mongo_db_executor" 36 description: str = """ 37 Input to this tool is a mongodb pipeline, output is a list of documents. 38 If the pipeline is not correct, an error message will be returned. 39 If an error is returned, report back to the user the issue and stop. 40 """ 41 args_schema: Type[BaseModel] = _QueryExecutorMongoDBToolInput 42 43 def _run( 44 self, 45 pipeline: str, 46 run_manager: Optional[CallbackManagerForToolRun] = None, 47 ) -> Union[List[Dict], str]: 48 """Get the result for the mongodb pipeline.""" 49 try: 50 pipeline = json.loads(pipeline) 51 if self.match_filter: 52 pipeline = [{"$match": self.match_filter}] + pipeline 53 print(f"Updated pipeline: {pipeline}/") 54 documents = self.client.run_aggregate_pipeline(pipeline) 55 return documents 56 except Exception as e: 57 """Format the error message""" 58 return f"Error: {e}\n{traceback.format_exc()}"
In the second stage, we are only accounting for use cases where the user wants "latest," "recent," "first," or "last" type of queries. We will instruct our LLM to only generate an aggregation pipeline to generate filters for these types of queries.
1 SYSTEM_PROMPT_TEMPLATE = """ 2 Your goal is to structure the user's query to match the request schema provided below. 3 4 << Structured Request Schema >> 5 When responding use a markdown code snippet with a JSON object formatted in the following schema: 6 7 ```json 8 {{{{ 9 "query": string \\ rewritten user's query after removing the information handled by the filter 10 "filter": string \\ logical condition statement for filtering documents 11 }}}}
The query string should be re-written. Any conditions in the filter should not be mentioned in the query as well.
A logical condition statement is composed of one or more comparison and logical operation statements.
A comparison statement takes the form:
comp(attr, val)
:comp
('eq | ne | gt | gte | lt | lte | in | nin'): comparatorattr
(string): name of attribute to apply the comparison toval
(string): is the comparison value
A logical operation statement takes the form
op(statement1, statement2, ...)
:op
('and | or'): logical operatorstatement1
,statement2
, ... (comparison statements or logical operation statements): one or more statements to apply the operation to
First step is to think about whether the user question mentions anything about date or time related that require a lookup in the MongoDB database. Words like "latest", "recent", "earliest", "first", "last" etc. in the query means a look up could be required.
If no lookup is required, return "NO_FILTER" for the filter value.
If required, create a syntactically correct MongoDB aggregation pipeline using '$sort' and '$limit' operator to run.
Use projection to only fetch the relevant date columns.
Then look at the results of the aggregation pipeline and generate a date range query that can be used to filter relevant documents from the collection.
Make sure to only generate date-based filters.
Make sure to only generate the query if a user asks about a time based question such as latest, most recent and not mention a specific date time.
Make sure that you only use the comparators and logical operators listed above and no others.
Make sure that filters only refer to date/time attributes that exist in the data source.
Make sure that filters only use the attributed names with its function names if there are functions applied on them.
Make sure that filters only use format
YYYY-MM-DD
when handling date data typed values.
Make sure that filters take into account the descriptions of attributes and only make comparisons that are feasible given the type of data being stored.
Make sure that filters are only used as needed. If there are no filters that should be applied return "NO_FILTER" for the filter value.
Make sure the column names in the filter query are in double quotes.<< Data Source >>
1 {{{{ 2 "content": {content_description}, 3 "attributes": {attribute_info} 4 }}}}
"""
1 #### 2 3 Now, let's go ahead and define our `generate_time_based_filter` function that we will be using to generate our time-based filters. 4 5 We have handled the cases where only one stage filter generation is required or no filter generation is required. 6 7 ```python 8 from typing import Tuple 9 10 from langchain.agents import create_tool_calling_agent, AgentExecutor 11 from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder, PromptTemplate, HumanMessagePromptTemplate, \ 12 SystemMessagePromptTemplate 13 from pymongo.collection import Collection 14 from langchain.chains.query_constructor.base import AttributeInfo, _format_attribute_info 15 16 17 from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder, PromptTemplate, HumanMessagePromptTemplate, \ 18 SystemMessagePromptTemplate 19 from pymongo.collection import Collection 20 from langchain.chains.query_constructor.base import AttributeInfo 21 22 23 def generate_time_based_filter(llm: ChatOpenAI, collection: Collection, pre_filter: Dict, query: str, document_content_description: str, metadata_field_info: List[AttributeInfo]) -> Tuple[Dict, str]: 24 """ 25 This function is responsible for generating filter query for "most recent", "latest", "earliest" type of user 26 questions. 27 :param llm: (ChatOpenAI) llm instance 28 :param collection: pymongo collection instance 29 :param pre_filter: (Dict) metadata pre-filter query 30 :param query: (str) user query 31 :param document_content_description: (str) description of data 32 :param metadata_field_info: (List[AttributeInfo]) list of metadata attributes information 33 :return: (Tuple[Dict, str]) time-based filter query and re-written user query 34 """ 35 client = MongoDBClient(collection=collection) 36 executor_tool = QueryExecutorMongoDBTool(client=client, match_filter=pre_filter) 37 tools = [executor_tool] 38 attribute_str = _format_attribute_info(metadata_field_info) 39 system_prompt_template = SYSTEM_PROMPT_TEMPLATE.format(attribute_info=attribute_str, 40 content_description=document_content_description) 41 42 prompt = ChatPromptTemplate(input_variables=["agent_scratchpad", "input"], 43 messages=[SystemMessagePromptTemplate( 44 prompt=PromptTemplate(input_variables=[], template=system_prompt_template)), 45 MessagesPlaceholder(variable_name="chat_history", optional=True), 46 HumanMessagePromptTemplate( 47 prompt=PromptTemplate(input_variables=["input"], 48 template="{input}")), 49 MessagesPlaceholder(variable_name="agent_scratchpad")]) 50 51 agent = create_tool_calling_agent(llm, tools, prompt) 52 agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True) 53 structured_query = agent_executor.invoke({"input": query}) 54 allowed_attributes = [] 55 for ainfo in metadata_field_info: 56 allowed_attributes.append( 57 ainfo.name if isinstance(ainfo, AttributeInfo) else ainfo["name"] 58 ) 59 60 output_parser = StructuredQueryOutputParser.from_components( 61 allowed_comparators=translator.allowed_comparators, 62 allowed_operators=translator.allowed_operators, 63 allowed_attributes=allowed_attributes 64 ) 65 structured_query = output_parser.parse(structured_query["output"]) 66 new_query, new_kwargs = translator.visit_structured_query(structured_query) 67 time_based_pre_filter = enforce_constraints(new_kwargs) 68 time_based_pre_filter = time_based_pre_filter.get('pre_filter', {}) if time_based_pre_filter else {} 69 return time_based_pre_filter, new_query 70 ``` 71 72 #### Metadata and time-based filter generation 73 74 Let’s run with both the filters now and check the result. 75 76 ```python 77 query = "I want to watch a movie released before year 2000 in the animation genre with the latest release date" 78 79 pre_filter, new_query = generate_metadata_filter( 80 query=query, 81 llm=llm, 82 document_content_description=document_content_description, 83 metadata_field_info=metadata_field_info, 84 schema_prompt=DEFAULT_SCHEMA_PROMPT, 85 examples=DEFAULT_EXAMPLES) 86 87 print(pre_filter) 88 print(new_query) 89 90 time_based_pre_filter, final_query = generate_time_based_filter( 91 llm=llm, 92 collection=collection, 93 pre_filter=pre_filter, 94 query=new_query, 95 document_content_description=document_content_description, 96 metadata_field_info=metadata_field_info 97 ) 98 print(time_based_pre_filter) 99 print(final_query) 100 ``` 101 102 Output: 103 104 ```python 105 {'$and': [{'release_date': {'$lt': '2000-01-01'}}, {'genre': {'$in': ['animation']}}]} 106 I want to watch a movie with the latest release date 107 108 {'release_date': {'$eq': '1999-11-24'}} 109 I want to watch a movie 110 ``` 111 112 Now that we have generated both stages’ filters, we can combine them using the `$and` operator to generate our final filter. 113 114 ```python 115 # initialize final filter with stage 1 filter 116 final_pre_filter = pre_filter 117 if time_based_pre_filter: 118 # add time_based_filter if applicable 119 final_pre_filter = {"$and": [pre_filter, time_based_pre_filter]} 120 121 print(final_pre_filter) 122 ``` 123 124 Output: 125 126 ```python 127 {'$and': [{'$and': [{'release_date': {'$lt': '2000-01-01'}}, {'genre': {'$in': ['animation']}}]}, {'release_date': {'$eq': '1999-11-24'}}]} 128 ``` 129 130 Now, let’s run a semantic search using the `final_pre_filter`. 131 132 ```python 133 vectorStore = MongoDBAtlasVectorSearch( collection, embeddings ) 134 135 docs = vectorStore.similarity_search(query=query, pre_filter=final_pre_filter) 136 for doc in docs: 137 print(doc.page_content) 138 ``` 139 140 Output: 141 142 ```bash 143 The toys embark on a rescue mission to save Woody after he is stolen by a toy collector. 144 ``` 145 146 With smart filtering, we used both metadata and time-based filtering stages and were able to generate a filter that can be used to pre-filter the data before running a semantic search. We have received only the required documents in the end. 147 148 ## Benefits of smart filtering 149 150 Smart filtering brings a host of advantages to the table, making it a valuable tool for enhancing search experiences: 151 152 * **Improved search accuracy:** By precisely targeting the data that matches your query, smart filtering dramatically increases the likelihood of finding relevant results. No more wading through irrelevant information. 153 * **Faster search results:** Since smart filtering narrows down the search scope, the system can process information more efficiently, leading to quicker results. 154 * **Enhanced user experience:** When users find what they're looking for quickly and easily, it leads to higher satisfaction and a better overall experience. 155 * **Versatility:** Smart filtering can be applied to various domains, from e-commerce product searches to content recommendations, making it a versatile tool. 156 157 By leveraging metadata and creating targeted pre-filters, smart filtering empowers you to deliver search results that truly meet user expectations. 158 159 ## Conclusion 160 161 Smart filtering is a powerful tool that transforms search experiences by bridging the gap between user intent and search results. By harnessing the power of metadata and vector search, it delivers more accurate, relevant, and efficient search outcomes. 162 163 Whether you're building an e-commerce platform, a content recommendation system, or any application that relies on effective search, incorporating smart filtering can significantly enhance user satisfaction and drive better results. 164 165 By understanding the fundamentals of smart filtering, you're equipped to explore its potential and implement it in your projects. So why wait? Start leveraging the power of smart filtering today and revolutionize your search game\! 166 167 View the [full source code](https://github.com/bhardwaj-vipul/SmartFilteringRAG) for smart filtering using MongoDB Atlas. 168 169 Check out additional resources: [Unlock the Power of Semantic Search With MongoDB Atlas Vector Search](https://mongodb.prakticum-team.ru/basics/semantic-search) and [Interactive RAG With MongoDB Atlas \+ Function Calling API](https://mongodb.prakticum-team.ru/developer/products/atlas/interactive-rag-mongodb-atlas-function-calling-api/). If you have any questions or want to show us what you are building, join us in the [MongoDB Community Forums](https://mongodb.prakticum-team.ru/community/forums/). 170 171 172 [1]: https://images.contentstack.io/v3/assets/blt39790b633ee0d5a7/blt07d47760e8c04d15/66d6dd2326adee3c5af681ed/image2.png 173 [2]: https://images.contentstack.io/v3/assets/blt39790b633ee0d5a7/blt6ce1fa4c1376e507/66d6dd233bf41e1189a41469/image1.png
Top Comments in Forums
There are no comments on this article yet.