Filtering documents in MongoDBAtlasVectorSearch for targeted RetrievalQA in Langchain

work_state · March 17, 2024, 4:50pm

Hello everyone,

I am currently developing a question-answering system using Langchain’s RetrievalQA, with MongoDBAtlasVectorSearch for fetching documents from a MongoDB Atlas collection. I need to refine the document retrieval process to select specific documents that should be used as the basis for answering questions.

Here is the current setup of my retrieval function:

from langchain.chains import RetrievalQA
from langchain_openai import OpenAI
from langchain.prompts import PromptTemplate
from langchain_community.vectorstores import MongoDBAtlasVectorSearch
from langchain_openai import OpenAIEmbeddings

def create_vector_search(db_name, collection_name):
   vector_search = MongoDBAtlasVectorSearch.from_connection_string(
      "mongodb_connection_string",
      f"{db_name}.{collection_name}",
      OpenAIEmbeddings(),
      index_name="vector_index"
   )

   return vector_search

def perform_question_answering(query):
    vector_search = create_vector_search("langchain_db", "test")

    qa_retriever = vector_search.as_retriever(
        search_type="similarity",
        search_kwargs={"k": 100, "post_filter_pipeline": [{"$limit": 1}]}
    )

    prompt_template = """
    Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

    {context}

    Question: {question}
    """
    PROMPT = PromptTemplate(
        template=prompt_template, input_variables=["context", "question"]
    )

    qa = RetrievalQA.from_chain_type(
        llm=OpenAI(),
        chain_type="stuff",
        retriever=qa_retriever,
        return_source_documents=True,
        chain_type_kwargs={"prompt": PROMPT}
    )

    docs = qa({"query": query})

    return docs["result"], docs['source_documents']

I attempted to use pre_filter in search_kwargs to limit the documents based on a specific condition (e.g., page equals 555), but it didn’t work as expected.

qa_retriever = vector_search.as_retriever(
        search_type="similarity",
        search_kwargs={
            "k": 100,
            "pre_filter": {"page": {"$eq": 555}},
            "post_filter_pipeline": [
                { "$limit": 1 }
            ]
        }
    )

So my question is how can I effectively use pre filter in MongoDBAtlasVectorSearch to limit the retrieval to documents that match specific criteria?

I’m looking for insights, best practices, or examples that could help refine the document selection process within this system.

Thank you for your time and help!

Apoorva_Joshi · March 19, 2024, 3:02pm

Hi! Can you elaborate on what you mean by “didn’t work as expected”? One thing to check is that the fields you are filtering on are indexed in your vector search index definition. See more in the docs here:

Prakul_Agarwal · March 19, 2024, 8:49pm

@work_state

Did you setup a corresponding vector index

{
“fields”:[
{
“type”: “vector”,
“path”: “embedding”,
“numDimensions”: 1536,
“similarity”: “cosine”
},
{
“type”: “filter”,
“path”: “page”
}
]
}

Followed by the query
qa_retriever = vector_search.as_retriever( search_type="similarity", search_kwargs={ "k": 100, "pre_filter": {"page": {"$eq": 555}}, "post_filter_pipeline": [ { "$limit": 1 } ] } )

work_state · March 20, 2024, 11:30am

What I meant by “didn’t work as expected” is that I received answers from the system even when there were no documents in my database that were related to the query. In other words, I expected no results in such cases, but the system still returned answers.

work_state · March 20, 2024, 11:33am

Yes, I have set up the vector index correctly with both the vector and filter fields as suggested. However, even after using the similarity_score_threshold search type with a score threshold of 0.5

qa_retriever = vector_search.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={
        "score_threshold": 0.5,
    }
)

I am still getting answers for queries that should not return any results, as my database does not contain related data. It’s perplexing why I’m receiving these unrelated answers despite the configurations and filters applied.