Handling irrelevant search results in MongoDBAtlasVectorSearch for specific queries

Hello,

I’m using MongoDBAtlasVectorSearch with Langchain’s RetrievalQA to fetch documents from my MongoDB Atlas collection, which is focused on MongoDB Atlas services. While the retrieval process generally works as expected, I’ve encountered some peculiar behavior.

Here’s the context: When querying topics directly related to the data in my collection, such as MongoDB Atlas features, the system performs well. However, when I query about unrelated topics, for example, “What is Google?”, it still returns documents, despite these topics being unrelated to the content of my database.

Moreover, I’ve noticed an unusual scenario where, after deleting all data from my collection, the RetrievalQA system still provided a relevant answer to the query “What is Google?”. This is perplexing because, with no data in the database, I expected no results or a response like “I don’t know.”

Here’s the code snippet used for the retrieval process:

from langchain.chains import RetrievalQA
from langchain_openai import OpenAI
from langchain.prompts import PromptTemplate
from langchain_community.vectorstores import MongoDBAtlasVectorSearch
from langchain_openai import OpenAIEmbeddings

def vector_search_from_connection_string(db_name, collection_name):
   vector_search = MongoDBAtlasVectorSearch.from_connection_string(
      "mongodb_connection_string",
      f"{db_name}.{collection_name}",
      OpenAIEmbeddings(),
      index_name="vector_index"
   )

   return vector_search

def perform_question_answering(query):
    vector_search = vector_search_from_connection_string("langchain_db", "test")

    qa_retriever = vector_search.as_retriever(
        search_type="similarity",
        search_kwargs={"k": 100, "post_filter_pipeline": [{"$limit": 1}]}
    )

    prompt_template = """
    Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

    {context}

    Question: {question}
    """
    PROMPT = PromptTemplate(
        template=prompt_template, input_variables=["context", "question"]
    )

    qa = RetrievalQA.from_chain_type(
        llm=OpenAI(),
        chain_type="stuff",
        retriever=qa_retriever,
        return_source_documents=True,
        chain_type_kwargs={"prompt": PROMPT}
    )

    docs = qa({"query": query})

    return docs["result"], docs['source_documents']

output:

Are there best practices or additional configurations in MongoDB Atlas or Langchain that could help improve the accuracy of the search results, especially for unrelated queries?

Any insights, experiences, or recommendations on managing such search behavior would be greatly appreciated. I aim to refine the search results to ensure they are contextually relevant to the queries.

Thank you for your help and guidance!

I’ve noticed an unusual scenario where, after deleting all data from my collection, the RetrievalQA system still provided a relevant answer to the query “What is Google?”. This is perplexing because, with no data in the database, I expected no results or a response like “I don’t know.”

You should try prompting your LLM with this specific instruction - ie “Answer ONLY on the basis of the context provided below and don’t use any other knowledge source” so that it can restrict to the retrieved documents. Different LLMs have a different propensity for instruction following

1 Like

To your question about why documents are being returned even if they are irrelevant, this is because the “similarity” search_type retrieves documents even if they are low scoring.

If you want to filter for higher scores, you might consider using this method instead, where you can specify a score threshold:

Alternatively, you could also use a post_filter to filter by score.

1 Like

Thank you for the advice. I’ll instruct the LLM to rely solely on the provided context and will open an issue on Langchain to discuss this further. Appreciate your help!

I’ve used the similarity_score_threshold search type with a score threshold to filter the results, but I’m still receiving answers even when the data doesn’t seem to have any documents closely related to the query. Thanks a lot for the help; I appreciate it!

There is a way to have a score threshold on retriever as well

retriever = vector_search.as_retriever(
   search_type = "similarity",
   search_kwargs = {
      "k": 10,
      "score_threshold": 0.75,
      "pre_filter" = { "page": { "$eq": 17 } }
   }
)

1 Like

Thank you for the suggestion. The score_threshold combined with pre_filter is indeed working as expected, and I appreciate your help. However, I’m still facing an issue: even though the source_documents array is empty, which aligns with the absence of related data in my database, I am still getting a detailed answer in the result

query: "What's mongodb atlas?"
---
answer: {
    "result": "MongoDB Atlas is a cloud-based database service offered by MongoDB for managing and deploying MongoDB databases on the cloud. It provides a secure and scalable platform for storing and accessing data, and offers various features such as automated backups, monitoring, and disaster recovery options.",
    "source_documents": []
}

This is because the model is answering based on its parametric knowledge. You might try including more specific instructions in the prompt to instruct the model to respond only based on the retrieved context, for example “Answer the following question only based on the context provided.”

2 Likes

@Apoorva_Joshi Yes exactly, and I came with some changes, and for now it’s working for me

here is the current implementation that works for me

from langchain_community.vectorstores import MongoDBAtlasVectorSearch
from langchain_openai import OpenAIEmbeddings
from langchain.prompts import PromptTemplate
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain_openai import ChatOpenAI

def vector_search_from_connection_string(db_name, collection_name):
   vector_search = MongoDBAtlasVectorSearch.from_connection_string(
      "mongodb_connection_string",
      f"{db_name}.{collection_name}",
      OpenAIEmbeddings(),
      index_name="vector_index"
   )

   return vector_search

def perform_question_answering(query):
    vector_search = vector_search_from_connection_string("langchain_db", "test")

    qa_retriever = vector_search.as_retriever(
        search_type = "similarity",
        search_kwargs = {
            "k": 10,
            "score_threshold": 0.75,
        }
    )

    prompt_template = """If you encounter a question for which you don't know the answer based on the predefined points,
    please respond with 'I'm sorry, I can't provide that information\nI don't have any knowledge about {context}\n\nIs there something else you'd like to ask?' and refrain from making up an answer.
    However, if the answer is not present in the predefined points, then Provide comprehensive information related to the user's query.
    Remember, your goal is to assist the user in the best way possible. If the question is unclear or ambiguous, you can ask for clarification.

    Maintain the same language as the follow up input message.

    Chat History:
    {chat_history}

    User's Question: {question} 
    AI Answer:"""
    
    PROMPT = PromptTemplate.from_template(prompt_template)
    
    memory = ConversationBufferMemory(
        memory_key="chat_history",
        input_key="question",
        return_messages=True
    )

    qa = ConversationalRetrievalChain.from_llm(
        llm=ChatOpenAI(),
        verbose=True,
        memory=memory,
        retriever=qa_retriever,
        condense_question_prompt=PROMPT,
        response_if_no_docs_found="""I'm sorry, I can't provide that information.
        I don't have any knowledge about it.

        Is there something else you'd like to ask?"""
    )

    docs = qa({"question": query})

    return docs["answer"]

Thank you all very much for the helpful insights

1 Like