Handling irrelevant search results in MongoDBAtlasVectorSearch for specific queries

work_state · March 19, 2024, 11:06am

Hello,

I’m using MongoDBAtlasVectorSearch with Langchain’s RetrievalQA to fetch documents from my MongoDB Atlas collection, which is focused on MongoDB Atlas services. While the retrieval process generally works as expected, I’ve encountered some peculiar behavior.

Here’s the context: When querying topics directly related to the data in my collection, such as MongoDB Atlas features, the system performs well. However, when I query about unrelated topics, for example, “What is Google?”, it still returns documents, despite these topics being unrelated to the content of my database.

Moreover, I’ve noticed an unusual scenario where, after deleting all data from my collection, the RetrievalQA system still provided a relevant answer to the query “What is Google?”. This is perplexing because, with no data in the database, I expected no results or a response like “I don’t know.”

Here’s the code snippet used for the retrieval process:

from langchain.chains import RetrievalQA
from langchain_openai import OpenAI
from langchain.prompts import PromptTemplate
from langchain_community.vectorstores import MongoDBAtlasVectorSearch
from langchain_openai import OpenAIEmbeddings

def vector_search_from_connection_string(db_name, collection_name):
   vector_search = MongoDBAtlasVectorSearch.from_connection_string(
      "mongodb_connection_string",
      f"{db_name}.{collection_name}",
      OpenAIEmbeddings(),
      index_name="vector_index"
   )

   return vector_search

def perform_question_answering(query):
    vector_search = vector_search_from_connection_string("langchain_db", "test")

    qa_retriever = vector_search.as_retriever(
        search_type="similarity",
        search_kwargs={"k": 100, "post_filter_pipeline": [{"$limit": 1}]}
    )

    prompt_template = """
    Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

    {context}

    Question: {question}
    """
    PROMPT = PromptTemplate(
        template=prompt_template, input_variables=["context", "question"]
    )

    qa = RetrievalQA.from_chain_type(
        llm=OpenAI(),
        chain_type="stuff",
        retriever=qa_retriever,
        return_source_documents=True,
        chain_type_kwargs={"prompt": PROMPT}
    )

    docs = qa({"query": query})

    return docs["result"], docs['source_documents']

output:

Are there best practices or additional configurations in MongoDB Atlas or Langchain that could help improve the accuracy of the search results, especially for unrelated queries?

Any insights, experiences, or recommendations on managing such search behavior would be greatly appreciated. I aim to refine the search results to ensure they are contextually relevant to the queries.

Thank you for your help and guidance!

Prakul_Agarwal · March 19, 2024, 3:48pm

I’ve noticed an unusual scenario where, after deleting all data from my collection, the RetrievalQA system still provided a relevant answer to the query “What is Google?”. This is perplexing because, with no data in the database, I expected no results or a response like “I don’t know.”

You should try prompting your LLM with this specific instruction - ie “Answer ONLY on the basis of the context provided below and don’t use any other knowledge source” so that it can restrict to the retrieved documents. Different LLMs have a different propensity for instruction following

Apoorva_Joshi · March 20, 2024, 3:43am

To your question about why documents are being returned even if they are irrelevant, this is because the “similarity” search_type retrieves documents even if they are low scoring.

If you want to filter for higher scores, you might consider using this method instead, where you can specify a score threshold:

github.com

langchain-ai/langchain/blob/2c835baae4c638484f343bf746ee48233d0471dd/libs/core/langchain_core/vectorstores.py#L302-L320


      
          def similarity_search_with_relevance_scores(
              self,
              query: str,
              k: int = 4,
              **kwargs: Any,
          ) -> List[Tuple[Document, float]]:
              """Return docs and relevance scores in the range [0, 1].
          
              0 is dissimilar, 1 is most similar.
          
              Args:
                  query: input text
                  k: Number of Documents to return. Defaults to 4.
                  **kwargs: kwargs to be passed to similarity search. Should include:
                      score_threshold: Optional, a floating point value between 0 to 1 to
                          filter the resulting set of retrieved docs
          
              Returns:
                  List of Tuples of (doc, similarity_score)

Alternatively, you could also use a post_filter to filter by score.

work_state · March 20, 2024, 11:25am

Thank you for the advice. I’ll instruct the LLM to rely solely on the provided context and will open an issue on Langchain to discuss this further. Appreciate your help!

work_state · March 20, 2024, 11:27am

I’ve used the similarity_score_threshold search type with a score threshold to filter the results, but I’m still receiving answers even when the data doesn’t seem to have any documents closely related to the query. Thanks a lot for the help; I appreciate it!

Prakul_Agarwal · March 20, 2024, 10:25pm

There is a way to have a score threshold on retriever as well

retriever = vector_search.as_retriever(
   search_type = "similarity",
   search_kwargs = {
      "k": 10,
      "score_threshold": 0.75,
      "pre_filter" = { "page": { "$eq": 17 } }
   }
)

work_state · March 21, 2024, 11:47am

Prakul_Agarwal:

There is a way to have a score threshold on retriever as well

retriever = vector_search.as_retriever(
   search_type = "similarity",
   search_kwargs = {
      "k": 10,
      "score_threshold": 0.75,
      "pre_filter" = { "page": { "$eq": 17 } }
   }
)

Thank you for the suggestion. The score_threshold combined with pre_filter is indeed working as expected, and I appreciate your help. However, I’m still facing an issue: even though the source_documents array is empty, which aligns with the absence of related data in my database, I am still getting a detailed answer in the result

query: "What's mongodb atlas?"
---
answer: {
    "result": "MongoDB Atlas is a cloud-based database service offered by MongoDB for managing and deploying MongoDB databases on the cloud. It provides a secure and scalable platform for storing and accessing data, and offers various features such as automated backups, monitoring, and disaster recovery options.",
    "source_documents": []
}

Apoorva_Joshi · March 21, 2024, 4:20pm

This is because the model is answering based on its parametric knowledge. You might try including more specific instructions in the prompt to instruct the model to respond only based on the retrieved context, for example “Answer the following question only based on the context provided.”

work_state · March 22, 2024, 3:30pm

@Apoorva_Joshi Yes exactly, and I came with some changes, and for now it’s working for me

here is the current implementation that works for me

from langchain_community.vectorstores import MongoDBAtlasVectorSearch
from langchain_openai import OpenAIEmbeddings
from langchain.prompts import PromptTemplate
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain_openai import ChatOpenAI

def vector_search_from_connection_string(db_name, collection_name):
   vector_search = MongoDBAtlasVectorSearch.from_connection_string(
      "mongodb_connection_string",
      f"{db_name}.{collection_name}",
      OpenAIEmbeddings(),
      index_name="vector_index"
   )

   return vector_search

def perform_question_answering(query):
    vector_search = vector_search_from_connection_string("langchain_db", "test")

    qa_retriever = vector_search.as_retriever(
        search_type = "similarity",
        search_kwargs = {
            "k": 10,
            "score_threshold": 0.75,
        }
    )

    prompt_template = """If you encounter a question for which you don't know the answer based on the predefined points,
    please respond with 'I'm sorry, I can't provide that information\nI don't have any knowledge about {context}\n\nIs there something else you'd like to ask?' and refrain from making up an answer.
    However, if the answer is not present in the predefined points, then Provide comprehensive information related to the user's query.
    Remember, your goal is to assist the user in the best way possible. If the question is unclear or ambiguous, you can ask for clarification.

    Maintain the same language as the follow up input message.

    Chat History:
    {chat_history}

    User's Question: {question} 
    AI Answer:"""
    
    PROMPT = PromptTemplate.from_template(prompt_template)
    
    memory = ConversationBufferMemory(
        memory_key="chat_history",
        input_key="question",
        return_messages=True
    )

    qa = ConversationalRetrievalChain.from_llm(
        llm=ChatOpenAI(),
        verbose=True,
        memory=memory,
        retriever=qa_retriever,
        condense_question_prompt=PROMPT,
        response_if_no_docs_found="""I'm sorry, I can't provide that information.
        I don't have any knowledge about it.

        Is there something else you'd like to ask?"""
    )

    docs = qa({"question": query})

    return docs["answer"]

Thank you all very much for the helpful insights