Filtering the results of Vector Search with LangChain

Jacky_Casas · August 22, 2023, 8:39am

Hello,
I created an Vector Search Index in my Atlas cluster, on the “embedding” field of a “embeddings” collection. It works well.

Now I want to filter the results to only retrieve entries for a specific “project”. I use LangChain, and the MongoDBAtlasVectorSearch as a retriever. In the documentation it says I can add the filter, as explained here.

My code:

from langchain.vectorstores import MongoDBAtlasVectorSearch

vectorstore = MongoDBAtlasVectorSearch(
    collection=db.embeddings,
    embedding=get_embedding("azureopenai"),
    index_name="embedding_index")
retriever = vectorstore.as_retriever(
    search_kwargs={
        'k': 5,
        'filter': { 'project': 'heroes' }
    }
)

I then use the retriever in a LangChain chain. I got results (5, as expected), but the filter does not work, I got results from all projects (not only the ‘heroes’ project).

Other info for context:

Here is the index (I also added an index on the ‘project’ field, but it does change the results):

{
  "mappings": {
    "dynamic": true,
    "fields": {
      "embedding": {
        "dimensions": 1536,
        "similarity": "cosine",
        "type": "knnVector"
      },
      "project": {
        "type": "string"
      }
    }
  }
}

And here is an example of a document stored in the ‘embeddings’ collection:

{
  "_id": {
    "$oid": "64e379206cfcf8a7866bce8c"
  },
  "text": "Spider-Man, créé par Stan Lee et Steve Ditko, est un super-héros de Marvel Comics. Peter\nParker, un étudiant doué mais timide, est mordu par une araignée radioactive ...",
  "embedding": [
    0.0013639614901196446,
    -0.02883271683320636,
    0.014490925689774099,
    -0.012036416665376559,
    ....
  ],
  "source": "uploads/heroes/spiderman-short.pdf",
  "file": "spiderman-short.pdf",
  "project": "heroes"
}

Any hints or solutions?
Thanks a lot

Henry_Weller · August 22, 2023, 7:03pm

Hi Jacky,

Thanks for the question! Our integration in langchain treats filters slightly differently from other vector stores. You would actually need to specify this as a ‘pre_filter’ not a ‘filter’ in order for this to work. The syntax will also look slightly different as you will need to specify the path (‘project’), operator (‘equals’) and value (‘heroes’) for the filter. The example below should make this more clear.

I’d also recommend increasing the value of k to a larger number, and adding an additional post_filter_pipeline search_kwarg that limits the results to k. This will boost the accuracy of your results considerably.

Your search kwargs with both of these changes should look like this

k = 5
search_kwargs={
        'k': k * 10, # overrequest k during search
        'pre_filter': { 'path': 'project', 'equals', 'heroes' }
        'post_filter_pipeline': [{'$limit' : k}] # limit results to top k
}

Let me know if you run into any other issues!

Jacky_Casas · August 23, 2023, 12:59pm

Hey,
Thanks a lot for your answer!

I tested this intensively.
The pre-filter like you suggested leads to an error "knnBeta.filter.equals" must be a document. Actually the ‘equals’ operator cannot match a string value.

But I think it works like that with the ‘text’ operator:

search_kwargs={
    'k': k * 10,
    'pre_filter': {
        'text': {
            'path': 'project',
            'query': 'heroes'
        }
    },
    'post_filter_pipeline': [ { '$limit': k } ]
}

Does it make sense?

Then, my second problem is that I need to filter on two fields (not only ‘project’, but also on ‘username’). How would you do that? I tested with the compound operator, but didn’t manage to make it work correctly.

Henry_Weller · August 23, 2023, 3:53pm

Nice catch on the pre_filter - yes you should use the ‘text’ filter in this situation, not ‘equals’.

You should be able to use a compound filter here like you would with regular search. This post has an example of what this could look like. If you’re still running into issues would you mind sharing the syntax you are using?

YASH_SHARMA6 · August 24, 2023, 11:47am

hello , I am trying to create a vectorstore , to store a document and embeddings,

from pymongo import MongoClient
from langchain.vectorstores import MongoDBAtlasVectorSearch
MONGODB_ATLAS_CLUSTER_URI =""
# initialize MongoDB python client
client = MongoClient(MONGODB_ATLAS_CLUSTER_URI)

db_name = "langchain_db"
collection_name = "langchain_col"
collection = client[db_name][collection_name]
index_name = "langchain_demo"

# insert the documents in MongoDB Atlas with their embedding
docsearch = MongoDBAtlasVectorSearch.from_documents( docs
    ,model_NEW, collection=collection, index_name=index_name
)

i am getting the error SSL handshake failed

T_A · August 25, 2023, 10:34pm

Hey,
I think you may need to install cerfiti, then pass it in your mongodb client.
Should be something like this

client = MongoClient(mongodb_url, tlsCAFile=certifi.where())

Adrien_Le_Clair · November 23, 2023, 10:25am

Have you succeed in filtering with multiple fields ?

Prakul_Agarwal · December 3, 2023, 5:06am

Mahimai_Raja_J1 · January 2, 2024, 6:21am

Is there a way to filter the records between the selected dates?

Vishnu_Satis · January 12, 2024, 6:39am

Multiple field filter works using $and

pranjal_gahankari · August 23, 2024, 11:38am

Hello,
Does this filtering will work for MongoDB vcore as well?