Filtering the results of Vector Search with LangChain

Hello,
I created an Vector Search Index in my Atlas cluster, on the “embedding” field of a “embeddings” collection. It works well.

Now I want to filter the results to only retrieve entries for a specific “project”. I use LangChain, and the MongoDBAtlasVectorSearch as a retriever. In the documentation it says I can add the filter, as explained here.

My code:

from langchain.vectorstores import MongoDBAtlasVectorSearch

vectorstore = MongoDBAtlasVectorSearch(
    collection=db.embeddings,
    embedding=get_embedding("azureopenai"),
    index_name="embedding_index")
retriever = vectorstore.as_retriever(
    search_kwargs={
        'k': 5,
        'filter': { 'project': 'heroes' }
    }
)

I then use the retriever in a LangChain chain. I got results (5, as expected), but the filter does not work, I got results from all projects (not only the ‘heroes’ project).

Other info for context:

Here is the index (I also added an index on the ‘project’ field, but it does change the results):

{
  "mappings": {
    "dynamic": true,
    "fields": {
      "embedding": {
        "dimensions": 1536,
        "similarity": "cosine",
        "type": "knnVector"
      },
      "project": {
        "type": "string"
      }
    }
  }
}

And here is an example of a document stored in the ‘embeddings’ collection:

{
  "_id": {
    "$oid": "64e379206cfcf8a7866bce8c"
  },
  "text": "Spider-Man, créé par Stan Lee et Steve Ditko, est un super-héros de Marvel Comics. Peter\nParker, un étudiant doué mais timide, est mordu par une araignée radioactive ...",
  "embedding": [
    0.0013639614901196446,
    -0.02883271683320636,
    0.014490925689774099,
    -0.012036416665376559,
    ....
  ],
  "source": "uploads/heroes/spiderman-short.pdf",
  "file": "spiderman-short.pdf",
  "project": "heroes"
}

Any hints or solutions?
Thanks a lot

Hi Jacky,

Thanks for the question! Our integration in langchain treats filters slightly differently from other vector stores. You would actually need to specify this as a ‘pre_filter’ not a ‘filter’ in order for this to work. The syntax will also look slightly different as you will need to specify the path (‘project’), operator (‘equals’) and value (‘heroes’) for the filter. The example below should make this more clear.

I’d also recommend increasing the value of k to a larger number, and adding an additional post_filter_pipeline search_kwarg that limits the results to k. This will boost the accuracy of your results considerably.

Your search kwargs with both of these changes should look like this

k = 5
search_kwargs={
        'k': k * 10, # overrequest k during search
        'pre_filter': { 'path': 'project', 'equals', 'heroes' }
        'post_filter_pipeline': [{'$limit' : k}] # limit results to top k
}

Let me know if you run into any other issues!

2 Likes

Hey,
Thanks a lot for your answer!

I tested this intensively.
The pre-filter like you suggested leads to an error "knnBeta.filter.equals" must be a document. Actually the ‘equals’ operator cannot match a string value.

But I think it works like that with the ‘text’ operator:

search_kwargs={
    'k': k * 10,
    'pre_filter': {
        'text': {
            'path': 'project',
            'query': 'heroes'
        }
    },
    'post_filter_pipeline': [ { '$limit': k } ]
}

Does it make sense?

Then, my second problem is that I need to filter on two fields (not only ‘project’, but also on ‘username’). How would you do that? I tested with the compound operator, but didn’t manage to make it work correctly.

1 Like

Nice catch on the pre_filter - yes you should use the ‘text’ filter in this situation, not ‘equals’.

You should be able to use a compound filter here like you would with regular search. This post has an example of what this could look like. If you’re still running into issues would you mind sharing the syntax you are using?

hello , I am trying to create a vectorstore , to store a document and embeddings,

from pymongo import MongoClient
from langchain.vectorstores import MongoDBAtlasVectorSearch
MONGODB_ATLAS_CLUSTER_URI =""
# initialize MongoDB python client
client = MongoClient(MONGODB_ATLAS_CLUSTER_URI)

db_name = "langchain_db"
collection_name = "langchain_col"
collection = client[db_name][collection_name]
index_name = "langchain_demo"

# insert the documents in MongoDB Atlas with their embedding
docsearch = MongoDBAtlasVectorSearch.from_documents( docs
    ,model_NEW, collection=collection, index_name=index_name
)

i am getting the error SSL handshake failed

Hey,
I think you may need to install cerfiti, then pass it in your mongodb client.
Should be something like this

client = MongoClient(mongodb_url, tlsCAFile=certifi.where())
1 Like

Have you succeed in filtering with multiple fields ?

Is there a way to filter the records between the selected dates?

Multiple field filter works using $and

1 Like

Hello,
Does this filtering will work for MongoDB vcore as well?