Building a Semantic Search Service With Spring AI and MongoDB Atlas

Tim Kelly9 min read • Published Sep 03, 2024 • Updated Oct 24, 2024

Spring AI Java

FULL APPLICATION

Rate this tutorial

What is the song that goes, "Duh da, duh da, DUH da duh"? We've all been plagued by this before. We remember a snippet of the chorus, we know it has something to do with a hotel in Chelsea, but what is that song? I can't remember the title — how do you search by vibe?! Well, with the power of AI, we are able to search our databases, not just by matching words, but searching the semantic meaning of the text. And with Spring AI, you can incorporate the AI-powered search into your Spring application. With just the vague memory of a famous woman who prefers handsome men, we can locate our Leonard Cohen classic.

Spring AI is an application framework from Spring that allows you to combine various AI services and plugins with your applications. With support for many chat, text-to-image, and embedding models, you can get your AI-powered Java application set up for a variety of AI use cases.

With Spring AI, MongoDB Atlas is supported as a vector database, all with Atlas Vector Search to power your semantic search and implement your RAG applications. To learn more about RAG and other key concepts in AI, check out the MongoDB AI integration docs.

In this tutorial, we’ll go through what you need to get started with Spring AI and MongoDB, adding documents to your database with the vectorised content (embeddings), and searching this content with semantic search. The full code for this tutorial is available in the GitHub repository.

Prerequisites

Before starting this tutorial, you'll need to have the following:

A MongoDB Atlas account and an M10+ cluster running MongoDB version 6.0.11, 7.0.2, or later
- An M10+ cluster is necessary to create the index programmatically (by Spring AI).
An OpenAI API key with a paid OpenAI account and available credits
Java 21 and an IDE such as IntelliJ IDEA or Eclipse
Maven 3.9.6+ configured for your project

Spring Initializr

Navigate to the Spring Initializr and configure your project with the following settings:

Project: Maven
Language: Java
Spring Boot: Default version
Java: 21

Add the following dependencies:

MongoDB Atlas Vector Database
Spring Web
OpenAI (other embedding models are available, but we use this for the tutorial)

Generate and download your project, then open it in your IDE.

Setting up your project

Open the application in the IDE of your choosing and the first thing we will do is inspect our pom.xml. In order to use the latest version of Spring AI, change the spring-ai.version version for the Spring AI BOM to 1.0.0-SNAPSHOT. As of writing this article, it will be 1.0.0-M1 by default.

Application configuration

Configure your Spring application to set up the vector store and other necessary beans.

In our application properties, we are going to configure our MongoDB database, as well as everything we need for semantically searching our data. We'll also add in information such as our OpenAI embedding model and API key.

1 spring.application.name=lyric-semantic-search
2 spring.ai.openai.api-key=<Your-API-key>
3 spring.ai.openai.embedding.options.model=text-embedding-ada-002
4 
5 spring.data.mongodb.uri=<Your-MongoDB-connection-string>
6 spring.data.mongodb.database=lyrics
7 spring.ai.vectorstore.mongodb.indexName=vector_index
8 spring.ai.vectorstore.mongodb.collection-name=vector_store
9 spring.ai.vectorstore.mongodb.initialize-schema=true

You'll see at the end, we are setting the initialized schema to be true. This means our application will set up our search index (if it doesn't exist) so we can semantically search our data. If you already have a search index set up with this name and configuration, you can set this to be false.

In your IDE, open up your project. Create a Config.java file in a config package. Here, we are going to set up our OpenAI embedding model. Spring AI makes this a very simple process.

1 package com.mongodb.lyric_semantic_search.config;
2 
3 import org.springframework.ai.embedding.EmbeddingModel;
4 import org.springframework.ai.openai.OpenAiEmbeddingModel;
5 import org.springframework.ai.openai.api.OpenAiApi;
6 import org.springframework.beans.factory.annotation.Value;
7 import org.springframework.boot.SpringBootConfiguration;
8 import org.springframework.boot.autoconfigure.EnableAutoConfiguration;
9 import org.springframework.context.annotation.Bean;
10 import org.springframework.context.annotation.Configuration;
11 
12 @SpringBootConfiguration
13 @EnableAutoConfiguration
14 
15 @Configuration
16 public class Config {
17 
18     @Value("${spring.ai.openai.api-key}")
19     private String openAiKey;
20 
21     @Bean
22     public EmbeddingModel embeddingModel() {
23         return new OpenAiEmbeddingModel(new OpenAiApi(openAiKey));
24     }
25 }

Now, we are able to send away our data to be vectorized, and receive the vectorized results.

Model classes

Create a package called model, for our DocumentRequest class to go in. This is what we are going to be storing in our MongoDB database. The content will be what we are embedding — so lyrics, in our case. The metadata will be anything we want to store alongside it, so artists, albums, or genres. This metadata will be returned alongside our content and can also be used to filter our results.

1 package com.mongodb.lyric_semantic_search.model;
2 
3 import java.util.Map;
4 
5 public class DocumentRequest {
6     private String content;
7     private Map<String, Object> metadata;
8 
9     public DocumentRequest() {
10     }
11 
12     public DocumentRequest(String content, Map<String, Object> metadata) {
13         this.content = content;
14         this.metadata = metadata;
15     }
16 
17     public String getContent() {
18         return content;
19     }
20 
21     public void setContent(String content) {
22         this.content = content;
23     }
24 
25     public Map<String, Object> getMetadata() {
26         return metadata;
27     }
28 
29     public void setMetadata(Map<String, Object> metadata) {
30         this.metadata = metadata;
31     }
32 
33 }

Repository interface

Create a repository package and add a LyricSearchRepository interface. Here, we'll define some of the methods we'll implement later.

1 package com.mongodb.lyric_semantic_search.repository;
2 
3 import java.util.List;
4 import java.util.Optional;
5 
6 import org.springframework.ai.document.Document;
7 import org.springframework.ai.vectorstore.SearchRequest;
8 
9 public interface LyricSearchRepository {
10 
11     void addDocuments(List<Document> docs);
12 
13     Optional<Boolean> deleteDocuments(List<String> ids);
14 
15     List<Document> semanticSearchByLyrics(SearchRequest searchRequest);
16 }

Repository implementation

Create a LyricSearchRepositoryImpl class to implement the repository interface.

1 package com.mongodb.lyric_semantic_search.repository;
2 
3 import java.util.List;
4 import java.util.Optional;
5 
6 import org.springframework.ai.document.Document;
7 import org.springframework.ai.vectorstore.SearchRequest;
8 import org.springframework.ai.vectorstore.VectorStore;
9 import org.springframework.beans.factory.annotation.Autowired;
10 import org.springframework.stereotype.Repository;
11 
12 @Repository
13 public class LyricSearchRepositoryImpl implements LyricSearchRepository {
14 
15     private final VectorStore vectorStore;
16 
17     @Autowired
18     public LyricSearchRepositoryImpl(VectorStore vectorStore) {
19         this.vectorStore = vectorStore;
20     }
21 
22     @Override
23     public void addDocuments(List<Document> docs) {
24         vectorStore.add(docs);
25     }
26 
27     @Override
28     public Optional<Boolean> deleteDocuments(List<String> ids) {
29         return vectorStore.delete(ids);
30     }
31 
32     @Override
33     public List<Document> semanticSearchByLyrics(SearchRequest searchRequest) {
34         return vectorStore.similaritySearch(searchRequest);
35     }
36 }

We are using the methods add, delete, and similaritySearch, all already defined and implemented in Spring AI. These will allow us to embed our data when adding them to our MongoDB database, and we can search these embeddings with vector search.

Service

Create a service package and inside, a LyricSearchService class to handle business logic for our lyrical search application. We will implement these methods later in the tutorial:

1 package com.mongodb.lyric_semantic_search.service;
2 
3 import org.springframework.stereotype.Service;
4 
5 @Service
6 public class LyricSearchService {
7 
8 }

Controller

Create a controller package and a LyricSearchController class to handle HTTP requests. We are going to add a call to add our data, a call to delete any documents we no longer need, and a search call, to semantically search our data.

These will call back to the methods we defined earlier. We’ll implement them in the next steps:

1 package com.mongodb.lyric_semantic_search.controller;
2 
3 import org.springframework.web.bind.annotation.RestController;
4 
5 @RestController
6 public class LyricSearchController {
7 
8 }

Adding documents

In our LyricSearchService class, let's add some logic to take in our documents and add them to our MongoDB database.

1     private static final int MAX_TOKENS = (int) (8192 * 0.80); // OpenAI model's maximum content length + BUFFER for when one word > 1 token
2 
3     @Autowired
4     LyricSearchRepository lyricSearchRepository;
5 
6     public List<Document> addDocuments(List<DocumentRequest> documents) {
7         if (documents == null || documents.isEmpty()) {
8             return Collections.emptyList();
9         }
10 
11         List<Document> docs = documents.stream()
12             .filter(doc -> doc != null && doc.getContent() != null && !doc.getContent()
13                 .trim()
14                 .isEmpty())
15             .map(doc -> new Document(doc.getContent(), doc.getMetadata()))
16             .filter(doc -> {
17                 int wordCount = doc.getContent()
18                     .split("\\s+").length;
19                 return wordCount <= MAX_TOKENS;
20             })
21             .collect(Collectors.toList());
22 
23         if (!docs.isEmpty()) {
24             lyricSearchRepository.addDocuments(docs);
25         }
26 
27         return docs;
28     }

This function takes a single parameter, documents, which is a list of DocumentRequest objects. These represent the documents that need to be processed and added to the repository.

The function first checks if the documents list is null or empty.

The documents list is converted into a stream to facilitate functional-style operations.

The filter is a bit of pre-processing to help clean up our data. It removes any DocumentRequest objects that are null, have null content, or have empty (or whitespace-only) content. This ensures that only valid documents are processed further.

Know your limits! The filter removes any Document objects whose content exceeds the maximum token limit (MAX_TOKENS) for the OpenAI API. The token limit is estimated based on word count, assuming one word is slightly more than one token (not far off the truth). This estimation works for the demo, but in production, we would likely want to implement a form of chunking, where large bodies of text are separated into smaller, more digestible pieces.

Each DocumentRequest object is transformed into a Document object. The Document constructor is called with the content and metadata from the DocumentRequest.

The filtered and transformed Document objects are collected into a list and these documents are added to our MongoDB vector store, along with an embedding of the lyrics.

We'll also add our function to delete documents while we're here:

1 public List<String> deleteDocuments(List<String> ids) {
2     if (ids == null || ids.isEmpty()) {
3         return Collections.emptyList(); // Nothing to delete
4     }
5 
6     Optional<Boolean> result = lyricSearchRepository.deleteDocuments(ids);
7     if (result.isPresent() && result.get()) {
8         return ids; // Return the list of successfully deleted IDs
9     } else {
10         return Collections.emptyList(); // Return empty list if deletion was unsuccessful
11     }
12 }

And the appropriate imports:

1 import java.util.Collections;
2 import java.util.List;
3 import java.util.Optional;
4 import java.util.stream.Collectors;
5 
6 import org.springframework.ai.document.Document;
7 import org.springframework.beans.factory.annotation.Autowired;
8 
9 import com.mongodb.lyric_semantic_search.model.DocumentRequest;
10 import com.mongodb.lyric_semantic_search.repository.LyricSearchRepository;

Now that we have the logic, let’s add the endpoints to our LyricSearchController.

1     @Autowired
2     private LyricSearchService lyricSearchService;
3 
4     @PostMapping("/addDocuments")
5     public List<Map<String, Object>> addDocuments(@RequestBody List<DocumentRequest> documents) {
6         return lyricSearchService.addDocuments(documents).stream()
7             .map(doc -> Map.of("content", doc.getContent(), "metadata", doc.getMetadata()))
8             .collect(Collectors.toList());
9     }
10 
11     @DeleteMapping("/delete")
12     public List<String> deleteDocuments(@RequestBody List<String> ids) {
13         return lyricSearchService.deleteDocuments(ids);
14     }

And our imports:

1 import java.util.List;
2 import java.util.Map;
3 import java.util.stream.Collectors;
4 
5 import org.springframework.beans.factory.annotation.Autowired;
6 import org.springframework.web.bind.annotation.DeleteMapping;
7 import org.springframework.web.bind.annotation.PostMapping;
8 import org.springframework.web.bind.annotation.RequestBody;
9 
10 import com.mongodb.lyric_semantic_search.model.DocumentRequest;
11 import com.mongodb.lyric_semantic_search.service.LyricSearchService;

To test our embedding, let's keep it simple with a few nursery rhymes for now.

Build and run your application. Use the following CURL command to add sample documents:

1 curl -X POST "http://localhost:8080/addDocuments" \
2      -H "Content-Type: application/json" \          
3      -d '[
4            {"content": "Twinkle, twinkle, little star, How I wonder what you are! Up above the world so high, Like a diamond in the sky.", "metadata": {"title": "Twinkle Twinkle Little Star", "artist": "Jane Taylor", "year": "1806"}},
5            {"content": "The itsy bitsy spider climbed up the waterspout. Down came the rain and washed the spider out. Out came the sun and dried up all the rain and the itsy bitsy spider climbed up the spout again.", "metadata": {"title": "Itsy Bitsy Spider", "artist": "Traditional", "year": "1910"}},
6            {"content": "Humpty Dumpty sat on a wall, Humpty Dumpty had a great fall. All the kings horses and all the kings men couldnt put Humpty together again.", "metadata": {"title": "Humpty Dumpty", "artist": "Mother Goose", "year": "1797"}}
7          ]'

Searching semantically

Let's define our searching method in our LyricSearchService. This is how we will semantically search our documents in our database.

1     public List<Map<String, Object>> searchDocuments(String query, int topK, double similarityThreshold) {
2         SearchRequest searchRequest = SearchRequest.query(query)
3             .withTopK(topK)
4             .withSimilarityThreshold(similarityThreshold);
5 
6         List<Document> results = lyricSearchRepository.semanticSearchByLyrics(searchRequest);
7 
8         return results.stream()
9             .map(doc -> Map.of("content", doc.getContent(), "metadata", doc.getMetadata()))
10             .collect(Collectors.toList());
11     }

This method take in: - query: A String representing the search query or the text for which you want to find semantically similar lyrics - topK: An int specifying the number of top results to retrieve (i.e., top 10) - similarityThreshold: A double indicating the minimum similarity score a result must have to be included in the results

This returns a list of Map<String, Object> objects. Each map contains the content and metadata of a document that matches the search criteria.

And the imports to our service:

1 import java.util.Map;
2 import org.springframework.ai.vectorstore.SearchRequest;

Let's add an endpoint to our controller, and build and run our application.

1     @GetMapping("/search")
2     public List<Map<String, Object>> searchDocuments(@RequestParam String query, @RequestParam int topK, @RequestParam double similarityThreshold
3         ) {
4         return lyricSearchService.searchDocuments(query, topK, similarityThreshold);
5         
6     }

And the imports:

1 import org.springframework.web.bind.annotation.GetMapping;
2 import org.springframework.web.bind.annotation.RequestParam;

Use the following CURL command to search your data bases for lyrics about small celestial bodies:

1 curl -X GET "http://localhost:8080/search?query=small%20celestial%20bodie&topK=5&similarityThreshold=0.8"

And voila! We have our twinkly little star at the top of our list.

1 [{
2 	"metadata":{
3 		"title":"Twinkle Twinkle Little Star",
4 		"artist":"Jane Taylor",
5 		"year":"1806"
6 	},
7 	"content":"Twinkle, twinkle, little star,..."
8 },
9 ...

Filter by metadata

In order to filter our data, we need to head over to our index in MongoDB. You can do this through the Atlas UI by selecting the collection where your data is stored and going to the search indexes. You can edit this index by selecting the three dots on the right of the index name and we will add our filter for the artist.

1 {
2   "fields": [
3     {
4       "numDimensions": 1536,
5       "path": "embedding",
6       "similarity": "cosine",
7       "type": "vector"
8     },
9     {
10       "path": "metadata.artist",
11       "type": "filter"
12     }
13   ]
14 }

Let's head back to our LyricSearchService and add a method with an artist parameter so we can filter our results.

1     public List<Map<String, Object>> searchDocumentsWithFilter(String query, int topK, double similarityThreshold, String artist) {
2         FilterExpressionBuilder filterBuilder = new FilterExpressionBuilder();
3         Expression filterExpression = filterBuilder.eq("artist", artist)
4             .build();
5 
6         SearchRequest searchRequest = SearchRequest.query(query)
7             .withTopK(topK)
8             .withSimilarityThreshold(similarityThreshold)
9             .withFilterExpression(filterExpression);
10 
11         List<Document> results = lyricSearchRepository.semanticSearchByLyrics(searchRequest);
12 
13         return results.stream()
14             .map(doc -> Map.of("content", doc.getContent(), "metadata", doc.getMetadata()))
15             .collect(Collectors.toList());
16     }

And the imports we'll need:

1 import org.springframework.ai.vectorstore.filter.Filter.Expression;
2 import org.springframework.ai.vectorstore.filter.FilterExpressionBuilder;

And lastly, an endpoint in our controller:

1     @GetMapping("/searchWithFilter")
2     public List<Map<String, Object>> searchDocumentsWithFilter(@RequestParam String query, @RequestParam int topK, @RequestParam double similarityThreshold, @RequestParam String artist) {
3         return lyricSearchService.searchDocumentsWithFilter(query, topK, similarityThreshold, artist);
4     }

Now, we are able to not only search as before, but we can say we want to restrict it to only specific artists.

Use the following CURL command to try a semantic search with metadata filtering:

1 curl -X GET "http://localhost:8080/searchWithFilter?query=little%20star&topK=5&similarityThreshold=0.8&artist=Jane%20Taylor"

Unlike before, and even asking for the top five results, we are only returned the one document because we only have one document from the artist Jane Taylor. Hooray!

1 [{
2 	"metadata":{
3 		"title":"Twinkle Twinkle Little Star",
4 		"artist":"Jane Taylor",
5 		"year":"1806"
6 	},
7 	"content":"Twinkle, twinkle, little star,..."
8 }]

Conclusion

You now have a Spring application that allows you to search through your data by performing semantic searches. This is an important step when you are looking to implement your RAG applications, or just an AI-enhanced search feature in your applications.

If you want to learn more about the MongoDB Spring AI integration, follow along with the quick-start Get Started With the Spring AI Integration, and if you have any questions or want to show us what you are building, join us in the MongoDB Community Forums.

Top Comments in Forums

There are no comments on this article yet.

Start the Conversation

Rate this tutorial

Article

MongoDB ORMs, ODMs, and Libraries

Aug 28, 2024 | 3 min read

Article

Building a Quarkus Application to Perform MongoDB Vector Search

Oct 07, 2024 | 9 min read

Tutorial

Single-Collection Designs in MongoDB with Spring Data (Part 2)

Aug 12, 2024 | 10 min read