Building a Semantic Search Service With Spring AI and MongoDB Atlas
Rate this tutorial
What is the song that goes, "Duh da, duh da, DUH da duh"? We've all been plagued by this before. We remember a snippet of the chorus, we know it has something to do with a hotel in Chelsea, but what is that song? I can't remember the title — how do you search by vibe?! Well, with the power of AI, we are able to search our databases, not just by matching words, but searching the semantic meaning of the text. And with Spring AI, you can incorporate the AI-powered search into your Spring application. With just the vague memory of a famous woman who prefers handsome men, we can locate our Leonard Cohen classic.
Spring AI is an application framework from Spring that allows you to combine various AI services and plugins with your applications. With support for many chat, text-to-image, and embedding models, you can get your AI-powered Java application set up for a variety of AI use cases.
With Spring AI, MongoDB Atlas is supported as a vector database, all with Atlas Vector Search to power your semantic search and implement your RAG applications. To learn more about RAG and other key concepts in AI, check out the MongoDB AI integration docs.
In this tutorial, we’ll go through what you need to get started with Spring AI and MongoDB, adding documents to your database with the vectorised content (embeddings), and searching this content with semantic search. The full code for this tutorial is available in the GitHub repository.
Before starting this tutorial, you'll need to have the following:
- A MongoDB Atlas account and an M10+ cluster running MongoDB version 6.0.11, 7.0.2, or later
- An M10+ cluster is necessary to create the index programmatically (by Spring AI).
- An OpenAI API key with a paid OpenAI account and available credits
- Java 21 and an IDE such as IntelliJ IDEA or Eclipse
- Maven 3.9.6+ configured for your project
- Project: Maven
- Language: Java
- Spring Boot: Default version
- Java: 21
Add the following dependencies:
- MongoDB Atlas Vector Database
- Spring Web
- OpenAI (other embedding models are available, but we use this for the tutorial)
Generate and download your project, then open it in your IDE.
Open the application in the IDE of your choosing and the first thing we will do is inspect our
pom.xml
. In order to use the latest version of Spring AI, change the spring-ai.version
version for the Spring AI BOM to 1.0.0-SNAPSHOT
. As of writing this article, it will be 1.0.0-M1
by default.Configure your Spring application to set up the vector store and other necessary beans.
In our application properties, we are going to configure our MongoDB database, as well as everything we need for semantically searching our data. We'll also add in information such as our OpenAI embedding model and API key.
1 spring.application.name=lyric-semantic-search 2 spring.ai.openai.api-key=<Your-API-key> 3 spring.ai.openai.embedding.options.model=text-embedding-ada-002 4 5 spring.data.mongodb.uri=<Your-MongoDB-connection-string> 6 spring.data.mongodb.database=lyrics 7 spring.ai.vectorstore.mongodb.indexName=vector_index 8 spring.ai.vectorstore.mongodb.collection-name=vector_store 9 spring.ai.vectorstore.mongodb.initialize-schema=true
You'll see at the end, we are setting the initialized schema to be
true
. This means our application will set up our search index (if it doesn't exist) so we can semantically search our data. If you already have a search index set up with this name and configuration, you can set this to be false
.In your IDE, open up your project. Create a
Config.java
file in a config
package. Here, we are going to set up our OpenAI embedding model. Spring AI makes this a very simple process.1 package com.mongodb.lyric_semantic_search.config; 2 3 import org.springframework.ai.embedding.EmbeddingModel; 4 import org.springframework.ai.openai.OpenAiEmbeddingModel; 5 import org.springframework.ai.openai.api.OpenAiApi; 6 import org.springframework.beans.factory.annotation.Value; 7 import org.springframework.boot.SpringBootConfiguration; 8 import org.springframework.boot.autoconfigure.EnableAutoConfiguration; 9 import org.springframework.context.annotation.Bean; 10 import org.springframework.context.annotation.Configuration; 11 12 13 14 15 16 public class Config { 17 18 19 private String openAiKey; 20 21 22 public EmbeddingModel embeddingModel() { 23 return new OpenAiEmbeddingModel(new OpenAiApi(openAiKey)); 24 } 25 }
Now, we are able to send away our data to be vectorized, and receive the vectorized results.
Create a package called
model
, for our DocumentRequest
class to go in. This is what we are going to be storing in our MongoDB database. The content will be what we are embedding — so lyrics, in our case. The metadata will be anything we want to store alongside it, so artists, albums, or genres. This metadata will be returned alongside our content and can also be used to filter our results.1 package com.mongodb.lyric_semantic_search.model; 2 3 import java.util.Map; 4 5 public class DocumentRequest { 6 private String content; 7 private Map<String, Object> metadata; 8 9 public DocumentRequest() { 10 } 11 12 public DocumentRequest(String content, Map<String, Object> metadata) { 13 this.content = content; 14 this.metadata = metadata; 15 } 16 17 public String getContent() { 18 return content; 19 } 20 21 public void setContent(String content) { 22 this.content = content; 23 } 24 25 public Map<String, Object> getMetadata() { 26 return metadata; 27 } 28 29 public void setMetadata(Map<String, Object> metadata) { 30 this.metadata = metadata; 31 } 32 33 }
Create a
repository
package and add a LyricSearchRepository
interface. Here, we'll define some of the methods we'll implement later.1 package com.mongodb.lyric_semantic_search.repository; 2 3 import java.util.List; 4 import java.util.Optional; 5 6 import org.springframework.ai.document.Document; 7 import org.springframework.ai.vectorstore.SearchRequest; 8 9 public interface LyricSearchRepository { 10 11 void addDocuments(List<Document> docs); 12 13 Optional<Boolean> deleteDocuments(List<String> ids); 14 15 List<Document> semanticSearchByLyrics(SearchRequest searchRequest); 16 }
Create a
LyricSearchRepositoryImpl
class to implement the repository interface.1 package com.mongodb.lyric_semantic_search.repository; 2 3 import java.util.List; 4 import java.util.Optional; 5 6 import org.springframework.ai.document.Document; 7 import org.springframework.ai.vectorstore.SearchRequest; 8 import org.springframework.ai.vectorstore.VectorStore; 9 import org.springframework.beans.factory.annotation.Autowired; 10 import org.springframework.stereotype.Repository; 11 12 13 public class LyricSearchRepositoryImpl implements LyricSearchRepository { 14 15 private final VectorStore vectorStore; 16 17 18 public LyricSearchRepositoryImpl(VectorStore vectorStore) { 19 this.vectorStore = vectorStore; 20 } 21 22 23 public void addDocuments(List<Document> docs) { 24 vectorStore.add(docs); 25 } 26 27 28 public Optional<Boolean> deleteDocuments(List<String> ids) { 29 return vectorStore.delete(ids); 30 } 31 32 33 public List<Document> semanticSearchByLyrics(SearchRequest searchRequest) { 34 return vectorStore.similaritySearch(searchRequest); 35 } 36 }
We are using the methods
add
, delete
, and similaritySearch
, all already defined and implemented in Spring AI. These will allow us to embed our data when adding them to our MongoDB database, and we can search these embeddings with vector search.Create a
service
package and inside, a LyricSearchService
class to handle business logic for our lyrical search application. We will implement these methods later in the tutorial:1 package com.mongodb.lyric_semantic_search.service; 2 3 import org.springframework.stereotype.Service; 4 5 6 public class LyricSearchService { 7 8 }
Create a controller package and a
LyricSearchController
class to handle HTTP requests. We are going to add a call to add our data, a call to delete any documents we no longer need, and a search call, to semantically search our data.These will call back to the methods we defined earlier. We’ll implement them in the next steps:
1 package com.mongodb.lyric_semantic_search.controller; 2 3 import org.springframework.web.bind.annotation.RestController; 4 5 6 public class LyricSearchController { 7 8 }
In our
LyricSearchService
class, let's add some logic to take in our documents and add them to our MongoDB database.1 private static final int MAX_TOKENS = (int) (8192 * 0.80); // OpenAI model's maximum content length + BUFFER for when one word > 1 token 2 3 4 LyricSearchRepository lyricSearchRepository; 5 6 public List<Document> addDocuments(List<DocumentRequest> documents) { 7 if (documents == null || documents.isEmpty()) { 8 return Collections.emptyList(); 9 } 10 11 List<Document> docs = documents.stream() 12 .filter(doc -> doc != null && doc.getContent() != null && !doc.getContent() 13 .trim() 14 .isEmpty()) 15 .map(doc -> new Document(doc.getContent(), doc.getMetadata())) 16 .filter(doc -> { 17 int wordCount = doc.getContent() 18 .split("\\s+").length; 19 return wordCount <= MAX_TOKENS; 20 }) 21 .collect(Collectors.toList()); 22 23 if (!docs.isEmpty()) { 24 lyricSearchRepository.addDocuments(docs); 25 } 26 27 return docs; 28 }
This function takes a single parameter,
documents
, which is a list of DocumentRequest
objects. These represent the documents that need to be processed and added to the repository.The function first checks if the
documents
list is null or empty.The
documents
list is converted into a stream to facilitate functional-style operations.The filter is a bit of pre-processing to help clean up our data. It removes any
DocumentRequest
objects that are null, have null content, or have empty (or whitespace-only) content. This ensures that only valid documents are processed further.Know your limits! The filter removes any
Document
objects whose content exceeds the maximum token limit (MAX_TOKENS
) for the OpenAI API. The token limit is estimated based on word count, assuming one word is slightly more than one token (not far off the truth). This estimation works for the demo, but in production, we would likely want to implement a form of chunking, where large bodies of text are separated into smaller, more digestible pieces.Each
DocumentRequest
object is transformed into a Document
object. The Document
constructor is called with the content and metadata from the DocumentRequest
.The filtered and transformed
Document
objects are collected into a list and these documents are added to our MongoDB vector store, along with an embedding of the lyrics.We'll also add our function to delete documents while we're here:
1 public List<String> deleteDocuments(List<String> ids) { 2 if (ids == null || ids.isEmpty()) { 3 return Collections.emptyList(); // Nothing to delete 4 } 5 6 Optional<Boolean> result = lyricSearchRepository.deleteDocuments(ids); 7 if (result.isPresent() && result.get()) { 8 return ids; // Return the list of successfully deleted IDs 9 } else { 10 return Collections.emptyList(); // Return empty list if deletion was unsuccessful 11 } 12 }
And the appropriate imports:
1 import java.util.Collections; 2 import java.util.List; 3 import java.util.Optional; 4 import java.util.stream.Collectors; 5 6 import org.springframework.ai.document.Document; 7 import org.springframework.beans.factory.annotation.Autowired; 8 9 import com.mongodb.lyric_semantic_search.model.DocumentRequest; 10 import com.mongodb.lyric_semantic_search.repository.LyricSearchRepository;
Now that we have the logic, let’s add the endpoints to our
LyricSearchController
.1 2 private LyricSearchService lyricSearchService; 3 4 5 public List<Map<String, Object>> addDocuments( List<DocumentRequest> documents) { 6 return lyricSearchService.addDocuments(documents).stream() 7 .map(doc -> Map.of("content", doc.getContent(), "metadata", doc.getMetadata())) 8 .collect(Collectors.toList()); 9 } 10 11 12 public List<String> deleteDocuments( List<String> ids) { 13 return lyricSearchService.deleteDocuments(ids); 14 }
And our imports:
1 import java.util.List; 2 import java.util.Map; 3 import java.util.stream.Collectors; 4 5 import org.springframework.beans.factory.annotation.Autowired; 6 import org.springframework.web.bind.annotation.DeleteMapping; 7 import org.springframework.web.bind.annotation.PostMapping; 8 import org.springframework.web.bind.annotation.RequestBody; 9 10 import com.mongodb.lyric_semantic_search.model.DocumentRequest; 11 import com.mongodb.lyric_semantic_search.service.LyricSearchService;
To test our embedding, let's keep it simple with a few nursery rhymes for now.
Build and run your application. Use the following CURL command to add sample documents:
1 curl -X POST "http://localhost:8080/addDocuments" \ 2 -H "Content-Type: application/json" \ 3 -d '[ 4 {"content": "Twinkle, twinkle, little star, How I wonder what you are! Up above the world so high, Like a diamond in the sky.", "metadata": {"title": "Twinkle Twinkle Little Star", "artist": "Jane Taylor", "year": "1806"}}, 5 {"content": "The itsy bitsy spider climbed up the waterspout. Down came the rain and washed the spider out. Out came the sun and dried up all the rain and the itsy bitsy spider climbed up the spout again.", "metadata": {"title": "Itsy Bitsy Spider", "artist": "Traditional", "year": "1910"}}, 6 {"content": "Humpty Dumpty sat on a wall, Humpty Dumpty had a great fall. All the kings horses and all the kings men couldnt put Humpty together again.", "metadata": {"title": "Humpty Dumpty", "artist": "Mother Goose", "year": "1797"}} 7 ]'
Let's define our searching method in our
LyricSearchService
. This is how we will semantically search our documents in our database.1 public List<Map<String, Object>> searchDocuments(String query, int topK, double similarityThreshold) { 2 SearchRequest searchRequest = SearchRequest.query(query) 3 .withTopK(topK) 4 .withSimilarityThreshold(similarityThreshold); 5 6 List<Document> results = lyricSearchRepository.semanticSearchByLyrics(searchRequest); 7 8 return results.stream() 9 .map(doc -> Map.of("content", doc.getContent(), "metadata", doc.getMetadata())) 10 .collect(Collectors.toList()); 11 }
This method take in:
-
query
: A String
representing the search query or the text for which you want to find semantically similar lyrics
- topK
: An int
specifying the number of top results to retrieve (i.e., top 10)
- similarityThreshold
: A double
indicating the minimum similarity score a result must have to be included in the resultsThis returns a list of
Map<String, Object>
objects. Each map contains the content and metadata of a document that matches the search criteria.And the imports to our service:
1 import java.util.Map; 2 import org.springframework.ai.vectorstore.SearchRequest;
Let's add an endpoint to our controller, and build and run our application.
1 2 public List<Map<String, Object>> searchDocuments( String query, int topK, double similarityThreshold 3 ) { 4 return lyricSearchService.searchDocuments(query, topK, similarityThreshold); 5 6 }
And the imports:
1 import org.springframework.web.bind.annotation.GetMapping; 2 import org.springframework.web.bind.annotation.RequestParam;
Use the following CURL command to search your data bases for lyrics about small celestial bodies:
1 curl -X GET "http://localhost:8080/search?query=small%20celestial%20bodie&topK=5&similarityThreshold=0.8"
And voila! We have our twinkly little star at the top of our list.
1 [{ 2 "metadata":{ 3 "title":"Twinkle Twinkle Little Star", 4 "artist":"Jane Taylor", 5 "year":"1806" 6 }, 7 "content":"Twinkle, twinkle, little star,..." 8 }, 9 ...
In order to filter our data, we need to head over to our index in MongoDB. You can do this through the Atlas UI by selecting the collection where your data is stored and going to the search indexes. You can edit this index by selecting the three dots on the right of the index name and we will add our filter for the artist.
1 { 2 "fields": [ 3 { 4 "numDimensions": 1536, 5 "path": "embedding", 6 "similarity": "cosine", 7 "type": "vector" 8 }, 9 { 10 "path": "metadata.artist", 11 "type": "filter" 12 } 13 ] 14 }
Let's head back to our
LyricSearchService
and add a method with an artist parameter so we can filter our results.1 public List<Map<String, Object>> searchDocumentsWithFilter(String query, int topK, double similarityThreshold, String artist) { 2 FilterExpressionBuilder filterBuilder = new FilterExpressionBuilder(); 3 Expression filterExpression = filterBuilder.eq("artist", artist) 4 .build(); 5 6 SearchRequest searchRequest = SearchRequest.query(query) 7 .withTopK(topK) 8 .withSimilarityThreshold(similarityThreshold) 9 .withFilterExpression(filterExpression); 10 11 List<Document> results = lyricSearchRepository.semanticSearchByLyrics(searchRequest); 12 13 return results.stream() 14 .map(doc -> Map.of("content", doc.getContent(), "metadata", doc.getMetadata())) 15 .collect(Collectors.toList()); 16 }
And the imports we'll need:
1 import org.springframework.ai.vectorstore.filter.Filter.Expression; 2 import org.springframework.ai.vectorstore.filter.FilterExpressionBuilder;
And lastly, an endpoint in our controller:
1 2 public List<Map<String, Object>> searchDocumentsWithFilter( String query, int topK, double similarityThreshold, String artist) { 3 return lyricSearchService.searchDocumentsWithFilter(query, topK, similarityThreshold, artist); 4 }
Now, we are able to not only search as before, but we can say we want to restrict it to only specific artists.
Use the following CURL command to try a semantic search with metadata filtering:
1 curl -X GET "http://localhost:8080/searchWithFilter?query=little%20star&topK=5&similarityThreshold=0.8&artist=Jane%20Taylor"
Unlike before, and even asking for the top five results, we are only returned the one document because we only have one document from the artist Jane Taylor. Hooray!
1 [{ 2 "metadata":{ 3 "title":"Twinkle Twinkle Little Star", 4 "artist":"Jane Taylor", 5 "year":"1806" 6 }, 7 "content":"Twinkle, twinkle, little star,..." 8 }]
You now have a Spring application that allows you to search through your data by performing semantic searches. This is an important step when you are looking to implement your RAG applications, or just an AI-enhanced search feature in your applications.
If you want to learn more about the MongoDB Spring AI integration, follow along with the quick-start Get Started With the Spring AI Integration, and if you have any questions or want to show us what you are building, join us in the MongoDB Community Forums.
Top Comments in Forums
There are no comments on this article yet.
Related
Tutorial
Retrieval-Augmented Generation With MongoDB and Spring AI: Bringing AI to Your Java Applications
Sep 23, 2024 | 6 min read
Tutorial
Introduction to Data Pagination With Quarkus and MongoDB: A Comprehensive Tutorial
Apr 25, 2024 | 7 min read