Explore Developer Center's New Chatbot! MongoDB AI Chatbot can be accessed at the top of your navigation to answer all your MongoDB questions.

MongoDB Developer
Atlas
plus
Sign in to follow topics
MongoDB Developer Centerchevron-right
Developer Topicschevron-right
Productschevron-right
Atlaschevron-right

Capturing and Storing Real-World Optics With MongoDB Atlas, OpenAI GPT-4o, and PyMongo

Pavel Duchovny7 min read • Published Sep 04, 2024 • Updated Sep 04, 2024
AIPythonAtlas
FULL APPLICATION
Facebook Icontwitter iconlinkedin icon
Rate this article
star-empty
star-empty
star-empty
star-empty
star-empty
Every time OpenAI posts news about a new AI GPT model, I get excited. I was building demos with OpenAI GPT APIs from 2021 (pretty much when they were first released). Reminiscing on my first article, I can’t believe what a huge milestone GPT-4o is with its “Omni” media abilities. It allows users to be flexible and free, with text, image, and audio inputs working together with one API endpoint so any intelligent task can be performed.
MongoDB has been known for its ability to store flexible data streams and JSON structures for years, leveraged by millions of users. So, it's not surprising to me that mixing MongoDB Atlas and Atlas Vector Search with GPT-4o on texts and images, captured by a simple web app, is so powerful and amazing.
In this article, we explore an innovative way to capture and store real-world data using MongoDB, GPT-4o, and the PyMongo driver within a Streamlit app. We’ll walk through the development of an application that transforms captured images into searchable JSON documents, making use of OpenAI’s powerful GPT-4o for OCR capabilities. This project is an excellent demonstration of how to integrate various technologies to solve practical problems in a streamlined and efficient manner.

Introduction

Real-world objects such as recipes, documents, animals, and vehicles often contain valuable information that can be digitized for easier access and analysis. By combining the capabilities of MongoDB, Streamlit, and OpenAI, we can build an application that captures images, extracts text, and stores the information in a MongoDB database. This approach allows for efficient storage, retrieval, and searching of the digitized data.
Demo of the OCR

Key technologies

MongoDB Atlas: A flexible, scalable, and document-oriented database that is perfect for storing JSON-like documents
PyMongo: MongoDB’s robust Python driver, serving as the access point to operational and vector queries
OpenAI GPT-4o: A state-of-the-art new language model capable of understanding multiple media channels inputs (text, images, and audio) and generating human-like text or images, which we will use here for Optical Character Recognition (OCR)

Application workflow

User authentication: Ensures that only authorized users can access the application
Image capture: Uses the device’s camera to capture images of real-world objects
Text extraction: Utilizes OpenAI’s GPT-4o to transcribe the captured images into structured JSON data
Data storage: Stores the extracted JSON data in MongoDB for efficient retrieval and searching
Data retrieval: Allows users to search and view the stored documents and their corresponding images
Pipeline AI task on captured documents: Uses retrieval-augmented generation (RAG) to get a prompt from a user, and allows operating on existing content to create new generated content (e.g., translate a captured post to four other languages, create a LinkedIn post from a product summary announcement)

Building the application

If you haven’t done so already, register for MongoDB Atlas and create a cluster.
Once you have your cluster created with IP access added to your host, get your connection string and copy it for use later.
Atlas allows you to create full-text search indexes alongside vector search indexes to allow robust and rich searching abilities on your stored documents.
Full-text search allows you to leverage aggregations and dynamically search your documents based on keywords and fuzzy logic on any set of attributes at any level.
Follow our tutorial to create an index with the needed syntax for this application. Apply it to a collection in the database ocr_db and the collection ocr_documents, with the index name “search”:
1{
2 "mappings": {
3 "dynamic": true,
4 "fields": {
5 "api_key": {
6 "type": "string"
7 },
8 "ocr": {
9 "dynamic": true,
10 "type": "document"
11 }
12 }
13 }
14}
Vectors are float based arrays created by AI providers, like OpenAI in this case, that represent the encoded inputs as a numerical vector. The vector index helps it to search semantic-based similarity of an encoded query/string/media, with the stored vectors representing the encoded content on the database documents.
To create it, use the following index, with the needed syntax for this application on the database ocr_db and the collection ocr_documents, index name “vector_index”:
1{
2 "fields": [
3 {
4 "numDimensions": 1536,
5 "path": "embedding",
6 "similarity": "cosine",
7 "type": "vector"
8 },
9 {
10 "path": "api_key",
11 "type": "filter"
12 }
13 ]
14}
Once you have the collection and indexes ready, you can build the application artifacts.

Setting up the environment

Let perform the basics steps to get our application running.

1. Clone the repository and install the required packages

1git clone https://github.com/Pash10g/allcr-ai.git
2cd allcr-app
3pip install -r requirements.txt

2. Set up your environment variables in the terminal

1OPENAI_API_KEY=your_openai_api_key
2MONGODB_ATLAS_URI=your_mongodb_atlas_uri

Running the application

To run the Streamlit app, use the following command:
1streamlit run app.py
Open your web browser and go to http://localhost:8501 to access the application.

3. MongoDB connection

The application initializes a global collection instance to use:
1# MongoDB connection
2client = MongoClient(os.environ.get("MONGODB_ATLAS_URI"))
3db = client['ocr_db']
4collection = db['ocr_documents']

User authentication

The application begins by prompting the user to enter an API code. This ensures that only authorized users can access the app’s functionalities. The API code is checked against the database''s stored permitted keys.
1def auth_form():
2
3
4 st.write("Please enter the API code to access the application.")
5 api_code = st.text_input("API Code", type="password")
6 if st.button("Submit"):
7 st.toast("Authenticating...", icon="⚠️")
8 db_api_key=auth_collection.find_one({"api_key":api_code})
9 if db_api_key:
10 st.session_state.authenticated = True
11 st.session_state.api_code = api_code
12 st.success("Authentication successful.")
13 st.rerun() # Re-run the script to remove the auth form
14 else:
15 st.error("Authentication failed. Please try again.")

4. Capturing images

Once authenticated, users can capture images using their device's camera. The app supports images of various real-world objects, such as recipes, documents, animals, vehicles, and products.

5. Extracting text from images

Captured images are sent to OpenAI’s GPT-4o for OCR. The model processes the images and extracts relevant text which is then structured into a JSON format. This JSON document includes fields like 'name' and 'type', ensuring that the data is well-organized and ready for storage.
1# Function to transform image to text using OpenAI
2def transform_image_to_text(image, format):
3 img_byte_arr = io.BytesIO()
4 image.save(img_byte_arr, format=format)
5 img_byte_arr = img_byte_arr.getvalue()
6 encoded_image = base64.b64encode(img_byte_arr).decode('utf-8')
7
8 response = openai.chat.completions.create(
9 model="gpt-4o",
10 messages=[{
11 "role": "system",
12 "content": "You are an ocr to json expert looking to transcribe an image. If the type is 'other' please specify the type of object and clasiffy as you see fit."
13 },
14 {
15 "role": "user",
16 "content": [
17 {
18 "type": "text",
19 "text": f"Please trunscribe this {transcribed_object} into a json only output for MongoDB store, calture all data as a single document. Always have a 'name', 'summary' (for embedding ) and 'type' top field (type is a subdocument with user and 'ai_classified') as well as other fields as you see fit."
20 },
21 {
22 "type": "image_url",
23 "image_url": {
24 "url": f"data:image/jpeg;base64,{encoded_image}"
25 }
26 }
27 ]
28 }
29 ]
30 )
31 extracted_text = response.choices[0].message.content
32 return extracted_text

6. Storing data in MongoDB

The structured JSON data is stored in a MongoDB database. MongoDB’s document-oriented nature makes it an excellent choice for this kind of application, allowing for flexible and efficient storage and retrieval of data. It uses OpenAI embedding to embed summarized fields and names for semantic search.
1def clean_document(document):
2 cleaned_document = document.strip().strip("```json").strip("```").strip()
3 return json.loads(cleaned_document)
4
5# Function to save image and text to MongoDB
6def save_image_to_mongodb(image, description):
7 img_byte_arr = io.BytesIO()
8 image.save(img_byte_arr, format=image.format)
9 img_byte_arr = img_byte_arr.getvalue()
10 encoded_image = base64.b64encode(img_byte_arr).decode('utf-8')
11
12 # Remove the ```json and ``` parts
13
14
15 # Parse the cleaned JSON string into a Python dictionary
16 document = clean_document(description)
17
18 response = openai.embeddings.create(
19 input=json.dumps({
20 'name' : document['name'],
21 'summary' : document['summary']
22 }),
23 model="text-embedding-3-small"
24)
25
26 gen_embeddings=response.data[0].embedding
27
28 collection.insert_one({
29 'image': encoded_image,
30 'api_key': st.session_state.api_code,
31 'embedding' : gen_embeddings,
32 'ocr': document,
33 'ai_tasks': []
34 })

7. Searching and displaying documents

Users can search for stored documents using keywords. The app retrieves matching documents from MongoDB and displays them, along with their corresponding images. This makes it easy to browse through and find specific information.
1def search_aggregation(search_query):
2 docs = list(collection.aggregate([
3 {
4 '$search': {
5 'index': 'search',
6 'compound': {
7 'should': [
8 {
9 'text': {
10 'query': search_query,
11 'path': {
12 'wildcard': '*'
13 }
14 }
15 }
16 ],
17 'filter': [
18 {
19 'queryString': {
20 'defaultPath': 'api_key',
21 'query': st.session_state.api_code
22 }
23 }
24 ]
25 }
26 }
27 }
28 ]))
29 return docs
30
31def vector_search_aggregation(search_query, limit):
32 query_resp = openai.embeddings.create(
33 input=search_query,
34 model="text-embedding-3-small"
35 )
36 query_vec = query_resp.data[0].embedding
37 docs = list(collection.aggregate([
38 {
39 '$vectorSearch': {
40 'index': 'vector_index',
41 'queryVector': query_vec,
42 'path': 'embedding',
43 'numCandidates' : 20,
44 'limit' : limit,
45 'filter': {
46 'api_key': st.session_state.api_code
47 }
48 }},
49 { '$project' : {'embedding' : 0} }
50 ]))
51 return docs
Additionally, with the use of a UI toggle, we can switch to the semantic vector search or the free text contextual search.
In both searches, the code performs extra filtering to return only documents with the user's tagged APIkey.

8. Applying AI tasks on captured documents

The application also supports adding additional AI tasks to each document. Here’s how you can extend the functionality:

AI task pipeline

You can create and save AI tasks on each document using the following functions. These functions allow you to define tasks for the AI to perform on stored JSON documents and save the results back to MongoDB. The flexibility of MongoDB allows us to add the content and present it for record and future reuse.
1def get_ai_task(ocr,prompt):
2 ## Use existing document as context and perform another GPT task
3 ocr_text = json.dumps(ocr)
4 response = openai.chat.completions.create(
5 model="gpt-4o",
6 messages=[{
7 "role": "system",
8 "content": "You are a task assistant looking to create a task for the AI to perform on the JSON object. Please return plain output which is only copy paste with no explanation."
9 },
10 {
11 "role": "user",
12 "content": f"Please perform the following task {prompt} on the following JSON object {ocr_text}. Make sure that the output is stright forward to copy paste."
13 }
14 ]
15 )
16
17 return response.choices[0].message.content
18
19def save_ai_task(task_id, task_result, prompt):
20
21 collection.update_one(
22 {"_id": ObjectId(task_id)},
23 {"$push" : {"ai_tasks" : {'prompt' : prompt, 'result' : task_result}}}
24 )
25
26 return "Task saved successfully."

Putting it all together

To illustrate the abilities of the described workflows, I have produced the following pictures where I scan a gin recipe from a book. The content is being captured as a JSON document and now I can search it via vector or text searches, as well as produce a task like “generating a non-alcoholic beverage" similar to the original recipe.
Initial OCR
AI task on document code

Try it yourself

This project code can be found in the following GitHub repo which you can deploy yourself by following the README.md file.

Conclusion

This application demonstrates the power and flexibility of integrating MongoDB Atlas, Streamlit, and OpenAI’s GPT-4o to capture, process, and store real-world data. By leveraging these technologies, we can build robust solutions that transform physical information into digital, searchable documents, enhancing accessibility and usability.
The combination of MongoDB Atlas's scalable storage, the PyMongo driver, Streamlit's user-friendly interface, and OpenAI's advanced OCR capabilities offer a comprehensive solution for managing and utilizing real-world data effectively.
If you have any questions or suggestions, feel free to reach out or contribute to the project. Try MongoDB Atlas today and join our forums for further engagement. Happy coding!
Top Comments in Forums
There are no comments on this article yet.
Start the Conversation

Facebook Icontwitter iconlinkedin icon
Rate this article
star-empty
star-empty
star-empty
star-empty
star-empty
Related
Article

Harnessing Natural Language for MongoDB Queries With Google Gemini


Dec 13, 2024 | 8 min read
Tutorial

Utilizing Collection Globbing and Provenance in Data Federation


Jun 28, 2023 | 5 min read
Tutorial

Streamlining Log Management to Amazon S3 Using Atlas Push-based Log Exports With HashiCorp Terraform


Jul 08, 2024 | 6 min read
Tutorial

Semantic search with Jina Embeddings v2 and MongoDB Atlas


Dec 05, 2023 | 12 min read
Table of Contents