Capturing and Storing Real-World Optics With MongoDB Atlas, OpenAI GPT-4o, and PyMongo

Pavel Duchovny7 min read • Published Sep 04, 2024 • Updated Sep 04, 2024

AI Python Atlas

FULL APPLICATION

Rate this article

Every time OpenAI posts news about a new AI GPT model, I get excited. I was building demos with OpenAI GPT APIs from 2021 (pretty much when they were first released). Reminiscing on my first article, I can’t believe what a huge milestone GPT-4o is with its “Omni” media abilities. It allows users to be flexible and free, with text, image, and audio inputs working together with one API endpoint so any intelligent task can be performed.

MongoDB has been known for its ability to store flexible data streams and JSON structures for years, leveraged by millions of users. So, it's not surprising to me that mixing MongoDB Atlas and Atlas Vector Search with GPT-4o on texts and images, captured by a simple web app, is so powerful and amazing.

In this article, we explore an innovative way to capture and store real-world data using MongoDB, GPT-4o, and the PyMongo driver within a Streamlit app. We’ll walk through the development of an application that transforms captured images into searchable JSON documents, making use of OpenAI’s powerful GPT-4o for OCR capabilities. This project is an excellent demonstration of how to integrate various technologies to solve practical problems in a streamlined and efficient manner.

Introduction

Real-world objects such as recipes, documents, animals, and vehicles often contain valuable information that can be digitized for easier access and analysis. By combining the capabilities of MongoDB, Streamlit, and OpenAI, we can build an application that captures images, extracts text, and stores the information in a MongoDB database. This approach allows for efficient storage, retrieval, and searching of the digitized data.

Key technologies

MongoDB Atlas: A flexible, scalable, and document-oriented database that is perfect for storing JSON-like documents

PyMongo: MongoDB’s robust Python driver, serving as the access point to operational and vector queries

OpenAI GPT-4o: A state-of-the-art new language model capable of understanding multiple media channels inputs (text, images, and audio) and generating human-like text or images, which we will use here for Optical Character Recognition (OCR)

Application workflow

User authentication: Ensures that only authorized users can access the application

Image capture: Uses the device’s camera to capture images of real-world objects

Text extraction: Utilizes OpenAI’s GPT-4o to transcribe the captured images into structured JSON data

Data storage: Stores the extracted JSON data in MongoDB for efficient retrieval and searching

Data retrieval: Allows users to search and view the stored documents and their corresponding images

Pipeline AI task on captured documents: Uses retrieval-augmented generation (RAG) to get a prompt from a user, and allows operating on existing content to create new generated content (e.g., translate a captured post to four other languages, create a LinkedIn post from a product summary announcement)

Building the application

If you haven’t done so already, register for MongoDB Atlas and create a cluster.

Once you have your cluster created with IP access added to your host, get your connection string and copy it for use later.

Create Atlas full-text search and vector search

Atlas allows you to create full-text search indexes alongside vector search indexes to allow robust and rich searching abilities on your stored documents.

Full-text search

Full-text search allows you to leverage aggregations and dynamically search your documents based on keywords and fuzzy logic on any set of attributes at any level.

Follow our tutorial to create an index with the needed syntax for this application. Apply it to a collection in the database ocr_db and the collection ocr_documents, with the index name “search”:

1 {
2   "mappings": {
3     "dynamic": true,
4     "fields": {
5       "api_key": {
6         "type": "string"
7       },
8       "ocr": {
9         "dynamic": true,
10         "type": "document"
11       }
12     }
13   }
14 }

Vector search

Vectors are float based arrays created by AI providers, like OpenAI in this case, that represent the encoded inputs as a numerical vector. The vector index helps it to search semantic-based similarity of an encoded query/string/media, with the stored vectors representing the encoded content on the database documents.

To create it, use the following index, with the needed syntax for this application on the database ocr_db and the collection ocr_documents, index name “vector_index”:

1 {
2   "fields": [
3     {
4       "numDimensions": 1536,
5       "path": "embedding",
6       "similarity": "cosine",
7       "type": "vector"
8     },
9     {
10       "path": "api_key",
11       "type": "filter"
12     }
13   ]
14 }

Once you have the collection and indexes ready, you can build the application artifacts.

Setting up the environment

Let perform the basics steps to get our application running.

1. Clone the repository and install the required packages

1 git clone https://github.com/Pash10g/allcr-ai.git
2 cd allcr-app
3 pip install -r requirements.txt

2. Set up your environment variables in the terminal

1 OPENAI_API_KEY=your_openai_api_key
2 MONGODB_ATLAS_URI=your_mongodb_atlas_uri

Running the application

To run the Streamlit app, use the following command:

1 streamlit run app.py

Open your web browser and go to http://localhost:8501 to access the application.

3. MongoDB connection

The application initializes a global collection instance to use:

1 # MongoDB connection
2 client = MongoClient(os.environ.get("MONGODB_ATLAS_URI"))
3 db = client['ocr_db']
4 collection = db['ocr_documents']

User authentication

The application begins by prompting the user to enter an API code. This ensures that only authorized users can access the app’s functionalities. The API code is checked against the database''s stored permitted keys.

1 def auth_form():
2     
3 
4     st.write("Please enter the API code to access the application.")
5     api_code = st.text_input("API Code", type="password")
6     if st.button("Submit"):
7         st.toast("Authenticating...", icon="⚠️")
8         db_api_key=auth_collection.find_one({"api_key":api_code})
9         if db_api_key:
10             st.session_state.authenticated = True
11             st.session_state.api_code = api_code
12             st.success("Authentication successful.")
13             st.rerun()  # Re-run the script to remove the auth form
14         else:
15             st.error("Authentication failed. Please try again.")

4. Capturing images

Once authenticated, users can capture images using their device's camera. The app supports images of various real-world objects, such as recipes, documents, animals, vehicles, and products.

5. Extracting text from images

Captured images are sent to OpenAI’s GPT-4o for OCR. The model processes the images and extracts relevant text which is then structured into a JSON format. This JSON document includes fields like 'name' and 'type', ensuring that the data is well-organized and ready for storage.

1 # Function to transform image to text using OpenAI
2 def transform_image_to_text(image, format):
3     img_byte_arr = io.BytesIO()
4     image.save(img_byte_arr, format=format)
5     img_byte_arr = img_byte_arr.getvalue()
6     encoded_image = base64.b64encode(img_byte_arr).decode('utf-8')
7 
8     response = openai.chat.completions.create(
9         model="gpt-4o",
10         messages=[{
11         "role": "system",
12         "content": "You are an ocr to json expert looking to transcribe an image. If the type is 'other' please specify the type of object and clasiffy as you see fit."
13         },
14     {
15       "role": "user",
16       "content": [
17         {
18           "type": "text",
19           "text": f"Please trunscribe this {transcribed_object} into a json only output for MongoDB store, calture all data as a single document. Always have a 'name', 'summary' (for embedding ) and 'type' top field (type is a subdocument with user and 'ai_classified') as well as other fields as you see fit."
20         },
21         {
22           "type": "image_url",
23           "image_url": {
24             "url": f"data:image/jpeg;base64,{encoded_image}"
25           }
26         }
27       ]
28     }
29   ]
30     )
31     extracted_text = response.choices[0].message.content
32     return extracted_text

6. Storing data in MongoDB

The structured JSON data is stored in a MongoDB database. MongoDB’s document-oriented nature makes it an excellent choice for this kind of application, allowing for flexible and efficient storage and retrieval of data. It uses OpenAI embedding to embed summarized fields and names for semantic search.

1 def clean_document(document):
2     cleaned_document = document.strip().strip("```json").strip("```").strip()
3     return json.loads(cleaned_document)
4 
5 # Function to save image and text to MongoDB
6 def save_image_to_mongodb(image, description):
7     img_byte_arr = io.BytesIO()
8     image.save(img_byte_arr, format=image.format)
9     img_byte_arr = img_byte_arr.getvalue()
10     encoded_image = base64.b64encode(img_byte_arr).decode('utf-8')
11     
12     # Remove the ```json and ``` parts
13     
14 
15     # Parse the cleaned JSON string into a Python dictionary
16     document = clean_document(description)
17 
18     response = openai.embeddings.create(
19     input=json.dumps({
20         'name' : document['name'],
21         'summary' : document['summary']
22     }),
23     model="text-embedding-3-small"
24 )
25 
26     gen_embeddings=response.data[0].embedding
27 
28     collection.insert_one({
29         'image': encoded_image,
30         'api_key': st.session_state.api_code,
31         'embedding' : gen_embeddings,
32         'ocr': document,
33         'ai_tasks': []
34     })

7. Searching and displaying documents

Users can search for stored documents using keywords. The app retrieves matching documents from MongoDB and displays them, along with their corresponding images. This makes it easy to browse through and find specific information.

1 def search_aggregation(search_query):
2     docs = list(collection.aggregate([
3         {
4             '$search': {
5                 'index': 'search', 
6                 'compound': {
7                     'should': [
8                         {
9                             'text': {
10                                 'query': search_query, 
11                                 'path': {
12                                     'wildcard': '*'
13                                 }
14                             }
15                         }
16                     ], 
17                     'filter': [
18                         {
19                             'queryString': {
20                                 'defaultPath': 'api_key', 
21                                 'query': st.session_state.api_code
22                             }
23                         }
24                     ]
25                 }
26             }
27         }
28     ]))
29     return docs   
30 
31 def vector_search_aggregation(search_query, limit):
32     query_resp = openai.embeddings.create(
33         input=search_query,
34         model="text-embedding-3-small"
35     )
36     query_vec = query_resp.data[0].embedding
37     docs = list(collection.aggregate([
38         {
39             '$vectorSearch': {
40                 'index': 'vector_index', 
41                 'queryVector': query_vec, 
42                 'path': 'embedding', 
43                 'numCandidates' : 20,
44                 'limit' : limit,
45                 'filter': {
46                     'api_key': st.session_state.api_code
47                 }
48             }},
49             { '$project' : {'embedding'  : 0} }
50     ]))
51     return docs

Additionally, with the use of a UI toggle, we can switch to the semantic vector search or the free text contextual search.

In both searches, the code performs extra filtering to return only documents with the user's tagged APIkey.

8. Applying AI tasks on captured documents

The application also supports adding additional AI tasks to each document. Here’s how you can extend the functionality:

AI task pipeline

You can create and save AI tasks on each document using the following functions. These functions allow you to define tasks for the AI to perform on stored JSON documents and save the results back to MongoDB. The flexibility of MongoDB allows us to add the content and present it for record and future reuse.

1 def get_ai_task(ocr,prompt):
2     ## Use existing document as context and perform another GPT task
3     ocr_text = json.dumps(ocr)
4     response = openai.chat.completions.create(
5         model="gpt-4o",
6         messages=[{
7         "role": "system",
8         "content": "You are  a task assistant looking to create a task for the AI to perform on the JSON object. Please return plain output which is only copy paste with no explanation."
9         },
10         {
11         "role": "user",
12         "content": f"Please perform the following task {prompt}  on the following JSON object {ocr_text}. Make sure that the output is stright forward to copy paste."
13         }
14         ]
15         )
16     
17     return response.choices[0].message.content
18 
19 def save_ai_task(task_id, task_result, prompt):
20 
21     collection.update_one(
22         {"_id": ObjectId(task_id)},
23         {"$push" : {"ai_tasks" : {'prompt' : prompt, 'result' : task_result}}}
24     )
25 
26     return "Task saved successfully."

Putting it all together

To illustrate the abilities of the described workflows, I have produced the following pictures where I scan a gin recipe from a book. The content is being captured as a JSON document and now I can search it via vector or text searches, as well as produce a task like “generating a non-alcoholic beverage" similar to the original recipe.

Try it yourself

This project code can be found in the following GitHub repo which you can deploy yourself by following the README.md file.

Conclusion

This application demonstrates the power and flexibility of integrating MongoDB Atlas, Streamlit, and OpenAI’s GPT-4o to capture, process, and store real-world data. By leveraging these technologies, we can build robust solutions that transform physical information into digital, searchable documents, enhancing accessibility and usability.

The combination of MongoDB Atlas's scalable storage, the PyMongo driver, Streamlit's user-friendly interface, and OpenAI's advanced OCR capabilities offer a comprehensive solution for managing and utilizing real-world data effectively.

If you have any questions or suggestions, feel free to reach out or contribute to the project. Try MongoDB Atlas today and join our forums for further engagement. Happy coding!

Top Comments in Forums

There are no comments on this article yet.

Start the Conversation

Rate this article

Article

Implementing Robust RAG Pipelines: Integrating Google's Gemma 2 (2B) Open Model, MongoDB, and LLM Evaluation Techniques

Sep 12, 2024 | 20 min read

Quickstart

Single Click to Success: Deploying on Netlify, Vercel, Heroku, and Render with Atlas

Apr 10, 2024 | 6 min read

Tutorial

How to Use PyMongo to Connect MongoDB Atlas with AWS Lambda

Apr 02, 2024 | 6 min read