Capturing and Storing Real-World Optics With MongoDB Atlas, OpenAI GPT-4o, and PyMongo
Pavel Duchovny7 min read • Published Sep 04, 2024 • Updated Sep 04, 2024
FULL APPLICATION
Rate this article
Every time OpenAI posts news about a new AI GPT model, I get excited. I was building demos with OpenAI GPT APIs from 2021 (pretty much when they were first released). Reminiscing on my first article, I can’t believe what a huge milestone GPT-4o is with its “Omni” media abilities. It allows users to be flexible and free, with text, image, and audio inputs working together with one API endpoint so any intelligent task can be performed.
MongoDB has been known for its ability to store flexible data streams and JSON structures for years, leveraged by millions of users. So, it's not surprising to me that mixing MongoDB Atlas and Atlas Vector Search with GPT-4o on texts and images, captured by a simple web app, is so powerful and amazing.
In this article, we explore an innovative way to capture and store real-world data using MongoDB, GPT-4o, and the PyMongo driver within a Streamlit app. We’ll walk through the development of an application that transforms captured images into searchable JSON documents, making use of OpenAI’s powerful GPT-4o for OCR capabilities. This project is an excellent demonstration of how to integrate various technologies to solve practical problems in a streamlined and efficient manner.
Real-world objects such as recipes, documents, animals, and vehicles often contain valuable information that can be digitized for easier access and analysis. By combining the capabilities of MongoDB, Streamlit, and OpenAI, we can build an application that captures images, extracts text, and stores the information in a MongoDB database. This approach allows for efficient storage, retrieval, and searching of the digitized data.
MongoDB Atlas: A flexible, scalable, and document-oriented database that is perfect for storing JSON-like documents
PyMongo: MongoDB’s robust Python driver, serving as the access point to operational and vector queries
OpenAI GPT-4o: A state-of-the-art new language model capable of understanding multiple media channels inputs (text, images, and audio) and generating human-like text or images, which we will use here for Optical Character Recognition (OCR)
User authentication: Ensures that only authorized users can access the application
Image capture: Uses the device’s camera to capture images of real-world objects
Text extraction: Utilizes OpenAI’s GPT-4o to transcribe the captured images into structured JSON data
Data storage: Stores the extracted JSON data in MongoDB for efficient retrieval and searching
Data retrieval: Allows users to search and view the stored documents and their corresponding images
Pipeline AI task on captured documents: Uses retrieval-augmented generation (RAG) to get a prompt from a user, and allows operating on existing content to create new generated content (e.g., translate a captured post to four other languages, create a LinkedIn post from a product summary announcement)
Once you have your cluster created with IP access added to your host, get your connection string and copy it for use later.
Atlas allows you to create full-text search indexes alongside vector search indexes to allow robust and rich searching abilities on your stored documents.
Full-text search allows you to leverage aggregations and dynamically search your documents based on keywords and fuzzy logic on any set of attributes at any level.
Follow our tutorial to create an index with the needed syntax for this application. Apply it to a collection in the database ocr_db and the collection
ocr_documents
, with the index name “search”:1 { 2 "mappings": { 3 "dynamic": true, 4 "fields": { 5 "api_key": { 6 "type": "string" 7 }, 8 "ocr": { 9 "dynamic": true, 10 "type": "document" 11 } 12 } 13 } 14 }
Vectors are float based arrays created by AI providers, like OpenAI in this case, that represent the encoded inputs as a numerical vector. The vector index helps it to search semantic-based similarity of an encoded query/string/media, with the stored vectors representing the encoded content on the database documents.
To create it, use the following index, with the needed syntax for this application on the database ocr_db and the collection ocr_documents, index name “vector_index”:
1 { 2 "fields": [ 3 { 4 "numDimensions": 1536, 5 "path": "embedding", 6 "similarity": "cosine", 7 "type": "vector" 8 }, 9 { 10 "path": "api_key", 11 "type": "filter" 12 } 13 ] 14 }
Once you have the collection and indexes ready, you can build the application artifacts.
Let perform the basics steps to get our application running.
1 git clone https://github.com/Pash10g/allcr-ai.git 2 cd allcr-app 3 pip install -r requirements.txt
1 OPENAI_API_KEY=your_openai_api_key 2 MONGODB_ATLAS_URI=your_mongodb_atlas_uri
To run the Streamlit app, use the following command:
1 streamlit run app.py
The application initializes a global collection instance to use:
1 # MongoDB connection 2 client = MongoClient(os.environ.get("MONGODB_ATLAS_URI")) 3 db = client['ocr_db'] 4 collection = db['ocr_documents']
The application begins by prompting the user to enter an API code. This ensures that only authorized users can access the app’s functionalities. The API code is checked against the database''s stored permitted keys.
1 def auth_form(): 2 3 4 st.write("Please enter the API code to access the application.") 5 api_code = st.text_input("API Code", type="password") 6 if st.button("Submit"): 7 st.toast("Authenticating...", icon="⚠️") 8 db_api_key=auth_collection.find_one({"api_key":api_code}) 9 if db_api_key: 10 st.session_state.authenticated = True 11 st.session_state.api_code = api_code 12 st.success("Authentication successful.") 13 st.rerun() # Re-run the script to remove the auth form 14 else: 15 st.error("Authentication failed. Please try again.")
Once authenticated, users can capture images using their device's camera. The app supports images of various real-world objects, such as recipes, documents, animals, vehicles, and products.
Captured images are sent to OpenAI’s GPT-4o for OCR. The model processes the images and extracts relevant text which is then structured into a JSON format. This JSON document includes fields like 'name' and 'type', ensuring that the data is well-organized and ready for storage.
1 # Function to transform image to text using OpenAI 2 def transform_image_to_text(image, format): 3 img_byte_arr = io.BytesIO() 4 image.save(img_byte_arr, format=format) 5 img_byte_arr = img_byte_arr.getvalue() 6 encoded_image = base64.b64encode(img_byte_arr).decode('utf-8') 7 8 response = openai.chat.completions.create( 9 model="gpt-4o", 10 messages=[{ 11 "role": "system", 12 "content": "You are an ocr to json expert looking to transcribe an image. If the type is 'other' please specify the type of object and clasiffy as you see fit." 13 }, 14 { 15 "role": "user", 16 "content": [ 17 { 18 "type": "text", 19 "text": f"Please trunscribe this {transcribed_object} into a json only output for MongoDB store, calture all data as a single document. Always have a 'name', 'summary' (for embedding ) and 'type' top field (type is a subdocument with user and 'ai_classified') as well as other fields as you see fit." 20 }, 21 { 22 "type": "image_url", 23 "image_url": { 24 "url": f"data:image/jpeg;base64,{encoded_image}" 25 } 26 } 27 ] 28 } 29 ] 30 ) 31 extracted_text = response.choices[0].message.content 32 return extracted_text
The structured JSON data is stored in a MongoDB database. MongoDB’s document-oriented nature makes it an excellent choice for this kind of application, allowing for flexible and efficient storage and retrieval of data. It uses OpenAI embedding to embed summarized fields and names for semantic search.
1 def clean_document(document): 2 cleaned_document = document.strip().strip("```json").strip("```").strip() 3 return json.loads(cleaned_document) 4 5 # Function to save image and text to MongoDB 6 def save_image_to_mongodb(image, description): 7 img_byte_arr = io.BytesIO() 8 image.save(img_byte_arr, format=image.format) 9 img_byte_arr = img_byte_arr.getvalue() 10 encoded_image = base64.b64encode(img_byte_arr).decode('utf-8') 11 12 # Remove the ```json and ``` parts 13 14 15 # Parse the cleaned JSON string into a Python dictionary 16 document = clean_document(description) 17 18 response = openai.embeddings.create( 19 input=json.dumps({ 20 'name' : document['name'], 21 'summary' : document['summary'] 22 }), 23 model="text-embedding-3-small" 24 ) 25 26 gen_embeddings=response.data[0].embedding 27 28 collection.insert_one({ 29 'image': encoded_image, 30 'api_key': st.session_state.api_code, 31 'embedding' : gen_embeddings, 32 'ocr': document, 33 'ai_tasks': [] 34 })
Users can search for stored documents using keywords. The app retrieves matching documents from MongoDB and displays them, along with their corresponding images. This makes it easy to browse through and find specific information.
1 def search_aggregation(search_query): 2 docs = list(collection.aggregate([ 3 { 4 '$search': { 5 'index': 'search', 6 'compound': { 7 'should': [ 8 { 9 'text': { 10 'query': search_query, 11 'path': { 12 'wildcard': '*' 13 } 14 } 15 } 16 ], 17 'filter': [ 18 { 19 'queryString': { 20 'defaultPath': 'api_key', 21 'query': st.session_state.api_code 22 } 23 } 24 ] 25 } 26 } 27 } 28 ])) 29 return docs 30 31 def vector_search_aggregation(search_query, limit): 32 query_resp = openai.embeddings.create( 33 input=search_query, 34 model="text-embedding-3-small" 35 ) 36 query_vec = query_resp.data[0].embedding 37 docs = list(collection.aggregate([ 38 { 39 '$vectorSearch': { 40 'index': 'vector_index', 41 'queryVector': query_vec, 42 'path': 'embedding', 43 'numCandidates' : 20, 44 'limit' : limit, 45 'filter': { 46 'api_key': st.session_state.api_code 47 } 48 }}, 49 { '$project' : {'embedding' : 0} } 50 ])) 51 return docs
Additionally, with the use of a UI toggle, we can switch to the semantic vector search or the free text contextual search.
In both searches, the code performs extra filtering to return only documents with the user's tagged APIkey.
The application also supports adding additional AI tasks to each document. Here’s how you can extend the functionality:
You can create and save AI tasks on each document using the following functions. These functions allow you to define tasks for the AI to perform on stored JSON documents and save the results back to MongoDB. The flexibility of MongoDB allows us to add the content and present it for record and future reuse.
1 def get_ai_task(ocr,prompt): 2 ## Use existing document as context and perform another GPT task 3 ocr_text = json.dumps(ocr) 4 response = openai.chat.completions.create( 5 model="gpt-4o", 6 messages=[{ 7 "role": "system", 8 "content": "You are a task assistant looking to create a task for the AI to perform on the JSON object. Please return plain output which is only copy paste with no explanation." 9 }, 10 { 11 "role": "user", 12 "content": f"Please perform the following task {prompt} on the following JSON object {ocr_text}. Make sure that the output is stright forward to copy paste." 13 } 14 ] 15 ) 16 17 return response.choices[0].message.content 18 19 def save_ai_task(task_id, task_result, prompt): 20 21 collection.update_one( 22 {"_id": ObjectId(task_id)}, 23 {"$push" : {"ai_tasks" : {'prompt' : prompt, 'result' : task_result}}} 24 ) 25 26 return "Task saved successfully."
To illustrate the abilities of the described workflows, I have produced the following pictures where I scan a gin recipe from a book. The content is being captured as a JSON document and now I can search it via vector or text searches, as well as produce a task like “generating a non-alcoholic beverage" similar to the original recipe.
This project code can be found in the following GitHub repo which you can deploy yourself by following the README.md file.
This application demonstrates the power and flexibility of integrating MongoDB Atlas, Streamlit, and OpenAI’s GPT-4o to capture, process, and store real-world data. By leveraging these technologies, we can build robust solutions that transform physical information into digital, searchable documents, enhancing accessibility and usability.
The combination of MongoDB Atlas's scalable storage, the PyMongo driver, Streamlit's user-friendly interface, and OpenAI's advanced OCR capabilities offer a comprehensive solution for managing and utilizing real-world data effectively.
If you have any questions or suggestions, feel free to reach out or contribute to the project. Try MongoDB Atlas today and join our forums for further engagement. Happy coding!
Top Comments in Forums
There are no comments on this article yet.