BlogAtlas Vector Search voted most loved vector database in 2024 Retool State of AI reportLearn more >>
MongoDB Developer
Atlas
plus
Sign in to follow topics
MongoDB Developer Centerchevron-right
Developer Topicschevron-right
Productschevron-right
Atlaschevron-right

Capturing and Storing Real-World Optics With MongoDB Atlas, OpenAI GPT-4o, and PyMongo

Pavel Duchovny7 min read • Published Jun 03, 2024 • Updated Jun 03, 2024
AIPythonAtlas
FULL APPLICATION
Facebook Icontwitter iconlinkedin icon
Rate this article
star-empty
star-empty
star-empty
star-empty
star-empty
Every time OpenAI posts news about a new AI GPT model, I get excited. I was building demos with OpenAI GPT APIs from 2021 (pretty much when they were first released). Reminiscing on my first article, I can’t believe what a huge milestone GPT-4o is with its “Omni” media abilities. It allows users to be flexible and free, with text, image, and audio inputs working together with one API endpoint so any intelligent task can be performed.
MongoDB has been known for its ability to store flexible data streams and JSON structures for years, leveraged by millions of users. So, it's not surprising to me that mixing MongoDB Atlas and Atlas Vector Search with GPT-4o on texts and images, captured by a simple web app, is so powerful and amazing.
In this article, we explore an innovative way to capture and store real-world data using MongoDB, GPT-4o, and the PyMongo driver within a Streamlit app. We’ll walk through the development of an application that transforms captured images into searchable JSON documents, making use of OpenAI’s powerful GPT-4o for OCR capabilities. This project is an excellent demonstration of how to integrate various technologies to solve practical problems in a streamlined and efficient manner.

Introduction

Real-world objects such as recipes, documents, animals, and vehicles often contain valuable information that can be digitized for easier access and analysis. By combining the capabilities of MongoDB, Streamlit, and OpenAI, we can build an application that captures images, extracts text, and stores the information in a MongoDB database. This approach allows for efficient storage, retrieval, and searching of the digitized data.
Demo of the OCR

Key technologies

MongoDB Atlas: A flexible, scalable, and document-oriented database that is perfect for storing JSON-like documents
PyMongo: MongoDB’s robust Python driver, serving as the access point to operational and vector queries
OpenAI GPT-4o: A state-of-the-art new language model capable of understanding multiple media channels inputs (text, images, and audio) and generating human-like text or images, which we will use here for Optical Character Recognition (OCR)

Application workflow

User authentication: Ensures that only authorized users can access the application
Image capture: Uses the device’s camera to capture images of real-world objects
Text extraction: Utilizes OpenAI’s GPT-4o to transcribe the captured images into structured JSON data
Data storage: Stores the extracted JSON data in MongoDB for efficient retrieval and searching
Data retrieval: Allows users to search and view the stored documents and their corresponding images
Pipeline AI task on captured documents: Uses retrieval-augmented generation (RAG) to get a prompt from a user, and allows operating on existing content to create new generated content (e.g., translate a captured post to four other languages, create a LinkedIn post from a product summary announcement)

Building the application

If you haven’t done so already, register for MongoDB Atlas and create a cluster.
Once you have your cluster created with IP access added to your host, get your connection string and copy it for use later.
Atlas allows you to create full-text search indexes alongside vector search indexes to allow robust and rich searching abilities on your stored documents.
Full-text search allows you to leverage aggregations and dynamically search your documents based on keywords and fuzzy logic on any set of attributes at any level.
Follow our tutorial to create an index with the needed syntax for this application. Apply it to a collection in the database ocr_db and the collection ocr_documents, with the index name “search”:
Vectors are float based arrays created by AI providers, like OpenAI in this case, that represent the encoded inputs as a numerical vector. The vector index helps it to search semantic-based similarity of an encoded query/string/media, with the stored vectors representing the encoded content on the database documents.
To create it, use the following index, with the needed syntax for this application on the database ocr_db and the collection ocr_documents, index name “vector_index”:
Once you have the collection and indexes ready, you can build the application artifacts.

Setting up the Environment

Let perform the basics steps to get our application running.

1. Clone the repository and install the required packages

2. Set up your environment variables in the terminal

Running the application

To run the Streamlit app, use the following command:
Open your web browser and go to http://localhost:8501 to access the application.

3. MongoDB connection

The application initializes a global collection instance to use:

User authentication

The application begins by prompting the user to enter an API code. This ensures that only authorized users can access the app’s functionalities. The API code is checked against the database''s stored permitted keys.

4. Capturing Images

Once authenticated, users can capture images using their device's camera. The app supports images of various real-world objects, such as recipes, documents, animals, vehicles, and products.

5. Extracting text from images

Captured images are sent to OpenAI’s GPT-4o for OCR. The model processes the images and extracts relevant text which is then structured into a JSON format. This JSON document includes fields like 'name' and 'type', ensuring that the data is well-organized and ready for storage.

6. Storing data in MongoDB

The structured JSON data is stored in a MongoDB database. MongoDB’s document-oriented nature makes it an excellent choice for this kind of application, allowing for flexible and efficient storage and retrieval of data. It uses OpenAI embedding to embed summarized fields and names for semantic search.

7. Searching and displaying documents

Users can search for stored documents using keywords. The app retrieves matching documents from MongoDB and displays them, along with their corresponding images. This makes it easy to browse through and find specific information.
Additionally, with the use of a UI toggle, we can switch to the semantic vector search or the free text contextual search.
In both searches, the code performs extra filtering to return only documents with the user's tagged APIkey.

8. Applying AI tasks on captured documents

The application also supports adding additional AI tasks to each document. Here’s how you can extend the functionality:

AI task pipeline

You can create and save AI tasks on each document using the following functions. These functions allow you to define tasks for the AI to perform on stored JSON documents and save the results back to MongoDB. The flexibility of MongoDB allows us to add the content and present it for record and future reuse.

Putting it all together

To illustrate the abilities of the described workflows, I have produced the following pictures where I scan a gin recipe from a book. The content is being captured as a JSON document and now I can search it via vector or text searches, as well as produce a task like “generating a non-alcoholic beverage" similar to the original recipe.
Initial OCR
AI task on document code

Try it yourself

This project code can be found in the following GitHub repo which you can deploy yourself by following the README.md file.

Conclusion

This application demonstrates the power and flexibility of integrating MongoDB Atlas, Streamlit, and OpenAI’s GPT-4o to capture, process, and store real-world data. By leveraging these technologies, we can build robust solutions that transform physical information into digital, searchable documents, enhancing accessibility and usability.
The combination of MongoDB Atlas's scalable storage, the PyMongo driver, Streamlit's user-friendly interface, and OpenAI's advanced OCR capabilities offer a comprehensive solution for managing and utilizing real-world data effectively.
If you have any questions or suggestions, feel free to reach out or contribute to the project. Try [MongoDB Atlas today]((https://www.mongodb.com/try) and join our forums for further engagement. Happy coding!

Facebook Icontwitter iconlinkedin icon
Rate this article
star-empty
star-empty
star-empty
star-empty
star-empty
Related
Article

A Free REST API for Johns Hopkins University COVID-19 dataset


Nov 16, 2023 | 5 min read
Tutorial

How to Query from Multiple MongoDB Databases Using MongoDB Atlas Data Federation


Jan 23, 2024 | 7 min read
Tutorial

Unlocking Semantic Search: Building a Java-Powered Movie Search Engine with Atlas Vector Search and Spring Boot


Jul 01, 2024 | 10 min read
Video

The Atlas Search 'cene: Season 1


Dec 15, 2023 | 2 min
Table of Contents