Building AI with MongoDB: How Devnagri Brings the Internet to 1.3 Billion People with Machine Translations

Mat Keep

#genAI#Vector Search

It was while on a trip to Japan that Himanshu Sharma — later to become CEO of Devnagri — made an observation that drew parallels with his native India. Despite the majority of Japan’s population not speaking English, they were still well served by an internet that was largely based on the English language. Key to doing this was translation, and specifically the early days of automated machine translation. And so the idea to found Devnagri, India’s first AI-powered translation platform, was born.

“In India, 90% of the population are not fluent in English. That is close to 1.3 billion people. We wanted to bridge this gap to make it easy for non-English speakers to access the internet in their native languages. There are more than 22 Indian languages in use, but they represent just 0.1% of data on the internet,” says Sharma.

“We want to give people the same access to knowledge and education in their native languages so that they can be part of the digital ecosystem. We wanted to help businesses and the government reach real people who were not online because of the language barrier.”

Check out our AI resource page to learn more about building AI-powered apps with MongoDB.

Figure 1: Devnagri’s real time translation engine helps over 100 Indian brands connect with their customers over digital channels for the first time

Building India’s first machine translation platform

Sharma and his team at Devnagri have developed an AI-powered translation platform that can accept multiple file formats from different industry domains. Conceptually it is similar to Google Translate. Rather than a general consumer tool, it focuses on the four key industries that together make the largest impact on the everyday lives of Indian citizens: e-learning, banking, e-commerce, and media publishing. Devnagri provides API access to its platform and a plug-and-play solution for dynamically translating applications and websites.

As Sharma explains, “Our platform is built on our own custom transformer model based on the MarianNMT neural machine translation framework. We train on corpuses of content in documents, chunking them into sentences and storing them in MongoDB Atlas. We use in-context learning for training, which is further augmented with reinforcement learning from human feedback (RLHF) to further tune for precise accuracy.”

Sharma goes on to say, “We run on Google Vertex AI, which handles our MLops pipeline across both model training as well as inferencing. We use Google Tensor Processing Units (TPUs) to host our models so we can translate content — such as web pages, PDFs, documentation, web and mobile apps, images, and more — for users on the fly in real-time.”

While the custom transformer-based models have served the company well, recent advancements in off-the-shelf models is leading Devnagri’s engineers to switch. They are evaluating a move to OpenAI GPT-4 and the Llama-2-7b foundation models, fine-tuned with the past four years of machine translation data captured by Devnagri.

Why MongoDB? Flexibility and performance

MongoDB is used as the database platform for Devnagri’s machine translation models. For each sentence chunk, MongoDB stores the source English language version, the machine translation, and if applicable, the human-verified sentence translation.

As Sharma explains, “We use the sentences stored in MongoDB to train our models and support real-time inference. The flexibility of its document data model made MongoDB an ideal fit to store the diversity of structured and unstructured content and features our ML models translate.”

We also exploit MongoDB’s scalable distributed architecture. This allows our models to parallelize read and write requests across multiple nodes in the cloud, dramatically improving training and inference throughput. We get faster time to market with higher quality results by using MongoDB.

Himanshu Sharma, Devnagri co-founder and CEO

What's next?

Today Devnagri serves over 100 brands and several government agencies in India. The company has also joined MongoDB’s AI Innovators Program. The program provides its data science team with access to free Atlas credits to support further machine translation experiments and development, along with access to technical guidance and best practices.

If you are building AI-powered apps, the best way to get started is to sign up for an account on MongoDB Atlas. From there, you can create a free MongoDB instance with the Atlas database and Atlas Vector Search, load your own data or our sample data sets, and explore what’s possible within the platform.