Building AI with MongoDB: How Devnagri Brings the Internet to 1.3 Billion People with Machine Translations
It was while on a trip to Japan that Himanshu Sharma — later to become CEO of Devnagri — made an observation that drew parallels with his native India. Despite the majority of Japan’s population not speaking English, they were still well served by an internet that was largely based on the English language. Key to doing this was translation, and specifically the early days of automated machine translation. And so the idea to found Devnagri, India’s first AI-powered translation platform, was born.
“In India, 90% of the population are not fluent in English. That is close to 1.3 billion people. We wanted to bridge this gap to make it easy for non-English speakers to access the internet in their native languages. There are more than 22 Indian languages in use, but they represent just 0.1% of data on the internet,” says Sharma.
“We want to give people the same access to knowledge and education in their native languages so that they can be part of the digital ecosystem. We wanted to help businesses and the government reach real people who were not online because of the language barrier.”
Check out our AI resource page to learn more about building AI-powered apps with MongoDB.
Building India’s first machine translation platform
Sharma and his team at Devnagri have developed an AI-powered translation platform that can accept multiple file formats from different industry domains. Conceptually it is similar to Google Translate. Rather than a general consumer tool, it focuses on the four key industries that together make the largest impact on the everyday lives of Indian citizens: e-learning, banking, e-commerce, and media publishing. Devnagri provides API access to its platform and a plug-and-play solution for dynamically translating applications and websites.
As Sharma explains, “Our platform is built on our own custom transformer model based on the MarianNMT neural machine translation framework. We train on corpuses of content in documents, chunking them into sentences and storing them in MongoDB Atlas. We use in-context learning for training, which is further augmented with reinforcement learning from human feedback (RLHF) to further tune for precise accuracy.”
Sharma goes on to say, “We run on Google Vertex AI, which handles our MLops pipeline across both model training as well as inferencing. We use Google Tensor Processing Units (TPUs) to host our models so we can translate content — such as web pages, PDFs, documentation, web and mobile apps, images, and more — for users on the fly in real-time.”
While the custom transformer-based models have served the company well, recent advancements in off-the-shelf models is leading Devnagri’s engineers to switch. They are evaluating a move to OpenAI GPT-4 and the Llama-2-7b foundation models, fine-tuned with the past four years of machine translation data captured by Devnagri.
Why MongoDB? Flexibility and performance
MongoDB is used as the database platform for Devnagri’s machine translation models. For each sentence chunk, MongoDB stores the source English language version, the machine translation, and if applicable, the human-verified sentence translation.
As Sharma explains, “We use the sentences stored in MongoDB to train our models and support real-time inference. The flexibility of its document data model made MongoDB an ideal fit to store the diversity of structured and unstructured content and features our ML models translate.”
We also exploit MongoDB’s scalable distributed architecture. This allows our models to parallelize read and write requests across multiple nodes in the cloud, dramatically improving training and inference throughput. We get faster time to market with higher quality results by using MongoDB.
Himanshu Sharma, Devnagri co-founder and CEO
What's next?
Today Devnagri serves over 100 brands and several government agencies in India. The company has also joined MongoDB’s AI Innovators Program. The program provides its data science team with access to free Atlas credits to support further machine translation experiments and development, along with access to technical guidance and best practices.
Head over to our quick-start guide to get started with Atlas Vector Search today.