MongoDB EventMongoDB.local SF, Jan 15: See the speaker lineup & ship your AI vision faster. Use WEB50 to save 50% >
AnnouncementLearn why MongoDB was named a Leader in the 2025 Gartner® Magic Quadrant™ Learn more >

Machine Learning Research Group

Research in machine learning focused on information retrieval, application development and modernization, and database performance.
An illustration that shows cognitive functions being powered by machines (cogwheels).
Who We Are
The Machine Learning team focuses on representation learning, language models, and learning on semi-structured data. It provides MongoDB with deep technical knowledge of the latest developments in machine learning.
Research Areas
Technical_MDB_Vectors

Representation learning

MongoDB conducts research on enabling embedding models to run on resource-constrained infrastructure and automated fine-tuning approaches.

mdb_database

Learning on semi-structured data

Models that natively support semi-structured data as input are useful for tasks such as supervised learning, DB index recommendations, and cardinality estimation.

general_action_develop

Language models for code generation

MongoDB aims to improve the capabilities of language models in the domain of code generation to support application modernization.

Our Team
Picture of Robin Vujanic

Robin Vujanic

Staff Research Engineer

Robin's research focuses on embedding models, particularly techniques for improving computational efficiency.

Research Papers

LEAF: Knowledge Distillation of Text Embedding Models with Teacher-Aligned Representations

Robin Vujanic, Thomas Rückstieß

arXiv, 2025

We introduce a knowledge distillation technique that produces embedding models retaining up to 97% of their teacher's performance, while being 5x-15x smaller, 7x-24x faster, and fully compatible with their teacher for flexible deployment.
 
 
Read the paper | Models | Blog

ORIGAMI: A generative transformer architecture for predictions from semi-structured data

Thomas Rückstieß, Alana Huang and Robin Vujanic

arXiv, 2024

ORIGAMI enables efficient end-to-end learning on semi-structured JSON data.
 
 
Read the paper | GitHub Repository