Machine Learning Research Group

Research in machine learning focused on information retrieval, application development and modernization, and database performance.

An illustration that shows cognitive functions being powered by machines (cogwheels).

Who We Are

The Machine Learning team focuses on representation learning, language models, and learning on semi-structured data. It provides MongoDB with deep technical knowledge of the latest developments in machine learning.

Research Areas

Representation learning

MongoDB conducts research on enabling embedding models to run on resource-constrained infrastructure and automated fine-tuning approaches.

Learning on semi-structured data

Models that natively support semi-structured data as input are useful for tasks such as supervised learning, DB index recommendations, and cardinality estimation.

Language models for code generation

MongoDB aims to improve the capabilities of language models in the domain of code generation to support application modernization.

Our Team

Robin Vujanic

Staff Research Engineer

Robin's research focuses on embedding models, particularly techniques for improving computational efficiency.

Read bio

Research Papers

LEAF: Knowledge Distillation of Text Embedding Models with Teacher-Aligned Representations

Robin Vujanic, Thomas Rückstieß

arXiv, 2025

We introduce a knowledge distillation technique that produces embedding models retaining up to 97% of their teacher's performance, while being 5x-15x smaller, 7x-24x faster, and fully compatible with their teacher for flexible deployment.

Read the paper | Models | Blog

ORIGAMI: A generative transformer architecture for predictions from semi-structured data

Thomas Rückstieß, Alana Huang and Robin Vujanic

arXiv, 2024

ORIGAMI enables efficient end-to-end learning on semi-structured JSON data.

Read the paper | GitHub Repository