ORiGAMi: A Machine Learning Architecture for the Document Model
The document model has proven to be the optimal paradigm for modern application schemas. At MongoDB, we've long understood that semi-structured data formats like JSON offer superior expressiveness compared to traditional tabular and relational representations. Their flexible schema accommodates dynamic and nested data structures, naturally representing complex relationships between data entities.
However, the
machine learning
(ML) community has faced persistent challenges when working with semi-structured formats. Traditional ML algorithms, as implemented in popular libraries like
scikit-learn
and
pandas
, operate on the assumption of fixed-dimensional tabular data consisting of rows and columns. This fundamental mismatch forces data scientists to manually convert JSON documents into tabular form—a time-consuming process that requires significant domain expertise.
Recent advances in natural language processing (NLP) demonstrate the power of Transformers in learning from unstructured data but their application to semi-structured data, has been under-studied. To bridge this gap, MongoDB's ML research group has developed a novel Transformer-based architecture designed for supervised learning on semi-structured data (e.g., JSON data in a document model database).
We call this new architecture ORiGAMi (Object Representation through Generative, Autoregressive Modelling), and we're excited to make it available to the community at
github.com/mongodb-labs/origami
. It includes components that make training a Transformer model feasible on datasets entailing as few as 200 labeled samples. By combining this data efficiency with the flexibility of Transformers, ORiGAMi enables prediction directly from semi-structured documents, without the cumbersome flattening and manual feature extraction required for tabular data representation. You can read more about our model on
arXiv
.
Technical innovation
The key insight behind ORiGAMi lies in its tokenization strategy: documents are transformed into sequences of key-value pairs and special structural tokens that encode nested types like arrays and subdocuments:
These token sequences serve as input to the Transformer model trained to predict the next token given a portion of the document, similar to how
large language models
(LLMs) are trained on text tokens.
What’s more, our modifications to the standard Transformer architecture include guardrails to ensure that the model only generates valid, well-formed documents, and a novel position encoding strategy that respects the order invariance of key/value pairs in JSON. These modifications also allow for much smaller models compared to LLMs, which can thus be trained on consumer hardware in minutes to hours depending on dataset size and complexity, versus days to weeks for LLMs.
By reformulating classification as a next-token prediction task, ORiGAMi can predict any field within a document, including complex types like arrays and nested subdocuments. This unified approach eliminates the need for separate models or preprocessing pipelines for different prediction tasks.
Example use case
Our initial focus has been supervised learning: training models from labeled data to make predictions on unseen documents. Let's explore a practical example of user segmentation.
Consider a collection where each document represents a user profile, containing both simple fields and complex nested structures:
{
"_id": "user_7842",
"email": "sarah.chen@example.com",
"signup_date": "2024-01-15",
"device_history": [
{
"device": "mobile_ios",
"first_seen": "2024-01-15",
"last_seen": "2024-02-11"
},
{
"device": "desktop_chrome",
"first_seen": "2024-01-16",
"last_seen": "2024-02-10"
}
],
"subscription": {
"plan": "pro",
"billing_cycle": "annual",
"features_used": ["analytics", "api_access", "team_sharing"],
"usage_metrics": {
"storage_gb": 45.2,
"api_calls_per_day": 1250,
"active_projects": 8
}
},
"user_segment": "enterprise_power_user" // <-- target field
}
Suppose you want to automatically classify users into segments like "enterprise_power_user", "smb_growth", or "early_stage_startup" based on their behavior and characteristics. Some documents in your collection already have correct labels, perhaps assigned through manual analysis or customer interviews.
Traditional ML approaches would require flattening this rich document structure, leading to very sparse tables and potentially losing important hierarchical relationships. With ORiGAMi, you can:
Train directly on the raw documents with existing labels
Preserve the full context of nested structures and arrays
Make predictions for the "user_segment" field on new users immediately after signup
Update predictions as user behavior evolves without rebuilding feature pipelines
Getting started with ORiGAMi
We're excited to be open-sourcing ORiGAMi (
github.com/mongodb-labs/origami
) and you can read more about our model on
arXiv
. We've also included a command-line interface that lets users make predictions without writing any code.
Training a model is as simple as pointing ORiGAMi to your MongoDB collection:
origami train <mongo-uri> -d app -c users
Once trained, you can generate predictions and seamlessly integrate them back into your MongoDB workflow. For example, to predict user segments for new signups (from the
analytics.signups collection
) and write the resulting predictions back to MongoDB to an
analytics.predicted
collection:
origami predict <mongo-uri> -d analytics -c signups --target
user_segment --json | mongoimport -d analytics -c predicted
For those looking to dive deeper, we've also included
several Jupyter notebooks
in the repository that demonstrate advanced features and customization options. Model performance can be improved by adjusting the hyperparameters.
We're just scratching the surface of what's possible with document-native machine learning, and have many more use cases in mind. We invite you to explore the repository, contribute to the project, and share how you use ORiGAMi to solve real-world problems.
Head over to the
ORiGAMi github repo
, play around with it, and tell us about new ways of applying it and problems it’s well-suited to solving.
March 11, 2025