Announcing Atlas Data Federation and Atlas Data Lake

Benjamin Flast
June 7, 2022 | Updated: September 9, 2024

>> Announcement: Some features mentioned below will be deprecated on Sep. 30, 2025. Learn more.

Two years ago, we released the first iteration of Atlas Data Lake. Since then, we’ve helped customers combine data from various storage layers to feed downstream systems. But after years spent studying our customers’ experiences, we realized we hadn’t gone far enough. To truly unleash the genius in all our developers, we needed to add an economical cloud object storage solution with a rich MQL query experience to the world of Atlas. Today, we’re thrilled to announce that our new Atlas Data Federation and Atlas Data Lake offerings do just that.

We now offer two complementary services, Atlas Data Federation (our existing query service formerly known as Atlas Data Lake) and our new and improved Atlas Data Lake (a fully managed analytic-oriented storage service). Together, these services (both in preview) provide flexible and versatile options for querying and transforming data across storage services, as well as a MongoDB-native analytic storage solution. With these tools, you can query across multiple clusters, move data into self managed cloud object storage for consumption by downstream services, query a workload-isolated inexpensive copy of cluster data, compare your cluster data across different points in time, and much, much more.

In hearing from our customers about their experiences with Atlas Data Lake, we learned where they have struggled, as well as the features they’ve been looking for us to provide. With this in mind, we decided to shift the name of our current query federation service to Atlas Data Federation to better align with how customers see the service and are getting value. We’ve seen many customers benefit from the flexibility of a federated query engine service, including querying data across multiple clusters, databases, and collections, as well as exporting data to third-party systems.

We also saw where our customers were struggling with data lakes. We heard them ask for a fully managed storage solution so they could achieve all of their analytic goals within Atlas. Specifically, customers wanted scalable storage that would provide high query performance at a low cost. Our new Data Lake provides a high-performance analytic object storage solution, allowing customers to query historical data with no additional formatting or maintenance work needed on their end.

How it works

Atlas Data Federation encompasses our existing Data Lake functionality with several new affordances. It continues to deliver the same power that it always has, with increased performance and efficiency. The new Atlas Data Lake will now allow you to create Data Lake pipelines (based on your Atlas Cluster backup schedules) and fields on which you can optimize queries. The service takes the following steps:

On the selected schedule, a copy of your collection will be extracted from your Atlas backup with no impact to your cluster.
During extraction, we build partition indexes based on the contents of your documents and the fields you’ve selected for optimization. These indexes allow your queries to be optimized by capturing the minimums and maximums (and other stats) of the records in each partition, letting you quickly find the relevant data for your queries.
Finally, the underlying data lands in an analytic-oriented format inside of cloud object storage. This minimizes data scanned when you execute a query.

Once a pipeline has run and a Data Lake dataset has been created, you can select it as a data source in our new Data Federation query experience. You can either set it as the source for a specific virtual collection in a Federated Database Instance or you can have your Federated Database Instance generate a collection name for each dataset that your pipeline has created.

Amazingly, no part of this process will consume compute resources from your cluster — neither the export nor the querying of datasets. These datasets provide workload isolation and consistency for long-running analytic queries, a target for ETL jobs using the powerful $out to S3. This makes it easy to compare the state of your data over time.

Advanced though this is, it’s only the beginning of the story. We’re committing to evolving the service, improving performance, adding more sources of data, and building new features. All of this will be based on the feedback you, the user, gives us. We can’t wait to see how you’ll use this powerful new tool and can’t wait to hear what you’d like to see next.

← Previous

Working Together: MongoDB's Partner of the Year Awards

Organizations face more pressure than ever to modernize, to innovate, and to become more data-driven. Whether it's AstraZeneca doing next-generation genome sequencing to develop drugs to fight cancer and other diseases, Toyota connecting more than 100,000 warehouse vehicles with integrated telematics to reshape industrialization, or Forbes developing new ways to create and deploy content, many of the world's top companies have chosen to modernize with MongoDB. Today, more than 35,000 customers trust the MongoDB Atlas data platform with their most critical workloads. But we don’t do it alone. MongoDB's partner ecosystem has scaled to every vertical, industry, and geography, delivering MongoDB to tens of thousands of customers. MongoDB partners play many roles in the ecosystem, from extending MongoDB's developer data platform with complementary functionality to building new solutions on MongoDB to offering expertise to implement and modernize databases. Now, we are pleased to announce the 2022 MongoDB Partners of the Year. Although we are grateful for all of the many partners that help MongoDB succeed with customers, this year at MongoDB World we are recognizing a select few that have worked especially closely with us to drive joint success in the past year. The 2022 MongoDB Partners of the Year are Accenture , Alibaba Cloud , AWS , BigID , Capgemini , Carahsoft , commercetools , Confluent , Exafluence , Google Cloud , HCL Technologies , IBM , PeerIslands , Tecnotree , Temenos , and Unqork . We will honor these companies at our partner VIP reception on June 6, 2022. Global System Integrator Partner of the Year Accenture is a leading global professional services company, providing a broad range of services and solutions across strategy, consulting, applied intelligence, technology, and operations. A core tenet of Accenture’s “Cloud Continuum” approach indicates that the companies who achieve the most value from their cloud transformation journeys are those who go beyond “lift and shift” migrations and drive to “move and improve” their applications onto the cloud. Thus, Accenture is investing heavily in upskilling their consultants on MongoDB, as well as building assets to accelerate their clients’ MongoDB transformations. One example is Smart Data Mover, an Accenture asset designed to simplify and accelerate the process of modernizing legacy applications from relational database systems to MongoDB Atlas on the cloud. As a result, Accenture has been an essential partner in helping enterprises modernize with MongoDB. Learn more about the partnership. Boutique System Integrator Partner of the Year PeerIslands is a service consulting firm focused on providing high-velocity digital transformation. PeerIslands won the Boutique Systems Integrator Partner of the Year in 2021 as well, in recognition of their expertise and outstanding work in the MongoDB ecosystem. The firm consults on cloud transformation strategy and partners on product and data engineering, application transformation, and data science and AI/ML. In addition, PeerIslands has developed a suite of tools that allow for easy adoption of MongoDB and seamless migration of workloads to MongoDB within enterprises. Learn more about the partnership. Rising Star System Integrator Partner of the Year Capgemini is one of the world's foremost providers of consulting, technology and outsourcing services. Capgemini and MongoDB have partnered to help re-envision data and advanced business processes so its clients can focus on innovation. Traditional relational database management systems often inhibit performance, falter under heavy volumes, and slow deployment. Capgemini’s new Database Convert & Compare (DCC) tool, built on MongoDB, helps customers develop a new database or migrate from legacy systems to MongoDB. CapGemini has helped grow MongoDB’s footprint in some of the largest Fortune 100 financial services customers contributing large revenue numbers through the partnership, and it has been instrumental in driving MongoDB adoption by identifying new workloads and implementing new use cases for digital transformation, legacy modernization, and app modernization for financial services customers. Learn more about the partnership. System Integrator Certification Partner of the Year HCL Technologies (HCL) is a leading global technology company, helping enterprises reimagine their businesses for the digital age. Their technology products and services are built on four decades of innovation, with a world-renowned management philosophy, a strong culture of invention and risk-taking, and a relentless focus on customer relationships. HCL is partnering with MongoDB across Cloud Migration, Digital Business, and Engineering Services. This award honors partners with the highest number of MongoDB certified champions across the globe, who are skilled to modernize and transform customers' application and infrastructure leveraging MongoDB technologies. Learn more about the partnership. Cloud (Co-Sell) Partner of the Year AWS provides a massive global cloud infrastructure that allows customers to rapidly innovate and iterate. In 2022, AWS and MongoDB entered into a six-year collaboration agreement , which has placed the partnership into hypergrowth. MongoDB and AWS sellers have come together to help customers modernize, accelerate consumption of AWS customer commits, win new workloads, and close on major cross-industry and cross-vertical deals around the world. MongoDB and AWS have also partnered to help customers build modern, event-driven serverless applications, while releasing integrations for Atlas with Amazon EventBridge, Amazon Kinesis, AWS App Runner, AWS PrivateLink, AWS Wavelength, and more in the past year. The launch of Pay-as-You-Go option on AWS Marketplace made it easier for customers to pay for MongoDB Atlas using their AWS account. With additional integrations in the pipeline, there’s so much more to come. Learn more about the partnership. Cloud (Marketplace) Partner of the Year Google Cloud views the data platform as an essential building block of cloud infrastructure. Since launching MongoDB Atlas on Google Cloud Marketplace over the past three years, this partnership has seen rapid adoption and acceleration across industries such as gaming, retail, healthcare, financial services, and automotive. We have integrated MongoDB Atlas with more Google services, including DataStream, BigQuery, DataFlow, Cloud Run, App Engine, Apigee, and Cloud Functions, while also introducing a Pay-as-You-Go option on the Google Cloud Console and Marketplace . More and more Google Cloud customers are choosing to run MongoDB Atlas for a variety of needs, including managing large-scale product catalogs of popular e-commerce websites, building great customer experiences by unifying disparate pieces of data, and building modern global web and mobile applications. Learn more about the partnership. Cloud (Emerging Markets) Partner of the Year Alibaba Cloud is one of MongoDB’s largest Cloud OEM partners. In our third year of partnership with Alibaba Cloud as an authorized MongoDB-as-a-service solution, MongoDB has seen some of its strongest adoption numbers in mainland China, especially in the gaming and automotive industries. With this partnership, Alibaba Cloud ensures end-to-end management and support for customers on current and future versions of MongoDB, with the ability to escalate bug fixes and support issues on their behalf. Users of Alibaba Cloud’s platform offering receive easy access to the latest MongoDB features and capabilities, backed by comprehensive support from Alibaba Cloud and MongoDB. We remain thrilled to further scale this partnership. Try MongoDB on Alibaba Cloud . Independent Software Vendor (ISV) Partner of the Year (Financial Services) Temenos offers cloud-native, cloud-agnostic, API-first digital banking, core banking, payments, fund management, and wealth management software products, enabling banks to deliver consistent, frictionless customer journeys and achieve market-leading cost/income performance. More than 3,800 banks across the globe rely on Temenos to process the client interactions and daily transactions of more than 500 million customers. MongoDB underpins Temenos Transact, which is the most successful and widely used digital core-banking solution in the world. Temenos also selected MongoDB as foundational technology for their microservices transformation journey as they deliver their flagship product, Infinity Digital Banking. Together, we help banks and financial institutions comply with stringent data sovereignty requirements and scale seamlessly. Learn more about the partnership. Independent Software Vendor (ISV) Partner of the Year (Telco) Tecnotree is a digital business support systems (BSS) and digital partner ecosystem provider with more than 40 years of deep domain knowledge, proven delivery, and transformation capability. The company offers digital solutions for Communication & Digital Service Providers across the globe. Leveraging MongoDB technology, Tecnotree delivers world-class BSS and partner ecosystem solutions for telcos, empowering them to deliver insightful and connected customer experiences to their enterprises, consumers and partners. MongoDB powers Tecnotree’s Digital Customer Lifecycle Management Suite, empowering communications service providers (CSPs) to transform their business towards a marketplace of digital services. Learn more about the partnership. Independent Software Vendor (ISV) Partner of the Year (Healthcare) Exafluence is a domain-centric data and analytics company that utilizes modern digital technologies to enhance the competitive advantage of its customers in their markets. Exafluence brings in low-code platforms, accelerators, and expertise across a range of digital technologies to reduce time by 45% to 60% and cost by 25% to 30%, while delivering significantly higher business value. Exafluence has built a variety of technical and business solutions that can be assembled together and customized for specific business scenarios. Some of its key platform offerings include ExfDigital (Data), ExfInsights (Analytics), ExfHealth (Healthcare), and ExfIndustry (IoT). Learn more about the partnership. Independent Software Vendor (ISV) Partner of the Year (Retail) Commercetools ’ world-leading commerce platform, built on modern MACH principles (microservice-based, API-first, cloud-native and headless), is driving the future of digital commerce. Fully powered by MongoDB Atlas, the Commercetools suite allows brands to work with, not around, their commerce solution to tailor experiences to the exact needs of the business and its customers. Through this partnership, Commercetools has made MongoDB the optimal system of record for an even broader range of business-critical applications. Learn more about the partnership. Independent Software Vendor (ISV) Partner of the Year (Security) BigID’s data intelligence and privacy platform allows customers to rely on actual data discovered across the enterprise and cloud, rather than surveys and interviews, to ensure privacy compliance. To generate data insights at scale, automate advanced discovery and classification, and accommodate complex enterprise requirements, BigID chose MongoDB to power their platform. MongoDB’s data platform provides a flexible, reliable and scalable solution to customers across any vertical. Learn more about the partnership. Technology Partner of the Year Founded by the original developers of Apache Kafka, Confluent delivers the most complete distribution of Kafka with Confluent Platform. Real-time event streaming (data-in-motion) from Confluent complements the modern general-purpose distributed document database platform (for data-at-rest) from MongoDB, enabling organizations to run their business in real-time and to build fast-moving applications enriched with historical context. Learn more about the partnership. Distributor Partner of the Year Carahsoft is a trusted, long-standing MongoDB partner, and it continues to be one of our largest resellers in the world. Carahsoft’s depth and reach in the public sector market helps government agencies leverage open technologies to drive innovation, maximize cost efficiencies, and achieve success for their digital modernization initiatives. The company has been critical to growing MongoDB’s government agency and civilian business, and we’re excited for what’s next. Learn more about the partnership. Hybrid Cloud Partner of the Year IBM is a global technology company that provides infrastructure, software, hybrid cloud services, and cognitive computing to global enterprise clients. IBM played an instrumental role in helping one of the world’s largest banks launch their internal MongoDB-as-a-service on the IBM LinuxONE platform for their most critical financial applications. Our strategic partnership has helped customers powering applications on IBM Power platform and those requiring a fully managed database-as-a-service on IBM Cloud. Our OEM Partnership with Data & AI solutions and integration with IBM Cloud Pak for Data (ICP) has helped create success stories at large enterprises across the globe. By leveraging the flexibility of MongoDB on IBM Cloud Pak for Data, customers can get to market faster, iterate faster, and attract new customers faster, without sacrificing security or control. Learn more about the partnership. MongoDB for Startups Program to Unicorn Partner of the Year Unqork ’s enterprise no-code platform helps leading organizations build, deploy, and manage complex software without having to think about code. Unqork created the codeless architecture standard to free the world’s largest enterprises from the pitfalls of legacy code and allow them to focus on innovation to drive business and maintain a competitive edge. In 2017, also the year of its founding, Unqork joined the MongoDB for Startups program, which gave them free access to get started on MongoDB Atlas. Since then, Unqork has raised $414 million in funding and is valued at $2 billion. MongoDB is thrilled for Unqork’s success and honored to be a part of their journey. Learn more about the partnership. Learn more about being a MongoDB partner. Visit our list of partners to see all of our partners.

June 6, 2022

Next →

ORiGAMi: A Machine Learning Architecture for the Document Model

The document model has proven to be the optimal paradigm for modern application schemas. At MongoDB, we've long understood that semi-structured data formats like JSON offer superior expressiveness compared to traditional tabular and relational representations. Their flexible schema accommodates dynamic and nested data structures, naturally representing complex relationships between data entities. However, the machine learning (ML) community has faced persistent challenges when working with semi-structured formats. Traditional ML algorithms, as implemented in popular libraries like scikit-learn and pandas , operate on the assumption of fixed-dimensional tabular data consisting of rows and columns. This fundamental mismatch forces data scientists to manually convert JSON documents into tabular form—a time-consuming process that requires significant domain expertise. Recent advances in natural language processing (NLP) demonstrate the power of Transformers in learning from unstructured data but their application to semi-structured data has been under-studied. To bridge this gap, MongoDB's ML research group has developed a novel Transformer-based architecture designed for supervised learning on semi-structured data (e.g., JSON data in a document model database). We call this new architecture ORiGAMi (Object Representation through Generative, Autoregressive Modelling), and we're excited to make it available to the community at github.com/mongodb-labs/origami . It includes components that make training a Transformer model feasible on datasets entailing as few as 200 labeled samples. By combining this data efficiency with the flexibility of Transformers, ORiGAMi enables prediction directly from semi-structured documents, without the cumbersome flattening and manual feature extraction required for tabular data representation. You can read more about our model on arXiv . Technical innovation The key insight behind ORiGAMi lies in its tokenization strategy: documents are transformed into sequences of key-value pairs and special structural tokens that encode nested types like arrays and subdocuments: These token sequences serve as input to the Transformer model trained to predict the next token given a portion of the document, similar to how large language models (LLMs) are trained on text tokens. What’s more, our modifications to the standard Transformer architecture include guardrails to ensure that the model only generates valid, well-formed documents, and a novel position encoding strategy that respects the order invariance of key/value pairs in JSON. These modifications also allow for much smaller models compared to LLMs, which can thus be trained on consumer hardware in minutes to hours depending on dataset size and complexity, versus days to weeks for LLMs. By reformulating classification as a next-token prediction task, ORiGAMi can predict any field within a document, including complex types like arrays and nested subdocuments. This unified approach eliminates the need for separate models or preprocessing pipelines for different prediction tasks. Example use case Our initial focus has been supervised learning: training models from labeled data to make predictions on unseen documents. Let's explore a practical example of user segmentation. Consider a collection where each document represents a user profile, containing both simple fields and complex nested structures: { "_id": "user_7842", "email": "sarah.chen@example.com", "signup_date": "2024-01-15", "device_history": [ { "device": "mobile_ios", "first_seen": "2024-01-15", "last_seen": "2024-02-11" }, { "device": "desktop_chrome", "first_seen": "2024-01-16", "last_seen": "2024-02-10" } ], "subscription": { "plan": "pro", "billing_cycle": "annual", "features_used": ["analytics", "api_access", "team_sharing"], "usage_metrics": { "storage_gb": 45.2, "api_calls_per_day": 1250, "active_projects": 8 } }, "user_segment": "enterprise_power_user" // <-- target field } Suppose you want to automatically classify users into segments like "enterprise_power_user", "smb_growth", or "early_stage_startup" based on their behavior and characteristics. Some documents in your collection already have correct labels, perhaps assigned through manual analysis or customer interviews. Traditional ML approaches would require flattening this rich document structure, leading to very sparse tables and potentially losing important hierarchical relationships. With ORiGAMi, you can: Train directly on the raw documents with existing labels Preserve the full context of nested structures and arrays Make predictions for the "user_segment" field on new users immediately after signup Update predictions as user behavior evolves without rebuilding feature pipelines Getting started with ORiGAMi We're excited to be open-sourcing ORiGAMi ( github.com/mongodb-labs/origami ) and you can read more about our model on arXiv . We've also included a command-line interface that lets users make predictions without writing any code. Training a model is as simple as pointing ORiGAMi to your MongoDB collection: origami train <mongo-uri> -d app -c users Once trained, you can generate predictions and seamlessly integrate them back into your MongoDB workflow. For example, to predict user segments for new signups (from the analytics.signups collection ) and write the resulting predictions back to MongoDB to an analytics.predicted collection: origami predict <mongo-uri> -d analytics -c signups --target user_segment --json | mongoimport -d analytics -c predicted For those looking to dive deeper, we've also included several Jupyter notebooks in the repository that demonstrate advanced features and customization options. Model performance can be improved by adjusting the hyperparameters. We're just scratching the surface of what's possible with document-native machine learning, and have many more use cases in mind. We invite you to explore the repository, contribute to the project, and share how you use ORiGAMi to solve real-world problems. Head over to the ORiGAMi github repo , play around with it, and tell us about new ways of applying it and problems it’s well-suited to solving.

March 11, 2025