EventJoin us at AWS re:Invent 2024! Learn how to use MongoDB for AI use cases. Learn more >>

What is a Distributed Database?

Since 2020, the amount of data created, captured, copied, and consumed worldwide has almost doubled — that's an increase of 55.8 zettabytes in just three years!

The volume of data/information created, captured, copied, and consumed worldwide in zettabytes.

(Source: Statista.com, 2023)


As a result, organizations are actively seeking flexible, scalable, high-performance data storage solutions that can accommodate various types of data (e.g., structured, unstructured) while maintaining a high degree of data integrity. Further, given the prevalence of users accessing that data from a variety of mobile devices, organizations need solutions that support multiple users simultaneously while also ensuring strong data security and availability. These are just a few of the reasons that distributed databases are continuing to grow in popularity and are the solution of choice for many global organizations.

In this distributed database guide, we'll explore the different types of distributed databases, how they work, distributed database architecture, as well as the benefits and challenges that distributed databases offer.


Table of contents

What is a distributed database?

In the most basic terms, a distributed database is a database that stores data in multiple locations instead of one location. This means that rather than putting all data on one server or on one computer, data is placed on multiple servers or in a cluster of computers consisting of individual nodes. These nodes are oftentimes geographically separate and may be physical computers or virtual machines within a cloud database.


Illustration of cluster and nodes


Illustration of cluster and nodes


The MongoDB cluster example above is one of many configuration possibilities available when creating a distributed database. However, unlike traditional centralized databases, all distributed databases share the common characteristic of spreading data across multiple locations (physical and/or virtual) which improves data resiliency and availability. Sharding the data across multiple locations allows for horizontal scaling as well.


Distributed database types

There are two distinct types of distributed databases: homogeneous databases and heterogeneous databases.


Homogeneous distributed databases

In a homogeneous distributed database, the machines, nodes, servers, or sites store the same data, use the same data model, work with the same operating system, and share the same distributed database management system (DDBMS) or occasionally multiple types of DDBMS from the same vendor.

Within homogenous distributed databases, there are two subsets: autonomous and non-autonomous.

  • Autonomous distributed databases: In an autonomous distributed database, nodes work on their own with their own complete set of data, only requiring an application to facilitate universal updates across all nodes or messaging between nodes.
  • Non-autonomous distributed databases: In non-autonomous distributed databases, nodes rely on a centralized database management system (DBMS) to coordinate data distribution, communications, and all updates.

As a rule, homogeneous distributed databases offer significant data protection through redundancy and simplified management due to the similarity of all nodes.


Heterogeneous distributed databases

In a heterogeneous distributed database, different machines or sites may house different data sets, use different operating systems, contain different data schemas, and require software to facilitate communication between machines. Further, different sites may not even be aware of the existence of other sites.

Within heterogeneous distributed databases, there are two subsets: federated and unfederated.

  • Federated distributed databases: In a federated distributed database, multiple nodes — which are able to function completely on their own and may contain different data — can work together and function as one entity. This means that when a query occurs, the system determines which node is best equipped to respond and passes the query appropriately. This process is sometimes referred to as data virtualization.
  • Unfederated distributed databases: In an unfederated distributed database, each node operates individually and there is a central application that manages the access to each database in each node.

While more complex to manage, heterogeneous distributed databases offer more flexibility in terms of data models, schema choices, and the types of data that can be stored than homogeneous distributed databases.

How do distributed databases work?

As previously discussed, nodes are individual servers or computers that reside within a distributed database system (e.g., computers, virtual machines, servers that share no physical components). Each node stores a set of data and runs on distributed database management system software (DDBMS). To determine which data will be stored amongst which nodes, the concept of data distribution must be considered.


Data distribution

Proper data distribution is critical to the efficiency, security, and optimal user access in a distributed database. This process, sometimes referred to as data partitioning, can be accomplished using two different methods.

  • Horizontal partitioning: Horizontal partitioning involves splitting data tables into rows across multiple nodes.
  • Vertical partitioning: Vertical partitioning splits tables into columns across multiple nodes.

Illustration of the difference between horizontal and vertical partitioning.

(Source: Hazelcast.com, 2023)


The resulting data sets from horizontal or vertical partitioning of the original table are sometimes referred to as shards.

Distributed database system communication

While nodes are able to fully function on their own, it is necessary for them to communicate with other nodes as well since, unlike centralized databases, they do not share the same physical components or even the same data sets. There are three types of distributed database communication:

  • Broadcast communication: One message is sent to all other nodes within the distributed database system.
  • Multicast communication: One message is sent to some but not all other nodes within the distributed database system.
  • Unicast communication: A message is sent from an individual node to one other individual node.

Transaction management

Distributed databases must often support distributed transactions, where one transaction can involve more than one node. This support methodology is highlighted in the ACID properties (atomicity, consistency, isolation, durability) of transactions across distributed database systems. Key elements of ACID properties include:

A description of the ACID properties. (Source: Dev.to, 2020)


  • Atomicity means that a transaction is treated as a single unit. This also means that either a complete transaction is available for storage or it's rejected as an error which ensures data integrity.

  • Consistency is maintained in distributed database systems by enforcing predefined rules and data constraints. If the state, nature, or content of a transaction violates these rules, the transaction will not be ingested and stored in the distributed system.

  • Isolation involves the separation of each transaction from the other transactions to prevent data conflicts and maintain data integrity. In addition, this benefits operations when managing multiple distributed data records that may exist across local data stores, virtual machines via cloud computing, and multiple database nodes which may be located across multiple sites.

  • Durability ensures that stored data is preserved in the event of a system failure. There are a variety of ways that a transactional distributed database management system accomplishes this task, including:


Fault tolerance

Because distributed database systems are more likely to experience failures or operations interruptions than centralized databases (e.g., due to multiple sites or a suboptimal file system), strong fault tolerance processes are essential to maintain access reliability and effective database operations. With that said, the number of individual components that distributed systems are able to preserve removes the risk of a single point of failure.

Some common fault tolerance processes include data replication, backup protocols, continuous failure detection, data checksums, load balancing, and query optimization.


Data replication

Data replication is the process by which multiple copies of data are maintained across different nodes, servers, or sites. There are different types of database replication schema to choose from, including:

  • Full replication: In full replication, a complete, functional copy of the entire database is sent to all sites within the distributed database system. Database copy updates are provided on a routine schedule. There are two subtypes of full replication, as well.

    • Transactional replication: In transactional replication, a full and complete database copy is provided to each node, and then data changes are updated to that copy as transaction processing occurs, often in real-time.
    • Snapshot replication: Using snapshot replication, a copy of the database at a specific point in time is captured. This snapshot is then distributed across nodes and the user base as needed but does not consistently monitor for data changes. For this reason, snapshot database replication is only recommended for infrequently changing content.
  • Partial replication: In some cases, certain nodes only require specific portions of the database, so a defined portion of the database is replicated to a select group. In this type of data replication, any number of nodes or sites can receive the replication.

  • Merge replication: As its name indicates, merge database replication is the merging of two databases into one. This is the most complex of the database replication types.


Backup protocols

Through a consistent program of automated data backups, data integrity and database systems availability can be maintained without overburdening organizational employees. Some of the most common solutions in the marketplace include backup software from Veeam, Druva, and Commvault.

Three of the most used types of backup for distributed databases include:

  • Full backup: The entire database is copied and stored every time a database backup is executed.

  • Differential backup: Only the changes made since the last full backup are copied and stored.

  • Incremental backup: Incremental backups do not require a previous full backup — they can save changes since the previous differential or incremental backup.


Continuous failure detection

As with any system, it is critical for distributed database systems to be continuously monitored for system failures — whether they be technical issues, natural disasters, or cyberattacks. Just a few of the ways this monitoring is accomplished include:

  • Heartbeating: In heartbeating, each node sends out a signal (heartbeat) to other nodes to verify it's operational. If that signal isn't received, a failure message is created and further investigation of that node's operations by system administration is undertaken.

  • Watchdog timers: Individual nodes will have watchdog timers that are focused on a specific activity or process. If the timer expires without the activity or process being completed, a failure message is generated indicating further investigation is required.

  • Data checksums: In order to identify data tampering or other issues with data transmission, when a data transmission is sent, it is assigned a certain value (or checksum). When that transmission is received, it is also assigned a checksum. By using software to verify that both the sender and receiver have equivalent checksums for that transmission, issues with data transmission integrity can be quickly identified.


Load Balancing

Load balancing techniques distribute user requests and queries evenly across database nodes. This not only improves performance but also ensures that the failure of one node does not cause an overload on others.

Usually, load balancing software is deployed as the intermediary between the applications or database users. When a query is received, the load balancer will evaluate the request and determine which node(s) are best equipped to respond. During this evaluation, such factors as proximity, current load, and other predetermined system rules will be considered. This evaluation and assignment helps the system avoid system overload and system inefficiency which can result in long wait times for users.


Query optimization

Distributed databases use query optimization techniques to distribute queries efficiently across nodes while minimizing data transfer traffic between nodes. One of the ways this is accomplished is through cost-based query optimization. This form of query optimization considers the most efficient execution for the query, with such factors as query complexity, available data, and the location of the site containing that data.

Benefits and challenges distributed databases offer


As with any type of database solution, there are both benefits and challenges. Here is a brief summary to consider when researching distributed databases for your organization.


Distributed database benefits

  • Flexibility: Flexibility of data structures and schemas used within a distributed database (e.g., heterogeneous) are a significant benefit for organizations with a variety of data asset types and processing requirements.

  • Resiliency: Because distributed databases locate data across multiple nodes in the distributed system, the risk of a single point of failure is significantly reduced.

  • Scalability: Distributed databases can easily scale up (or down) by simply adjusting the number of nodes in the database, making them ideal for growing organizations.

  • Improved performance: Distributed databases are able to use load balancing and query optimization to improve overall database performance while reducing user wait times.

  • High availability: Fault tolerance (e.g., data replication, continuous failure detection) provide high system availability for users.


Distributed database challenges

  • Complexity: Because there are more moving parts to distributed databases vs. centralized databases, they can be more complex to both design and manage. The Atlas developer data platform simplifies this dramatically by providing a single UI/API to control and manage secure MongoDB distributed systems at scale.

  • Latency: If not managed properly, latency can occur when users query data from multiple nodes.

  • Data consistency: Since distributed databases are able to employ multiple data schemas and structures, maintaining data consistency requires more effort than traditional databases. In addition, if there is a hardware or network failure, data restoration can be more complex.

  • Cost: Distributed databases can be more expensive due to the added complexity that their greater flexibility brings. In addition, there may be additional networking costs since they tend to have more sites and hardware than traditional databases.


Additional resources to discover

Interested in learning more about distributed databases? Below is a list of resources that delve into specific elements of distributed databases.

FAQs

What is a distributed database?

A distributed database is a database system that stores data in multiple locations instead of one location. This means that rather than putting all data on one server or on one computer, data is placed on multiple servers or in a cluster of computers consisting of individual nodes. These nodes are oftentimes geographically separate and may be physical computers or virtual machines within a cloud database.

What are the two types of distributed databases?

The two types of distributed databases are homogeneous distributed databases and heterogeneous distributed databases. In homogeneous distributed databases, all machines across multiple sites in the distributed database share the same exact data, data model, operating system, and the same distributed database management system. Meanwhile, in a heterogeneous distributed database, different machines may house different data sets, use different operating systems, contain different data schemas, and require software to facilitate communication between machines.

What are the ACID properties?

The ACID properties consist of four elements:

  1. Atomicity: Atomicity means that a transaction is treated as a single unit.

  2. Consistency: Distributed database systems maintain data consistency by enforcing predefined rules and data constraints.

  3. Isolation: Each transaction is isolated from the other transactions to prevent data conflicts and maintain data integrity.

  4. Durability: Durability ensures that stored data is preserved in the event of a system failure.

What are some types of database replication?

Types of database replication include full replication, transactional replication, snapshot replication, partial replication, and merge replication.

What are different types of continuous failure monitoring in distributed database systems?

Different types of continuous failure monitoring include:

  • Heartbeating.
  • Watchdog timers.
  • Data checksums.