Since 2020, the amount of data created, captured, copied, and consumed worldwide has almost doubled — that's an increase of 55.8 zettabytes in just three years!
(Source: Statista.com, 2023)
As a result, organizations are actively seeking flexible, scalable, high-performance data storage solutions that can accommodate various types of data (e.g., structured, unstructured) while maintaining a high degree of data integrity. Further, given the prevalence of users accessing that data from a variety of mobile devices, organizations need solutions that support multiple users simultaneously while also ensuring strong data security and availability. These are just a few of the reasons that distributed databases are continuing to grow in popularity and are the solution of choice for many global organizations.
In this distributed database guide, we'll explore the different types of distributed databases, how they work, distributed database architecture, as well as the benefits and challenges that distributed databases offer.
Table of contents
In the most basic terms, a distributed database is a database that stores data in multiple locations instead of one location. This means that rather than putting all data on one server or on one computer, data is placed on multiple servers or in a cluster of computers consisting of individual nodes. These nodes are oftentimes geographically separate and may be physical computers or virtual machines within a cloud database.
The MongoDB cluster example above is one of many configuration possibilities available when creating a distributed database. However, unlike traditional centralized databases, all distributed databases share the common characteristic of spreading data across multiple locations (physical and/or virtual) which improves data resiliency and availability. Sharding the data across multiple locations allows for horizontal scaling as well.
There are two distinct types of distributed databases: homogeneous databases and heterogeneous databases.
In a homogeneous distributed database, the machines, nodes, servers, or sites store the same data, use the same data model, work with the same operating system, and share the same distributed database management system (DDBMS) or occasionally multiple types of DDBMS from the same vendor.
Within homogenous distributed databases, there are two subsets: autonomous and non-autonomous.
As a rule, homogeneous distributed databases offer significant data protection through redundancy and simplified management due to the similarity of all nodes.
In a heterogeneous distributed database, different machines or sites may house different data sets, use different operating systems, contain different data schemas, and require software to facilitate communication between machines. Further, different sites may not even be aware of the existence of other sites.
Within heterogeneous distributed databases, there are two subsets: federated and unfederated.
While more complex to manage, heterogeneous distributed databases offer more flexibility in terms of data models, schema choices, and the types of data that can be stored than homogeneous distributed databases.
As previously discussed, nodes are individual servers or computers that reside within a distributed database system (e.g., computers, virtual machines, servers that share no physical components). Each node stores a set of data and runs on distributed database management system software (DDBMS). To determine which data will be stored amongst which nodes, the concept of data distribution must be considered.
Proper data distribution is critical to the efficiency, security, and optimal user access in a distributed database. This process, sometimes referred to as data partitioning, can be accomplished using two different methods.
(Source: Hazelcast.com, 2023)
The resulting data sets from horizontal or vertical partitioning of the original table are sometimes referred to as shards.
While nodes are able to fully function on their own, it is necessary for them to communicate with other nodes as well since, unlike centralized databases, they do not share the same physical components or even the same data sets. There are three types of distributed database communication:
Distributed databases must often support distributed transactions, where one transaction can involve more than one node. This support methodology is highlighted in the ACID properties (atomicity, consistency, isolation, durability) of transactions across distributed database systems. Key elements of ACID properties include:
(Source: Dev.to, 2020)
Atomicity means that a transaction is treated as a single unit. This also means that either a complete transaction is available for storage or it's rejected as an error which ensures data integrity.
Consistency is maintained in distributed database systems by enforcing predefined rules and data constraints. If the state, nature, or content of a transaction violates these rules, the transaction will not be ingested and stored in the distributed system.
Isolation involves the separation of each transaction from the other transactions to prevent data conflicts and maintain data integrity. In addition, this benefits operations when managing multiple distributed data records that may exist across local data stores, virtual machines via cloud computing, and multiple database nodes which may be located across multiple sites.
Durability ensures that stored data is preserved in the event of a system failure. There are a variety of ways that a transactional distributed database management system accomplishes this task, including:
Because distributed database systems are more likely to experience failures or operations interruptions than centralized databases (e.g., due to multiple sites or a suboptimal file system), strong fault tolerance processes are essential to maintain access reliability and effective database operations. With that said, the number of individual components that distributed systems are able to preserve removes the risk of a single point of failure.
Some common fault tolerance processes include data replication, backup protocols, continuous failure detection, data checksums, load balancing, and query optimization.
Data replication is the process by which multiple copies of data are maintained across different nodes, servers, or sites. There are different types of database replication schema to choose from, including:
Full replication: In full replication, a complete, functional copy of the entire database is sent to all sites within the distributed database system. Database copy updates are provided on a routine schedule. There are two subtypes of full replication, as well.
Partial replication: In some cases, certain nodes only require specific portions of the database, so a defined portion of the database is replicated to a select group. In this type of data replication, any number of nodes or sites can receive the replication.
Merge replication: As its name indicates, merge database replication is the merging of two databases into one. This is the most complex of the database replication types.
Through a consistent program of automated data backups, data integrity and database systems availability can be maintained without overburdening organizational employees. Some of the most common solutions in the marketplace include backup software from Veeam, Druva, and Commvault.
Three of the most used types of backup for distributed databases include:
Full backup: The entire database is copied and stored every time a database backup is executed.
Differential backup: Only the changes made since the last full backup are copied and stored.
Incremental backup: Incremental backups do not require a previous full backup — they can save changes since the previous differential or incremental backup.
As with any system, it is critical for distributed database systems to be continuously monitored for system failures — whether they be technical issues, natural disasters, or cyberattacks. Just a few of the ways this monitoring is accomplished include:
Heartbeating: In heartbeating, each node sends out a signal (heartbeat) to other nodes to verify it's operational. If that signal isn't received, a failure message is created and further investigation of that node's operations by system administration is undertaken.
Watchdog timers: Individual nodes will have watchdog timers that are focused on a specific activity or process. If the timer expires without the activity or process being completed, a failure message is generated indicating further investigation is required.
Data checksums: In order to identify data tampering or other issues with data transmission, when a data transmission is sent, it is assigned a certain value (or checksum). When that transmission is received, it is also assigned a checksum. By using software to verify that both the sender and receiver have equivalent checksums for that transmission, issues with data transmission integrity can be quickly identified.
Load balancing techniques distribute user requests and queries evenly across database nodes. This not only improves performance but also ensures that the failure of one node does not cause an overload on others.
Usually, load balancing software is deployed as the intermediary between the applications or database users. When a query is received, the load balancer will evaluate the request and determine which node(s) are best equipped to respond. During this evaluation, such factors as proximity, current load, and other predetermined system rules will be considered. This evaluation and assignment helps the system avoid system overload and system inefficiency which can result in long wait times for users.
Distributed databases use query optimization techniques to distribute queries efficiently across nodes while minimizing data transfer traffic between nodes. One of the ways this is accomplished is through cost-based query optimization. This form of query optimization considers the most efficient execution for the query, with such factors as query complexity, available data, and the location of the site containing that data.
As with any type of database solution, there are both benefits and challenges. Here is a brief summary to consider when researching distributed databases for your organization.
Flexibility: Flexibility of data structures and schemas used within a distributed database (e.g., heterogeneous) are a significant benefit for organizations with a variety of data asset types and processing requirements.
Resiliency: Because distributed databases locate data across multiple nodes in the distributed system, the risk of a single point of failure is significantly reduced.
Scalability: Distributed databases can easily scale up (or down) by simply adjusting the number of nodes in the database, making them ideal for growing organizations.
Improved performance: Distributed databases are able to use load balancing and query optimization to improve overall database performance while reducing user wait times.
High availability: Fault tolerance (e.g., data replication, continuous failure detection) provide high system availability for users.
Complexity: Because there are more moving parts to distributed databases vs. centralized databases, they can be more complex to both design and manage. The Atlas developer data platform simplifies this dramatically by providing a single UI/API to control and manage secure MongoDB distributed systems at scale.
Latency: If not managed properly, latency can occur when users query data from multiple nodes.
Data consistency: Since distributed databases are able to employ multiple data schemas and structures, maintaining data consistency requires more effort than traditional databases. In addition, if there is a hardware or network failure, data restoration can be more complex.
Cost: Distributed databases can be more expensive due to the added complexity that their greater flexibility brings. In addition, there may be additional networking costs since they tend to have more sites and hardware than traditional databases.
Interested in learning more about distributed databases? Below is a list of resources that delve into specific elements of distributed databases.
A distributed database is a database system that stores data in multiple locations instead of one location. This means that rather than putting all data on one server or on one computer, data is placed on multiple servers or in a cluster of computers consisting of individual nodes. These nodes are oftentimes geographically separate and may be physical computers or virtual machines within a cloud database.
The two types of distributed databases are homogeneous distributed databases and heterogeneous distributed databases. In homogeneous distributed databases, all machines across multiple sites in the distributed database share the same exact data, data model, operating system, and the same distributed database management system. Meanwhile, in a heterogeneous distributed database, different machines may house different data sets, use different operating systems, contain different data schemas, and require software to facilitate communication between machines.
The ACID properties consist of four elements:
Atomicity: Atomicity means that a transaction is treated as a single unit.
Consistency: Distributed database systems maintain data consistency by enforcing predefined rules and data constraints.
Isolation: Each transaction is isolated from the other transactions to prevent data conflicts and maintain data integrity.
Durability: Durability ensures that stored data is preserved in the event of a system failure.
Different types of continuous failure monitoring include: