Sunday, June 14, 2015

Introduction to MongoDB Sharding and Replica Sets

The following diagram breaks down the individual components of a MongoDB sharded system.

We have the config servers, which are responsible for maintaining the metadata for your cluster. We have the MongoS query routers, which fetch data from the config servers in order to return data from the cluster and return them to the client application.

Think of MongoS as the load balancer, the shards as the data store, and the config servers as the metadata store. There are always exactly three config servers with the same metadata and they use a two phase commit protocol to keep the metadata in sync with each other. As long as one config server is up, the cluster is alive, but you lose the ability to do splits and migrates if one or more go down. You'll learn more about these components as your continue on.

Word of advice: unless you're at the point where you absolutely have to partition MongoDB (billions upon billions of rows), avoid sharding. Run MongoDB as a replica set from the get go, and only think about sharding when you really need the extra capacity[1].

  • Sharding makes your system more complex, making you keep track of more nodes and more processes.
  • Premature optimization will cost you time and money, especially when you haven't identified the real issue. Is your application running fast due to CPU-bound processes or I/O? Do you have too many or too little indexes? Are they even being hit by your queries?
  • You have to select a shard key. Make sure you have a clear understanding of your data model and system prior to doing this.

Sharding, or partitioning, is what allows MongoDB to scale out and distribute different ranges of data across multiple databases. It's meant to support the demands of data growth, horizontally, when a single machine is not sufficient. Each shard contains a given set of documents. Now each document can only exist within one shard at any point in time, though it's possible to move documents around for rebalancing purposes. Within a shard, we can have replication, however, that contain duplicates of the same documents. So in summary, in a replica set, different members have the same documents, and within shards, different shards have different documents. We mainly use replication for high availability, data safety, automatic failover, eliminate single points of failure, and disaster recovery. Use sharding for scaling out.

How Exactly is Data Distributed Across Shards?

In MongoDB, distribution is based on a shard key within a given collection. Shards are able to indicate the ranges of keys for the data that's being stored. So Shard 1 could hold K0 - K19, and Shard 2 could hold K20 - K39, and so on. Another thing to note is that documents whose shard key is close to another shard key in sort order tend to live on the same shard because of range-based partitioning, similar to Google Big Table.

  id: 0,
  name: "David",
  role: "Developer"

So if you're data model for a given document looks like the object above, and you're sharding on the id key, you're ranges (a.k.a. chunks) will look like the following within the metadata for your cluster:

idlow idhigh shard
0 19 S2
20 39 S0
40 59 S1

Why Range-Based Partitioning?

The reasoning behind range-based partitioning is to allow for some level of efficiency. Think about it. If you already know the id of the data you're fetching, you can be smart about which shard to access directly, by performing queries that target that specific id or an id that is greater than or less than, etc. Queries that don't leverage the shard key are referred to as scattered gather queries and are very expensive. So take advantage of the shard key in order to limit the data you have to pull from.

Important Sharding Operations to Be Aware Of

Ranges / chunk sizes are allowed to be within 1-1024 megabytes. Be mindful of performance when you allocate more or less than the default 64 megabytes.

In order to maintain an even distribution of data across all shards, MongoDB runs split and migrate operations, automatically. So as your chunks near the chunk limit, the system will split the chunk in half, while determining the new midpoint. Splits are inexpensive. Migrates are responsible for maintaining balance based on the cluster average, and is the process involved in actually distributing the data to different shards. Migrates are more expensive because the system is actually transferring data, inserting and deleting. Because of the cost, there will only be one chunk migration happening at any given time.

Last thing to note about these operations is that there is no locking mechanism and you'll still be able to read and write to your data while they're running.

Replication Sets

Replica sets can be used as an alternative to sharding or in conjuction with them. A proper configuration provides your databases with high-availability and failover. This is a must for production environments.

Replica sets live within each shard and are very similar to nodes in a master-slave configuration, where you have a primary member serving as the base for updating all of the secondary members. However, the added benefit of a replica sets is automatic failover if the primary member goes down. So yes, a secondary can become a primary if this scenario arises.

The Arbiter serves only to act as a tie-breaker when a secondary member needs to take the primary's place. It's important to have an odd number of nodes. If you have an even number of total nodes or one secondary within your shard, an arbiter is required.

Replica sets are allowed one primary member at a time because replication involves copying the primary operations log (oplog) and applying the changes on every secondary's dataset. If there were multiple primaries performing this write operation, we'd end up with data inconsistencies. As for secondary members, you're allowed up to 12 members with up to 7 voting in an election.


1 comment:

  1. Think of MongoS as the load balancer, the shards as the data store, and the config servers as the metadata store. There are always exactly three config servers with the same metadata and they use a two phase commit protocol to keep the metadata in sync with each other.
    hedge fund research outsourcing