Distributed and Consistent Hashing – Part 3

Windows Azure Cache (WA Cache) is an distributed in-memory cache. WA Cache provides a simple <Key, Value> based API. Like, Cache.Put (key, value), Cache.Get (key). You can correlate WA Cache API to that of a Hash. This blog post series tries to take the user from traditional one-node Hash to a distributed Hash. This is third post of the series (and assumes you have read first and second posts).

Distributed Systems are gaining momentum because of its promise to economies of scale.  The economy is possible due to the use of ‘commodity’ hardware.  The commodity hardware is cheap, but they pose higher degree of failure when compared to reliable (and costly) hardware.  As a result, distributed systems have been designed to work with different types of failures. 

Hardware Failures: Electrical Power Supply outages (#1),  Power Socket Failures, Network Switch Failures, Network Router Failures, Network Card Failures, Hard Disk Failures, Memory Chip Failures.

Hardware Failures in Windows Azure are discussed in Inside Windows Azure – Build 2011 Talk

Software Failures: Crashes due to bugs

Failures due to misconfiguration : Simple network misconfigurations can put node/rack/datacenter network communication in trouble.

#1 Electrical Power Supply Outages:

While Uninterrupted Power Supply (UPS) helps in case of power failures, it can only help as an alternative power source for short time (~1 hour).  Any long power outage or when UPS is not available, this problem needs to be dealt at upper layers. 

At hardware level: Redundancy is the mantra for many of the problems – redundancy in power supply, redundancy in communication paths, etc. 

At software level: Replication of data, replication of responsibility, and right integration with hardware redundancy.

Solutions must be well integrated across layers to get best results (or sometimes even the desired results).  The importance of integration becomes evident with an example.

Let us say a web server is deployed on two nodes.  So in case one node faults there exists another node to serve the requests.  If both these nodes are connected to same power socket, any fault in power socket would result both the nodes go down at the same time.  This essentially means that in spite of having two web server nodes, we landed in trouble.  Imagine if we have a way to make sure that two nodes are always connected to distinct set of resources, it would be much better model. 

The system of categorizing resources into group/set is thus necessary.  These groups are called Fault Domains (FD).  No two FDs share same power source, network switch, etc.  With these FDs in picture, at software level any redundancy system just have to make sure to place the redundancy across FDs. 

We have discussed redundancy as a solution at software layer to deal with faults.  In case of stateless software programs, just having another node would be sufficient.  Where as in case of stateful software programs, there is much more to be done.  For example, database systems.  Traditionally in scale-up world, RAID systems were used to make sure to protect against bad sectors (checksums), hard disk crashes (multiple copies on different disks), etc.  RAID storage is costly so can’t be the choice for distributed systems.  The other scale-up world technique has been data replication.  Replication is typically the place where FD knowledge is required to place copies of data in different FDs.

The moment replication comes into discussion it is important to call out the terminology:

Primary – The node is responsible to ‘own’ replication and interface with client.  Primary is usually only one.

Secondary – The node is responsible to ‘cooperate’ replication.  There can be multiple secondary nodes. 

Replication is a vast subject, but I will keep it short.  Depending on how Primary and Secondary agree to replication there are multiple methods.

Asynchronous Replication – In this model, changes in primary are committed and client is acknowledged without waiting for the secondary nodes to be updated.  Replication with secondary is either triggered or a background task takes up the responsibility of bringing secondary nodes up-to-date.  In short, replication happens asynchronously.  In this mode, if primary node fails, there could be data loss as secondary nodes are not up-to-date.  If the data is not super critical and ok to lose, this model is apt.

Synchronous Replication – In this model, for every change in primary it is synchronously replicated to secondary. Till secondary responds, changes on primary are not ‘committed’ (and so, client is not acknowledged). In short, replication happens synchronously. If secondary node is down, writes are blocked till the node is brought back up.

The no. of copies to be maintained is referred as ‘Replication Factor’.

If data is important, admin would opt for synchronous replication.  The higher the replication factor, the better fault tolerance.  But keeping client on hold till all secondary copies are updated makes an admin to chose lower replication factor.  It can be argued that admin is forced to think deeply about the replication factor with this model.  There are places where, one might need higher replication factor but not at the risk of increased operation times.  There comes the need for mixture of both synchronous and asynchronous replication models, namely Hybrid Replication.  In this model, one can chose no. of secondary nodes to be in synchronous mode and no. secondary nodes to be in asynchronous mode.  Here again, there are two choices where one can designate ‘fixed set’  of secondary nodes for synchronous replication or live with ‘any N’ secondary nodes to acknowledge.

It is also important to note that, a node can act as both primary and secondary for different replication units.  In case of database systems, each database is a replication unit.  So, a node can act as primary for database 1 and as a secondary for database 2. 

In case of Routing Client, client would typically know the primary to reach out for each replication unit.  Some systems allow clients to read from secondary.  Depending on which secondary nodes a client is allowed to read from, results in different levels of consistency.  Werner Vogels wrote great blog about different consistency models (Blog Post, Wikipedia Link).

Primary and Secondary communicate as part of replication and there are multiple models.

Pipeline model – The replication data flow is like a chain – Primary sends the data to first secondary node, first secondary nodes then sends the data to next secondary and so on so forth.  Windows Azure Storage, Hadoop Distributed File System use this model

Multi-Unicast Model – Primary sends replication data to all ‘synchronous secondary’ nodes separately and waits for acknowledgement from everyone

Multicast Model – Incase of hybrid replication with ‘any N’ acknowledgements – Primary sends replication data to all ‘secondary’ nodes (both synchronous and asynchronous secondary nodes) and wait for any N secondary nodes to acknowledge.  The set of N nodes that acknowledge vary for every data block or chunk or packet.

One major advantage of Windows Azure Cache over MemCacheD is it supports High Availability (the public name for replication).  Windows Azure Cache supports synchronous replication model (and no. of secondary nodes is fixed to 1 – and communication model is multi-unicast) and each partition is a replication unit. Cache is in an in-memory system and so the replication is limited to in-memory replication.   And, that is the catch!.  In any stateful system, a node reboot does not lose the state as the state is persisted on disk (either locally or remotely).  However, in case of in-memory system like Windows Azure Cache – a node reboot results in state loss.  Synchronous replication and node-reboot-leads-to-state-loss made us (Windows Azure Cache) to let clients commit when all secondary nodes are down as they don’t have any data that can be said would go out-of-date by allowing writes.  Windows Azure Cache (as on 1.8 SDK release) does not allow client to read from secondary nodes.

Many a times cache is used for data that need to be processed.  Processing involves code to be run.  It is a well known and established fact that keeping the code and data near makes systems complete tasks fast. 

In case of database systems, stored procedures help bring the code (or business logic) near to data. 

In Windows Azure Cache, we have Append, Prepend, Increment, Decrement API to help process value.  It would have been lovely if we had ‘stored procedure’ model instead of these individual API.  That way, any processing can be pushed to these ‘stored procedures’ and we could have simply shipped Append, Prepend, Increment, Decrement as ‘Microsoft owned’ stored procedures.

This is the last post of the series “Distributed and Consistent Hashing”.  Here are the important lessons/techniques to remember (deliberately explained in generic terms so as to carry forward them to use in other discussions).

– Data Nodes – The cache server nodes that actually store data

– Control Nodes or Metadata Nodes – Partition Manager is one such control node we have discussed that helps manage the data nodes and their partition ownership. 

– Routing Clients – Make client intelligent to talk to data nodes directly without the need to go thru control node

– Read from Secondary – Load balance, allowing different levels of consistency,

– Code and Data Distance – If code and data are near, tasks can be completed faster 


That is all for now, will come back with another series on taking a system from scale-up world to scale-out (or distributed) world.



Laxmi Narsimha Rao Oruganti (alias: LaxmiNRO, alias: OLNRao)