Distributed and Consistent Hashing – Part 2

Windows Azure Cache (WA Cache) is an distributed in-memory cache. WA Cache provides a simple <Key, Value> based API. Like, Cache.Put (key, value), Cache.Get (key). You can correlate WA Cache API to that of a Hash. This blog post series tries to take the user from traditional one-node Hash to a distributed Hash.  This is second post of the series (and assumes you have read first post).

We will find variety of ways to distribute if clients and servers do not have common understanding.  If we setup a way to learn key to node mapping, then the decoupling does not result in any problem.

In all the discussion we had, the key range was dynamic in size and changes as the cluster shrinks or expands.  Let us divide the whole key range into equal sized finite portions.  For example, if the key range is 1 – 20, we could define range size as 5 and we would then have four ranges namely: 1-5, 6-10, 11-15, 16-20.  In reality, the key range is much larger based on data type for Hash Code.  That is, it can be ulong.min to ulong.max (or uint.min to uint.max). Let us call each key range as ‘Partition’.  Let us also place some important restriction/assumption on the Partition that “A partition is restricted to a single node/server of cluster (i.e. a partition can’t span two nodes)”.  With this restriction in place, we don’t want to end-up blowing up a machine in case where one particular partition is heavily populated.  One simplest way to reduce this probability is to keep the key-range very small.  That means, no. of key ranges (or Partitions) are high.  

Partition Map:

As discussed each partition would be restricted to one node.  The node on which a partition resides is called ‘Partition Owner’.   As there are many partitions and fewer nodes, each node can own more than one partition.  The information of which node owns what partitions is referred to as ‘Partition Map’.

How this partition map is created, updated, maintained is a big topic, but we will have short discussion to get clarity.  Let us say we designate a server node for this responsibility.  This designation is called ‘Partition Manager (PM)’.  PM keeps the partition map up-to-date, responds to partition map queries, updates partition map in response to node up/downs.  When nodes come-up/go-down, cluster expands/shrinks, PM moves the partitions around (also called as, Load Balancing).  As PM node itself can go down, there should also be a mechanism to any other node to become PM (after making sure the actual PM has died).  Note that, we never want to have two nodes acting as PM, as that would mean there are two decision makers on who owns which partitions, etc.  PM’s partition map is called as Global Partition Map (GPM).  A copy of partition map is available at every server node and is called as Local Partition Map (LPM).  PM communicates appropriately the GPM updates so that server nodes can update LPM accordingly.

Simple Client – Server Gateway:

In this model, client are unaware of all this partition intelligence.  Clients would simply talk to some server (either by round robin or any other simple unrelated mechanism).  Server in turn would use LPM to lookup the actual owner of the Partition to which the key belongs to and work as a mediator between client and partition owner. As there is a mediation in this model, there would be performance cost due to two-hop model.  This model usually works or preferred when parts of the system are legacy or not owned.  For example, MemCacheD clients are not partition aware.  Hence, Windows Azure cache supports this Gateway model when MemCacheD clients want to use MemCacheD protocol against Windows Azure Cache.  We refer to this mechanism of deployment as ‘Server Shim’ (or Gateway Shim). 

Routing Client – Simple Server:

In this model, clients also keep a local copy of GPM (namely, Local Partition Map LPM).  Clients could query PM node (or any other server node) to get a copy of GPM.  Client takes the input key (from user), derives Hash Code, maps the Hash Code to a range/Partition, then uses the LPM to find the owner server node.  Client would then interact directly with server to get/put on that key.  As client is doing all the routing-style logic of finding the right server node, it is called ‘Routing Client’.  As discussed before, PM would be updating GPM as and when required.  However, routing clients have a copy that essentially needs to be invalidated.  So, every time client uses LPM and talks to a server node for any cache operation – it would also validate the LPM version w.r.t. GPM version.  If client has an old copy, it would initiate a LPM refresh.  If a server found using its LPM was already dead or not the primary any more, that is also an indication that its’ LPM is outdated.

Initially there is only one server-node cluster and data is 6,12,18.  Key range size is 5 and total range is limited to 1-20 (all for example purposes only).



Figure 1: Routing Client – One Server

Now, let us insert more data into cache, namely: 3, 9, 15.


Figure 2: Routing Client – One Server – More data

Let us say admin saw the cache usage is approaching the capacity and adds one more server.  At this time, before the partitions are moved to balance the load on nodes (by Load Balancer), here is how the cluster looks:


Figure 3: Routing Client – Two Servers – Prior to load balance activity

Note that, there is no data movement required to be done to make the client queries work.  Client keeps talking to Server S1 for all its cache requests as its LPM (or GPM) says so.  What we have just described is nothing but ‘Consistent Hashing’.  That is, your data distribution is not altered because of node up/downs, cluster expansions/contractions.

Though server S2 is part of cache cluster, it is not used at all as all partitions are owned by S1.  Not just for ‘get’, but ‘put’ is served by server S1.   While we have solved ‘consistent hashing’ with partition maps (routing tables), we are not effectively using resources (i.e. server S2).  As we discussed, PM also takes the ownership of balancing load.  Eventually, PM would detect the S2 is under-utilized and initiates the partition movement.  After PM finishes load balancing, it updates the GPM and keeps every server informed of the latest GPM.  However, clients still would have old version partition map.  Here is how the cluster looks:


Figure 4: Routing Client – Two Servers – Post load balance activity

Note that, GPM is updated, version bumped on server. Client still has old version. 

If client tries to operate on partitions P1 (1-5), P2 (6-10) those operations would succeed.  However, when it tries to operate on partitions P3 (11-15), P4 (16-20) – it would try to talk to S1 (as per its own LPM).  S1 would then react back ‘I am not Primary’ per its own GPM.  Client takes that as a hint and initiates the LPM update.  Here is how the cluster looks like post the LPM update.



Figure 5: Routing Client – Two Servers – Post LPM update

Once the LPM is updated, client would then talk to S2 for operations on P3 and P4 – which would be successful.  The advantage of this model is LB can be done at any appropriate time than immediately.   Partition Movement can be made efficient to reduce the ‘unavailability’ of partition, which is a separate topic of its own.

Windows Azure Cache clients are ‘routing’ clients by default.  If MemCacheD clients are used with Windows Azure Cache server, there is also another mechanism called ‘Client Shim’.  The client shim is nothing but a sibling module on partition unaware MemCacheD clients to bring partition intelligence.



Laxmi Narsimha Rao Oruganti (alias: LaxmiNRO, alias: OLNRao)



2 thoughts on “Distributed and Consistent Hashing – Part 2

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s