Consistency != Correctness

We had a healthy lunch table discussion on Consistency Vs. Correctness in context of data stores (ex: relation databases and NoSQL stores).  Where many times people assumed that if data store guarantees Consistency, the solution is correct.  I came from Database Engine development background can clearly distinguish the difference.  But I learnt today over lunch table discussion that how one can interpret the article discussing this domain differently by a non-database-engine developer.  Thanks to Mustafa Hussain who made me understand the other way of looking at things.

Coming to the main topic, there was a good article on CAP Twelve Years Later: How the “Rules” Have Changed.

Why do banks have their own Correctness system when Relational Databases provide Consistency?

Banking solution correctness involves many non-database parts.  Not just non-database parts, but non-software parts.  A simple ATM transaction is not equal to Database transaction.  Because, a database transaction involves making sure that ACID properties are adhered to data with in the database.  Where as an ATM transaction involves databases, teller machine, physical currency notes, dispenser, unpicked-money-pull-back, currency loading, physical book-keeping, etc.  As you can see, by just using database that guarantees Consistency, Banking solution does not become correct.  Much more need to be done and hence Consistency != Correctness.

Well, you may then ask:

Why should Financial Banks use costly Relational Databases over cheap NoSQL stores?

I am talking about a case where if brand new financial banking software were to be developed, why should it use Relation Databases over NoSQL Stores.

Note that, I am talking about NoSQL stores that chose to support Availability and Partition Tolerance and forfeited Consistency per famous CAP theorem that the above cited article revisited. 

While Availability is important in the ATM scenario author chose to describe in the above cited article, he did not mean to forfeit ‘C’ completely.  We still need consistency with in a database in a particular site (ex: ATM room) across rows, tables, etc.  For example, money transfer between two accounts using ATM requires the consistency guaranty from data store.  NoSQL stores of today do not provide it.  Hence, the author cited it as “Another aspect of CAP confusion is the hidden cost of forfeiting consistency” . 

So, NoSQL stores have their own strong scenarios but not necessarily they fit for all scenarios.  Banking is one such scenario. 

As cited article author discussed, when we have NoSQL Stores that support C and A, we can then think of Banking like workloads to move over to such stores. 



Laxmi Narsimha Rao Oruganti


Distributed and Consistent Hashing – Part 2

Windows Azure Cache (WA Cache) is an distributed in-memory cache. WA Cache provides a simple <Key, Value> based API. Like, Cache.Put (key, value), Cache.Get (key). You can correlate WA Cache API to that of a Hash. This blog post series tries to take the user from traditional one-node Hash to a distributed Hash.  This is second post of the series (and assumes you have read first post).

We will find variety of ways to distribute if clients and servers do not have common understanding.  If we setup a way to learn key to node mapping, then the decoupling does not result in any problem.

In all the discussion we had, the key range was dynamic in size and changes as the cluster shrinks or expands.  Let us divide the whole key range into equal sized finite portions.  For example, if the key range is 1 – 20, we could define range size as 5 and we would then have four ranges namely: 1-5, 6-10, 11-15, 16-20.  In reality, the key range is much larger based on data type for Hash Code.  That is, it can be ulong.min to ulong.max (or uint.min to uint.max). Let us call each key range as ‘Partition’.  Let us also place some important restriction/assumption on the Partition that “A partition is restricted to a single node/server of cluster (i.e. a partition can’t span two nodes)”.  With this restriction in place, we don’t want to end-up blowing up a machine in case where one particular partition is heavily populated.  One simplest way to reduce this probability is to keep the key-range very small.  That means, no. of key ranges (or Partitions) are high.  

Partition Map:

As discussed each partition would be restricted to one node.  The node on which a partition resides is called ‘Partition Owner’.   As there are many partitions and fewer nodes, each node can own more than one partition.  The information of which node owns what partitions is referred to as ‘Partition Map’.

How this partition map is created, updated, maintained is a big topic, but we will have short discussion to get clarity.  Let us say we designate a server node for this responsibility.  This designation is called ‘Partition Manager (PM)’.  PM keeps the partition map up-to-date, responds to partition map queries, updates partition map in response to node up/downs.  When nodes come-up/go-down, cluster expands/shrinks, PM moves the partitions around (also called as, Load Balancing).  As PM node itself can go down, there should also be a mechanism to any other node to become PM (after making sure the actual PM has died).  Note that, we never want to have two nodes acting as PM, as that would mean there are two decision makers on who owns which partitions, etc.  PM’s partition map is called as Global Partition Map (GPM).  A copy of partition map is available at every server node and is called as Local Partition Map (LPM).  PM communicates appropriately the GPM updates so that server nodes can update LPM accordingly.

Simple Client – Server Gateway:

In this model, client are unaware of all this partition intelligence.  Clients would simply talk to some server (either by round robin or any other simple unrelated mechanism).  Server in turn would use LPM to lookup the actual owner of the Partition to which the key belongs to and work as a mediator between client and partition owner. As there is a mediation in this model, there would be performance cost due to two-hop model.  This model usually works or preferred when parts of the system are legacy or not owned.  For example, MemCacheD clients are not partition aware.  Hence, Windows Azure cache supports this Gateway model when MemCacheD clients want to use MemCacheD protocol against Windows Azure Cache.  We refer to this mechanism of deployment as ‘Server Shim’ (or Gateway Shim). 

Routing Client – Simple Server:

In this model, clients also keep a local copy of GPM (namely, Local Partition Map LPM).  Clients could query PM node (or any other server node) to get a copy of GPM.  Client takes the input key (from user), derives Hash Code, maps the Hash Code to a range/Partition, then uses the LPM to find the owner server node.  Client would then interact directly with server to get/put on that key.  As client is doing all the routing-style logic of finding the right server node, it is called ‘Routing Client’.  As discussed before, PM would be updating GPM as and when required.  However, routing clients have a copy that essentially needs to be invalidated.  So, every time client uses LPM and talks to a server node for any cache operation – it would also validate the LPM version w.r.t. GPM version.  If client has an old copy, it would initiate a LPM refresh.  If a server found using its LPM was already dead or not the primary any more, that is also an indication that its’ LPM is outdated.

Initially there is only one server-node cluster and data is 6,12,18.  Key range size is 5 and total range is limited to 1-20 (all for example purposes only).



Figure 1: Routing Client – One Server

Now, let us insert more data into cache, namely: 3, 9, 15.


Figure 2: Routing Client – One Server – More data

Let us say admin saw the cache usage is approaching the capacity and adds one more server.  At this time, before the partitions are moved to balance the load on nodes (by Load Balancer), here is how the cluster looks:


Figure 3: Routing Client – Two Servers – Prior to load balance activity

Note that, there is no data movement required to be done to make the client queries work.  Client keeps talking to Server S1 for all its cache requests as its LPM (or GPM) says so.  What we have just described is nothing but ‘Consistent Hashing’.  That is, your data distribution is not altered because of node up/downs, cluster expansions/contractions.

Though server S2 is part of cache cluster, it is not used at all as all partitions are owned by S1.  Not just for ‘get’, but ‘put’ is served by server S1.   While we have solved ‘consistent hashing’ with partition maps (routing tables), we are not effectively using resources (i.e. server S2).  As we discussed, PM also takes the ownership of balancing load.  Eventually, PM would detect the S2 is under-utilized and initiates the partition movement.  After PM finishes load balancing, it updates the GPM and keeps every server informed of the latest GPM.  However, clients still would have old version partition map.  Here is how the cluster looks:


Figure 4: Routing Client – Two Servers – Post load balance activity

Note that, GPM is updated, version bumped on server. Client still has old version. 

If client tries to operate on partitions P1 (1-5), P2 (6-10) those operations would succeed.  However, when it tries to operate on partitions P3 (11-15), P4 (16-20) – it would try to talk to S1 (as per its own LPM).  S1 would then react back ‘I am not Primary’ per its own GPM.  Client takes that as a hint and initiates the LPM update.  Here is how the cluster looks like post the LPM update.



Figure 5: Routing Client – Two Servers – Post LPM update

Once the LPM is updated, client would then talk to S2 for operations on P3 and P4 – which would be successful.  The advantage of this model is LB can be done at any appropriate time than immediately.   Partition Movement can be made efficient to reduce the ‘unavailability’ of partition, which is a separate topic of its own.

Windows Azure Cache clients are ‘routing’ clients by default.  If MemCacheD clients are used with Windows Azure Cache server, there is also another mechanism called ‘Client Shim’.  The client shim is nothing but a sibling module on partition unaware MemCacheD clients to bring partition intelligence.



Laxmi Narsimha Rao Oruganti (alias: LaxmiNRO, alias: OLNRao)


Of Home and Shelves

For a change I want to talk about my home. Well not really, it is my attempt to explain memory (volatile and persistent) hierarchy in a computer.

Have you ever observed a kitchen (or dressing room) in your home?  Let us take a look at it to understand and articulate.

1) Small Shelve (< 10 jars) very near to stove – We usually keep tea power, sugar, salt, popu (పోపు), chilly power, oil in this shelve

2) Medium Shelve (10 jars to 50 jars) in near reach of our hands, but little away from stove – We usually keep all other raw materials such as Red gram dall, Black gram dall, Idli Rawa, Bombay Rawa, Wheat Rawa, Vermicelli, etc. and utensils such as plates, glasses, utensils, etc. in this shelve

3) Big shelve (storage for cartons, gunny bags, drums, etc.) that are beyond reach of our hands and require ladder – We usually keep storage items such as rice bags, oil carton, unused utensils or rarely used items (dinner sets), etc. in this shelve

4) Apart from the shelves, we also having work area near the stove where items are brought for a temporary purpose and are placed back in their respective places when we are done with them.

I hope you now get the rough idea of how kitchen is organized.  Here are the top level reasoning of organization of items.

– The time order to bring items near to stove is: Small Shelve, Medium Shelve, Big Shelve

– The more frequently you use an item, the near it is brought to the stove – Salt, Chilly Power, etc.

– The bigger the item – we split into parts and keep one part handy to use and keep remaining parts farther away – Rice bags are split into small portion (10 KG) and rest.  Small portion is kept in reach and the rest can be kept far

Let us imagine a kitchen, where things are not organized by above pivoting rules/guidelines, what would be side effect?  We will end-up taking more time to prepare the food than in the current model of organization.

Now let us analyze stove.  Let us say, I have a stove ‘A’ that can cook a curry in an 30 minutes if all items are supplied w/o any delay (ideal case).  Because there will be delay to bring the required ingredients and mix with curry, we need to count that.  Here we might be lowering the stove flame to account for ingredient transport. Let us say, we spend 30 minutes in getting the ingredients for ready use.  So, we can prepare one curry with stove ‘A’ in one hour.  I became rich and can afford a better stove ‘B’ that can prepare a curry in just 20 minutes (in ideal case).  That is, actual curry preparation time would be 50 minutes (with transport time of 30 minutes).  Even richer me, better stove ‘C’ with 10 min. preparation time (ideal) – it would take 40 minutes in total. 

Ouch, we are severely inefficient in bringing the ingredients near the stove.  Hey, let us say you get an assistant to help.  You seek help of an assistant and reduce the ingredient transport time to 20 minutes (from 30 minutes).  So, with stove ‘C’, assistant help, we will be able to finish curry in 30 minutes.  Wow, yummy curry in 30 minutes!

Well, we become smart at doing things upon practice.  So, we learn which ingredients are required at what time, so why wait till that time?  That is, we could predict the ingredients and time map, keep them ready at the right time, so that we make sure we don’t unnecessarily occupy the place near the stove.  With this prediction, let us say we reduce ingredient transport to 10 minutes.  That means, yummy curry preparation with stove ‘C’ and prediction is just 20 minutes!

Well, I hope with that gyan of home, kitchen, shelves, stoves, etc.  Let us compare this whole system with computer and esp. memory hierarchy.

1) Small Shelves – On-Chip Processor Cache

2) Medium Shelves – RAM

3) Big Shelves – Hard Disk

4) Work area – Registers (AX, BX, CX, DX, AC, IP, etc.) + Processor Cache

5) Stove – CPU

6) Prediction – Instruction Pipeline, Branch Prediction

What shall we keep in processor cache? – Most frequently accessed data.  With ‘prediction’ smartness, we keep ‘next required’ data.

What shall we keep in RAM? – What we use periodically, not super frequently, but moderately required such as currently running program’s required code pages, usual program runtime data, etc.

What shall we keep on Disk? – Big files, videos, binaries, etc.  If we need any file, we don’t load whole file into RAM but get parts of it (much like we get 6-10 KG of Rice from rice bag)

As we have found in the discussion, the more we spend time in ingredient transport the more we need to use stove at lower flames.  That means, the more time it takes to get data from RAM/Disk, the less efficient use of CPU.  Performance experts refer to it as ‘memory wall’. 

Like we become smart in predicting which ingredients are required at what time to avoid using stove with lower flames in cooking, modern processors have techniques like branch prediction to reduce memory wall problem for CPU processing.

If we have 200 KGs of rice (4 Rice bags of 50 KG each), we might keep 3 full rice bags in big shelve and have another shelve where we could keep ‘currently’ used rice bag which is farther away than medium shelve but does not require a ladder.  Some times, we can have kitchen w/o big shelve at all in which case we don’t have much storage space and so we wont keep that many bags of rice. 

As much as we can add more types of shelves, similar stuff is applicable to computers. We now have Solid State Drives, that come between RAM and Hard Disk.  We can have a computer w/o Hard Disk, it just means that we don’t have much storage space.

With all this kitchen discussion, I am feeling hungry.  Will be back with another article on technology in simple English.  Till then…

Laxmi Narsimha Rao Oruganti

Distributed and Consistent Hashing – Part 1

Windows Azure Cache (WA Cache) is an distributed in-memory cache. WA Cache provides a simple <Key, Value> based API. Like, Cache.Put (key, value), Cache.Get (key). You can correlate WA Cache API to that of a Hash. This blog post series tries to take the user from traditional one-node Hash to a distributed Hash.  This is first post of the series.


Hash is an in-memory data structure indexed by key. Hash is generally implemented as an array of collections. Array is referred to as Hash Directory or directory. Each collection is referred to as Hash Bucket or simply Bucket. The array index for a bucket is called as Bucket Id.

Since, key can be of any data type; key is converted to a number namely HashCode. There exists variety of schemes to generate a HashCode for a given key. HashCode is then converted to bucket id by different scheme. The simplest scheme is, BucketIndex = HashCode % BucketCount.

In all the examples of this article, we will use data as numbers. Plus, HashCode=Key.

Blue boxes are Hash directories and Green boxes are Hash buckets. The Hash looks like the diagram below if we insert 0 to 10.


Figure 1 : Normal Hash

What I have described (above) is a single level Hashing. Let us now delve into multi-level hashing. In multi-level hashing, leaf level has hash buckets and all other levels are hash directories.

An example of two-level hash is given below. Data is 0 to 10.


Figure 2 : Multi Level Hash

Thus far, we have discussed in-memory Hashing (of different levels) where all memory is within a node. If we were to Hash of data whose size is in Gigabytes (and Terabytes), then the only to use this hashing is to have a machine with that much memory/RAM; which you know is super costly. The immediate thing that strikes is why not use memory from multiple nodes? Namely, distributed hashing. The capacity of distributed hash is the sum of the capacities of node hashes.

Now with multiple nodes coming into picture, hashing has to have a provision to identify ‘node’ on which to place/retrieve data. Hence, we need (and assume) to have a unique identifier for each node – Node Id.

Distributed Hashing API is not different from the regular Hashing API. That means we need a way to find the node on which the key would be stored. Once we identify the node, we then use the regular Hashing with in that node. As you can see, we have a communication over a network with typical Client-Server system. Each node runs the server side and the actual user application would run the client side.

Instead of jumping into perfect solution, let us go step-by-step with improvements.

Simple Client – Simple Server:

In this scheme, Client identifies the node by applying modulo operator with Node Count.

Node Id = Hash Code % Node Count

Client talks to Server identified, and Server would then uses its own hashing scheme to get/put the user supplied key.


Figure 3 : Distributed Hash – Simple Client and Simple Server

Simple, is not it? Let us find problems in this model.

Application has been inserting data into the Hash and distributed hash has run out of capacity (how does one detect that hash went out of capacity is altogether a separate topic and I will not digress into it. Let us assume, there is a mechanism to detect and inform the user). To increase the capacity, one more server has been added. Just after adding one more server, the system looks like below:


Figure 4 : Distributed Hash – Simple Client and Simple Server – Cluster Expansion

Now Client wants to get Key 3. Node Id = 3 (Hash Code) % 3 (Node Count) = 0. Client then talks to server 0; server 0 can’t find the key ‘3’. Client now gets confused on ‘Key does not exist’ answer, as it has put the key ‘3’ into hash. What’s wrong? Client identification of node is tied with server count.

Remedy 1– Reapply hashing scheme on existing data, whenever there is a change in server/node count, and redo the placement. Let us call this procedure as, ‘data movement’. Here is how the distributed hash looks like, after data movement:


Figure 5 : Distributed Hash – Simple Client and Simple Server – Data Movement

Now, the Client’s get of key ‘3’ would succeed as the Server 0 has the key and returns. Let us look at the problems with this.

‘Red’ colored data represents moved data. As you can observe, around 70%+ of data has participated in data movement. That’s huge amount and is inefficient. The problem gets aggravated if nodes go down and come up frequently.

Taking a step back and rethinking of placement of data in nodes may yield better results.

Scheme: Divide hash code space in to ‘Node Count’ ranges. Each range is owned by a node whose id is nearest to that hash code. Client also does the same when doing get/put to identify the node.

Hash Code is a number and if we consider it as 32-bit integer, the hash space is MIN_INT32 to MAX_INT32. For simplicity of our discussion, let us take Hash Code as 4-bit integer that translates the hash code space as [0, 15].

When there is only one node, whole range is owned by that one node (Server 0).


Figure 6 : Distributed Hash – Range Based – One Node Cluster

Let us add one more node. As the number of ranges is equal to the number of nodes, we will have to divide the hash code space into two ranges. As a result, server 0 now owns [0-7] and server 1 owns [8-15].


Figure 7 : Distributed Hash – Range Based – Two Node Cluster

We retained the practice of ‘Red’ coloring data items that participated in data movement. The data moved now is less. Let us add one more node again. As a result, we need to divide the whole range into three sub-ranges and move the data.


Figure 8 : Distributed Hash – Range Based – Two Node Cluster – Data Movement

We have reduced the data movement volume, but data distribution is skewed. I will introduce you to another way of distributing data, which may/may not solve the distribution skew.

Instead of Node Id of Servers as contiguous, let us assign identifiers with gap but within the hash code space. In our example, HashCode space is [0, 15]. Let us give server id as 4, 10, and 15. The key owner is identified by the server near to the key’s HashCode. The hash with two nodes ids 3, 9 can be depicted as:


Figure 9 : Distributed Hash – Sparse Node Id and Nearest Node Id

Let us introduce another node with id 15. We would treat HashCode space as circular and hence ‘0’ is nearer to Server 15 than Server 4. With this scheme of distribution, the hash is:


Figure 10 : Distributed Hash – Sparse Node Id and Nearest Node Id – Data Movement

The beauty of data distribution schemes discussed so far is that client’s way of identifying the ‘server’ where data should be placed/found is same as server’s way of distributing data. In other words, Client and Server have common and agreed way of distributing data. If we get rid of this common understanding, we will find other ways of data distribution. The decoupling and how clients learn server’s distribution model is the next blog post.