Consistency != Correctness

We had a healthy lunch table discussion on Consistency Vs. Correctness in context of data stores (ex: relation databases and NoSQL stores).  Where many times people assumed that if data store guarantees Consistency, the solution is correct.  I came from Database Engine development background can clearly distinguish the difference.  But I learnt today over lunch table discussion that how one can interpret the article discussing this domain differently by a non-database-engine developer.  Thanks to Mustafa Hussain who made me understand the other way of looking at things.

Coming to the main topic, there was a good article on CAP Twelve Years Later: How the “Rules” Have Changed.

Why do banks have their own Correctness system when Relational Databases provide Consistency?

Banking solution correctness involves many non-database parts.  Not just non-database parts, but non-software parts.  A simple ATM transaction is not equal to Database transaction.  Because, a database transaction involves making sure that ACID properties are adhered to data with in the database.  Where as an ATM transaction involves databases, teller machine, physical currency notes, dispenser, unpicked-money-pull-back, currency loading, physical book-keeping, etc.  As you can see, by just using database that guarantees Consistency, Banking solution does not become correct.  Much more need to be done and hence Consistency != Correctness.

Well, you may then ask:

Why should Financial Banks use costly Relational Databases over cheap NoSQL stores?

I am talking about a case where if brand new financial banking software were to be developed, why should it use Relation Databases over NoSQL Stores.

Note that, I am talking about NoSQL stores that chose to support Availability and Partition Tolerance and forfeited Consistency per famous CAP theorem that the above cited article revisited. 

While Availability is important in the ATM scenario author chose to describe in the above cited article, he did not mean to forfeit ‘C’ completely.  We still need consistency with in a database in a particular site (ex: ATM room) across rows, tables, etc.  For example, money transfer between two accounts using ATM requires the consistency guaranty from data store.  NoSQL stores of today do not provide it.  Hence, the author cited it as “Another aspect of CAP confusion is the hidden cost of forfeiting consistency” . 

So, NoSQL stores have their own strong scenarios but not necessarily they fit for all scenarios.  Banking is one such scenario. 

As cited article author discussed, when we have NoSQL Stores that support C and A, we can then think of Banking like workloads to move over to such stores. 



Laxmi Narsimha Rao Oruganti


Distributed File System – Part 1 (Retake)

This blog post series tries to take the user from traditional one-node file system to a distributed file system. This is first post of the series.

File System (of a scale-up world or in a single node):

The basic responsibilities of file system are:

  1. Support user objects and operations (outwardly)
    1. Directories – Create, Delete, Rename
    2. Files – Create, Rename, Read, Write, Expand, Shrink, Truncate, Delete
  2. Manage Disk Space (inwardly)
    1. Structure and organization of data on disk
    2. Allocation and de-allocation of disk space

* There are many more responsibilities that file systems own.  Such as: file permissions, symbolic links, etc.  They are all deliberately excluded for keeping things short (Need I say sweet Smile).

File System metadata typically contains:

  • Which data blocks constitute a file and their order
  • Which directories contains which files and the hierarchy

Let us take an example to get clarity.  In case of, Unix-like OS file systems:

  1. The very first block on the disk is usually Master Boot Record (MBR)
    1. LILO and GRUB are the popular boot loaders
  2. MBR contains the details of Disk Partitions (namely: drives)
  3. Disk Layout: MBR | Partition 1 | Partition 2 | …


Figure 1: Disk Layout

At disk level, VFS (virtual file system) operates.  It then mounts the drives (based on the file system to which the drive is setup for).

File System divides the whole drive space into blocks. The block is usually configurable and is part of OS installation configuration. The typical block size is 4 KB (for long time), recently it has moved up to 8 KB. Of these blocks, some blocks are used for file systems own metadata and the rest can be left for use by applications (or data).


Figure 2: ext2 File System – Disk Partition (or Drive) Layout


Figure 3: ext2 File System – Block Group Layout

  1. Drive Layout: Super Block | Block Group 1 | Block Group 2 | …
    1. SuperBlock: Contains the file system info, and drive info as a whole (like what type of file system it is, how many blocks are in use, how many blocks are free)
  2. Block Group Layout: Group Descriptors | Block Bitmap | iNode Bitmap | iNode Table | Data Blocks
    1. Group Descriptors: number of free and used blocks in this block group, number of free and used iNode count, block number of Block Bitmap, iNode Bitmap
    2. Block Bitmap: Each bit represents that particular block is free/used
    3. iNode Bitmap: Each bit represents that particular iNode is free/used
    4. Data Blocks

  1. Each file system object (directory, file, symbolic link) are represented in a metadata structure named iNode.
  2. Internally all iNodes are addressed by numbers (namely, iNode Number) – starting with 1
  3. iNode structure typically contains:
    1. Block Pointers – 12 Level 0, 1 Level 1, 1 Level 2, 1 Level 3
      1. Level 0 – 12 Pointers to Data Blocks
      2. Level 1 – Pointer to Block of Pointers to Data Blocks
      3. Level 2 – Pointer to Block of Pointers to Blocks of Pointers to Data Blocks
      4. Level 3 – Pointer to Block of Pointers to Blocks of Pointers to Blocks of Pointers to Data Blocks
    2. In case of Directory, Data Blocks contain details of immediate sub-directories and files.  For each item, there is an iNode number
    3. In case of Files, Data Blocks contain actual user data

Since, the first directory to create in the system is root “/”.  It usually comes in the very beginning (I think, iNode Number is 2).

The work flow to open a file “/usr/laxminro/olnrao.txt”

  • Get the iNode for “/”
  • Find the Data Block details from this iNode
  • From “/” Data Block, find the iNode Number of sub-directory “usr”
  • Get the iNode for “usr” (with iNode Number found above)
  • Find the Data Block details from this iNode
  • From “usr” Data Block, find the iNode Number of sub-directory “laxminro”
  • Get the iNode for “laxminro” (with iNode Number found above)
  • Find the Data Block details from this iNode
  • From “laxminro” Data Block, find the iNode Number of file “olnrao.txt”
  • Get the iNode for “olnrao.txt” (with iNode Number found above)
  • Find the Data Block details from this iNode
  • These data blocks contains the actual data (i.e. contents of “olnrao.txt” file)

Now that we have talked at very high level what a typical File System does and how it has implemented, let us talk about file system resources and their limitations (esp. Hard disks), usage patterns, and design choices made accordingly.

  • Hard disks are mechanical devices
    • Disk space was a scarce resource
      • Disk fragmentation was a serious concern
    • Disk bandwidth was and is a scarce resource
    • Seek time is in order of milliseconds
      • Random Reads and Random writes incur seek
    • Small writes and small reads were the norm

Applications tried to make sure they use very less disk space and bandwidth.  So, amount of data read from and written to disk was small.  Note that, each read or write could potentially incur a seek.  As a result, file systems introduced buffering.  Read buffering to serve the next possible read from buffers.  Write buffering to make sure to accumulate enough data before writing to disk.  Buffering also helped order the writes in such a way that seek is in one direction than to and fro.  Database Systems (esp. RDBMS) are one of the heavy users of file systems and if one delves into design choices made in RDBMS, one can easily how much a file system design choice and hardware limitations need to be dealt.  I will take a short detour to get a sense:

  • B+ Trees were the common form of on-disk storage of tables and indexes
    • For the record, heap based tables and hash based indexes do exist
  • Within DB file, space is allocated in terms of extents.  An extent is about 32/64/128 KB.
    • Extent system assumes that disk space is together (ex: on same platter)
    • Each extent is usually exclusive to a B+ Tree so that
    • Read and Write of a B+ Tree data is collocated as extent is on same platter, otherwise one page could be on one platter and another page of B+ Tree could be on different platter
  • A typical transaction adds/updates one/two rows in different tables
    • Ex: An order made by a customer, would result in one Order Table row, few Order Details Table rows, Payments table row, etc.
    • Even with extent model this above typical usage pattern demands that we need to touch multiple B+ Trees, which means small writes in different locations
    • Solution: Write ahead logging – Log files serve multiple purposes
      • Bringing ‘append’ and ‘chunk’ semantics of file system usage
      • Atomic way of committing or aborting a transaction
        • Either Commit Transaction or Abort Transaction
        • Atomicity is still possible with other techniques such as ‘Shadow Paging’ – but they are not successful because of limitations with Hard disks
      • Actual B+ Tree structures on-disk are updated as part of Checkpoint

Pardon me for the detour, coming back to disks.  We discussed mostly about optimizations on ‘write’.  At scale and for reads, the introduction of ‘cache tier’ came into play.

Apart from the heavy user like RDBMS, many applications use file system in different ways and most of the time write buffering and read buffering helped contain the problem.  It is also important to note that general file systems served wide variety of applications, work loads, and hence their design choices were limited.  That may be the reason, there are many file systems that have been designed for the tailor workloads.

The newer persistent memory technologies such as ‘Solid State Drives’, an electronic device, did not make things any better.  Why? you may ask.  While random reads are not a concern, random writes are.  Random small writes incur ‘erase block’ semantics of SSD.  Erase block is costly because it requires charge pump.  There are other problems such as Wear Leveling, etc.  This whole story is famously known as Write amplification.

Interested readers may read the following to get a pulse of hot trends in persistent memory world: Phase-change memory, Memristor,

File Systems continues to be an area of research with newer storage devices coming up.

Hopefully that has given the gist of a File System in one-node world.  That’s all for now, shall come back soon with more on the same topic.  Thanks for reading.



Laxmi Narsimha Rao Oruganti (alias: LaxmiNRO, alias: OLNRao)

Distributed and Consistent Hashing – Part 3

Windows Azure Cache (WA Cache) is an distributed in-memory cache. WA Cache provides a simple <Key, Value> based API. Like, Cache.Put (key, value), Cache.Get (key). You can correlate WA Cache API to that of a Hash. This blog post series tries to take the user from traditional one-node Hash to a distributed Hash. This is third post of the series (and assumes you have read first and second posts).

Distributed Systems are gaining momentum because of its promise to economies of scale.  The economy is possible due to the use of ‘commodity’ hardware.  The commodity hardware is cheap, but they pose higher degree of failure when compared to reliable (and costly) hardware.  As a result, distributed systems have been designed to work with different types of failures. 

Hardware Failures: Electrical Power Supply outages (#1),  Power Socket Failures, Network Switch Failures, Network Router Failures, Network Card Failures, Hard Disk Failures, Memory Chip Failures.

Hardware Failures in Windows Azure are discussed in Inside Windows Azure – Build 2011 Talk

Software Failures: Crashes due to bugs

Failures due to misconfiguration : Simple network misconfigurations can put node/rack/datacenter network communication in trouble.

#1 Electrical Power Supply Outages:

While Uninterrupted Power Supply (UPS) helps in case of power failures, it can only help as an alternative power source for short time (~1 hour).  Any long power outage or when UPS is not available, this problem needs to be dealt at upper layers. 

At hardware level: Redundancy is the mantra for many of the problems – redundancy in power supply, redundancy in communication paths, etc. 

At software level: Replication of data, replication of responsibility, and right integration with hardware redundancy.

Solutions must be well integrated across layers to get best results (or sometimes even the desired results).  The importance of integration becomes evident with an example.

Let us say a web server is deployed on two nodes.  So in case one node faults there exists another node to serve the requests.  If both these nodes are connected to same power socket, any fault in power socket would result both the nodes go down at the same time.  This essentially means that in spite of having two web server nodes, we landed in trouble.  Imagine if we have a way to make sure that two nodes are always connected to distinct set of resources, it would be much better model. 

The system of categorizing resources into group/set is thus necessary.  These groups are called Fault Domains (FD).  No two FDs share same power source, network switch, etc.  With these FDs in picture, at software level any redundancy system just have to make sure to place the redundancy across FDs. 

We have discussed redundancy as a solution at software layer to deal with faults.  In case of stateless software programs, just having another node would be sufficient.  Where as in case of stateful software programs, there is much more to be done.  For example, database systems.  Traditionally in scale-up world, RAID systems were used to make sure to protect against bad sectors (checksums), hard disk crashes (multiple copies on different disks), etc.  RAID storage is costly so can’t be the choice for distributed systems.  The other scale-up world technique has been data replication.  Replication is typically the place where FD knowledge is required to place copies of data in different FDs.

The moment replication comes into discussion it is important to call out the terminology:

Primary – The node is responsible to ‘own’ replication and interface with client.  Primary is usually only one.

Secondary – The node is responsible to ‘cooperate’ replication.  There can be multiple secondary nodes. 

Replication is a vast subject, but I will keep it short.  Depending on how Primary and Secondary agree to replication there are multiple methods.

Asynchronous Replication – In this model, changes in primary are committed and client is acknowledged without waiting for the secondary nodes to be updated.  Replication with secondary is either triggered or a background task takes up the responsibility of bringing secondary nodes up-to-date.  In short, replication happens asynchronously.  In this mode, if primary node fails, there could be data loss as secondary nodes are not up-to-date.  If the data is not super critical and ok to lose, this model is apt.

Synchronous Replication – In this model, for every change in primary it is synchronously replicated to secondary. Till secondary responds, changes on primary are not ‘committed’ (and so, client is not acknowledged). In short, replication happens synchronously. If secondary node is down, writes are blocked till the node is brought back up.

The no. of copies to be maintained is referred as ‘Replication Factor’.

If data is important, admin would opt for synchronous replication.  The higher the replication factor, the better fault tolerance.  But keeping client on hold till all secondary copies are updated makes an admin to chose lower replication factor.  It can be argued that admin is forced to think deeply about the replication factor with this model.  There are places where, one might need higher replication factor but not at the risk of increased operation times.  There comes the need for mixture of both synchronous and asynchronous replication models, namely Hybrid Replication.  In this model, one can chose no. of secondary nodes to be in synchronous mode and no. secondary nodes to be in asynchronous mode.  Here again, there are two choices where one can designate ‘fixed set’  of secondary nodes for synchronous replication or live with ‘any N’ secondary nodes to acknowledge.

It is also important to note that, a node can act as both primary and secondary for different replication units.  In case of database systems, each database is a replication unit.  So, a node can act as primary for database 1 and as a secondary for database 2. 

In case of Routing Client, client would typically know the primary to reach out for each replication unit.  Some systems allow clients to read from secondary.  Depending on which secondary nodes a client is allowed to read from, results in different levels of consistency.  Werner Vogels wrote great blog about different consistency models (Blog Post, Wikipedia Link).

Primary and Secondary communicate as part of replication and there are multiple models.

Pipeline model – The replication data flow is like a chain – Primary sends the data to first secondary node, first secondary nodes then sends the data to next secondary and so on so forth.  Windows Azure Storage, Hadoop Distributed File System use this model

Multi-Unicast Model – Primary sends replication data to all ‘synchronous secondary’ nodes separately and waits for acknowledgement from everyone

Multicast Model – Incase of hybrid replication with ‘any N’ acknowledgements – Primary sends replication data to all ‘secondary’ nodes (both synchronous and asynchronous secondary nodes) and wait for any N secondary nodes to acknowledge.  The set of N nodes that acknowledge vary for every data block or chunk or packet.

One major advantage of Windows Azure Cache over MemCacheD is it supports High Availability (the public name for replication).  Windows Azure Cache supports synchronous replication model (and no. of secondary nodes is fixed to 1 – and communication model is multi-unicast) and each partition is a replication unit. Cache is in an in-memory system and so the replication is limited to in-memory replication.   And, that is the catch!.  In any stateful system, a node reboot does not lose the state as the state is persisted on disk (either locally or remotely).  However, in case of in-memory system like Windows Azure Cache – a node reboot results in state loss.  Synchronous replication and node-reboot-leads-to-state-loss made us (Windows Azure Cache) to let clients commit when all secondary nodes are down as they don’t have any data that can be said would go out-of-date by allowing writes.  Windows Azure Cache (as on 1.8 SDK release) does not allow client to read from secondary nodes.

Many a times cache is used for data that need to be processed.  Processing involves code to be run.  It is a well known and established fact that keeping the code and data near makes systems complete tasks fast. 

In case of database systems, stored procedures help bring the code (or business logic) near to data. 

In Windows Azure Cache, we have Append, Prepend, Increment, Decrement API to help process value.  It would have been lovely if we had ‘stored procedure’ model instead of these individual API.  That way, any processing can be pushed to these ‘stored procedures’ and we could have simply shipped Append, Prepend, Increment, Decrement as ‘Microsoft owned’ stored procedures.

This is the last post of the series “Distributed and Consistent Hashing”.  Here are the important lessons/techniques to remember (deliberately explained in generic terms so as to carry forward them to use in other discussions).

– Data Nodes – The cache server nodes that actually store data

– Control Nodes or Metadata Nodes – Partition Manager is one such control node we have discussed that helps manage the data nodes and their partition ownership. 

– Routing Clients – Make client intelligent to talk to data nodes directly without the need to go thru control node

– Read from Secondary – Load balance, allowing different levels of consistency,

– Code and Data Distance – If code and data are near, tasks can be completed faster 


That is all for now, will come back with another series on taking a system from scale-up world to scale-out (or distributed) world.



Laxmi Narsimha Rao Oruganti (alias: LaxmiNRO, alias: OLNRao)