HBase Capacity Planning – Part 1

HBase is a distributed, fault-tolerant, persistent, NoSQL store in Hadoop Stack.  Hadoop has been in the market for a long time now, but I could not find any simple guide for HBase Capacity Planning for our usage.  I am sure there exist great many articles on internet that discuss Hadoop Capacity Planning (including HBase), but somehow they did not cover the details I was looking for.  Here is my version of it. 

This is going to be lengthy blog post as I have tried to explain what’s happening inside HBase with each knob and side effects of changing it.  I would like to first convey my sincere thanks to all those who have blogged about different parameters and the their experiences playing with them for their requirements in their deployment.  With out them, this blog post is nothing.  Where possible, I have given a reference URL.  If I have missed any reference, please pardon me.

HBase is not just storage node, but also a compute node.  Hence, HBase Capacity Planning has to consider not just storage requirements but compute requirements too.  That means, all the standard resources such as: disk, CPU Cores, Memory, Network.  In fact, as you go through this blog you will realize that it is Memory that demands more capacity than disk size (at least, in our case).  But, Java G1 GC has come to rescue!

I am assuming you know how HBase works (esp. Regions, Region Servers, Column Family, HFiles, MemStore, HMaster, etc.).

Never the less, here are some basic details you should know.

– Each Column Family data is stored in a separate HFile (also called as store file).  Each Region, will have it’s own HFiles (at least one) and one MemStore

– With in a Region data gets reorganized from time to time with a process called ‘Compaction’ (like Checkpoints in Relational Database World).  Minor Compaction is a process of merging multiple small HFiles into one bigger HFile.  Major Compaction merges all HFiles of a Region into one big HFile. 

Read this if you can: http://hbase.apache.org/book.html#_compaction

I will first discuss different parameters of HBase and the design considerations around it to set a proper value.

Config ID Configuration Name (Property Name) Description and Design Considerations Optimal Value Reference URL
C1 HBase: Max Region Size (regions.hbase.hregion.max.filesize)

If we have very large value for Max Region Size (and so Max HFile Size), a major compaction can result in huge disk write that could be as much as Max Region Size. 

15 GB Click this
C2 HBase:: Heap Percentage (hbase.regionserver.global.memstore.lowerLimit)
(see also: hbase.regionserver.global.memstore.upperLimit, hbase.hstore.blockingStoreFiles)

This knob (lower limit) limits the memory to be used by MemStores.   Flush thread starts flushing the MemStores when memory usage is above lower limit.  It stops until the memory usage falls below the lower limit.  However, there is a catch, flusher thread also has to consider the no. of HFiles on disk (hbase.hstore.blockingStoreFiles). 

The upper limit controls when to block the writes to a region.

0.4 Click this and this 

C3 HBase: MemStore Cache Size before flush (in a way, Max MemStore Size)(hbase.hregion.memstore.flush.size)

As no. of regions grow, the no. of MemStores grow in equal proportion.  That means, memory requirement grows too due to no. of MemStores plus the Max Size of MemStore .  If you keep this low with memory in mind, many HFiles get created resulting in many compactions.  Not only that, the amount of HLog to keep also increases as Max MemStore Size. 

128 MB  
C4 HBase: Compact Threshold HFiles (hbase.hstore.compactionThreshold)

This knob controls when to run Minor Compaction (which indirectly controls Major Compactions).  

Each HFile (or store file) consumes resources  like file handles, heap for holding metadata, bloom filters, etc..

Not just that, no. of looks ups in bloom filters increase (though in memory) and so the reads become slow.

3  
C5 HBase: Max HFiles on disk for a region(hbase.hstore.blockingStoreFiles)

If there are already these many store files of a region on the disk, flusher thread is blocked.  If we block the flusher thread, eventually MemStore for the region fills up.  If the MemStore fills up, writes are blocked to that region

15  
C6 HBase: Minor Compaction Max HFiles
(hbase.hstore.compaction.max)

Max no. of files for minor compaction, it can be less but not more

10  
C7 HBase: Major Compaction Max HFiles
(hbase.hregion.majorcompaction)

Max no. of files for major compaction, it can be less but not more

  Click this
NA Combination Check

hbase.regionserver.global.memstore.lowerLimit <= (hbase.regionserver.hlog.blocksize *
hbase.regionserver.logroll.multiplier * hbase.regionserver.maxlogs)

NA Click this
NA Combination Check

hbase.hstore.blockingStoreFiles >= (2 * hbase.hstore.compactionThreshold)

NA  
C8 HBase: Block Cache Percentage (hfile.block.cache.size)

Block cache is a LRU Cache of blocks and is used by reads.  If your read requires more than Block cache, LRU policy is applied and old cache blocks are evicted.  High value of Block Cache reduces cache evictions, mostly improves read performance, but increases GC time.  Low value of Block Cache increases cache evictions, degrades read performance, but reduces GC time.  (Similar to Page Cache in RDBMS Systems)

0.25 Click this
C9 HBase: Client side Write Buffer Size (hbase.client.write.buffer)

HBase Client side buffer size for write caching.  Till this memory size is reached (or a flush is called), HBase Client does not send the writes to HBase Server.  High value increases the risk of data loss, but reduces load on HBase Cluster.  Low value reduces the risk of data loss but increases load on HBase Cluster.  This size also effects the HBase Region Server as that much data is sent by Client in single go and has to be processed by server.  Total memory requirement of HBase Server = hbase.client.write.buffer * hbase.regionserver.handler.count

2 MB Click this
NA HBase: Column Family Names, Qualifier Names

Keep them very short (1 to 3 characters).  People often refer to Row Compression (FAST_DIFF) as a savior to not worry about length of the names.  It is definitely true for on-disk representation (i.e. HFile).  However, when the data resides in memory MemStores, etc.  rows are stored in full form w/o any compression.  That means a lengthy names are going to waste lot of memory, resulting in less no. of rows.

NA Click this
C10 HDFS: Replication Factor None 3  

 

Now that we have discussed different configuration parameters let’s get into data points.  Data Point Values depend completely on your business scenario, HBase Table Schema, etc..

Data Point ID Data Point Name Data Point Description Data Point Value
D1 Seed Days No. of Days HBase pre-populated offline (seed data) before we bring it online 365 days
D2 Daily Row Count No. of HBase Rows that are inserted daily into the table 100 Million
D3 Row Size

Rough size of a row measured from your trials (heavily depends on on your schema, data value size, etc.).  Insert about 10K records in the table, issue flush and major compact from HBase Shell, then get HFile Size from HDFS.  Enable all the options like Row Compression, Bloom Filters, Prefix Encoding, Snappy Compression of Data, etc. for your exercise of measuring the row size. 

1024 Bytes
D4 Column Family Count No. of Column Families in your schema definition 2
D5 Data Write Throughput per Core

No. of rows that can be inserted in to the HBase Table per Core.  Our observation given in value column.   

There is also another reference: ~500 KB (93.5 MB/sec with 16 Nodes * 2 Processors / Node * 6 Cores / Processor)

(http://hypertable.com/why_hypertable/hypertable_vs_hbase_2/)

As you can see this is highly dependent on your schema.  For us, we could just observe only 250 KB per core.

250 KB

 

Let’s define hardware specification (per Node) (Example: Sun X42L Server)

 

Resource Name Actual Resource Specification Usable Resource Percentage Effective Available 
(= Actual Resource * Usable Resource / 100)
Disk No. of Disks * Disk Size = 10 * 24 TB = 240 TB 60 % 131481 GB (approx.)
CPU Cores No. of Processors * No. of Cores Per Processor = 2 * 4 = 8 Cores 60 % 4 Cores
Physical Memory (RAM) 240 GB 60 % 148 GB
JVM Memory

Hadoop stack is Java Code base and so all services are JVMs.  JVM is limited to Max Heap Size.  Max Heap Size is heavily dependent on how good Garbage Collection works. 
Till Java 6, the most reliable GC was Concurrent Mark Sweep (CMS) GC.    The Max Heap a CMS GC can work well was only 24 GB. 

With Java 7, G1 GC has been introduced.  G1 GC has exhibited to work well with 100 GB Max Heap.

http://blog.cloudera.com/blog/2014/12/tuning-java-garbage-collection-for-hbase/

80% 19.2 GB (CMS)
80 GB (G1)

 

Let’s calculate the resource requirements

Disk:

If we have to plan for 1 year ahead, Total No. of Days = Seed Days (D1) + 365 Upcoming Days = 730 Days

No. of Rows = No. of Days *  Daily Row Count (D2) = 730 * 100 Million = 73000 Million

Disk Size w/o Replication = No. of Rows * Row Size (D3) = 73000 Million * 1 KB = 73000 GB

Disk Size w/ Replication = Replication Factor (C10) * Disk Size w/o Replication = 3 * 73000 GB = 219000 GB

Memory:

No. of Regions = Disk Size w/o Replication / Max Region Size (C1) = 73000 GB / 15 GB = 4867

No. of MemStores (one MemStore per Region per Column Family) = No. of Regions * No. of Column Families (D4) = 4867 * 2 = 9734

Memory Size of MemStores = No. of MemStores * Max MemStore Size (C3) = 9734 * 128 MB = 1217 GB

Memory Size = Memory Size of MemStores / Heap Percentage (C2) = 1217 GB / 0.4 = 3043 GB

CPU:

Row Count per Second = Daily Row Count (D2) / No. of Seconds in a Day = 100 Million / (24 * 60 * 60) = ~1200 Rows

Incoming Volume per Second = Row Count per Second * Row Size (D3) = 1200 * 1 KB = 1200 KB

CPU Core Count = Incoming Volume per Second / Data Write Throughput per Core (D5) = 1200 KB / 250 KB = ~5 Cores (approx.)

 

Let’s find no. of nodes …

Node Count w.r.t. Disk = Total Disk Size (Required) / Usable Disk Size per Node = 219000 GB / 131481 GB = 2 (Ceiled)

Node Count w.r.t.  CPU = Total CPU Cores (Required) / Usable Core Count per Node = 5 Cores / 4 Cores = 2 (Ceiled)

IMPORTANT: Node Count w.r.t. Memory is limited by JVM Memory.  Please note that, we can not run multiple HBase Region Servers on a single node.  That is, one Region Server JVM per one Node.  That means, the memory usable on a node is a minimum of JVM Memory and Physical Memory.

Node Count. w.r.t. Memory (CMS GC) = Total Memory Size (Required) / Min (JVM Memory, Physical Memory) = 3043 GB / Min(19.2 GB, 148 GB) = 159(Ceiled)

Node Count. w.r.t. Memory (G1 GC) = Total Memory Size (Required) / Min (JVM Memory, Physical Memory) = 3043 GB / Min(80 GB, 148 GB) = 39 (Ceiled)

Node Count w.r.t. all resources = Max (Node Count w/ each Resource) = 39 (G1 GC), 159 (CMS GC)

As you can see it is the Memory that demands more nodes than disk.  Note also that, has G1 GC did not increase the ability to have a big JVM Heap – lot of memory on the node would have been lying idle.

 

Does virtualization (VMs) help?

HBase uses Log Structured Merge (LSM) Trees for its indexes in HFile.  LSM demands that disk is not shared with any other process. As a result, Hadoop experts do not recommend VMs as the disk gets shared between VMs and performance degrades very badly.  But wait, if we have multiple disks on the same that can help.  By multiple disk, I am not talking about virtual disk, but physical spindles.   If your hardware (bare metal node) has multiple disks spindles, and you can create VMs such a way that no two VMs share the same physical spindle/disk, then by all means one can go for virtualization.

 

Further work

This article does not discuss about Reads.  Reads are affected by C8 and C9 configurations.  But in our case, read workload did not change much of our capacity planning.

 

Thanks,

Laxmi Narsimha Rao Oruganti

 

Update-1: Memory Size calculation was wrong.  Should consider only Disk Size w/o Replication, but was considering Disk Size w/ Replication.  Fixed it now.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s