HBase is a distributed, fault-tolerant, persistent, NoSQL store in Hadoop Stack. Hadoop has been in the market for a long time now, but I could not find any simple guide for HBase Capacity Planning for our usage. I am sure there exist great many articles on internet that discuss Hadoop Capacity Planning (including HBase), but somehow they did not cover the details I was looking for. Here is my version of it.
This is going to be lengthy blog post as I have tried to explain what’s happening inside HBase with each knob and side effects of changing it. I would like to first convey my sincere thanks to all those who have blogged about different parameters and the their experiences playing with them for their requirements in their deployment. With out them, this blog post is nothing. Where possible, I have given a reference URL. If I have missed any reference, please pardon me.
HBase is not just storage node, but also a compute node. Hence, HBase Capacity Planning has to consider not just storage requirements but compute requirements too. That means, all the standard resources such as: disk, CPU Cores, Memory, Network. In fact, as you go through this blog you will realize that it is Memory that demands more capacity than disk size (at least, in our case). But, Java G1 GC has come to rescue!
I am assuming you know how HBase works (esp. Regions, Region Servers, Column Family, HFiles, MemStore, HMaster, etc.).
Never the less, here are some basic details you should know.
– Each Column Family data is stored in a separate HFile (also called as store file). Each Region, will have it’s own HFiles (at least one) and one MemStore
– With in a Region data gets reorganized from time to time with a process called ‘Compaction’ (like Checkpoints in Relational Database World). Minor Compaction is a process of merging multiple small HFiles into one bigger HFile. Major Compaction merges all HFiles of a Region into one big HFile.
Read this if you can: http://hbase.apache.org/book.html#_compaction
I will first discuss different parameters of HBase and the design considerations around it to set a proper value.
|Config ID||Configuration Name (Property Name)||Description and Design Considerations||Optimal Value||Reference URL|
|C1||HBase: Max Region Size (regions.hbase.hregion.max.filesize)||
If we have very large value for Max Region Size (and so Max HFile Size), a major compaction can result in huge disk write that could be as much as Max Region Size.
|15 GB||Click this|
|C2||HBase:: Heap Percentage (hbase.regionserver.global.memstore.lowerLimit)
(see also: hbase.regionserver.global.memstore.upperLimit, hbase.hstore.blockingStoreFiles)
This knob (lower limit) limits the memory to be used by MemStores. Flush thread starts flushing the MemStores when memory usage is above lower limit. It stops until the memory usage falls below the lower limit. However, there is a catch, flusher thread also has to consider the no. of HFiles on disk (hbase.hstore.blockingStoreFiles).
The upper limit controls when to block the writes to a region.
|0.4||Click this and this
|C3||HBase: MemStore Cache Size before flush (in a way, Max MemStore Size)(hbase.hregion.memstore.flush.size)||
As no. of regions grow, the no. of MemStores grow in equal proportion. That means, memory requirement grows too due to no. of MemStores plus the Max Size of MemStore . If you keep this low with memory in mind, many HFiles get created resulting in many compactions. Not only that, the amount of HLog to keep also increases as Max MemStore Size.
|C4||HBase: Compact Threshold HFiles (hbase.hstore.compactionThreshold)||
This knob controls when to run Minor Compaction (which indirectly controls Major Compactions).
Each HFile (or store file) consumes resources like file handles, heap for holding metadata, bloom filters, etc..
Not just that, no. of looks ups in bloom filters increase (though in memory) and so the reads become slow.
|C5||HBase: Max HFiles on disk for a region(hbase.hstore.blockingStoreFiles)||
If there are already these many store files of a region on the disk, flusher thread is blocked. If we block the flusher thread, eventually MemStore for the region fills up. If the MemStore fills up, writes are blocked to that region
|C6||HBase: Minor Compaction Max HFiles
Max no. of files for minor compaction, it can be less but not more
|C7||HBase: Major Compaction Max HFiles
Max no. of files for major compaction, it can be less but not more
hbase.regionserver.global.memstore.lowerLimit <= (hbase.regionserver.hlog.blocksize *
hbase.hstore.blockingStoreFiles >= (2 * hbase.hstore.compactionThreshold)
|C8||HBase: Block Cache Percentage (hfile.block.cache.size)||
Block cache is a LRU Cache of blocks and is used by reads. If your read requires more than Block cache, LRU policy is applied and old cache blocks are evicted. High value of Block Cache reduces cache evictions, mostly improves read performance, but increases GC time. Low value of Block Cache increases cache evictions, degrades read performance, but reduces GC time. (Similar to Page Cache in RDBMS Systems)
|C9||HBase: Client side Write Buffer Size (hbase.client.write.buffer)||
HBase Client side buffer size for write caching. Till this memory size is reached (or a flush is called), HBase Client does not send the writes to HBase Server. High value increases the risk of data loss, but reduces load on HBase Cluster. Low value reduces the risk of data loss but increases load on HBase Cluster. This size also effects the HBase Region Server as that much data is sent by Client in single go and has to be processed by server. Total memory requirement of HBase Server = hbase.client.write.buffer * hbase.regionserver.handler.count
|2 MB||Click this|
|NA||HBase: Column Family Names, Qualifier Names||
Keep them very short (1 to 3 characters). People often refer to Row Compression (FAST_DIFF) as a savior to not worry about length of the names. It is definitely true for on-disk representation (i.e. HFile). However, when the data resides in memory MemStores, etc. rows are stored in full form w/o any compression. That means a lengthy names are going to waste lot of memory, resulting in less no. of rows.
|C10||HDFS: Replication Factor||None||3|
Now that we have discussed different configuration parameters let’s get into data points. Data Point Values depend completely on your business scenario, HBase Table Schema, etc..
|Data Point ID||Data Point Name||Data Point Description||Data Point Value|
|D1||Seed Days||No. of Days HBase pre-populated offline (seed data) before we bring it online||365 days|
|D2||Daily Row Count||No. of HBase Rows that are inserted daily into the table||100 Million|
Rough size of a row measured from your trials (heavily depends on on your schema, data value size, etc.). Insert about 10K records in the table, issue flush and major compact from HBase Shell, then get HFile Size from HDFS. Enable all the options like Row Compression, Bloom Filters, Prefix Encoding, Snappy Compression of Data, etc. for your exercise of measuring the row size.
|D4||Column Family Count||No. of Column Families in your schema definition||2|
|D5||Data Write Throughput per Core||
No. of rows that can be inserted in to the HBase Table per Core. Our observation given in value column.
There is also another reference: ~500 KB (93.5 MB/sec with 16 Nodes * 2 Processors / Node * 6 Cores / Processor)
As you can see this is highly dependent on your schema. For us, we could just observe only 250 KB per core.
Let’s define hardware specification (per Node) (Example: Sun X42L Server)
|Resource Name||Actual Resource Specification||Usable Resource Percentage||Effective Available
(= Actual Resource * Usable Resource / 100)
|Disk||No. of Disks * Disk Size = 10 * 24 TB = 240 TB||60 %||131481 GB (approx.)|
|CPU Cores||No. of Processors * No. of Cores Per Processor = 2 * 4 = 8 Cores||60 %||4 Cores|
|Physical Memory (RAM)||240 GB||60 %||148 GB|
Hadoop stack is Java Code base and so all services are JVMs. JVM is limited to Max Heap Size. Max Heap Size is heavily dependent on how good Garbage Collection works.
With Java 7, G1 GC has been introduced. G1 GC has exhibited to work well with 100 GB Max Heap.
|80%||19.2 GB (CMS)
80 GB (G1)
Let’s calculate the resource requirements
If we have to plan for 1 year ahead, Total No. of Days = Seed Days (D1) + 365 Upcoming Days = 730 Days
No. of Rows = No. of Days * Daily Row Count (D2) = 730 * 100 Million = 73000 Million
Disk Size w/o Replication = No. of Rows * Row Size (D3) = 73000 Million * 1 KB = 73000 GB
Disk Size w/ Replication = Replication Factor (C10) * Disk Size w/o Replication = 3 * 73000 GB = 219000 GB
No. of Regions = Disk Size w/o Replication / Max Region Size (C1) = 73000 GB / 15 GB = 4867
No. of MemStores (one MemStore per Region per Column Family) = No. of Regions * No. of Column Families (D4) = 4867 * 2 = 9734
Memory Size of MemStores = No. of MemStores * Max MemStore Size (C3) = 9734 * 128 MB = 1217 GB
Memory Size = Memory Size of MemStores / Heap Percentage (C2) = 1217 GB / 0.4 = 3043 GB
Row Count per Second = Daily Row Count (D2) / No. of Seconds in a Day = 100 Million / (24 * 60 * 60) = ~1200 Rows
Incoming Volume per Second = Row Count per Second * Row Size (D3) = 1200 * 1 KB = 1200 KB
CPU Core Count = Incoming Volume per Second / Data Write Throughput per Core (D5) = 1200 KB / 250 KB = ~5 Cores (approx.)
Let’s find no. of nodes …
Node Count w.r.t. Disk = Total Disk Size (Required) / Usable Disk Size per Node = 219000 GB / 131481 GB = 2 (Ceiled)
Node Count w.r.t. CPU = Total CPU Cores (Required) / Usable Core Count per Node = 5 Cores / 4 Cores = 2 (Ceiled)
IMPORTANT: Node Count w.r.t. Memory is limited by JVM Memory. Please note that, we can not run multiple HBase Region Servers on a single node. That is, one Region Server JVM per one Node. That means, the memory usable on a node is a minimum of JVM Memory and Physical Memory.
Node Count. w.r.t. Memory (CMS GC) = Total Memory Size (Required) / Min (JVM Memory, Physical Memory) = 3043 GB / Min(19.2 GB, 148 GB) = 159(Ceiled)
Node Count. w.r.t. Memory (G1 GC) = Total Memory Size (Required) / Min (JVM Memory, Physical Memory) = 3043 GB / Min(80 GB, 148 GB) = 39 (Ceiled)
Node Count w.r.t. all resources = Max (Node Count w/ each Resource) = 39 (G1 GC), 159 (CMS GC)
As you can see it is the Memory that demands more nodes than disk. Note also that, has G1 GC did not increase the ability to have a big JVM Heap – lot of memory on the node would have been lying idle.
Does virtualization (VMs) help?
HBase uses Log Structured Merge (LSM) Trees for its indexes in HFile. LSM demands that disk is not shared with any other process. As a result, Hadoop experts do not recommend VMs as the disk gets shared between VMs and performance degrades very badly. But wait, if we have multiple disks on the same that can help. By multiple disk, I am not talking about virtual disk, but physical spindles. If your hardware (bare metal node) has multiple disks spindles, and you can create VMs such a way that no two VMs share the same physical spindle/disk, then by all means one can go for virtualization.
This article does not discuss about Reads. Reads are affected by C8 and C9 configurations. But in our case, read workload did not change much of our capacity planning.
Laxmi Narsimha Rao Oruganti
Update-1: Memory Size calculation was wrong. Should consider only Disk Size w/o Replication, but was considering Disk Size w/ Replication. Fixed it now.