Distributed File System – Part 1 (Retake)

This blog post series tries to take the user from traditional one-node file system to a distributed file system. This is first post of the series.

File System (of a scale-up world or in a single node):

The basic responsibilities of file system are:

  1. Support user objects and operations (outwardly)
    1. Directories – Create, Delete, Rename
    2. Files – Create, Rename, Read, Write, Expand, Shrink, Truncate, Delete
  2. Manage Disk Space (inwardly)
    1. Structure and organization of data on disk
    2. Allocation and de-allocation of disk space

* There are many more responsibilities that file systems own.  Such as: file permissions, symbolic links, etc.  They are all deliberately excluded for keeping things short (Need I say sweet Smile).

File System metadata typically contains:

  • Which data blocks constitute a file and their order
  • Which directories contains which files and the hierarchy

Let us take an example to get clarity.  In case of, Unix-like OS file systems:

  1. The very first block on the disk is usually Master Boot Record (MBR)
    1. LILO and GRUB are the popular boot loaders
  2. MBR contains the details of Disk Partitions (namely: drives)
  3. Disk Layout: MBR | Partition 1 | Partition 2 | …

image

Figure 1: Disk Layout

At disk level, VFS (virtual file system) operates.  It then mounts the drives (based on the file system to which the drive is setup for).

File System divides the whole drive space into blocks. The block is usually configurable and is part of OS installation configuration. The typical block size is 4 KB (for long time), recently it has moved up to 8 KB. Of these blocks, some blocks are used for file systems own metadata and the rest can be left for use by applications (or data).

image

Figure 2: ext2 File System – Disk Partition (or Drive) Layout

image

Figure 3: ext2 File System – Block Group Layout

  1. Drive Layout: Super Block | Block Group 1 | Block Group 2 | …
    1. SuperBlock: Contains the file system info, and drive info as a whole (like what type of file system it is, how many blocks are in use, how many blocks are free)
  2. Block Group Layout: Group Descriptors | Block Bitmap | iNode Bitmap | iNode Table | Data Blocks
    1. Group Descriptors: number of free and used blocks in this block group, number of free and used iNode count, block number of Block Bitmap, iNode Bitmap
    2. Block Bitmap: Each bit represents that particular block is free/used
    3. iNode Bitmap: Each bit represents that particular iNode is free/used
    4. Data Blocks

  1. Each file system object (directory, file, symbolic link) are represented in a metadata structure named iNode.
  2. Internally all iNodes are addressed by numbers (namely, iNode Number) – starting with 1
  3. iNode structure typically contains:
    1. Block Pointers – 12 Level 0, 1 Level 1, 1 Level 2, 1 Level 3
      1. Level 0 – 12 Pointers to Data Blocks
      2. Level 1 – Pointer to Block of Pointers to Data Blocks
      3. Level 2 – Pointer to Block of Pointers to Blocks of Pointers to Data Blocks
      4. Level 3 – Pointer to Block of Pointers to Blocks of Pointers to Blocks of Pointers to Data Blocks
    2. In case of Directory, Data Blocks contain details of immediate sub-directories and files.  For each item, there is an iNode number
    3. In case of Files, Data Blocks contain actual user data

Since, the first directory to create in the system is root “/”.  It usually comes in the very beginning (I think, iNode Number is 2).

The work flow to open a file “/usr/laxminro/olnrao.txt”

  • Get the iNode for “/”
  • Find the Data Block details from this iNode
  • From “/” Data Block, find the iNode Number of sub-directory “usr”
  • Get the iNode for “usr” (with iNode Number found above)
  • Find the Data Block details from this iNode
  • From “usr” Data Block, find the iNode Number of sub-directory “laxminro”
  • Get the iNode for “laxminro” (with iNode Number found above)
  • Find the Data Block details from this iNode
  • From “laxminro” Data Block, find the iNode Number of file “olnrao.txt”
  • Get the iNode for “olnrao.txt” (with iNode Number found above)
  • Find the Data Block details from this iNode
  • These data blocks contains the actual data (i.e. contents of “olnrao.txt” file)

Now that we have talked at very high level what a typical File System does and how it has implemented, let us talk about file system resources and their limitations (esp. Hard disks), usage patterns, and design choices made accordingly.

  • Hard disks are mechanical devices
    • Disk space was a scarce resource
      • Disk fragmentation was a serious concern
    • Disk bandwidth was and is a scarce resource
    • Seek time is in order of milliseconds
      • Random Reads and Random writes incur seek
    • Small writes and small reads were the norm

Applications tried to make sure they use very less disk space and bandwidth.  So, amount of data read from and written to disk was small.  Note that, each read or write could potentially incur a seek.  As a result, file systems introduced buffering.  Read buffering to serve the next possible read from buffers.  Write buffering to make sure to accumulate enough data before writing to disk.  Buffering also helped order the writes in such a way that seek is in one direction than to and fro.  Database Systems (esp. RDBMS) are one of the heavy users of file systems and if one delves into design choices made in RDBMS, one can easily how much a file system design choice and hardware limitations need to be dealt.  I will take a short detour to get a sense:

  • B+ Trees were the common form of on-disk storage of tables and indexes
    • For the record, heap based tables and hash based indexes do exist
  • Within DB file, space is allocated in terms of extents.  An extent is about 32/64/128 KB.
    • Extent system assumes that disk space is together (ex: on same platter)
    • Each extent is usually exclusive to a B+ Tree so that
    • Read and Write of a B+ Tree data is collocated as extent is on same platter, otherwise one page could be on one platter and another page of B+ Tree could be on different platter
  • A typical transaction adds/updates one/two rows in different tables
    • Ex: An order made by a customer, would result in one Order Table row, few Order Details Table rows, Payments table row, etc.
    • Even with extent model this above typical usage pattern demands that we need to touch multiple B+ Trees, which means small writes in different locations
    • Solution: Write ahead logging – Log files serve multiple purposes
      • Bringing ‘append’ and ‘chunk’ semantics of file system usage
      • Atomic way of committing or aborting a transaction
        • Either Commit Transaction or Abort Transaction
        • Atomicity is still possible with other techniques such as ‘Shadow Paging’ – but they are not successful because of limitations with Hard disks
      • Actual B+ Tree structures on-disk are updated as part of Checkpoint

Pardon me for the detour, coming back to disks.  We discussed mostly about optimizations on ‘write’.  At scale and for reads, the introduction of ‘cache tier’ came into play.

Apart from the heavy user like RDBMS, many applications use file system in different ways and most of the time write buffering and read buffering helped contain the problem.  It is also important to note that general file systems served wide variety of applications, work loads, and hence their design choices were limited.  That may be the reason, there are many file systems that have been designed for the tailor workloads.

The newer persistent memory technologies such as ‘Solid State Drives’, an electronic device, did not make things any better.  Why? you may ask.  While random reads are not a concern, random writes are.  Random small writes incur ‘erase block’ semantics of SSD.  Erase block is costly because it requires charge pump.  There are other problems such as Wear Leveling, etc.  This whole story is famously known as Write amplification.

Interested readers may read the following to get a pulse of hot trends in persistent memory world: Phase-change memory, Memristor,

File Systems continues to be an area of research with newer storage devices coming up.

Hopefully that has given the gist of a File System in one-node world.  That’s all for now, shall come back soon with more on the same topic.  Thanks for reading.

 

Thanks,

Laxmi Narsimha Rao Oruganti (alias: LaxmiNRO, alias: OLNRao)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s