As mentioned in previous articles, Hadoop is an application ecosystem that allows the implementation of distributed, functional and scalable Big Data platforms.
The Hadoop ecosystem is designed to process large volumes of data distributed through the MapReduce programming model. The two main components in its core are: HDFS (Hadoop Distributed File System) and MapReduce (distributed programming model).
In this article, we will review the key concepts of HDFS and why it is important to choose a good data storage format.
Hadoop Distributed File System (HDFS) is a distributed file system designed for large-scale data processing where scalability, flexibility and performance are critical. Hadoop works in a master / slave architecture to store data in HDFS and is based on the principle of storing few very large files.
In HDFS two services are executed: Namenode and Datanode. The Namenode manages the namespace of the file system, in addition to maintaining the file system tree and metadata for all files and directories. This information is permanently stored on the local disk in the form of two files: the namespace image and the edition log. The Namenode also knows the Datanodes where the blocks of a file are located.
Like other file systems, HDFS also has the concept of block as the minimum amount of data that can be read and written. While in other file systems the blocks are usually a few kilobytes, the default size of an HDFS block is 128MB. The HDFS blocks are larger because they aim at minimizing the cost of searches, since if a block is large enough, the time to transfer data from the disk can be longer than the time needed to search from the beginning of the block. The blocks fit well with replication to provide fault tolerance and availability. Each block is replicated in several small separate machines.
Hadoop, like any standard file system, allows you to store information in any format, whether structured, semi-structured or unstructured data. In addition, it also provides support for optimized formats for storage and processing in HDFS.
Why should you choose a good format?
A file format is the definition of how information is stored in HDFS. Hadoop does not have a default file format and the choice of a format depends on its use.
The big problem in the performance of applications that use HDFS such as MapReduce or Spark is the information search time and the writing time. Managing the processing and storage of large volumes of information is complex, in addition to other difficulties such as the evolution of storage schemes or restrictions.
The choice of an appropriate file format can produce the following benefits:
- Optimum writing time
- Optimum reading time
- File divisibility
- Adaptive scheme and compression support
Formats for Hadoop
Below are some of the most common formats of the Hadoop ecosystem:
A plain text file or CSV is the most common format both outside and within the Hadoop ecosystem. The great disadvantage in the use of this format is that it does not support block compression, so the compression of a CSV file in Hadoop can have a high cost in reading.
The SequenceFile format stores the data in binary format. This format accepts compression; however, it does not store metadata and the only option in the evolution of its scheme is to add new fields at the end. This is usually used to store intermediate data in the input and output of MapReduce processes.
Avro is a row-based storage format. This format includes in each file, the definition of the scheme of your data in JSON format, improving interoperability and allowing the evolution of the scheme. Avro also allows block compression in addition to its divisibility, making it a good choice for most cases when using Hadoop.
Parquet is a column-based (column-based) binary storage format that can store nested data structures. This format is very efficient in terms of disk input / output operations when the necessary columns to be used are specified. This format is very optimized for use with Cloudera Impala.
RCFile (Record Columnar File)
RCFile is a columnar format that divides data into groups of rows, and inside it, data is stored in columns. This format does not support the evaluation of the scheme and if you want to add a new column it is necessary to rewrite the file, slowing down the process.
ORC (Optimized Row Columnar)
ORC is considered an evolution of the RCFile format and has all its benefits alongside with some improvements such as better compression, allowing faster queries. This format also does not support the evolution of the scheme.
Each format has advantages and disadvantages, and each stage of data processing will need a different format to be more efficient. The objective is to choose a format that maximizes advantages and minimizes inconveniences. Choosing an appropriate HDFS file format to the type of work that will be done with it, can ensure that resources will be used efficiently.
Analyzing the different uses and the characteristics of the different file formats used in Hadoop, you will obtain the following recommendations:
- The plain text format or CSV would only be recommended in case of extractions of data from Hadoop or a massive data load from a file.
- The SequenceFile format is recommended in case of storing intermediate data in MapReduce jobs.
- Avro is a good choice in case the data scheme can evolve over time.
- Parquet and ORC are recommended when query performance is important.