HDFS erasure coding: 一半的成本,更快的速度

16:20–17:00 2016年8月06日
Hadoop 内核与开发
地点: 紫金大厅B(Grand Hall B)




自诞生以来,HDFS都是通过复制来保证数据的可靠性。但随着数据量的增长,复制的代价也变得越来越明显:传统的3份复制相当于增加了200%的存储开销,给存储空间和网络带宽带来了很大的压力。通过引入纠错码(erasure coding,EC),我们可以在保证数据可靠性的同时大幅降低存储开销。在常见配置下,EC相对于3备份模式可以降低50%的存储成本。


Ever since its creation, HDFS has relied on data replication to shield against most failure scenarios. However, with the explosive growth in data volume, replication is getting quite expensive: the default 3x replication scheme incurs a 200% overhead in storage space and other resources (e.g., network bandwidth when writing the data). Erasure coding (EC) uses far less storage space while still providing the same level of fault tolerance. Under typical configurations, EC reduces the storage cost by ~50% compared with 3x replication.

Zhe Zhang and Kai Zheng present the first-ever performance study of the new HDFS erasure coding feature—drawn from 160 performance tests covering four different hardware configurations, three different erasure coders (including a native coder based on Intel’s ISA-L library), four I/O concurrency levels, and over ten different benchmarking workloads (including TPC-H running with Hive-on-Spark). The test results verify that the superior performance of the ISA-L coder translates to up to 50x higher I/O throughput. Among other interesting insights, a significant performance gain of EC over the replication mode under sequential I/O workloads was observed. Moreover, the tests have also revealed a number of previously hidden design issues to be optimized in the future.

Zhe Zhang


Zhe Zhang is an engineering manager at LinkedIn, where he’s currently leading an excellent engineering team to provide big data services (HDFS, YARN, Spark, TensorFlow, and beyond) to power LinkedIn’s business intelligence and relevance applications. Zhe is an Apache Hadoop PMC member; he led the design and development of HDFS Erasure Coding (HDFS-EC).

任职英特尔亚太研发中心大数据部门,作为资深研发工程师在安全和大数据领域从事开发和优化工作多年。目前担任研发经理,所在团队在Hadoop和Streaming领域诸多项目上有重要参与和贡献。热衷开源贡献,是Apache Hadoop committer,Apache Directory PMC 和Apache Kerby的关键发起者。

Kai Zheng is a big data engineering manager at Intel, where he explores broad enablement and optimization on the company’s IA platform. He has worked in big data space for a number of years across the security, storage, and computing domains. Kai is also an Apache Hadoop committer, a Kerby initiator, and a major contributor to HDFS erasure coding.



