O'Reilly、Cloudera 主办
Make Data Work
2016年8月3-4日:培训
2016年8月4-6日:会议
北京,中国

HDFS erasure coding: 一半的成本,更快的速度

16:20–17:00 2016年8月06日
Hadoop 内核与开发
地点: 紫金大厅B(Grand Hall B)

必要预备知识

具有HDFS的基本知识,了解其原理,特别是数据复制、数据本地化等特征。

描述

自诞生以来,HDFS都是通过复制来保证数据的可靠性。但随着数据量的增长,复制的代价也变得越来越明显:传统的3份复制相当于增加了200%的存储开销,给存储空间和网络带宽带来了很大的压力。通过引入纠错码(erasure coding,EC),我们可以在保证数据可靠性的同时大幅降低存储开销。在常见配置下,EC相对于3备份模式可以降低50%的存储成本。

在本次演讲中,我们会呈现HDFS-EC最新的性能测试结果,这也是我们首次对HDFS-EC的性能进行深入的研究和测试。我们通过4种不同的硬件配置,在不同的并发量之下,对EC和3备份之间、以及EC的不同codec之间进行了性能对比。性能对比的级别从最底层的API直到TPC-H的Hive查询,总共包含了160项测试数据。测试的结果展现了EC相对于3备份的优缺点,以及基于英特尔ISA-L库的codec如何实现近50倍的I/O吞吐量提升。另外通过测试我们也发现了在设计和实现上的不足和优化空间。


Ever since its creation, HDFS has relied on data replication to shield against most failure scenarios. However, with the explosive growth in data volume, replication is getting quite expensive: the default 3x replication scheme incurs a 200% overhead in storage space and other resources (e.g., network bandwidth when writing the data). Erasure coding (EC) uses far less storage space while still providing the same level of fault tolerance. Under typical configurations, EC reduces the storage cost by ~50% compared with 3x replication.

Zhe Zhang and Kai Zheng present the first-ever performance study of the new HDFS erasure coding feature—drawn from 160 performance tests covering four different hardware configurations, three different erasure coders (including a native coder based on Intel’s ISA-L library), four I/O concurrency levels, and over ten different benchmarking workloads (including TPC-H running with Hive-on-Spark). The test results verify that the superior performance of the ISA-L coder translates to up to 50x higher I/O throughput. Among other interesting insights, a significant performance gain of EC over the replication mode under sequential I/O workloads was observed. Moreover, the tests have also revealed a number of previously hidden design issues to be optimized in the future.

Photo of Zhe Zhang

Zhe Zhang

LinkedIn

Zhe Zhang is an engineering manager at LinkedIn, where he’s currently leading an excellent engineering team to provide big data services (HDFS, YARN, Spark, TensorFlow, and beyond) to power LinkedIn’s business intelligence and relevance applications. Zhe is an Apache Hadoop PMC member; he led the design and development of HDFS Erasure Coding (HDFS-EC).

Photo of 郑锴

郑锴

Intel

任职英特尔亚太研发中心大数据部门,作为资深研发工程师在安全和大数据领域从事开发和优化工作多年。目前担任研发经理,所在团队在Hadoop和Streaming领域诸多项目上有重要参与和贡献。热衷开源贡献,是Apache Hadoop committer,Apache Directory PMC 和Apache Kerby的关键发起者。

Kai Zheng is a big data engineering manager at Intel, where he explores broad enablement and optimization on the company’s IA platform. He has worked in big data space for a number of years across the security, storage, and computing domains. Kai is also an Apache Hadoop committer, a Kerby initiator, and a major contributor to HDFS erasure coding.

联系OReillyData

关注OReillyData微信号获取最新会议信息并浏览前沿数据文章。

WeChat QRcode

来自全球Strata+Hadoop 会议的照片。

Stay Connected Image 1

北京

Stay Connected Image 3

新加坡

Stay Connected Image 2

伦敦

阅读关于大数据的最新理念。

ORB Data Site