O'Reilly、Cloudera 主办
Make Data Work

HDFS纠删码最新探秘 (Demystifying erasure coding in HDFS)

This will be presented in English.

Andrew Wang (Cloudera), 郑锴 (Intel)
16:20–17:00 Saturday, 2017-07-15
Hadoop内核&发展 (Hadoop internals & development), 英文讲话 (Presented in English)
地点: 多功能厅2(Function Room 2) 观众水平 (Level): 中级 (Intermediate)

必要预备知识 (Prerequisite Knowledge)


您将学到什么 (What you'll learn)


描述 (Description)

在互联网应用和物联网应用大发展的今天,应用产生的数据在飞速增长。如何可靠又有效率的存储这些数据对于大数据用户来说是一个日益严峻的挑战。自诞生以来,HDFS都是通过复制来保证数据的可靠性。随着数据量的急剧增长,传统的3备份模式带来的开销日益昂贵。为了帮助用户降低存储开销,减轻存储成本压力,Hadoop 3.0 HDFS 引入了纠删码技术。在常见配置下,纠删码相对于3备份模式可以降低50%的存储成本。

在本次演讲中,我们首先会简短的介绍HDFS纠删码技术, 然后深入了解在Hadoop 3.0 GA 前我们为保证纠删码功能稳定性做的工作,以及分享Hadoop生态系统中重要成员Spark, Hive等在HDFS 纠删码上的性能表现。最后,我们会给出在生产环境中部署使用纠删码技术的一些考虑和建议。

With the massive data growth produced by internet applications and IoT devices, efficiently and reliably persisting this data into HDFS is a serious challenge for Hadoop customers. Erasure coding, arguably the most important feature in Hadoop 3.0, was created to address this challenge. Compared with the expensive 3x replication scheme, which incurs a 200% overhead in storage space and other resources, erasure coding uses far less storage space while still providing the same level of fault tolerance. Under typical configurations, erasure coding reduces the storage cost by ~50%.

Kai Zheng and Andrew Wang start by describing the overall design and architecture of HDFS erasure coding before diving into recent work on stabilizing and testing erasure coding for the Apache Hadoop 3.0 GA release, including results from large-scale testing and benchmarking with analytic big data workloads like Spark and Hive. Kai and Andrew end by discussing operational concerns and recommendations for running HDFS erasure coding in a production environment.

Photo of Andrew Wang

Andrew Wang


Andrew Wang is a software engineer at Cloudera on the HDFS team, an Apache Hadoop committer and PMC member, and the release manager for Hadoop 3.0. Previously, he was a PhD student in the AMPLab at UC Berkeley, where he worked on problems related to distributed systems and warehouse-scale computing. He holds a master’s and a bachelor’s degree in computer science from UC Berkeley and UVA, respectively.

Photo of 郑锴



任职英特尔亚太研发中心大数据部门,作为资深研发工程师在安全和大数据领域从事开发和优化工作多年。目前担任研发经理,所在团队在Hadoop和Streaming领域诸多项目上有重要参与和贡献。热衷开源贡献,是Apache Hadoop committer,Apache Directory PMC 和Apache Kerby的关键发起者。

Kai Zheng is a big data engineering manager at Intel, where he explores broad enablement and optimization on the company’s IA platform. He has worked in big data space for a number of years across the security, storage, and computing domains. Kai is also an Apache Hadoop committer, a Kerby initiator, and a major contributor to HDFS erasure coding.



WeChat QRcode


Stay Connected Image 1
Stay Connected Image 3
Stay Connected Image 2


ORB Data Site