O'Reilly、Cloudera 主办
Make Data Work
2017年7月12-13日:培训
2017年7月13-15日:会议
北京,中国

HDFS纠删码最新探秘 (Demystifying erasure coding in HDFS)

This will be presented in English.

Andrew Wang (Cloudera), 郑锴 (Intel)
16:20–17:00 Saturday, 2017-07-15
Hadoop内核&发展 (Hadoop internals & development), 英文讲话 (Presented in English)
地点: 多功能厅2(Function Room 2) 观众水平 (Level): 中级 (Intermediate)

必要预备知识 (Prerequisite Knowledge)

具有HDFS的基本知识,了解其原理,对纠删码技术有一定了解

您将学到什么 (What you'll learn)

观众对HDFS纠删码技术的原理,性能和使用会有一个全面的了解。对于Hadoop集群管理员,对于是否要部署纠删码技术,如何部署有一个很好的参考作用。

描述 (Description)

在互联网应用和物联网应用大发展的今天,应用产生的数据在飞速增长。如何可靠又有效率的存储这些数据对于大数据用户来说是一个日益严峻的挑战。自诞生以来,HDFS都是通过复制来保证数据的可靠性。随着数据量的急剧增长,传统的3备份模式带来的开销日益昂贵。为了帮助用户降低存储开销,减轻存储成本压力,Hadoop 3.0 HDFS 引入了纠删码技术。在常见配置下,纠删码相对于3备份模式可以降低50%的存储成本。

在本次演讲中,我们首先会简短的介绍HDFS纠删码技术, 然后深入了解在Hadoop 3.0 GA 前我们为保证纠删码功能稳定性做的工作,以及分享Hadoop生态系统中重要成员Spark, Hive等在HDFS 纠删码上的性能表现。最后,我们会给出在生产环境中部署使用纠删码技术的一些考虑和建议。


With the massive data growth produced by internet applications and IoT devices, efficiently and reliably persisting this data into HDFS is a serious challenge for Hadoop customers. Erasure coding, arguably the most important feature in Hadoop 3.0, was created to address this challenge. Compared with the expensive 3x replication scheme, which incurs a 200% overhead in storage space and other resources, erasure coding uses far less storage space while still providing the same level of fault tolerance. Under typical configurations, erasure coding reduces the storage cost by ~50%.

Kai Zheng and Andrew Wang start by describing the overall design and architecture of HDFS erasure coding before diving into recent work on stabilizing and testing erasure coding for the Apache Hadoop 3.0 GA release, including results from large-scale testing and benchmarking with analytic big data workloads like Spark and Hive. Kai and Andrew end by discussing operational concerns and recommendations for running HDFS erasure coding in a production environment.

Photo of Andrew Wang

Andrew Wang

Cloudera

Andrew Wang is a software engineer at Cloudera on the HDFS team, an Apache Hadoop committer and PMC member, and the release manager for Hadoop 3.0. Previously, he was a PhD student in the AMPLab at UC Berkeley, where he worked on problems related to distributed systems and warehouse-scale computing. He holds a master’s and a bachelor’s degree in computer science from UC Berkeley and UVA, respectively.

Photo of 郑锴

郑锴

Intel

任职英特尔亚太研发中心大数据部门,作为资深研发工程师在安全和大数据领域从事开发和优化工作多年。目前担任研发经理,所在团队在Hadoop和Streaming领域诸多项目上有重要参与和贡献。热衷开源贡献,是Apache Hadoop committer,Apache Directory PMC 和Apache Kerby的关键发起者。

Kai Zheng is a big data engineering manager at Intel, where he explores broad enablement and optimization on the company’s IA platform. He has worked in big data space for a number of years across the security, storage, and computing domains. Kai is also an Apache Hadoop committer, a Kerby initiator, and a major contributor to HDFS erasure coding.

联系OReillyData

关注OReillyData微信号获取最新会议信息并浏览前沿数据文章。

WeChat QRcode

 

Stay Connected Image 1
Stay Connected Image 3
Stay Connected Image 2

阅读关于大数据的最新理念。

ORB Data Site