O'Reilly、Cloudera 主办
Make Data Work
2017年7月12-13日:培训
2017年7月13-15日:会议
北京,中国

在领英搭建Hadoop和Kafka之间的桥梁——Hadoop团队的视角 (Building the bridge between Hadoop and Kafka at LinkedIn: A Hadoop team's perspective)

此演讲使用中文 (This will be presented in Chinese)

Fangshi Li (LinkedIn)
14:00–14:40 Saturday, 2017-07-15
数据工程和架构 (Data engineering and architecture)
地点: 多功能厅6A+B(Function Room 6A+B) 观众水平 (Level): 中级 (Intermediate)

必要预备知识 (Prerequisite Knowledge)

Hadoop和Kafka的基础知识 搭建数据基础架构的基础知识
A basic understanding of data processing

您将学到什么 (What you'll learn)

如何搭建数据基础架构 如何将online和offline的数据基础架构结合起来

描述 (Description)

Kafka是由领英创造并开源的。目前在领英,我们有一个超过1400台机器的Kafka集群。这个集群每天接收并处理超过14万亿条消息。我们还有一个Hadoop集群,包括1万多个节点,存储着50PB的数据。在领英的数据世界里,我们使用Kafka和Hadoop构建了我们的数据生态系统,分别作为实时和离线基础设施部分的核心。我将会从一个Hadoop成员的角度讲解领英是如何搭建Hadoop和Kafka之间的桥梁,让它们更好地一起工作。内容包括:

1. 简要介绍领英的数据生态系统。

2. 讲解Kafka和Hadoop集群间的数据流。我们使用用户交互数据(如页面浏览、印象和点击)作为例子来展示这些数据是如何从用户前端页面进入Kafka集群,然后通过ETL收集框架(Gobblin,我们在去年开源的)到达我们的Hadoop集群,并最终通过Pig、Hive、Presto和Spark为数据科学家们所使用。我也将会讨论一下我们为Hadoop用户开发的一些工具,能让他们很容易地在他们的Hadoop工作流里把数据推给Kafka。

3. 讲解Hadoop团队的一个有趣的使用案例(Hadoop+Flume+Kafka)。我们Hadoop团队有一些系统日志,比如HDFS审计日志、RM调度日志和任务的历史日志。我们希望能近乎实时地从这些日志里获取有用信息来帮助进行问题告警、调试以及随机分析。我们使用Hadoop系统里面的Flume作为桥梁来收集Hadoop的日志(单节点上就有80K条消息每秒),并发布到Kafka上,然后基于Kafka来开发实时分析应用。

4. 讲解我们最新的工作:通过OLAP类型的SQL来消费实时的Kafka数据流。在领英的Hadoop生态系统里,我们通过SQL(Hive)表和视图来为用户提供数据接入API,而底层的HDFS上的数据是来自Kafka并经过ETL处理的。在这个Hadoop的世界里,数据分析师的分析工作依赖于何时数据到达Hadoop/HDFS,而通常数据会有1小时的延迟。在我们最新的尝试里,我们把数据接入层(原来是Hive视图)变成了直接使用Kafka数据流。从而数据分析师可以无差别地使用HDFS上的历史数据和来自Kafka上的最新的数据。


Kafka originated at and was open-sourced by LinkedIn. LinkedIn built its data ecosystem with Kafka and Hadoop as the heart of online and offline infrastructure. Today, LinkedIn has 1,400+ machines in its Kafka clusters, and it ingests and processes over 14 trillion messages per day. LinkedIn also has Hadoop clusters with more than 10K+ nodes and 50 PB of data. Fangshi Li explains how LinkedIn built a bridge between its Kafka and Hadoop clusters.

Topics include:

  • A brief intro of LinkedIn’s data ecosystem
  • The data flow between its Kafka and Hadoop clusters: Using user engagement datasets (like pageview, impression, click) as an example, Fangshi demonstrates how data moves from the user frontend to the Kafka cluster through the ETL ingestion framework (Gobblin, which LinkedIn open-sourced last year) to the Hadoop clusters, before being exposed to data scientists through Pig/Hive/Presto/Spark. Fangshi also explores the tools LinkedIn built for Hadoop users to easily push data to Kafka in their Hadoop workflows.
  • An interesting use case (Hadoop + Flume + Kafka) within the Hadoop team: The team had system logs, including HDFS audit logs, RM scheduling logs, and job history logs, and wanted to get useful information from them in near real time for problem alerting, debugging, and ad hoc analysis. Fangshi explains how LinkedIn uses Flume within its Hadoop infrastructure as the bridge to ingest Hadoop logs (80K messages per second on a single machine) and publish to Kafka and shares the applications the team built on top of Kafka for real-time analysis.
  • Consuming real-time Kafka streams through OLAP-style SQL: LinkedIn’s Hadoop ecosystem exposes the data access API to the user through SQL (Hive) tables and views, from which the underlying data on HDFS is ETL-ed from Kafka. In the Hadoop world, the data analysts’ offline analytic jobs depend on data arrival from Hadoop/HDFS, which usually has about an hour delay. Fangshi describes the team’s latest effort—modifying the data access layer (Hive view) to consume Kafka streams directly so that data analysts can consume old data partitions on HDFS and the latest data partitions from Kafka without any difference.
Photo of Fangshi Li

Fangshi Li

LinkedIn

Fangshi Li is a senior software engineer on Linkedin’s Hadoop team. Fangshi built and open-sourced Dr. Elephant. He is currently doing Hive- and Spark-related work. Fangshi holds a degree from Carnegie Mellon.

联系OReillyData

关注OReillyData微信号获取最新会议信息并浏览前沿数据文章。

WeChat QRcode

 

Stay Connected Image 1
Stay Connected Image 3
Stay Connected Image 2

阅读关于大数据的最新理念。

ORB Data Site