O'Reilly、Cloudera 主办
Make Data Work

数据应用与数据产品架构 (Architecting data applications and data products)

This will be presented in English.

Ted Malaska (Capital One)
13:30–17:00 Thursday, 2017-07-13
Hadoop内核&发展 (Hadoop internals & development), 英文讲话 (Presented in English)
地点: 多功能厅5B(Function Room 5B) 观众水平 (Level): Intermediate
平均得分:: *****
(5.00, 1 次得分)

必要预备知识 (Prerequisite Knowledge)

The tutorial will include a live demo of the full project on Cloudera's QuickStart VM. The code for the demo is available on GitHub. Download it here to follow along.

该辅导课要求硬件和/或安装 (Hardware and/or installation requirements)

这不是一个动手的教程所以无需特别准备。本课程将包括Cloudera Quickstart VM上完整项目的现场演示。演示的代码将在GitHub上提供,所以观众可以稍后上手实践。
This is not a hands-on tutorial, so no special preparation is necessary. The tutorial will include a live demo of the full project on Cloudera'a Quickstart VM. The code for the demo will be available on GitHub so the audience can follow along.

您将学到什么 (What you'll learn)

Learn how to build a fraud-detection app on Hadoop

描述 (Description)

设计实现一个可扩展、低延迟的架构需要广泛了解各种框架,比如Kafka、HBase、HDFS、Flume、Spark、Spark Streaming和Impala等。好消息是现在有非常充沛的资源(书籍、网站、会议等)来深入了解和这些项目相关的信息。坏消息则是对于如何集成这些部件并实现完整的解决方案的信息却是相当得匮乏。

Ted将会指导参会者搭建一个欺诈检测系统,并使用一个端到端的案例研究作为一个具体的例子,展示如何使用Apache Hadoop组件(比如Kafka、HBase、Impala和Spark)来架构和实现一个实时系统。他会介绍架构设计实时应用的最佳实践和考虑点,为那些已经了解Hadoop和熟悉分布式数据处理系统的开发人员、架构师或是项目领导提供如何利用Hadoop组件来实现实时应用的更多的洞察。


  • 在Kafka、HBase和Hadoop里建立数据模型,并为数据选择最优的存储格式
  • 集成多个数据采集、处理和存储系统
  • 收集和分析基于事件的数据,比如日志、机器生成的数据,并在Hadoop里存储这些数据
  • 对数据做查询和出报表

Implementing a scalable, low-latency architecture requires understanding a broad range of frameworks, such as Kafka, HBase, HDFS, Flume, Spark, Spark Streaming, and Impala, among many others. The good news is that there’s an abundance of resources—books, websites, conferences, etc.—for gaining a deep understanding of these related projects. The bad news is there’s still a scarcity of information on how to integrate these components to implement complete solutions.

Ted Malaska walks you through building a fraud-detection system, using an end-to-end case study to provide a concrete example of how to architect and implement real-time systems via Apache Hadoop components like Kafka, HBase, Impala, and Spark. Along the way, Ted covers best practices and considerations for architecting real-time applications to give developers, architects, or project leads who are already knowledgeable about Hadoop or similar distributed data processing systems more insight into how they can be leveraged to implement real-world applications.

Topics include:

  • Modeling data in Kafka, HBase, and Hadoop and selecting optimal formats for storing data
  • Integrating multiple data collection, processing, and storage systems
  • Collecting and analyzing event-based data such as logs and machine-generated data and storing the data in Hadoop
  • Querying and reporting on data
Photo of Ted Malaska

Ted Malaska

Capital One

Ted Malaska is a director of enterprise architecture at Capital One. Previously, he was the director of engineering in the Global Insight Department at Blizzard; principal solutions architect at Cloudera, helping clients find success with the Hadoop ecosystem; and a lead architect at the Financial Industry Regulatory Authority (FINRA). He has contributed code to Apache Flume, Apache Avro, Apache Yarn, Apache HDFS, Apache Spark, Apache Sqoop, and many more. Ted is a coauthor of Hadoop Application Architectures, a frequent speaker at many conferences, and a frequent blogger on data architectures.



WeChat QRcode


Stay Connected Image 1
Stay Connected Image 3
Stay Connected Image 2


ORB Data Site