O'Reilly、Cloudera 主办
Make Data Work
2017年7月12-13日:培训
2017年7月13-15日:会议
北京,中国

Strata Data Conference 2017 讲师

会有新讲师不断加入。请经常回来查看日程安排的最新变化。

过滤器

搜索讲师

专注于Hadoop,Spark,Flink,Kafka,Elastic,HBase,Hive,Kylin等大数据相关技术的源码研究和企业级实战,《基于Apache Kylin构建大数据分析平台》一书作者。

Presentations

Hyperledger与CDH大数据生态系统的融合以及应用实践 (Hyperledger’s integration with CDH's big data ecosystem and its real-world applications) 议题 (Session)

区块链,比特币背后的技术,是一个去中心的分布式账本技术。Hyperledger是一个开源,跨行业的区块链平台技术。它是一个由金融,银行,物联网,供应链,制造业的行业领袖协同组成的全球协作项目。我们将Hyperledger同CDH进行集成,以利用CDH的服务部署,监控,管理功能。通过这个项目,用户可以方便地在CDH托管的数据中心部署Hyperledger集群,而且便于利用CDH大数据平台分析Hyperledger的数据,提取更多的商业价值。在万达内部使用的项目包含:数字权益平台和共享商业平台。其中共享商业平台包含了金融和供应链等多个环节。我们相信这个项目对于Hyperledger开源社区将很有帮助。

叶杰平,滴滴出行研究院副院长,DiDi Fellow,美国密歇根大学终身教授及密歇根大学大数据研究中心的管理委员会成员。2005年美国明尼苏达大学计算机系博士毕业。专业方向为机器学习, 数据挖掘,以及大数据分析。在机器学习和数据挖掘国际顶级会议及期刊上共发表论文200余篇。曾获KDD和ICML最佳论文奖以及美国国家自然科学基金会生涯奖 (NSF CAREER Award),并担任多个机器学习和数据挖掘领域顶级会议的主席。现任职机器学习和数据挖掘期刊IEEE TPAMI,DMKD,和 IEEE TKDE的副编委。

Presentations

大数据在滴滴出行 (Big data at Didi Chuxing) 主题演讲 (Keynote)

Every day, Didi Chuxing's platform generates over 70 TB worth of data, processes more than 9 billion routing requests, and produces over 13 billion location points. Ye Jieping explains how Didi Chuxing applies AI technologies to analyze such big transportation data and improve the travel experience for millions of people in China.

Lukas Biewald is the founder and chief data scientist of CrowdFlower, a data enrichment platform that taps into an on-demand workforce to help companies collect training data and do human-in-the-loop machine learning. Previously, he led the Search Relevance team for Yahoo Japan and worked as a senior data scientist at Powerset. Lukas was recognized by Inc. magazine as a 30 under 30. Lukas holds a BS in mathematics and an MS in computer science from Stanford University. He is also an expert Go player.

Presentations

主题演讲 (Keynote by Lukas Biewald) 主题演讲 (Keynote)

敬请期待更多细节。Details to come.

现实世界里的主动学习 (Active learning in the real world) 议题 (Session)

Training data collection strategies are often the most important and overlooked part of deploying real-world machine-learning algorithms. Lukas Biewald explains why active learning is the best way to collect training data and can make the difference between a failed research project and a deployed production algorithm.

Cloudera售前技术经理、行业领域顾问、资深方案架构师,原Intel Hadoop发行版核心开发人员。2006年加入Intel编译器部门从事服务器中间件软件开发,擅长服务器软件调试与优化。2010 年后开始Hadoop 产品开发及方案顾问,先后负责Hadoop 产品化、HBase 性能调优,以及行业解决方案顾问,已在交通、通信等行业成功实施并支持多个上百节点Hadoop 集群。

Presentations

HBase多数据中心方案及未来的增量备份功能介绍 (HBase as a multiple-data-center solution and its future incremental backup function) 议题 (Session)

多年来Hadoop技术无法进入核心业务系统,其中无成熟稳定的异地多数据中心方案是其中重要原因之一。由于灾备等原因,存储重要数据的HBase集群通常要求跨数据中心进行备份。国内银行业监管单位更是提出了异地多中心的硬性要求。而现在的HBase多为单数据中心部署,目前HBase提供的replica,快照拷贝或export的方式,皆不能满足监管和异地灾备要求。在本session将分享现有多中心部署要求下HBase所遇到的问题、解决办法。未来HBase将增加增量备份功能,其提供的增量备份方案,避免了现有技术对全表数据的扫描,大大提高了备份性能,同时又提供了repica不具备的一致性。在本session中也将详细描述此功能对于多数据方案的重要性、使用介绍以及内部原理刨析。

深度学习工程师,做过HBase、Ceph等分布式存储项目,参与过OpenStack和Docker社区项目,目前负责小米云深度学习平台架构与实现,专注于Kubernetes和TensorFlow社区。

Presentations

云深度学习平台架构与实践 (Architecture and practices of a cloud-based deep learning platform) 议题 (Session)

介绍小米内部应用的cloud machine learning平台,分析通用深度学习平台的架构设计和实现原理,还有在企业内部支持开发环境、模型训练以及模型服务的实践经验。

Haifeng Chen is a senior software architect at Intel’s Asia Pacific R&D Center. He has more than 12 years’ experience in software design and development, big data, and security, with a particular interest in image processing. Haifeng is the author of image browsing, editing, and processing software ColorStorm.

Presentations

Hadoop遇到云上对象存储——实现原理、陷阱和性能优化 (When Hadoop meets object storage: Implementation principles, pitfalls, and performance optimization) 议题 (Session)

Hadoop社区很早就支持公有云上的对象存储,比如AWS S3和Azure Storge。最近发布的Apache Hadoop 3.0 (alpha)版本中增加了更多的云存储服务支持,比如Azure Data Lake和阿里云OSS。这些云存储都提供了Hadoop兼容的文件系统,用户可以把他们当成另一个HDFS使用。但是对象存储和HDFS在实现原理上有很多的不同,所以即使两者有类似的文件系统接口,很多API的行为完全不同。 本议题以阿里云OSS的实践出发,介绍阿里云OSS FileSystem实现进入Apache Hadoop历程。同时会介绍对象存储在文件上传、下载、删除和移动上和传统文件系统的区别,从性能和成本上评估HDFS和OSS文件系统的优劣。最后会结合对象存储的特性,给出一些优化方案,可以提升Hive或Spark等开源访问对象存储的性能。

在Apache Hadoop和Spark上加速大数据加密 (Speed up big data encryption in Apache Hadoop and Spark) 议题 (Session)

Although the processing capability of modern platforms is approaching memory speed, securing big data using encryption still hurts performance. Haifeng Chen shares proven ways to speed up data encryption in Hadoop and Spark, as well as the latest progress in open source, and demystifies using hardware acceleration technology to protecting your data.

Cheng Feng is a data engineer at Grab, where he works on the big data platform, distributed computing, streaming processing, and data science. Previously, he was a data scientist at the Lazada Group, working on Lazada’s tracker, customer segmentation and recommendation systems, and fraud detection.

Presentations

使用大数据推动东南亚前行 (Driving Southeast Asia forward with big data) 议题 (Session)

In SEA, Grab is sitting at the junction of the digital and physical worlds. Its vision is to drive Southeast Asia forward and transform the way people travel and pay across the region. Feng Cheng and Edwin Law explain Grab's data architecture and offer a history of its data platform migration and stream-processing apps.

种骥科博士现任宜人贷 (NYSE:YRD) 首席数据科学家,正利用“万神庙”框架创建/布局宜人贷数据部,并负责反欺诈风控,和数字驱动的运营和创新。之前,种骥科曾任职于美国Simply Hired招聘平台,创建了数据科学部, 并应邀为白宫科技办公室参谋大数据技术产品设计。还曾就职于美国Silver Lake 私募公司任Kraftwerk基金数据科学架构师,负责大数据技术在私募投资风控方面的应用。种骥科曾任美国卡内基梅隆大学教授与博士生导师,持有加州大学伯克利分校电子工程和计算机科学系博士学位,卡内基梅隆大学电子和计算机工程系硕士及本科学位,和9项专利(5项获准,3项待批)。

Presentations

SDK + FinGraph + Go:用一手行为数据和图谱信息创造商业价值 (SDK + FinGraph + Go: Create business value with firsthand user behavior data and knowledge graph information) 议题 (Session)

在移动互联网流量红利过后,我们怎样深度挖掘一手移动数据,实时响应用户需求,通过用户行为和知识图谱技术,创造商业价值?我们会通过具体业务案例,分享一个SDK + FinGraph + Go的技术框架。此框架只用一行代码将SDK埋入APP,通过实时/准实时的上传机制和Flume + Kafka的实时处理分析,获取用户意向;用Spark Streaming流式处理,HBase KV查询输出,和Neo4j集群做的关联、存储来挖掘图谱信息;并通过Go高效的开发基础平台,Python连接自动提报后台,scikit-learn做事件识别,和Cypher挖掘图谱关系来预测用户意愿,引导用户行为 - 用实时数据创造商业价值。

数据科学精髓:互联网金融实例 - 量化线上金融信用与欺诈风险的评估 (Data science essentials: Examples from internet finance—Quantifying credit and fraud risks online) 培训 (Training)

您想了解互联网金融幕后的量化分析流程吗?个人信用是怎样通过大数据被量化的?在实践过程中,机器学习算法的应用存在着哪些需要关注的方面?怎样通过图谱分析来融合多维数据,为我们区分正常用户和欺诈用户? 这套辅导课基于清华大学交叉信息研究院2017年春天新开设的一门"量化金融信用与风控分析”研究生课。其中会用LendingClub的真实借贷数据做为案例,解说一些具体模型的实现。

Doug Cutting is the chief architect at Cloudera and the founder of numerous successful open source projects, including Lucene, Nutch, Avro, and Hadoop. Doug joined Cloudera from Yahoo, where he was a key member of the team that built and deployed a production Hadoop storage-and-analysis cluster for mission-critical business analytics. Doug holds a bachelor’s degree from Stanford University and sits on the board of the Apache Software Foundation.

Presentations

周六欢迎致辞 (Saturday opening welcome) 主题演讲 (Keynote)

大会日程主席 Jason Dai、Ben Lorica 与 Doug Cutting致辞开始第二天主题演讲。

Jason is currently a Sr. Principle Engineer and CTO, Big Data Technologies, at Intel, responsible for leading the global engineering teams (located in both Silicon Valley and Shanghai) on the development of advanced Big Data analytics (incl. distributed machine / deep learning), as well as collaborations with leading research labs (e.g., UC Berkeley AMPLab).

He is an internationally recognized expert on big data, cloud and distributed machine learning; he is the program co-chair of Strata Data Conference Beijing, a committer and PMC member of Apache Spark project, and the creator of BigDL (https://github.com/intel-analytics/BigDL/) project, a distributed deep learning framework on Apache Spark.

Presentations

周六欢迎致辞 (Saturday opening welcome) 主题演讲 (Keynote)

大会日程主席 Jason Dai、Ben Lorica 与 Doug Cutting致辞开始第二天主题演讲。

AWS解决方案架构师;拥有17年IT 领域的工作经验,先后在IBM,RIM,Apple 等企业担任工程师、架构师等职位;目前就职于AWS,担任解决方案架构师一职。喜欢编程,喜欢各种编程语言,尤其喜欢Lisp。喜欢新技术,喜欢各种技术挑战,目前在集中精力学习分布式计算环境下的机器学习算法以及深度神经网络框架。

Presentations

AWS上使用MXNet进行分布式深度学习 (Distributed deep learning on AWS using MXNet) 教学辅导课 (Tutorial)

深度学习正持续地在诸如计算机视觉、自然语言处理和推荐引擎等领域引领最前沿的进步。带来这个进步的一个关键因素就是大量的高度灵活和对开发人员很友好的深度学习框架的出现。在本辅导课里,亚马逊机器学习团队的成员将会就深度学习的背景做一个简短的介绍,主要关注与其相关的应用领域。并会对强大和可扩展的深度学习框架——MXNet——做一个介绍。辅导课的最后,你可以获得上手的机会来获得针对多种应用的经验,包括计算机视觉和推荐引擎等。并可以看到如何使用预先配置好的深度学习AMI和CloudFormation模版来帮助加快开发速度。

AWS上的MXNet (MXNet on AWS) 议题 (Session)

Damon Deng provides a short background on deep learning, focusing on relevant application domains, and offers an introduction to using the powerful and scalable deep learning framework MXNet. Join in to learn MXNet works and how you can spin up AWS GPU clusters to train at record speeds.

Mathieu Dumoulin is a data scientist in MapR Technologies’s Tokyo office, where he combines his passion for machine learning and big data with the Hadoop ecosystem. Mathieu started using Hadoop from the deep end, building a full unstructured data classification prototype for Fujitsu Canada’s Innovation Labs, a project that eventually earned him the 2013 Young Innovator award from the Natural Sciences and Engineering Research Council of Canada. Afterward, he moved to Tokyo with his family where he worked as a search engineer at a startup and a managing data scientist for a large Japanese HR company, before coming to MapR.

Presentations

机器人的预测性维护实战:解读实时、可扩展的分析管道 (Robot predictive maintenance in action: Real-time, scalable pipelines explained) 议题 (Session)

Mathieu Dumoulin and Mateusz Dymczyk walk you step by step through building a scalable, real-time anomaly detection pipeline applied to an industrial robot. You'll learn how to gather data from a wireless movement sensor, process it with H2O on a MapR cluster, and visualize the output through an AR headset by an operator.

Mateusz is a Tokyo-based software engineer at H2O.ai, the maker behind H2O, the leading open source machine learning platform for smarter applications and data products. He works on distributed machine learning projects including the core H2O platform and Sparkling Water, which integrates H2O and Apache Spark. Previously, he worked at Fujitsu Laboratories on natural language processing and utilization of machine learning techniques for investments. After Fujitsu he moved to Infoscience to work on a highly distributed log data collection and analysis platform.

Mateusz loves all things distributed and machine learning, and hates buzzwords. In his spare time he participates in the IT community by organizing, attending, and speaking at conferences and meetups. Mateusz holds an MSc in computer science from AGH UST in Krakow.

Presentations

机器人的预测性维护实战:解读实时、可扩展的分析管道 (Robot predictive maintenance in action: Real-time, scalable pipelines explained) 议题 (Session)

Mathieu Dumoulin and Mateusz Dymczyk walk you step by step through building a scalable, real-time anomaly detection pipeline applied to an industrial robot. You'll learn how to gather data from a wireless movement sensor, process it with H2O on a MapR cluster, and visualize the output through an AR headset by an operator.

Bin Fan is a software engineer at Alluxio and a PMC member of the Alluxio project. Prior to Alluxio, Bin worked at Google building next-generation storage infrastructure, where he won Google’s Technical Infrastructure award. Bin holds a PhD in computer science from Carnegie Mellon University.

Presentations

使用Alluxio(前Tachyon)来加速大数据计算 (Using Alluxio (formerly Tachyon) to speed up big data analytics) 教学辅导课 (Tutorial)

在这个三个小时的教学课中, 我们将向参与者讲授Alluxio基础知识,演示Alluxio如何工作以及如何使用此系统帮助分布式计算引擎(如Spark或MapReduce)以内存速度共享数据。

在Spark上使用Alluxio的最佳实践(Best practices for using Alluxio with Spark) 议题 (Session)

Alluxio (formerly Tachyon) is a memory-speed virtual distributed storage system that leverages memory for managing data across different storage. Many deployments use Alluxio with Spark. Gene Pang and Bin Fan explain how Alluxio helps Spark be more effective and share examples of production deployments of Alluxio and Spark working together.

Darren Fu (傅冬雷) is currently team leader and architect for data service and the feeds platform at eBay. Previously, he worked at EA and IBM.

Presentations

Ebay的广告数据服务和处理平台 (eBay's data service and processing platform for ads) 议题 (Session)

eBay is one of largest ecommerce companies in the world. Yi Liu and Darren Fu share eBay's data service and processing platform for ads, based on Hadoop, Spark, Kafka, and Cassandra, explore the challenges of SQL operations in high-performance on large datasets, and explain how eBay handles them.

Maosong Fu is the technical lead for ​Heron and ​real-time analytics at Twitter. He ​is the author of ​few publications in distributed area​. Maosong holds a master’s degree from Carnegie Mellon University and a bachelor’s from Huazhong University of Science and Technology.

Presentations

现代流计算架构 (Modern streaming architectures) 教学辅导课 (Tutorial)

The move to streaming architectures from batch processing is a revolution in how companies use data. But what is the state of the art for a real-time data stack? Sijie Guo and Maosong Fu explore the typical challenges in a modern real-time data stack and explain how the modern technology will impact streaming architecture and applications in the future.

顾荣,博士毕业于南京大学计算机系,现就职于南大计算机系,大数据开源存储项目Alluxio PMC member and mainitainer,Apache Spark contributor。作为知名的Alluxio社区开发者,顾荣完成了Alluxio很多功能稳定和性能增强方面的工作,包括性能测试框架Alluxio-Perf、Alluxio与Hadoop生态系统多个组件的整合、开发社区中文文档等。在与Spark结合方面,顾荣还设计实现了Spark 1.0版本中发布的支持RDD 存储到Alluxio的功能。顾荣目前已经发表或录用论文十余篇(其中10篇第一作者),并且参与编写《深入理解大数据—卷1: 大数据处理与编程实践》、《实战Hadoop:开启通向云计算的捷径》等书籍中的部分章节。顾荣热衷于技术分享,是南京大数据技术Meetup的组织人(已举行7次活动),也多次在国内知名的技术大会(例如中国数据库技术大会)上进行技术演讲。此外,顾荣曾在Microsoft Research、Intel、Baidu、星环科技(Transwarp)从事过大数据系统研发实习工作。

Presentations

Alluxio缓存策略优化与大规模性能评测 (Optimizing Alluxio cache strategy and large-scale performance evaluation) 议题 (Session)

Alluxio(原名Tachyon)是开源的、以内存为中心的统一分布式存储系统。它为上层计算框架和底层存储系统构建了桥梁。Alluxio还提供了分层存储机制,不仅可以管理内存,也可以统一管理SSD 和HDD等存储设备资源。为了使热数据尽量在更快的存储层上,我们在Alluxio中针对多种大数据的应用场景设计实现了众多高级的缓存替换策略包括LIRS、ARC、LRFU等。这些缓存策略已经被集成到Alluxio系统之中,并且可以很方便地用于上层应用性能调优。此外,为了对Alluxio上层的应用进行更大规模的性能评测和调优,我们还设计实现了针对的Alluxio大规模性能评测系统Alluxio-Perf。本演讲中,我将对针对Alluxio大数据的缓存策略与性能评测调优工具Alluxio-Perf的基本原理和使用方式进行详细的介绍。

Sijie Guo is the tech lead of Twitter’s Messaging group. Sijie is the cocreator of Apache DistributedLog and the PMC chair of Apache BookKeeper.

Presentations

使用Apache DistributedLog支持交易性的流计算 (Transactional streaming with Apache DistributedLog) 议题 (Session)

Sijie Guo explores the technical challenges of exactly-once delivery and transaction support in messaging and streaming storage systems and explains how Apache DistributedLog helps achieve transactional streaming.

现代流计算架构 (Modern streaming architectures) 教学辅导课 (Tutorial)

The move to streaming architectures from batch processing is a revolution in how companies use data. But what is the state of the art for a real-time data stack? Sijie Guo and Maosong Fu explore the typical challenges in a modern real-time data stack and explain how the modern technology will impact streaming architecture and applications in the future.

Yufeng Guo is a developer advocate for the Google Cloud Platform, where he is trying to make machine learning more understandable and usable for all. He enjoys hearing about new and interesting applications of machine learning, so be sure to share your use case with him.

Presentations

用TensorFlow进行深度学习 (Deep learning with TensorFlow) 教学辅导课 (Tutorial)

TensorFlow是一个流行的开源机器学习库,特别适合进行深度学习。本辅导课会通过实际的例子来介绍机器学习和深度学习。我们会指导参会者自己动手来使用TensorFlow和TensorBoard进行练习。

终端设备上的机器学习: Android设备上的TensorFlow (On-device machine learning: TensorFlow on Android) 议题 (Session)

Machine learning has traditionally been performed only on servers and high-performance machines, but on-device machine learning on mobile devices can be very valuable. Yufeng Guo uses TensorFlow to implement a deep learning model for image classification on an Android device, tailored to a custom dataset. You'll leave ready to get started on your own mobile deep learning solutions.

Hao Hao is a software engineer at Cloudera currently working on Apache Kudu and Apache Sentry and is committer and PMC of the Apache Sentry project. Previously, she worked on eBay’s Search Backend team, building search infrastructure for eBay’s online buying platform. Hao performed extensive research on smartphone security and web security while she was a PhD student at Syracuse University.

Presentations

Apache Kudo: 1.0版和未来 (Apache Kudu: 1.0 and beyond) 议题 (Session)

Hao Hao offers an overview of Apache Kudu, a project that enables fast analytics on big data.

Ron-Chung Hu is a database system architect at Huawei Technologies, where he works on building a big data analytics platform based on Apache Spark. Previously, he worked at Teradata, Sybase, and MarkLogic, focusing on parallel database systems and search engines. Ron holds a PhD in computer science from the University of California, Los Angeles.

Presentations

基于成本的Spark SQL优化器框架 (A cost-based optimizer framework for Spark SQL) 议题 (Session)

我们把基于成本的优化器框架贡献给社区版本Spark 2.2。在我们的框架中,我们计算每个数据库操作符的基数和输出大小。通过可靠的统计和精确的估算,我们能够在这些领域做出好的决定:选择散列连接(hash join)操作的正确构建端(build side),选择正确的连接算法(如broadcast hash join与 shuffled hash join), 调整连接的顺序等等。这个基于成本的优化器框架对Spark SQL查询的性能有很好的提升 。在这次演讲中,我们将展示Spark SQL的新的基于成本的优化器框架及其对TPC-DS查询的性能影响。

Andy M Huang(黄明):腾讯数据平台部T4专家,Spark早期的研究者和布道者之一,在分布式计算和机器学习领域,有一定的经验和研究。负责构建大规模并行计算和智能学习平台,助力腾讯各种数据和机器学习业务快速发展。

Presentations

Angel:面向高维度的机器学习计算框架 (Angel: A machine-learning framework for high dimensionality) 议题 (Session)

在机器学习和人工智能领域,为了让模型达到更好的线上效果,特征的维度往往会膨胀到千万和亿级别。在这种情况下,传统的分布式计算框架,很难有高的性能。为此,腾讯推出Angel机器学习框架,支持超大维度模型的高性能机器学习。该框架即支持自主的高性能机器学习算法开发,也能作为PS引擎,为其它框架(例如Spark……)提供PS支持,整体形成良好的PS生态圈。

Shengsheng (Shane) Huang is a software architect at Intel and an Apache Spark committer and PMC member, leading the development of large-scale analytical applications and infrastructure on Spark in Intel. Her area of focus is big data and distributed machine learning, especially deep (convolutional) neural networks. Previously at NUS (National University of Singapore), her research interests are large-scale vision data analysis and statistical machine learning.

Presentations

使用BigDL在Apache Spark上进行大规模分布式深度学习 (Distributed deep learning at scale on Apache Spark with BigDL) 议题 (Session)

Zhichao Li, Shengsheng Huang, and Yiheng Wanghow explore how data scientists have adopted BigDL for deep learning analysis on large amounts of data in a distributed fashion, allowing them to use their big data cluster as a unified data analytics platform for data storage, data processing and mining, feature engineering, traditional (non-deep) machine learning, and deep learning workloads.

Edwin Law was the third person and first engineer on the Data team at Grab (formerly MyTeksi and Grab Taxi), which encompasses data engineering, data science, and data analytics. Edwin leads the almost-15-member-strong Data Engineering and Database Operations teams as their engineering manager.

Presentations

使用大数据推动东南亚前行 (Driving Southeast Asia forward with big data) 议题 (Session)

In SEA, Grab is sitting at the junction of the digital and physical worlds. Its vision is to drive Southeast Asia forward and transform the way people travel and pay across the region. Feng Cheng and Edwin Law explain Grab's data architecture and offer a history of its data platform migration and stream-processing apps.

Tony Lee is the chief security officer at JD.

Presentations

在京东利用大数据进行安全分析 (Leveraging big data for security analytics at JD) 议题 (Session)

JD.com is one of the largest B2C online retailers in the world. Its mission is to provide a safe and secure marketplace for its 226M active users and 120K third-party vendors. Jimmy Zhigang Su and Tony Lee discuss the transformations big data has enabled at JD, including threat intelligence, account security, and end-point security.

Fangshi Li is a senior software engineer on Linkedin’s Hadoop team. Fangshi built and open-sourced Dr. Elephant. He is currently doing Hive- and Spark-related work. Fangshi holds a degree from Carnegie Mellon.

Presentations

在领英搭建Hadoop和Kafka之间的桥梁——Hadoop团队的视角 (Building the bridge between Hadoop and Kafka at Linkedin: A Hadoop team's perspective) 议题 (Session)

Kafka和Hadoop是LinkedIn数据基础设施online和offline部分的核心。Kafka是LinkedIn创造并且开源的,目前集群有超过一千台机器,每天收集并处理14万亿条消息。LinkedIn的Hadoop集群有超过1万台机器和50pb数据,每天处理20万个任务。在本议题中,我将会以一个Hadoop成员的角度讲解linkedin如何搭建Hadoop和Kafka的桥梁,让他们更好的一起工作。内容包括 1)讲解LinkedIn数据架构 dataset从产生到Kafka到Hadoop并且最终呈现给用户(数据分析师)的整个ETL流程 2)讲解我们的一个use case来使用Apache Flume和Kafka收集分析Hadoop集群的数据并且搭建实时分析程序 3)讲解我们最新的工作,提供统一的sql接口让用户可以同时处理Kafka数据流和hdfs的数据

Haoyuan Li is founder and CEO of Alluxio (formerly Tachyon Nexus), a memory-speed virtual distributed storage system. Before founding the company, Haoyuan was working on his PhD at UC Berkeley’s AMPLab, where he cocreated Alluxio. He is also a founding committer of Apache Spark. Previously, he worked at Conviva and Google. Haoyuan holds an MS from Cornell University and a BS from Peking University.

Presentations

使用Alluxio(前Tachyon)来加速大数据计算 (Using Alluxio (formerly Tachyon) to speed up big data analytics) 教学辅导课 (Tutorial)

在这个三个小时的教学课中, 我们将向参与者讲授Alluxio基础知识,演示Alluxio如何工作以及如何使用此系统帮助分布式计算引擎(如Spark或MapReduce)以内存速度共享数据。

使用开源的Alluxio解耦计算与存储的架构 (The architecture of decoupling compute and storage with open source Alluxio) 议题 (Session)

Decoupling of storage and computation is becoming increasingly popular for big data analytics platforms. Haoyuan Li and Gene Pang share production best practices and solutions to best utilize CPUs, memory, and different tiers of disaggregated compute and storage systems to build out a multitenant high-performance platform that addresses real-world business demands.

Yu Li is a senior technical expert at Alibaba leading the Alibaba Search HBase team. An HBase committer, Yu has over seven years’ work experience in the Hadoop stack for enterprise solution and has supported Alibaba for three Singles’ Days.

Presentations

生产环境里的堆外内存HBase读路径——阿里巴巴的故事 (Off-heap HBase read path in production: The Alibaba story) 议题 (Session)

Yu Li explains how Alibaba met the challenge of tens of millions requests per second to its Alibaba-Search HBase cluster on 2016 Singles' Day. With read-path off-heaping, Alibaba improved the throughput by 30% and achieved a predicable latency.

利智超来自于Intel大数据技术团队,专注于大数据分析领域, Spark contributor。他的同事和他致力于在Apache Spark平台上开发分布式机器学习算法,以满足大数据背景下的机器学习需求。他还为这些分布式机器学习算法在Intel平台上进行优化,以及帮助Intel的客户为他们的业务开发大数据分析程序。

Presentations

Apache Spark高级实践和原理解析 (Advanced practice and principle analysis) 培训 (Training)

这几年随着大数据分析和机器学习等等在工业界中越来越广泛的应用,越来越多的人选择在大数据平台比如Apache Spark之上构建大规模数据处理、分析和机器学习,以便利用大量原始数据和扩展架构。如何深入理解大数据关键技术并更好的运用它们?本次课程将结合当前大数据技术的浪潮和趋势,为您介绍Apache Spark的高级实践和原理解析,帮助您加深领会Apache Spark的精华设计思想,以及如何与流式分析、机器学习,深度学习等紧密结合,在数据采集,分析处理,特征提取,机器学习等方面提供一致性和集成性的高级实践。

使用BigDL在Apache Spark上进行大规模分布式深度学习 (Distributed deep learning at scale on Apache Spark with BigDL) 议题 (Session)

Zhichao Li, Shengsheng Huang, and Yiheng Wanghow explore how data scientists have adopted BigDL for deep learning analysis on large amounts of data in a distributed fashion, allowing them to use their big data cluster as a unified data analytics platform for data storage, data processing and mining, feature engineering, traditional (non-deep) machine learning, and deep learning workloads.

林元庆,现任百度深度学习实验室(IDL)主任,拥有清华大学光学工程硕士学位和宾夕法尼亚大学电气工程博士学位。
林元庆在机器学习和计算机视觉等研究领域拥有多年的研究经验和显著的成果。在加入百度前,曾任NEC美国实验室媒体分析部门主管。在他的带领下NEC研究团队在深度学习、计算机视觉和无人驾驶等领域取得世界领先水平。2005年至今在顶级国际会议和期刊发表论文30余篇,拥有11项美国专利,曾担任NIPS大会领域主席、大规模视觉识别和检索国际研讨会联合主席等。
加入百度后,林元庆致力于带领深度学习实验室研发具有统治级别的人工智能技术,其领导的团队在多个领域实现了技术上重大进展并且应用到百度的多项产品中去,极大地提升了产品的性能以及用户的体验,其带领的团队在多项重要计算机视觉技术在国际测试集上取得世界第一名的好成绩。

Presentations

主题演讲 (Keynote by Lin Yuanqing) 主题演讲 (Keynote)

敬请期待更多细节。Details to come.

Shaoshan Liu is the cofounder and president of PerceptIn, a company working on developing a next-generation robotics platform. Previously, he worked on autonomous driving and deep learning infrastructure at Baidu USA. Shaoshan holds a PhD in computer engineering from the University of California, Irvine.

Presentations

使用Alluxio助力机器人云 (Powering robotics clouds with Alluxio) 议题 (Session)

The rise of robotics applications demands new cloud architectures that deliver high throughput and low latency. Shaoshan Liu explains how PerceptIn designed and implemented a cloud architecture to support these emerging user requirements using Alluxio.

Yi Liu (刘轶) is the lead architect for Paid IM (internet marketing) at eBay, where he leads the architecture design for eBay’s ads, marketing data, and experimentation platform. Previously, he was an architect for big data infrastructure at Intel, where he also led Hadoop open source contributions and optimizations for projects in the Hadoop ecosystem and led large-scale machine-learning projects and optimizations based on Spark. Yi is a committer and PMC member of Apache Hadoop.

Presentations

Ebay的广告数据服务和处理平台 (eBay's data service and processing platform for ads) 议题 (Session)

eBay is one of largest ecommerce companies in the world. Yi Liu and Darren Fu share eBay's data service and processing platform for ads, based on Hadoop, Spark, Kafka, and Cassandra, explore the challenges of SQL operations in high-performance on large datasets, and explain how eBay handles them.

Ben Lorica is the chief data scientist at O’Reilly Media. Ben has applied business intelligence, data mining, machine learning, and statistical analysis in a variety of settings, including direct marketing, consumer and market research, targeted advertising, text mining, and financial engineering. His background includes stints with an investment management company, internet startups, and financial services.

Presentations

周六欢迎致辞 (Saturday opening welcome) 主题演讲 (Keynote)

大会日程主席 Jason Dai、Ben Lorica 与 Doug Cutting致辞开始第二天主题演讲。

Zhenxiao Luo is a software engineer at Uber working on Presto and Parquet. Before joining Uber, he led the development and operations of Presto at Netflix. Zhenxiao has big data experience at Facebook, Cloudera, and Vertica on Hadoop-related projects. He holds a master’s degree from the University of Wisconsin-Madison and a bachelor’s degree from Fudan University.

Presentations

列式存储在Uber (Columnar storage at Uber) 议题 (Session)

As Uber continues to grow, its big data systems must also grow in scalability, reliability, and performance to help Uber make business decisions, give user recommendations, and analyze experiments across all data sources. Zhenxiao Luo shares his experience running columnar storage in production at Uber and discusses query optimization techniques in SQL engines.

Ted Malaska is a group technical architect on the Battle.net team at Blizzard, helping support great titles like World of Warcraft, Overwatch, and HearthStone. Previously, Ted was a principal solutions architect at Cloudera helping clients find success with the Hadoop ecosystem and a lead architect at the Financial Industry Regulatory Authority (FINRA). He has also contributed code to Apache Flume, Apache Avro, Apache Yarn, Apache HDFS, Apache Spark, Apache Sqoop, and many more. Ted is a coauthor of Hadoop Application Architectures, a frequent speaker at many conferences, and a frequent blogger on data architectures.

Presentations

数据应用与数据产品架构 (Architecting data applications and data products) 教学辅导课 (Tutorial)

Ted Malaska walks you through building a fraud-detection system, using an end-to-end case study to provide a concrete example of how to architect and implement real-time systems via Apache Hadoop components like Kafka, HBase, Impala, and Spark.

大数据的数据模型 (Big data modeling) 教学辅导课 (Tutorial)

The recent advancement in distributed processing engines, from Spark to Impala to Spark Streaming and Storm, has proved exciting. Ted Malaska explains why, if your design only focuses on the processing layer to get speed and power you may be missing half the story, leaving a significant amount of optimization untapped.

掌握Spark单元测试 (Mastering Spark unit testing) 议题 (Session)

Ted Malaska explores examples of unit testing Spark Core, Spark MLlib, Spark GraphX, Spark SQL, and Spark Streaming, walking you through building and running the unit tests in real time and proving that debugging Spark is as easy as any other Java process.

Gene Pang is a software engineer at Alluxio. Previously, he worked at Google. Gene recently earned his PhD from the AMPLab at UC Berkeley, where his research focused on distributed database systems, and holds an MS from Stanford University and a BS from Cornell University.

Presentations

使用开源的Alluxio解耦计算与存储的架构 (The architecture of decoupling compute and storage with open source Alluxio) 议题 (Session)

Decoupling of storage and computation is becoming increasingly popular for big data analytics platforms. Haoyuan Li and Gene Pang share production best practices and solutions to best utilize CPUs, memory, and different tiers of disaggregated compute and storage systems to build out a multitenant high-performance platform that addresses real-world business demands.

在Spark上使用Alluxio的最佳实践(Best practices for using Alluxio with Spark) 议题 (Session)

Alluxio (formerly Tachyon) is a memory-speed virtual distributed storage system that leverages memory for managing data across different storage. Many deployments use Alluxio with Spark. Gene Pang and Bin Fan explain how Alluxio helps Spark be more effective and share examples of production deployments of Alluxio and Spark working together.

Jiangjie Qin is on the Data Infrastructure team at LinkedIn. He works on Apache Kafka and is a Kafka Committer. Previously, he worked at IBM, where he managed IBM’s zSeries platform for banking clients. Jiangjie holds a master’s degree in information networking from Carnegie Mellon’s Information Networking Institute.

Presentations

从简单到复杂:Apache Kafka应用实例详解 (From simple to complex: A detailed explanation of Apache Kafka applications in practice) 教学辅导课 (Tutorial)

Apache Kafka作为近年来最流行的消息系统之一,其使用场景已经从最初的集中系统消息队列发展到更为复杂的一系列使用场景,包括流处理,数据库复制,CDC等等。本次演讲将以Kafka在LinkedIn的实践为基础详细介绍Kafka的各种应用场景。

陈怡
英特尔大数据研发工程师,目前专注于Apache Hadoop HDFS 社区开源贡献,包括纠删码功能的开发和智能存储管理功能的开发。

Presentations

HDFS纠删码最新探秘 (Demystifying erasure coding in HDFS) 议题 (Session)

Hadoop3.0 引入了纠删码技术。在常见配置下,纠删码相对于传统数据3备份模式可以降低50%的存储成本,同时提高数据的可靠性。在本次演讲中,我们首先会简短的介绍HDFS纠删码技术, 然后深入了解在Hadoop 3.0 GA 前我们为保证纠删码功能稳定性做的工作,以及分享Hadoop生态系统中重要成员Spark, Hive,Impala, Kylin等等在HDFS 纠删码上的性能表现。最后,我们会给出在生产环境中部署使用纠删码技术的一些考虑和建议。

Jimmy Su is the head of JD security research center in Silicon Valley, where he leads the security research projects in the areas of account security, APT detection, IoT security, mobile security, and email security.

Presentations

在京东利用大数据进行安全分析 (Leveraging big data for security analytics at JD) 议题 (Session)

JD.com is one of the largest B2C online retailers in the world. Its mission is to provide a safe and secure marketplace for its 226M active users and 120K third-party vendors. Jimmy Zhigang Su and Tony Lee discuss the transformations big data has enabled at JD, including threat intelligence, account security, and end-point security.

Daniel Templeton has a long history in high-performance computing, open source communities, and technology evangelism. Today, Daniel works on the YARN development team at Cloudera, where he focuses on the resource manager, fair scheduler, and Docker support.

Presentations

Apache Hadoop 3.0的特性和开发进展的更新 (Apache Hadoop 3.0 features and development update) 议题 (Session)

Apache Hadoop 3.0 has made steady progress toward a planned release this year. Andrew Wang and Daniel Templeton offer an overview of new features, including HDFS erasure coding, YARN Timeline Service v2, and MapReduce task-level optimization, and discuss current release management status and community testing efforts dedicated to making Hadoop 3.0 the best Hadoop major release yet.

Ramkrishna is a senior software engineer at Intel working with Apache HBase. He is also an Apache Phoenix PMC member. Recently, Ramkrishna has been actively working on performance related features in HBase.

Presentations

生产环境里的堆外内存HBase读路径——阿里巴巴的故事 (Off-heap HBase read path in production: The Alibaba story) 议题 (Session)

Yu Li explains how Alibaba met the challenge of tens of millions requests per second to its Alibaba-Search HBase cluster on 2016 Singles' Day. With read-path off-heaping, Alibaba improved the throughput by 30% and achieved a predicable latency.

Andrew Wang is a software engineer at Cloudera on the HDFS team, an Apache Hadoop committer and PMC member, and the release manager for Hadoop 3.0. Previously, he was a PhD student in the AMP Lab at UC Berkeley, where he worked on problems related to distributed systems and warehouse-scale computing. He holds a master’s and a bachelor’s degree in computer science from UC Berkeley and UVA respectively.

Presentations

Apache Hadoop 3.0的特性和开发进展的更新 (Apache Hadoop 3.0 features and development update) 议题 (Session)

Apache Hadoop 3.0 has made steady progress toward a planned release this year. Andrew Wang and Daniel Templeton offer an overview of new features, including HDFS erasure coding, YARN Timeline Service v2, and MapReduce task-level optimization, and discuss current release management status and community testing efforts dedicated to making Hadoop 3.0 the best Hadoop major release yet.

HDFS纠删码最新探秘 (Demystifying erasure coding in HDFS) 议题 (Session)

Hadoop3.0 引入了纠删码技术。在常见配置下,纠删码相对于传统数据3备份模式可以降低50%的存储成本,同时提高数据的可靠性。在本次演讲中,我们首先会简短的介绍HDFS纠删码技术, 然后深入了解在Hadoop 3.0 GA 前我们为保证纠删码功能稳定性做的工作,以及分享Hadoop生态系统中重要成员Spark, Hive,Impala, Kylin等等在HDFS 纠删码上的性能表现。最后,我们会给出在生产环境中部署使用纠删码技术的一些考虑和建议。

Carson Wang is a big data software engineer at Intel, focusing on developing and improving new big data technologies. He is an active open source contributor to the Spark and Alluxio projects. Prior to Intel, Carson was an engineer at Microsoft working on cloud computing technologies.

Presentations

Apache Spark高级实践和原理解析 (Advanced practice and principle analysis) 培训 (Training)

这几年随着大数据分析和机器学习等等在工业界中越来越广泛的应用,越来越多的人选择在大数据平台比如Apache Spark之上构建大规模数据处理、分析和机器学习,以便利用大量原始数据和扩展架构。如何深入理解大数据关键技术并更好的运用它们?本次课程将结合当前大数据技术的浪潮和趋势,为您介绍Apache Spark的高级实践和原理解析,帮助您加深领会Apache Spark的精华设计思想,以及如何与流式分析、机器学习,深度学习等紧密结合,在数据采集,分析处理,特征提取,机器学习等方面提供一致性和集成性的高级实践。

王道远,英特尔亚太研发有限公司资深软件工程师,自2014年起参与Spark SQL开发,是Apache Spark开源社区的活跃贡献者。在参与Spark开发之前,他参与了IDH版本Hive的开发。译有《Spark快速大数据分析》一书。

Presentations

Apache Spark高级实践和原理解析 (Advanced practice and principle analysis) 培训 (Training)

这几年随着大数据分析和机器学习等等在工业界中越来越广泛的应用,越来越多的人选择在大数据平台比如Apache Spark之上构建大规模数据处理、分析和机器学习,以便利用大量原始数据和扩展架构。如何深入理解大数据关键技术并更好的运用它们?本次课程将结合当前大数据技术的浪潮和趋势,为您介绍Apache Spark的高级实践和原理解析,帮助您加深领会Apache Spark的精华设计思想,以及如何与流式分析、机器学习,深度学习等紧密结合,在数据采集,分析处理,特征提取,机器学习等方面提供一致性和集成性的高级实践。

Spinach: 使用Spark SQL进行即席查询 (Spinach: Using Spark SQL for ad hoc queries) 议题 (Session)

Spinach是英特尔大数据团队和百度基础架构团队的开源合作项目,旨在针对在Spark SQL上进行的大规模数据即席查询进行优化,满足在百度线上业务中对于海量搜索日志进行秒级查询的需求。 Spinach通过用户自定义的分布式索引和自动缓存等技术,极大地加速了一些特定场景下的SQL查询。Spinach支持多种索引类型,可以让用户根据数据特征选择适当的索引,加速查询的同时,引入较少的额外存储开销。 在百度的生产环境中,Spinach已经作为平台提供的查询加速方案,为部分实际查询带来5倍左右的性能提升,大大节约了查询的运行时间,丰富了Spark SQL的应用场景。

王海华,滴滴出行工程师,主要关注分布式计算和Hadoop Ecosystem。Apache Alluxio contributor,在Spark和分布式计算相关领域有丰富的研究和实践经验。

Presentations

在滴滴出行的最佳实践 (Spark best practices at Didi) 议题 (Session)

Spark是发源于美国加州大学伯克利分校AMPLab的集群计算平台,相比MapReduce有着长足进步。滴滴出行目前每天Spark应用数量6000+,主要应用在离线SQL、机器学习模型训练,地图计算和流式处理等方面。本次议题主要关于Spark在滴滴出行的大规模应用实践,以及Spark调优经验,和Spark诊断调优系统等内容。

Yiheng Wang is a software development engineer on the Big Data Technology team at Intel who works in the area of big data analytics. He and his colleagues are developing and optimizing distributed machine-learning algorithms (e.g., neural network and logistic regression) on Apache Spark. He also helps Intel customers build and optimize their big data analytics applications.

Presentations

Apache Spark高级实践和原理解析 (Advanced practice and principle analysis) 培训 (Training)

这几年随着大数据分析和机器学习等等在工业界中越来越广泛的应用,越来越多的人选择在大数据平台比如Apache Spark之上构建大规模数据处理、分析和机器学习,以便利用大量原始数据和扩展架构。如何深入理解大数据关键技术并更好的运用它们?本次课程将结合当前大数据技术的浪潮和趋势,为您介绍Apache Spark的高级实践和原理解析,帮助您加深领会Apache Spark的精华设计思想,以及如何与流式分析、机器学习,深度学习等紧密结合,在数据采集,分析处理,特征提取,机器学习等方面提供一致性和集成性的高级实践。

使用Apache Spark和BigDL来构建深度学习驱动的大数据分析 (Building deep learning-powered big data analytics using Apache Spark and BigDL) 教学辅导课 (Tutorial)

深度学习已经在很多的领域(例如计算机视觉、自然语言处理和语音识别等)取得了顶尖水准的表现,对工业界有极大的潜在应用价值。我们应该注意到深度学习和大数据的联系非常得紧密。首先,深度学习的模型需要使用大量的数据来训练,这就是为什么它直到大数据时代才开始蓬勃发展。其次,现在绝大部分的大数据都是视频、音频和文字数据,非常适合使用深度学习算法来处理。为了能释放深度学习的能力,我们就应该把它运用在大数据的环境里。

使用BigDL在Apache Spark上进行大规模分布式深度学习 (Distributed deep learning at scale on Apache Spark with BigDL) 议题 (Session)

Zhichao Li, Shengsheng Huang, and Yiheng Wanghow explore how data scientists have adopted BigDL for deep learning analysis on large amounts of data in a distributed fashion, allowing them to use their big data cluster as a unified data analytics platform for data storage, data processing and mining, feature engineering, traditional (non-deep) machine learning, and deep learning workloads.

Dennis Weng is vice president of engineering at JD Group. A 20-year IT veteran in cutting-edge technology companies, Dennis is an expert on storage systems and very large clustering systems. He holds nearly 20 US patents. Three years ago, he returned to China to lead the AI and big data group at JD. Dennis holds a masters degree from Lakehead University in Canada.

Presentations

An ecommerce future: AI and big data 主题演讲 (Keynote)

敬请期待更多细节 (Details to come)。

具有四年Hadoop及其生态系统的项目经验,专注在大数据解决方案的设计、部署和实现,具有多个行业,例如电信、保险、制造业以及公共安全方面的项目经历。
擅长于通过高效的Hadoop架构设计和实现,结合运用多种大数据工具,帮助业务部门从大数据中获取最终价值。

Presentations

使用Spark/BigDL高级机器学习实现寿险业务再发现 (Reimplement life insurance services using Spark and BigDL advanced machine learning) 议题 (Session)

中国人寿多年来积累了大量数据,如何深度挖掘数据的价值,用于业务推动、风险管理、客户服务等领域,是我们数据部门的主要目标。我们将介绍中国人寿如何使用Spark以及Spark上的深度学习库BigDL构建针对保险业务场景的高级分析应用。我们尝试了多种前沿的高级机器学习和深度学习技术,我们将分享我们的机器学习系统的架构,应用构建的流程,以及从中吸取到的经验和教训。

Mingxi Wu is the vice president of engineering at GraphSQL, a startup building world-leading real-time graph data platform. Over his 15-year career, Mingxi has focused on database research and data management software building in Microsoft’s SQL server group and Oracle’s relational database optimizer group. He has won research awards in the most prestigious publication venues in database and data mining (SIGMOD, KDD, and VLDB). Lately, he is focusing on building an easy-to-use and highly expressive graph query language. Mingxi holds a PhD from the University of Florida, where he specialized in both database and data mining.

Presentations

GraphSQL: 崭新的游戏规则一个完整的高效图数据和分析平台 (GraphSQL is a game changer: A complete high performance graph data and analytics platform) 议题 (Session)

Mingxi Wu and Yu Xu offer an overview of GraphSQL, a high-performance enterprise graph data platform for real-time graph analytics that enables businesses to transform structured, semistructured, and unstructured data and massive enterprise data silos into an intelligent interconnected data network, uncovering implicit patterns and critical insights to drive business growth.

吴中毕业于清华大学,在微软全球执行副总裁沈向洋博士的指导下获得计算机科学与技术学科的博士学位。现于DataVisor担任技术总监,并主要负责DataVisor中国区业务。在全球顶级计算机视觉会议如CVPR,ICCV,PAMI 等发表多篇有影响力的论文,并在大数据搜索,大数据安全领域有多项专利申请。在加入DataVisor之前,吴中在微软的Bing部门从事图像搜索工作,工作范围包括大规模文本及图像特征的抽取、索引,搭建高性能系统和设计高效算法,通过提高数十亿图像搜索索引的质量,进而提升Bing图像搜索结果的相关性。

Presentations

欺诈的潜伏性: 如何利用大数据进行反欺诈检测 (The latency of fraud: How to use big data to detect fraud) 议题 (Session)

你的用户中有多少是潜伏的欺诈者,等待发起攻击?所有线上用户社区都会存在隐藏群组、潜伏期账号欺诈的风险。根据DataVisor全球范围线上服务超过10亿用户和5千亿事件的分析数据,这个议题旨在详细阐述潜伏期欺诈账号存在的威胁性,探索欺诈者是如何应用复杂的攻击技术来逃避系统检测,以及Spark大数据安全分析的应用。

Tony Xing is a senior product manager on the Shared Data team within Microsoft’s Application and Service group. Previously, he was a senior product manager on the Skype data team within Microsoft’s Application and Service group. Tony is a frequent speaker at Strata.

Presentations

微软的通用异常检测平台 (The common anomaly detection platform at Microsoft) 议题 (Session)

Tony Xing offers an overview of Microsoft's common anomaly detection platform, an API service built internally to provide product teams the flexibility to plug in any anomaly detection algorithms to fit their own signal types.

中国人寿大数据机器学习项目经理,专注于大数据分析和机器学习的研究与应用

Presentations

使用Spark/BigDL高级机器学习实现寿险业务再发现 (Reimplement life insurance services using Spark and BigDL advanced machine learning) 议题 (Session)

中国人寿多年来积累了大量数据,如何深度挖掘数据的价值,用于业务推动、风险管理、客户服务等领域,是我们数据部门的主要目标。我们将介绍中国人寿如何使用Spark以及Spark上的深度学习库BigDL构建针对保险业务场景的高级分析应用。我们尝试了多种前沿的高级机器学习和深度学习技术,我们将分享我们的机器学习系统的架构,应用构建的流程,以及从中吸取到的经验和教训。

Yu Xu is the cofounder and CEO at GraphSQL. Previously, Yu held data engineering roles at Twitter, Teradata, and IBM. He is the author of 26 US patents (13 issued and 13 pending) in the areas of parallel computing, large-scale data analysis, information retrieval, and data management and has published 13 papers at top database conferences. Yu has also served on the program committees of top conferences in his field.

Presentations

GraphSQL: 崭新的游戏规则一个完整的高效图数据和分析平台 (GraphSQL is a game changer: A complete high performance graph data and analytics platform) 议题 (Session)

Mingxi Wu and Yu Xu offer an overview of GraphSQL, a high-performance enterprise graph data platform for real-time graph analytics that enables businesses to transform structured, semistructured, and unstructured data and massive enterprise data silos into an intelligent interconnected data network, uncovering implicit patterns and critical insights to drive business growth.

姚舜扬目前是Intel大数据团队的软件工程师,主要工作是分布式存储和计算,并在开源社区贡献代码。姚舜扬毕业于复旦大学的电子工程专业,他的主要兴趣是Hadoop和Spark的性能优化以及大数据安全。

Presentations

ShadowMask: 脱敏你的敏感的大数据 (ShadowMask: Anonymize your sensitive big data) 议题 (Session)

数据安全是大数据平台需要的非常重要的特性,如何防止用户敏感信息泄露是数据安全最大的威胁之一。ShadowMask是一个基于Spark大数据平台的开源数据脱敏项目,满足大数据用户对于用户隐私数据脱敏的需求,控制隐私数据泄露风险与数据处理需求的平衡。本次演讲主要介绍项目目标,架构,挑战,应用案例以及当前项目状态。

英特尔大数据技术中心高级技术经理。在服务器软硬件行业十年以上行业经验,目前致力于大数据分析相关软件方案的推广工作。

Presentations

使用Spark/BigDL高级机器学习实现寿险业务再发现 (Reimplement life insurance services using Spark and BigDL advanced machine learning) 议题 (Session)

中国人寿多年来积累了大量数据,如何深度挖掘数据的价值,用于业务推动、风险管理、客户服务等领域,是我们数据部门的主要目标。我们将介绍中国人寿如何使用Spark以及Spark上的深度学习库BigDL构建针对保险业务场景的高级分析应用。我们尝试了多种前沿的高级机器学习和深度学习技术,我们将分享我们的机器学习系统的架构,应用构建的流程,以及从中吸取到的经验和教训。

张李晔是新氦科技大数据架构师,目前主要专注于基于容器的流处理和实时分析平台的搭建和开发。新氦科技是新智集团下属,上海的一家大数据基础架构公司。在加入新氦科技之前张李晔在英特尔亚太研发有限公司担任大数据软件工程师,曾从事Spark和Hive的相关的代码开发、性能调优等工作。

Presentations

HAP:多流动态实时分析系统 议题 (Session)

HAP是一个实时分析系统,能够支持流式的输入,并且可以支持多流的碰撞,同时,可以根据查询层来动态的改变底层的流式处理方式以实现不同业务需求。另外在Kubernetes上可以实现水平扩展、高可用、高效、高速,并在保证数据exactly once语义的情况下实现秒级的数据分析和查询。

GrowingIO 创始人& CEO,硅谷十三年数据分析经历,亲手建立 LinkedIn 百人商务分析和数据科学团队,支撑 LinkedIn 所有与营收相关业务的高速增长。Data Science Central 评选其为“世界前十位前沿数据科学家”。
2015 年 5 月,创办基于用户行为的新一代数据分析产品 — GrowingIO,无需埋点即可采集全量、实时用户行为数据,帮助管理者、产品经理、市场运营、数据分析师、增长黑客提升转化率、优化网站/APP,实现数据驱动业务和用户增长。
GrowingIO 获得《快公司》评选的 2015 年中国最佳创新公司 50 强,并获得经纬中国、NEA、Greylock A 轮2000万美元融资。

Presentations

数据驱动企业增长 (Data-driven business growth) 议题 (Session)

当流量红利渐消,数据驱动用户和收入增长成为新的核心;用数据驱动决策,而不是靠拍脑袋;数据分析究竟有哪些魅力?如何帮助企业创造巨大的商业价值,如何令公司全员做到数据决策;硅谷最前沿的方法论、工具、技术,最前沿的产品理念有哪些?

Xuefu Zhang is a software engineer at Uber, where he is the tech team lead for SQL on Hadoop. A veteran of the open source community, Xuefu spends most of his time on Apache Hive and Pig. Previously, he was the tech lead for Hive at Cloudera and led a global effort for the Hive on Spark project, worked on the Hadoop team at Yahoo, and spent his early career at Informatica gaining important experience in enterprise software development, especially in ETL and data warehousing. Xuefu is an Apache member and a PMC member for Hive, Sentry, and Pig.

Presentations

为Hadoop上的大数据准备的统一的SQL (Unified SQL for big data on Hadoop) 议题 (Session)

Xuefu Zhang offers an overview of U-SQL, which was developed internally by engineers at Uber and is envisioned as the future of SQL platforms. U-SQL enables automatic parsing, translation, optimization, and routing for user queries written in any supported query language and provides a unified SQL interface for SQL users who might not be familiar with the underlying SQL engines.

现任领英公司研发经理,领导核心大数据团队。该团队开发和应用HDFS,YARN,Spark,TensorFlow等开源技术,为领英公司的大数据平台提供核心的存储/计算引擎。

张喆同时还是Apache Hadoop项目的管理委员会(PMC)成员。也是Hadoop3的主要功能之一,HDFS纠删码(HDFS-EC)的作者。在加入领英之前,张喆就职于Cloudera和IBM沃森研究中心。2006年至今,在国际会议和期刊上发表论文20余篇,拥有5项美国专利。在IBM期间,获杰出技术成就奖(Outstanding Technology Achievement Award)。

Zhe Zhang is an engineering manager at LinkedIn, where he leads the Core Big Data Services team, which leverages open source technologies such as Hadoop, Spark, TensorFlow, and beyond to form the storage-compute engine of LinkedIn’s big data platform. Zhe is a PMC member of Apache Hadoop and author of HDFS erasure coding, a major feature for Hadoop 3.0. Previously, Zhe worked at Cloudera and IBM’s T. J. Watson Research Center. Zhe has over 20 research publications and 5 US patents. While at IBM, he received the Research Accomplishment Award and the Outstanding Technology Achievement Award.

Presentations

成长的烦恼--领英大数据平台500倍扩展中应对的挑战 (Growing pains: When your big data platform grows really big) 主题演讲 (Keynote)

领英是全球最早应用大数据技术的公司之一。在过去9年的时间里,领英的大数据平台扩展了将近500倍,从20台节点支持10个用户运行MapReduce,到现在超过1万台节点支持几千名工程师和科学家运行从交互式Presto查询到TensorFlow深度学习的各种大规模数据分析。这个报告会分享领英的大数据平台团队怎样解决大规模和高速增长带来的各种挑战。

领英大数据平台--超过1万节点,每天15万个作业,智能连接4.7亿职场用户 (LinkedIn's big data platform: 10,000+ nodes and 150,000+ daily jobs connecting 470 million members) 议题 (Session)

领英是全球最早应用大数据技术的公司之一。在过去9年的时间里,领英的大数据平台扩展了将近500倍,从20台节点支持10个用户运行MapReduce,到现在超过1万台节点支持几千名工程师和科学家运行从交互式Presto查询到TensorFlow深度学习的各种大规模数据分析。这个报告会分享领英的大数据平台团队怎样解决大规模和高速增长带来的各种挑战。

Xiaoyong Zhu is a program manager at Microsoft focusing on scalable machine learning and advanced analytics.

Presentations

使用R和Apache Spark处理大规模数据 (Scaling R faster and larger using Apache Spark) 议题 (Session)

R is a popular data science tool for data analysis. However, it has many drawbacks, such as its memory utilization and single-thread design, that limit its usage for big data analysis. Xiaoyong Zhu explains how to use R to analyze terabytes of data.

Jia Zou runs product engineering at Mobike, the biggest bike-sharing company in China, where he is responsible for the design and implementation of the software technical stack that powers Mobike‘s app. Previously, Jia was an engineer at Google, where he worked on Google Wallet, as well as a founding member of Uber’s China Growth team, responsible for supplying the technologies needed to grow the driver base for Uber China. He holds a PhD from the University of California, Berkeley.

Presentations

主题演讲 (Keynote by Jia Zou) 主题演讲 (Keynote)

敬请期待更多细节。Details to come.

万达网络科技区块链研发

Presentations

Hyperledger与CDH大数据生态系统的融合以及应用实践 (Hyperledger’s integration with CDH's big data ecosystem and its real-world applications) 议题 (Session)

区块链,比特币背后的技术,是一个去中心的分布式账本技术。Hyperledger是一个开源,跨行业的区块链平台技术。它是一个由金融,银行,物联网,供应链,制造业的行业领袖协同组成的全球协作项目。我们将Hyperledger同CDH进行集成,以利用CDH的服务部署,监控,管理功能。通过这个项目,用户可以方便地在CDH托管的数据中心部署Hyperledger集群,而且便于利用CDH大数据平台分析Hyperledger的数据,提取更多的商业价值。在万达内部使用的项目包含:数字权益平台和共享商业平台。其中共享商业平台包含了金融和供应链等多个环节。我们相信这个项目对于Hyperledger开源社区将很有帮助。

2014年3月加入淘宝技术部,专注于集团内的Spark集群和服务建设。2015年5月加入阿里云,致力于在公有云上提供开源计算服务,关注分布式计算方向,Apache Hadoop和Spark社区贡献者。

Presentations

Hadoop遇到云上对象存储——实现原理、陷阱和性能优化 (When Hadoop meets object storage: Implementation principles, pitfalls, and performance optimization) 议题 (Session)

Hadoop社区很早就支持公有云上的对象存储,比如AWS S3和Azure Storge。最近发布的Apache Hadoop 3.0 (alpha)版本中增加了更多的云存储服务支持,比如Azure Data Lake和阿里云OSS。这些云存储都提供了Hadoop兼容的文件系统,用户可以把他们当成另一个HDFS使用。但是对象存储和HDFS在实现原理上有很多的不同,所以即使两者有类似的文件系统接口,很多API的行为完全不同。 本议题以阿里云OSS的实践出发,介绍阿里云OSS FileSystem实现进入Apache Hadoop历程。同时会介绍对象存储在文件上传、下载、删除和移动上和传统文件系统的区别,从性能和成本上评估HDFS和OSS文件系统的优劣。最后会结合对象存储的特性,给出一些优化方案,可以提升Hive或Spark等开源访问对象存储的性能。

英特尔大数据架构师,Spark开源贡献者。10年软件开发经验,熟悉大数据,流计算,存储,虚拟化。曾帮助多家公司构建基于Spark的流处理方案。

Presentations

Apache Spark高级实践和原理解析 (Advanced practice and principle analysis) 培训 (Training)

这几年随着大数据分析和机器学习等等在工业界中越来越广泛的应用,越来越多的人选择在大数据平台比如Apache Spark之上构建大规模数据处理、分析和机器学习,以便利用大量原始数据和扩展架构。如何深入理解大数据关键技术并更好的运用它们?本次课程将结合当前大数据技术的浪潮和趋势,为您介绍Apache Spark的高级实践和原理解析,帮助您加深领会Apache Spark的精华设计思想,以及如何与流式分析、机器学习,深度学习等紧密结合,在数据采集,分析处理,特征提取,机器学习等方面提供一致性和集成性的高级实践。

机器学习老兵,前mediav高级算法工程师,前阿里巴巴淘宝技术部算法专家,现任万达网络研究院资深研究员,对于机器学习,最优化算法在计算广告,推荐系统上有较多的经验,对于基于大数据的反欺诈,授信风险评估有所接触,

Presentations

从LR到DNN点击率预估系统的进化 (The evolution of CTR prediction systems, from LR to DNN) 议题 (Session)

广告点击率(ctr)预估的是一个热点问题,从事计算广告的公司一般都有自己的ctr系统,如何稳定可控地改进点击率预估系统,数据,架构,算法这三方面在不同的时间点要做什么是我这次想要分享的主题.通过回顾一个点击率预估系统是如何从最初的单纯的ETL+LR的形式逐步演变为包括模型在线训练,自动baddit,自动大规模特征探索的成熟在线系统.着重介绍在演化的几个关键节点上基于当时情况选择那个技术方向的思考过程,相当于结合ML&DL的知识体系和最近2年的发展,以业内几个比较知名的应用场景为线索,以几个关键节点(千人千面的上下线,双11的逐年演化)为例子来介绍大规模机器学习,分布式最优化的相关知识点,为参会者在面对在具体业务中遇到ML,DL相关问题如何做选型提供一份历史案例的参考

TalkingData首席数据科学家。2016年创建Fregata开源项目。曾在IBM中国研究院,腾讯数据平台部,华为诺亚方舟实验室任职。10年大规模机器学习,数据挖掘有深入的研究和实践经验。目前在TalkingData, 负责数据科学工作。

Presentations

Fregata:在Spark上支持万亿维模型的机器学习算法库(Fregata: Machine learning algorithm libraries for supporting trillion-dimensional model on Spark) 议题 (Session)

TalkingData的一些核心业务能力如Lookalike十分依赖大规模机器学习的能力,我们发现现有的大规模机器学习技术都不能很好的满足我们的需要。因为我们需要支持大规模数据的高速,稳定,无需调参的机器学习算法,而这是目前的一些主流平台和工具无法提供的能力。为此我们在算法和系统方面做了一些研究,取得了一些成果。我们开源的Fregata机器学习算法库完全基于Spark标准接口,在Logisti Regression, Softmax算法上能够做到无需调参,高速,支持万亿维度的模型。Fregata Logistic Regression算法,在消耗大约2-4台服务器的机器资源,对于5.1亿条,1万亿维度的训练数据,可以在15分钟内完成训练。我们在本次演讲中将介绍Fregata在算法上和系统方面的一些工作。

张铭,北京大学信息科学技术学院教授,博士生导师,ACM Education Council惟一的中国委员兼任中国ACM教育专委会主 席,是ACM/IEEE IT2017学科规范起草小组成员。自1984年考入北京大学,分别获得学士、硕士和博士学位。研究方向为文本挖掘、社会网络分析、教育大数据等,目前主持国家自然科学基金和教育部博士点基金在研项目,合作发表科研学术论文100多篇(ICML, KDD, AAAI, IJCAI, ACL, WWW, TKDE等A类会议和期刊),获得ICML 2014最佳论文奖。发表了SIGCSE、L@S等教学研究论文,出版学术专著1部,获软件著作权6项,获发明专利3项。主编多部教材,其中2部教材为国家“十一五”规划教材,《数据结构与算法》获北京市精品教材奖并得到国家“十二五”规划教材支持。主持的“数据结构与算法”被评选为国家级和北京市级精品课程,也是教育部精品资源共享课程。

Presentations

基于深度学习的网络表示 (Network representations based on deep learning) 议题 (Session)

网络结构在现实世界中无处不在(如航线网络、通信网络、论文引用网络、世界万维网和社交网络等),大规模的网络结构数据和丰富的网络节点信息对相关的研究方法提出了新的挑战,受到了学术界和工业界的广泛关注。本报告对基于神经网络的网络表示方法进行了详细的介绍,这些方法可以处理现实世界中拥有百万级节点和十亿级边的网络结构,主要考虑了网络结构信息和网络节点自身信息(如文本信息和属性信息等)。学习网络的低维网络表示,在不同应用领域中体现出很好的效率和效果。

李元健,百度基础架构部资深研发工程师,Apache Spark contributor。11年加入百度,先后参与并负责百度实时计算平台DStream,Tracing平台Rig,Spark平台及公有云BigSQL等核心服务的研发工作。

Presentations

Spinach: 使用Spark SQL进行即席查询 (Spinach: Using Spark SQL for ad hoc queries) 议题 (Session)

Spinach是英特尔大数据团队和百度基础架构团队的开源合作项目,旨在针对在Spark SQL上进行的大规模数据即席查询进行优化,满足在百度线上业务中对于海量搜索日志进行秒级查询的需求。 Spinach通过用户自定义的分布式索引和自动缓存等技术,极大地加速了一些特定场景下的SQL查询。Spinach支持多种索引类型,可以让用户根据数据特征选择适当的索引,加速查询的同时,引入较少的额外存储开销。 在百度的生产环境中,Spinach已经作为平台提供的查询加速方案,为部分实际查询带来5倍左右的性能提升,大大节约了查询的运行时间,丰富了Spark SQL的应用场景。

今日头条高级工程师,负责整个Spark平台。

Presentations

Spark在今日头条的实践 (Spark in JinRi TouTiao) 议题 (Session)

讲述今日头条是如何用Spark来处理海量数据,以及在实际使用中的一些改进。

李银辉拥有丰富的分布式系统研发经验,目前负责万达网络科技集团大数据平台的平台基础设施研发工作。此前曾就职中国电信,利用Hadoop生态圈组件,处理中国电信集团海量数据的etl工作。

Presentations

ShadowMask: 脱敏你的敏感的大数据 (ShadowMask: Anonymize your sensitive big data) 议题 (Session)

数据安全是大数据平台需要的非常重要的特性,如何防止用户敏感信息泄露是数据安全最大的威胁之一。ShadowMask是一个基于Spark大数据平台的开源数据脱敏项目,满足大数据用户对于用户隐私数据脱敏的需求,控制隐私数据泄露风险与数据处理需求的平衡。本次演讲主要介绍项目目标,架构,挑战,应用案例以及当前项目状态。

目前在阿里云iDST大规模算法团队负责大规模深度学习算法基础设施相关建设工作,对大规模分布式机器学习的开发、建设、优化以及在不同业务场景中的落地应用有较为深入的理解和认识。之前先后在奇虎360担当广告技术部门架构师,Yahoo北京研发中心担当效果广告系统技术负责人。

Presentations

Pluto:一款分布式异构深度学习框架 (Pluto: A distributed heterogeneous deep learning framework) 议题 (Session)

本分享会介绍阿里云iDST PAI团队研发的一款分布式深度学习框架Pluto。在Pluto里,阿里云PAI团队基于Caffe和TensorFlow这两款开源框架进行了分布式性能的深度优化定制,相较于优化前取得了显著的性能提升,在一些场景下取得了10X的收敛加速比提升。并成功应用到了集团安全、金融风险建模、证件类图片识别、客服问答、机器翻译等集团核心业务建模场景里,显著提升了建模迭代效率。

研究生毕业于中国科学技术大学,现任联想大数据产品研发部高级经理,负责大数据产品架构与算法研究等工作。曾在施乐、阿里巴巴、华为、百度、万达电商等公司从事数据挖掘研发工作,工作涉及机器学习/模式识别在图像处理、电子商务、搜索推荐、知识图谱、零售方面的应用。

Presentations

多视图建模与半监督学习:应用于海量用户数据挖掘与行为分析 (Multiview modeling and semisupervised learning applied to massive user data mining and behavior analysis) 议题 (Session)

在无法直接收集个人信息的情况下,企业需要根据用户行为数据,来预测用户的特定属性(如性别、职业、学历、购买力、年龄以及其它个人生命周期的状态等)。(目标) 一些有监督机器学习算法被用来实现这一目标,但是,面对数千万甚至上亿的海量用户、数百亿甚至更多的行为数据,标注量需要达到一定规模,才能保障机器学习的效果,而为了获得标注数据,是成本非常巨大的工作。(难点) 在实践中,我们通过多个角度对用户进行建模,构造不同的用户数据视图,在每个视图下选择合适的机器学习算法,应用cotraining半监督学习算法,通过多个数据视图机器学习算法的协同训练(cotraining),在使用非常少量的标注数据的情况下,就能在用户属性预测方面达到良好的效果。(方法)

王振华 (Zhenhua Wang) is a research engineer at Huawei Technologies, where he works on building a big data analytics platform based on Apache Spark. He holds a PhD in computer science from Zhejiang University. His research interests include information retrieval and web data mining.

Presentations

基于成本的Spark SQL优化器框架 (A cost-based optimizer framework for Spark SQL) 议题 (Session)

我们把基于成本的优化器框架贡献给社区版本Spark 2.2。在我们的框架中,我们计算每个数据库操作符的基数和输出大小。通过可靠的统计和精确的估算,我们能够在这些领域做出好的决定:选择散列连接(hash join)操作的正确构建端(build side),选择正确的连接算法(如broadcast hash join与 shuffled hash join), 调整连接的顺序等等。这个基于成本的优化器框架对Spark SQL查询的性能有很好的提升 。在这次演讲中,我们将展示Spark SQL的新的基于成本的优化器框架及其对TPC-DS查询的性能影响。

中国人寿大数据项目负责人,有丰富的数据服务项目的实施经验

Presentations

使用Spark/BigDL高级机器学习实现寿险业务再发现 (Reimplement life insurance services using Spark and BigDL advanced machine learning) 议题 (Session)

中国人寿多年来积累了大量数据,如何深度挖掘数据的价值,用于业务推动、风险管理、客户服务等领域,是我们数据部门的主要目标。我们将介绍中国人寿如何使用Spark以及Spark上的深度学习库BigDL构建针对保险业务场景的高级分析应用。我们尝试了多种前沿的高级机器学习和深度学习技术,我们将分享我们的机器学习系统的架构,应用构建的流程,以及从中吸取到的经验和教训。

莫云目前就职于宜人贷数据团队,任数据工程师。他曾负责实时查询引擎的技术选型和Impala框架的建立。此项目已支撑全公司的实时交互查询。他负责创建的知识图谱已积累了1.7亿节点和10亿关系,并已成功应用于反欺诈方向。在宜人贷就职前,莫云曾任职于搜狐畅游,负责广告平台部Hadoop及数据仓库的建立。

Presentations

SDK + FinGraph + Go:用一手行为数据和图谱信息创造商业价值 (SDK + FinGraph + Go: Create business value with firsthand user behavior data and knowledge graph information) 议题 (Session)

在移动互联网流量红利过后,我们怎样深度挖掘一手移动数据,实时响应用户需求,通过用户行为和知识图谱技术,创造商业价值?我们会通过具体业务案例,分享一个SDK + FinGraph + Go的技术框架。此框架只用一行代码将SDK埋入APP,通过实时/准实时的上传机制和Flume + Kafka的实时处理分析,获取用户意向;用Spark Streaming流式处理,HBase KV查询输出,和Neo4j集群做的关联、存储来挖掘图谱信息;并通过Go高效的开发基础平台,Python连接自动提报后台,scikit-learn做事件识别,和Cypher挖掘图谱关系来预测用户意愿,引导用户行为 - 用实时数据创造商业价值。

陈雨强,第四范式联合创始人、首席研究科学家。

世界级深度学习、迁移学习专家。
在百度主持了世界首个商用深度学习系统、在今日头条主持了全新的信息流推荐与广告系统的设计实现。

学术方面,他曾在 NIPS、AAAI、ACL、SIGKDD 等顶会上发表论文,并获 APWeb2010 Best Paper Award,KDD Cup 2011 名列前三,其学术工作在 2010 年作被全球权威科技杂志 MIT Technology Review 报道。

Presentations

人工智能工业应用痛点及解决思路 (Pain points in AI industrial applications and solutions) 议题 (Session)

AI的强大让各行各业纷纷侧目,未来对AI的应用情况将极大影响一家企业在市场中的位置。 然而, 在实验室叱咤风云的AI技术一旦应用到实际,难免水土不服。 那么,AI工业应用的必要条件是什么?痛点有哪些?如何解决?如何从系统层面、模型&特征层面、模型维度层面、实施上线层面实现突破?针对常见场景中的常见难点,有哪些黑科技正在起作用? 本演讲旨在分享演讲者在互联网、金融、电信等领域的人工智能工业应用实践中的痛点及解决思路。

中国人寿数据科学家,专注于大数据分析领域, 主要研究高级数据分析方法和机器学习原理

Presentations

使用Spark/BigDL高级机器学习实现寿险业务再发现 (Reimplement life insurance services using Spark and BigDL advanced machine learning) 议题 (Session)

中国人寿多年来积累了大量数据,如何深度挖掘数据的价值,用于业务推动、风险管理、客户服务等领域,是我们数据部门的主要目标。我们将介绍中国人寿如何使用Spark以及Spark上的深度学习库BigDL构建针对保险业务场景的高级分析应用。我们尝试了多种前沿的高级机器学习和深度学习技术,我们将分享我们的机器学习系统的架构,应用构建的流程,以及从中吸取到的经验和教训。

马晓宇是PingCAP的技术主管,负责TiDB大数据生态的整合以及MPP引擎开发。

Presentations

Spark和TiDB (Spark on TiDB) 议题 (Session)

SparkTI (Spark on TiDB)是TiDB基于Apache Spark的独立于原生系统的计算引擎。它将Spark和TiDB深度集成,在原有MySQL Workload之外借助Spark支持了更多样的用户场景和API。这个项目在SparkSQL和Catalyst引擎之外实现了一套扩展的,为TiDB定制的SQL前端(Parser,Planner和优化器):它了解TiDB如何组织数据,并知晓如何借助TiDB本身的计算能力加速查询,而不仅仅是一个Connector。凭借SparkTI,TiDB将成为Hadoop生态的一部分,铺平了OLTP系统和离线分析集群之间的鸿沟。

马洪宾,Kyligence技术合伙人兼高级软件架构师,Apache Kylin Committer & PMC Member。专注于大数据应用和底层架构,目前主要负责主持企业级数据仓库Kyligence Analytics Platform的开发工作,同时持续地贡献Apache Kylin社区。曾是微软亚洲研究院的Graph Engine Trinity的核心贡献者,在VLDB等数据库领域的核心会议上发表多篇论文。

Presentations

Apache Kylin 2.0:从Hadoop上的OLAP 引擎到实时数据仓库 (Apache Kylin 2.0: From an OLAP engine on Hadoop to a real-time data warehouse) 议题 (Session)

Apache Kylin v2.0即将发布!作为领先的大数据OLAP分析引擎,现在的Apache Kylin羽翼更丰:支持雪花模型、更加全面的SQL语法、初出茅庐的Spark Cubing、更好地支持实时流式数据接入等等。Apache Kylin正逐渐从一个Hadoop上的传统OLAP平台,演变为一个Hadoop上的实时数据仓库。

工学博士,现任广发银行数据中心总经理,银行业信息科技发展与风险管理专家,广州市金融高级专业人才,曾任工商银行北京数据中心信息科技专家。黄先生作为银行业科技条线的资深专家,在基础设施与运行维护、信息科技治理与管理、大数据研究及规划等领域具有丰富经验。

Presentations

大数据时代银行客户社交关系圈研究与应用 (Research on and the application of a social relation circle of bank customers in the big data era) 议题 (Session)

为加深对银行客户的洞察,提升银行营销获客与风险管控能力,广发银行基于Hadoop大数据平台,通过Hive on Spark、图计算进行数据加工,结合LFM社群发现、增强决策树等机器学习算法构建了银行客户社交关系模型,挖掘出银行客户社交关系圈,并应用于银行实际业务中。银行客户社交关系圈全面的反映了银行个人客户资金、社交等关系,以全新的视角实现银行对客户洞察从点到面、从单客到客群的扩展,填补银行个人客户社交关系研究与应用的空白。

联系OReillyData

关注OReillyData微信号获取最新会议信息并浏览前沿数据文章。

WeChat QRcode

 

Stay Connected Image 1
Stay Connected Image 3
Stay Connected Image 2

阅读关于大数据的最新理念。

ORB Data Site