O'Reilly、Cloudera 主办
Make Data Work
2017年7月12-13日:培训
2017年7月13-15日:会议
北京,中国

Strata Data Conference 2017 讲师

会有新讲师不断加入。请经常回来查看日程安排的最新变化。

过滤器

搜索讲师

专注于Hadoop,Spark,Flink,Kafka,Elastic,HBase,Hive,Kylin等大数据相关技术的源码研究和企业级实战,《基于Apache Kylin构建大数据分析平台》一书作者。

Presentations

Hyperledger与CDH大数据生态系统的融合以及应用实践 (Hyperledger’s integration with CDH's big data ecosystem and its real-world applications) 议题 (Session)

区块链,比特币背后的技术,是一个去中心的分布式账本技术。Hyperledger是一个开源,跨行业的区块链平台技术。它是一个由金融,银行,物联网,供应链,制造业的行业领袖协同组成的全球协作项目。我们将Hyperledger同CDH进行集成,以利用CDH的服务部署,监控,管理功能。通过这个项目,用户可以方便地在CDH托管的数据中心部署Hyperledger集群,而且便于利用CDH大数据平台分析Hyperledger的数据,提取更多的商业价值。在万达内部使用的项目包含:数字权益平台和共享商业平台。其中共享商业平台包含了金融和供应链等多个环节。我们相信这个项目对于Hyperledger开源社区将很有帮助。

叶杰平,滴滴出行研究院副院长,DiDi Fellow,美国密歇根大学终身教授及密歇根大学大数据研究中心的管理委员会成员。2005年美国明尼苏达大学计算机系博士毕业。专业方向为机器学习, 数据挖掘,以及大数据分析。在机器学习和数据挖掘国际顶级会议及期刊上共发表论文200余篇。曾获KDD和ICML最佳论文奖以及美国国家自然科学基金会生涯奖 (NSF CAREER Award),并担任多个机器学习和数据挖掘领域顶级会议的主席。现任职机器学习和数据挖掘期刊IEEE TPAMI,DMKD,和 IEEE TKDE的副编委。

Presentations

大数据在滴滴出行的应用 (Big data at DiDi Chuxing) 主题演讲 (Keynote)

Every day, Didi Chuxing's platform generates over 70 TB worth of data, processes more than 20 billion routing requests, and produces over 14 billion location points. Ye Jieping explains how Didi Chuxing applies AI technologies to analyze such big transportation data and improve the travel experience for people in China.

Ziya Ma is the general manager of the global Big Data Technologies organization in Intel’s Software and Services group (SSG) in the System Technologies and Optimization (STO) division. Her organization focuses on optimizing big data on Intel’s platform, leading open source efforts in the Apache community, and linking innovation in industry analytics to bring about the best and the most complete big data experiences. She works closely with Intel product teams, open source communities, partners from the industry, and academia to advise on implementing and optimizing the Intel platform for Hadoop or Spark ecosystems. Previously, Ziya held various management positions in Intel’s Technology Manufacturing group (TMG), where she was responsible for delivering embedded software for factory equipment, databases for manufacturing execution and process control, UI software, and more, and was product development software director of Intel IT, where she delivered software lifecycle management tools and infrastructure and analytics solutions to Intel software teams worldwide. She also worked at Motorola earlier in her career. Ziya holds a PhD and MS in computer science and engineering from Arizona State.

Presentations

英特尔技术加速实现分析与人工智能的未来 - 英特尔赞助 (Accelerating the future for analytics and AI with Intel technologies—sponsored by Intel) 主题演讲 (Keynote)

本主题将突出英特尔多方面的努力:大数据技术借助民主化进程,通过广泛的产品组合而整合生态系统;通过新的高度优化的AI解决方案的贡献,推进创新;并释放智慧以解决世界上最大的挑战,同时提供给客户最大的商业价值。

Amr Awadallah is the cofounder and CTO at Cloudera. Previously, Amr was an entrepreneur in residence at Accel Partners, served as vice president of product intelligence engineering at Yahoo, and ran one of the very first organizations to use Hadoop for data analysis and business intelligence. Amr’s first startup, VivaSmart, was acquired by Yahoo in July 2000. Amr holds bachelor’s and master’s degrees in electrical engineering from Cairo University, Egypt, and a PhD in electrical engineering from Stanford University.

Presentations

在企业中实现数据科学(Enabling data science in the enterprise) 主题演讲 (Keynote)

Amr Awadallah explains how data science and machine learning methods are evolving to bring a more comprehensive, secure, and enterprise-grade data science experience to the enterprise.

Lukas Biewald is the founder and chief data scientist of CrowdFlower, a data enrichment platform that taps into an on-demand workforce to help companies collect training data and do human-in-the-loop machine learning. Previously, he led the Search Relevance team for Yahoo Japan and worked as a senior data scientist at Powerset. Lukas was recognized by Inc. magazine as a 30 under 30. Lukas holds a BS in mathematics and an MS in computer science from Stanford University. He is also an expert Go player.

Presentations

专家见面会——Lukas Biewald (Crowdflower)(Meet the Expert with Lukas Biewald, Crowdflower) 专家见面会 (Meet the Experts)

Best practices in training data collection and human-in-the-loop computing to make it possible to deploy imperfect machine learning algorithms for mission critical application. Lukas Biewald explains how you can make the best possible use of training data and why it is essential to making your machine learning work well.

现实世界里的主动学习 (Active learning in the real world) 议题 (Session)

Training data collection strategies are often the most important and overlooked part of deploying real-world machine learning algorithms. Lukas Biewald explains why active learning is the best way to collect training data and can make the difference between a failed research project and a deployed production algorithm.

现实世界里的深度学习 (Deep learning in the real world) 主题演讲 (Keynote)

As companies take machine learning out of R&D and into production, they face a whole new set of challenges. Lukas Biewald explains why human in the loop, active learning, and transfer learning are all essential design patterns for making deep learning real.

Cloudera售前技术经理、行业领域顾问、资深方案架构师,原Intel Hadoop发行版核心开发人员。2006年加入Intel编译器部门从事服务器中间件软件开发,擅长服务器软件调试与优化。2010 年后开始Hadoop 产品开发及方案顾问,先后负责Hadoop 产品化、HBase 性能调优,以及行业解决方案顾问,已在交通、通信等行业成功实施并支持多个上百节点Hadoop 集群。

Presentations

HBase多数据中心方案及未来的增量备份功能介绍 (HBase as a multiple-data-center solution and its future incremental backup function) 议题 (Session)

多年来Hadoop技术无法进入核心业务系统,其中无成熟稳定的异地多数据中心方案是其中重要原因之一。由于灾备等原因,存储重要数据的HBase集群通常要求跨数据中心进行备份。国内银行业监管单位更是提出了异地多中心的硬性要求。而现在的HBase多为单数据中心部署,目前HBase提供的replica,快照拷贝或export的方式,皆不能满足监管和异地灾备要求。在本session将分享现有多中心部署要求下HBase所遇到的问题、解决办法。未来HBase将增加增量备份功能,其提供的增量备份方案,避免了现有技术对全表数据的扫描,大大提高了备份性能,同时又提供了repica不具备的一致性。在本session中也将详细描述此功能对于多数据方案的重要性、使用介绍以及内部原理刨析。

深度学习工程师,做过HBase、Ceph等分布式存储项目,参与过OpenStack和Docker社区项目,目前负责小米云深度学习平台架构与实现,专注于Kubernetes和TensorFlow社区。

Presentations

云深度学习平台架构与实践 (Architecture and practices of a cloud-based deep learning platform) 议题 (Session)

介绍小米内部应用的cloud machine learning平台,分析通用深度学习平台的架构设计和实现原理,还有在企业内部支持开发环境、模型训练以及模型服务的实践经验。

Haifeng Chen is a senior software architect at Intel’s Asia Pacific R&D Center. He has more than 12 years’ experience in software design and development, big data, and security, with a particular interest in image processing. Haifeng is the author of image browsing, editing, and processing software ColorStorm.

Presentations

Hadoop遇到云上对象存储——实现原理、陷阱和性能优化 (When Hadoop meets object storage: Implementation principles, pitfalls, and performance optimization) 议题 (Session)

Hadoop社区很早就支持公有云上的对象存储,比如AWS S3和Azure Storge。最近发布的Apache Hadoop 3.0 (alpha)版本中增加了更多的云存储服务支持,比如Azure Data Lake和阿里云OSS。这些云存储都提供了Hadoop兼容的文件系统,用户可以把他们当成另一个HDFS使用。但是对象存储和HDFS在实现原理上有很多的不同,所以即使两者有类似的文件系统接口,很多API的行为完全不同。 本议题以阿里云OSS的实践出发,介绍阿里云OSS FileSystem实现进入Apache Hadoop历程。同时会介绍对象存储在文件上传、下载、删除和移动上和传统文件系统的区别,从性能和成本上评估HDFS和OSS文件系统的优劣。最后会结合对象存储的特性,给出一些优化方案,可以提升Hive或Spark等开源访问对象存储的性能。

在Apache Hadoop和Spark上加速大数据加密 (Speed up big data encryption in Apache Hadoop and Spark) 议题 (Session)

Although the processing capability of modern platforms is approaching memory speed, securing big data using encryption still hurts performance. Haifeng Chen shares proven ways to speed up data encryption in Hadoop and Spark, as well as the latest progress in open source, and demystifies using hardware acceleration technology to protecting your data.

程小龙先生,上海音智达信息技术有限公司,首席技术官,合伙人。拥有超过15年传统数仓部署实施和架构设计经验,主要致力于大数据, 云计算分布式架构下,
智能制造,客户洞察,风险预测等相关的项目的建设与推广等。他曾领导了通用电气公司
中国,新加坡,英国,美国和印度的多个IT基础设施项目,对业务有敏锐的触觉,对复杂的信息和分析系统有深入的了解。

Presentations

大数据即服务: 蓝鲸大数据私有云平台分享 - Dell赞助议题 (Big data as a service: Blue Whale big data private cloud platform sharing—sponsored by Dell) 议题 (Session)

企业级的计算平台, 应该 能灵活的尝试现有的或新兴的大数据技术, 然后选择需要的技术以规模化部署, 本演讲分享了如何改造利用现有的IT基础设施 为一个敏捷的大数据私有云平台, 让各种规模的企业从他们的数据中获取更多价值。

Cheng Feng is a data engineer at Grab, where he works on the big data platform, distributed computing, stream processing, and data science. Previously, he was a data scientist at the Lazada Group, working on Lazada’s tracker, customer segmentation and recommendation systems, and fraud detection.

Presentations

使用大数据推动东南亚前行 (Driving Southeast Asia forward with big data) 议题 (Session)

Grab is sitting at the junction of the digital and physical worlds. Its vision is to drive Southeast Asia forward and transform the way people travel and pay across the region. Feng Cheng and Edwin Law explain Grab's data architecture and offer a history of its data platform migration and stream-processing apps.

种骥科博士现任宜人贷 (NYSE:YRD) 首席数据科学家,正利用“万神庙”框架创建/布局宜人贷数据部,并负责反欺诈风控,和数字驱动的运营和创新。之前,种骥科曾任职于美国Simply Hired招聘平台,创建了数据科学部, 并应邀为白宫科技办公室参谋大数据技术产品设计。还曾就职于美国Silver Lake 私募公司任Kraftwerk基金数据科学架构师,负责大数据技术在私募投资风控方面的应用。种骥科曾任美国卡内基梅隆大学教授与博士生导师,持有加州大学伯克利分校电子工程和计算机科学系博士学位,卡内基梅隆大学电子和计算机工程系硕士及本科学位,和9项专利(5项获准,3项待批)。

Presentations

SDK + FinGraph + Go:用一手行为数据和图谱信息创造商业价值 (SDK + FinGraph + Go: Create business value with firsthand user behavior data and knowledge graph information) 议题 (Session)

在移动互联网流量红利过后,我们怎样深度挖掘一手移动数据,实时响应用户需求,通过用户行为和知识图谱技术,创造商业价值?我们会通过具体业务案例,分享一个SDK + FinGraph + Go的技术框架。此框架只用一行代码将SDK埋入APP,通过实时/准实时的上传机制和Flume + Kafka的实时处理分析,获取用户意向;用Spark Streaming流式处理,HBase KV查询输出,和Neo4j集群做的关联、存储来挖掘图谱信息;并通过Go高效的开发基础平台,Python连接自动提报后台,scikit-learn做事件识别,和Cypher挖掘图谱关系来预测用户意愿,引导用户行为 - 用实时数据创造商业价值。

数据科学精髓:互联网金融实例 - 量化线上金融信用与欺诈风险的评估 (Data science essentials: Examples from internet finance—Quantifying credit and fraud risks online) 培训 (Training)

您想了解互联网金融幕后的量化分析流程吗?个人信用是怎样通过大数据被量化的?在实践过程中,机器学习算法的应用存在着哪些需要关注的方面?怎样通过图谱分析来融合多维数据,为我们区分正常用户和欺诈用户? 这套辅导课基于清华大学交叉信息研究院2017年春天新开设的一门"量化金融信用与风控分析”研究生课。其中会用LendingClub的真实借贷数据做为案例,解说一些具体模型的实现。

深度学习工程师,做过分布式数据库HBase,实现了分布式事务系统Themis;参与并负责小米融合云平台开发;目前负责小米深度学习平台(Cloud-ML)的研发,以及智能对话项目。

Presentations

云深度学习平台架构与实践 (Architecture and practices of a cloud-based deep learning platform) 议题 (Session)

介绍小米内部应用的cloud machine learning平台,分析通用深度学习平台的架构设计和实现原理,还有在企业内部支持开发环境、模型训练以及模型服务的实践经验。

Doug Cutting is the chief architect at Cloudera and the founder of numerous successful open source projects, including Lucene, Nutch, Avro, and Hadoop. Doug joined Cloudera from Yahoo, where he was a key member of the team that built and deployed a production Hadoop storage-and-analysis cluster for mission-critical business analytics. Doug holds a bachelor’s degree from Stanford University and sits on the board of the Apache Software Foundation.

Presentations

周五欢迎致辞 (Friday opening welcome) 主题演讲 (Keynote)

大会日程主席 Ben Lorica、Jason Dai 与 Doug Cutting致辞开始第一天主题演讲。

周六欢迎致辞 (Saturday opening welcome) 主题演讲 (Keynote)

大会日程主席 Jason Dai、 Doug Cutting, 与 Ben Lorica致辞开始第二天主题演讲。

Jason (Jinquan) Dai is a senior principal engineer and CTO of big data technologies at Intel, where he is responsible for leading the global engineering teams (located in both Silicon Valley and Shanghai) on the development of advanced big data analytics (including distributed machine and deep learning), as well as collaborations with leading research labs (e.g., UC Berkeley’s AMPLab). Jason is an internationally recognized expert on big data, the cloud, and distributed machine learning; he is the program cochair of O’Reilly’s Strata Data Conference in Beijing, a committer and PMC member of the Apache Spark project, and the creator of BigDL, a distributed deep learning framework on Apache Spark.

Presentations

周五欢迎致辞 (Friday opening welcome) 主题演讲 (Keynote)

大会日程主席 Ben Lorica、Jason Dai 与 Doug Cutting致辞开始第一天主题演讲。

周六欢迎致辞 (Saturday opening welcome) 主题演讲 (Keynote)

大会日程主席 Jason Dai、 Doug Cutting, 与 Ben Lorica致辞开始第二天主题演讲。

AWS解决方案架构师;拥有17年IT 领域的工作经验,先后在IBM,RIM,Apple 等企业担任工程师、架构师等职位;目前就职于AWS,担任解决方案架构师一职。喜欢编程,喜欢各种编程语言,尤其喜欢Lisp。喜欢新技术,喜欢各种技术挑战,目前在集中精力学习分布式计算环境下的机器学习算法以及深度神经网络框架。

Presentations

AWS上使用MXNet进行分布式深度学习 (Distributed deep learning on AWS using MXNet) 教学辅导课 (Tutorial)

深度学习正持续地在诸如计算机视觉、自然语言处理和推荐引擎等领域引领最前沿的进步。带来这个进步的一个关键因素就是大量的高度灵活和对开发人员很友好的深度学习框架的出现。在本辅导课里,亚马逊机器学习团队的成员将会就深度学习的背景做一个简短的介绍,主要关注与其相关的应用领域。并会对强大和可扩展的深度学习框架——MXNet——做一个介绍。辅导课的最后,你可以获得上手的机会来获得针对多种应用的经验,包括计算机视觉和推荐引擎等。并可以看到如何使用预先配置好的深度学习AMI和CloudFormation模版来帮助加快开发速度。

AWS上的MXNet (MXNet on AWS) 议题 (Session)

Damon Deng provides a short background on deep learning, focusing on relevant application domains, and offers an introduction to using the powerful and scalable deep learning framework MXNet. Join in to learn how MXNet works and how you can spin up AWS GPU clusters to train at record speeds.

Mathieu Dumoulin is a data scientist in MapR Technologies’s Tokyo office, where he combines his passion for machine learning and big data with the Hadoop ecosystem. Mathieu started using Hadoop from the deep end, building a full unstructured data classification prototype for Fujitsu Canada’s Innovation Labs, a project that eventually earned him the 2013 Young Innovator award from the Natural Sciences and Engineering Research Council of Canada. Afterward, he moved to Tokyo with his family, where he worked as a search engineer at a startup and a managing data scientist for a large Japanese HR company, before coming to MapR.

Presentations

机器人的预测性维护实战:解读实时、可扩展的分析管道 (Robot predictive maintenance in action: Real-time, scalable pipelines explained) 议题 (Session)

Mathieu Dumoulin and Mateusz Dymczyk walk you step by step through building a scalable, real-time anomaly detection pipeline applied to an industrial robot. You'll learn how to gather data from a wireless movement sensor, process it with H2O on a MapR cluster, and visualize the output through an AR headset by an operator.

Mateusz Dymczyk is a Tokyo-based software engineer at H2O.ai, the company behind H2O, the leading open source machine learning platform for smarter applications and data products. He works on distributed machine learning projects including the core H2O platform and Sparkling Water, which integrates H2O and Apache Spark. Previously, he worked at Fujitsu Laboratories on natural language processing and utilization of machine learning techniques for investments and at Infoscience on a highly distributed log data collection and analysis platform. Mateusz loves all things distributed and machine learning and hates buzzwords. In his spare time, he participates in the IT community by organizing, attending, and speaking at conferences and meetups. Mateusz holds an MSc in computer science from AGH UST in Krakow.

Presentations

机器人的预测性维护实战:解读实时、可扩展的分析管道 (Robot predictive maintenance in action: Real-time, scalable pipelines explained) 议题 (Session)

Mathieu Dumoulin and Mateusz Dymczyk walk you step by step through building a scalable, real-time anomaly detection pipeline applied to an industrial robot. You'll learn how to gather data from a wireless movement sensor, process it with H2O on a MapR cluster, and visualize the output through an AR headset by an operator.

Maosong Fu is the technical lead for ​Heron and ​real-time analytics at Twitter and the author of ​few publications in the distributed area​. Maosong holds a master’s degree from Carnegie Mellon University and bachelor’s from Huazhong University of Science and Technology.

Presentations

专家见面会 Sijie Guo (Streamlio) and Maosong Fu (Twitter)(Meet the Experts with Sijie Guo (Streamlio) and Maosong Fu (Twitter)) 专家见面会 (Meet the Experts)

了解现代流计算引擎 了解现在实时消息和存储系统 了解Twitter实时计算的情况

现代流计算架构 (Modern streaming architectures) 教学辅导课 (Tutorial)

The move to streaming architectures from batch processing is a revolution in how companies use data. But what is the state of the art for a real-time data stack? Sijie Guo and Maosong Fu explore the typical challenges in a modern real-time data stack and explain how the modern technology will impact streaming architecture and applications in the future.

Yupeng Fu is a software engineer at Alluxio and a PMC member of the Alluxio open source project. Previously, Yupeng worked at Palantir, where he led the efforts to build the company’s storage solution. Yupeng holds a BS and an MS from Tsinghua University and has completed coursework toward a PhD at UCSD.

Presentations

使用Alluxio(前Tachyon)来加速大数据计算 (Using Alluxio (formerly Tachyon) to speed up big data analytics) 教学辅导课 (Tutorial)

在这个三个小时的教学课中, 我们将向参与者讲授Alluxio基础知识,演示Alluxio如何工作以及如何使用此系统帮助分布式计算引擎(如Spark或MapReduce)以内存速度共享数据。

使用开源的Alluxio解耦计算与存储的架构 (The architecture of decoupling compute and storage with open source Alluxio) 议题 (Session)

Decoupling storage and computation is becoming increasingly popular for big data analytics platforms. Yupeng Fu shares production best practices and solutions to best utilize CPUs, memory, and different tiers of disaggregated compute and storage systems to build out a multitenant high-performance platform that addresses real-world business demands.

在Spark上使用Alluxio的最佳实践(Best practices for using Alluxio with Spark) 议题 (Session)

Alluxio (formerly Tachyon) is a memory-speed virtual distributed storage system that leverages memory for managing data across different storage. Many deployments use Alluxio with Spark. Yupeng Fu explains how Alluxio helps Spark be more effective and shares examples of production deployments of Alluxio and Spark working together.

富羽鹏 (Fu Yupeng), Alluxio 专家见面会 (Meet the Experts)

这个活动里Alluxio的 富羽鹏会和大家交流Alluxio的特性, 最新用例, 以及未来的发展方向. 同时也欢迎大家和专家交流对于Alluxio的建议和想法.

Adam Gibson is the cofounder of Skymind, an enterprise deep learning and NLP firm, and creator of the distributed, open source frameworks Deeplearning4j and ND4J. Adam has taught machine learning at Zipfian Academy and is currently the deep learning specialist in residence at GalvanizeU. Adam has spoken at Hadoop Summit, OSCON, and Tech Planet in Seoul and is a coauthor of the forthcoming O’Reilly book Deep learning: A Practitioner’s Guide. Adam consults for hedge funds, Fortune 500 companies, and startups. He studied CS at Michigan Tech.

Presentations

Jumpy:一个曾经没有的深度学习的JVM接口 (Jumpy: The missing JVM interface for deep learning) 议题 (Session)

Adam Gibson offers a high-level overview of jumpy, a better Python interface for deep learning applications, and explains why Spark's Py4J interface for deep learning makes it impractical for deep learning applications.

顾荣,博士毕业于南京大学计算机系,现就职于南大计算机系,大数据开源存储项目Alluxio PMC member and mainitainer,Apache Spark contributor。作为知名的Alluxio社区开发者,顾荣完成了Alluxio很多功能稳定和性能增强方面的工作,包括性能测试框架Alluxio-Perf、Alluxio与Hadoop生态系统多个组件的整合、开发社区中文文档等。在与Spark结合方面,顾荣还设计实现了Spark 1.0版本中发布的支持RDD 存储到Alluxio的功能。顾荣目前已经发表或录用论文十余篇(其中10篇第一作者),并且参与编写《深入理解大数据—卷1: 大数据处理与编程实践》、《实战Hadoop:开启通向云计算的捷径》等书籍中的部分章节。顾荣热衷于技术分享,是南京大数据技术Meetup的组织人(已举行7次活动),也多次在国内知名的技术大会(例如中国数据库技术大会)上进行技术演讲。此外,顾荣曾在Microsoft Research、Intel、Baidu、星环科技(Transwarp)从事过大数据系统研发实习工作。

Presentations

Alluxio缓存策略优化与大规模性能评测 (Optimizing Alluxio cache strategy and large-scale performance evaluation) 议题 (Session)

Alluxio(原名Tachyon)是开源的、以内存为中心的统一分布式存储系统。它为上层计算框架和底层存储系统构建了桥梁。Alluxio还提供了分层存储机制,不仅可以管理内存,也可以统一管理SSD 和HDD等存储设备资源。为了使热数据尽量在更快的存储层上,我们在Alluxio中针对多种大数据的应用场景设计实现了众多高级的缓存替换策略包括LIRS、ARC、LRFU等。这些缓存策略已经被集成到Alluxio系统之中,并且可以很方便地用于上层应用性能调优。此外,为了对Alluxio上层的应用进行更大规模的性能评测和调优,我们还设计实现了针对的Alluxio大规模性能评测系统Alluxio-Perf。本演讲中,我将对针对Alluxio大数据的缓存策略与性能评测调优工具Alluxio-Perf的基本原理和使用方式进行详细的介绍。

使用Alluxio(前Tachyon)来加速大数据计算 (Using Alluxio (formerly Tachyon) to speed up big data analytics) 教学辅导课 (Tutorial)

在这个三个小时的教学课中, 我们将向参与者讲授Alluxio基础知识,演示Alluxio如何工作以及如何使用此系统帮助分布式计算引擎(如Spark或MapReduce)以内存速度共享数据。

Sijie Guo is the cofounder of Streamlio, a company focused on building a next-generation real-time data stack. Previously, he was the tech lead for messaging group at Twitter, where he cocreated Apache DistributedLog, and worked on push notification infrastructure at Yahoo. He is the PMC chair of Apache BookKeeper.

Presentations

专家见面会 Sijie Guo (Streamlio) and Maosong Fu (Twitter)(Meet the Experts with Sijie Guo (Streamlio) and Maosong Fu (Twitter)) 专家见面会 (Meet the Experts)

了解现代流计算引擎 了解现在实时消息和存储系统 了解Twitter实时计算的情况

使用Apache DistributedLog支持交易性的流计算 (Transactional streaming with Apache DistributedLog) 议题 (Session)

Sijie Guo explores the technical challenges of exactly once delivery and transaction support in messaging and streaming storage systems and explains how Apache DistributedLog helps achieve transactional streaming.

现代流计算架构 (Modern streaming architectures) 教学辅导课 (Tutorial)

The move to streaming architectures from batch processing is a revolution in how companies use data. But what is the state of the art for a real-time data stack? Sijie Guo and Maosong Fu explore the typical challenges in a modern real-time data stack and explain how the modern technology will impact streaming architecture and applications in the future.

Yufeng Guo is a developer advocate for the Google Cloud Platform, where he is trying to make machine learning more understandable and usable for all. He enjoys hearing about new and interesting applications of machine learning, so be sure to share your use case with him.

Presentations

用TensorFlow进行深度学习 (Deep learning with TensorFlow) 教学辅导课 (Tutorial)

TensorFlow is a popular open source machine learning library that is especially well-suited for deep learning. Yufeng Guo introduces machine learning and deep learning with concrete examples, walking you through hands-on exercises using TensorFlow and TensorBoard.

终端设备上的机器学习: Android设备上的TensorFlow (On-device machine learning: TensorFlow on Android) 议题 (Session)

Machine learning has traditionally been performed only on servers and high-performance machines, but on-device machine learning on mobile devices can be very valuable. Yufeng Guo uses TensorFlow to implement a deep learning model for image classification on an Android device, tailored to a custom dataset. You'll leave ready to get started on your own mobile deep learning solutions.

Luke (Qing) Han is the coounder and CEO of Kyligence, which provides a leading intelligent data platform powered by Apache Kylin to simplify big data analytics from on-premises to the cloud. Luke is the cocreator and PMC chair of Apache Kylin, where he contributes his passion to driving the project’s strategy, roadmap, and product design. For the past few years, Luke has been working on growing Apache Kylin’s community, building its ecosystem, and extending its adoptions globally. Previously, he was big data product lead at eBay, where he managed Apache Kylin, engaged customers, and coordinated various teams from different geographical locations, and chief consultant at Actuate China.

Presentations

释放大数据生产力 - Kyligence赞助主题演讲 (Release productivity of big data—sponsored by Kyligence) 主题演讲 (Keynote)

大数据已成企业的核心竞争力,在大数据技术及平台相对复杂,人才短缺的现状下,大数据生产力无法得以充分释放,过多的依赖于人,特别是专业培训过的工程师很难让企业可以快速构建大数据平台,快速相应业务变化。 本次主题演讲,将从一个新的角度去看待这个问题,介绍为什么将传统的DW/BI能力、理论、方法论等在大数据平台上进行使能是如此的重要,如何通过这种办法,充分发挥现有人才的能力,为企业提供释放大数据生产力的可能

Hao Hao is a software engineer at Cloudera currently working on Apache Kudu and Apache Sentry and is committer and PMC of the Apache Sentry project. Previously, she worked on eBay’s Search Backend team, building search infrastructure for eBay’s online buying platform. Hao performed extensive research on smartphone security and web security while she was a PhD student at Syracuse University.

Presentations

Apache Kudo: 1.0版和未来 (Apache Kudu: 1.0 and beyond) 议题 (Session)

Hao Hao offers an overview of Apache Kudu, a project that enables fast analytics on big data.

Franky Ho is an enterprise technologist at Dell, where he works with Greater China customers on big data and cloud related solutions and guides them by sharing marketing trends and motivation of industry and conducting as-is to-be analyses to meet CxOs’ goals. Franky also designs big data and cloud solutions by leveraging emerging technologies and a combination of well-known foreign and local big data solution providers to form an end-to-end solution to help customers gain insight into their data.

Presentations

大数据即服务: 蓝鲸大数据私有云平台分享 - Dell赞助议题 (Big data as a service: Blue Whale big data private cloud platform sharing—sponsored by Dell) 议题 (Session)

企业级的计算平台, 应该 能灵活的尝试现有的或新兴的大数据技术, 然后选择需要的技术以规模化部署, 本演讲分享了如何改造利用现有的IT基础设施 为一个敏捷的大数据私有云平台, 让各种规模的企业从他们的数据中获取更多价值。

Mick Hollison is chief marketing officer at Cloudera, where he leads the company’s worldwide marketing efforts, including advertising, brand, communications, demand, partner, solutions, and web. Mick has had a successful 25-year career in enterprise and cloud software. Previously, he was CMO of sales acceleration at machine learning company InsideSales.com, where, under his leadership, InsideSales pioneered a shift to data-driven marketing and sales that has served as a model for organizations around the globe; was global vice president of marketing and strategy at Citrix, where he led the company’s push into the high-growth desktop virtualization market; managed executive marketing at Microsoft; and held numerous leadership positions at IBM Software. Mick is an advisory board member for InsideSales and a contributing author on Inc.com. He is also an accomplished public speaker who has shared his insightful messages about the business impact of technology with audiences around the world. Mick holds a bachelor of science in management from the Georgia Institute of Technology.

Presentations

驱动金融服务的可能性 (Powering possibilities in financial services) 主题演讲 (Keynote)

Mick Hollison and Jien Zhou discuss how organizations are applying machine learning and advanced analytics to improve customer service and reduce the threat of fraud and cyberattack and explain how China UnionPay is using big data to deliver a better customer experience and manage risk.

Ron-Chung Hu is a database system architect at Huawei Technologies, where he works on building a big data analytics platform based on Apache Spark. Previously, he worked at Teradata, Sybase, and MarkLogic, focusing on parallel database systems and search engines. Ron holds a PhD in computer science from the University of California, Los Angeles.

Presentations

基于成本的Spark SQL优化器框架 (A cost-based optimizer framework for Spark SQL) 议题 (Session)

我们把基于成本的优化器框架贡献给社区版本Spark 2.2。在我们的框架中,我们计算每个数据库操作符的基数和输出大小。通过可靠的统计和精确的估算,我们能够在这些领域做出好的决定:选择散列连接(hash join)操作的正确构建端(build side),选择正确的连接算法(如broadcast hash join与 shuffled hash join), 调整连接的顺序等等。这个基于成本的优化器框架对Spark SQL查询的性能有很好的提升 。在这次演讲中,我们将展示Spark SQL的新的基于成本的优化器框架及其对TPC-DS查询的性能影响。

Andy M Huang(黄明):腾讯数据平台部T4专家,Spark早期的研究者和布道者之一,在分布式计算和机器学习领域,有一定的经验和研究。负责构建大规模并行计算和智能学习平台,助力腾讯各种数据和机器学习业务快速发展。

Presentations

Angel:面向高维度的机器学习计算框架 (Angel: A machine learning framework for high dimensionality) 议题 (Session)

在机器学习和人工智能领域,为了让模型达到更好的线上效果,特征的维度往往会膨胀到千万和亿级别。在这种情况下,传统的分布式计算框架,很难有高的性能。为此,腾讯推出Angel机器学习框架,支持超大维度模型的高性能机器学习。该框架即支持自主的高性能机器学习算法开发,也能作为PS引擎,为其它框架(例如Spark……)提供PS支持,整体形成良好的PS生态圈。

Shengsheng (Shane) Huang is a software architect at Intel and an Apache Spark committer and PMC member, leading the development of large-scale analytical applications and infrastructure on Spark in Intel. Her area of focus is big data and distributed machine learning, especially deep (convolutional) neural networks. Previously at NUS (National University of Singapore), her research interests are large-scale vision data analysis and statistical machine learning.

Presentations

使用BigDL在Apache Spark上进行大规模分布式深度学习 (Distributed deep learning at scale on Apache Spark with BigDL) 议题 (Session)

Zhichao Li, Shengsheng Huang, and Yiheng Wanghow explore how data scientists have adopted BigDL for deep learning analysis on large amounts of data in a distributed fashion, allowing them to use their big data cluster as a unified data analytics platform for data storage, data processing and mining, feature engineering, traditional (non-deep) machine learning, and deep learning workloads.

Yifeng Jiang is a lead solution engineer at Hortonworks, where he helps design and architect big data solutions on HDP for Japanese, Chinese, and APAC customers in many industries. (His largest customer deployed a HDP cluster with over 1,000 nodes.) Previously, he was a solution architect at Amazon Web Services. Yifeng is the author of the HBase Administration Cookbook.

Presentations

HDF 3.0: 轻松使用的开源物流网平台 - Hortonworks赞助议题(HDF 3.0: An open source IoT platform for everyone—sponsored by Hortonworks) 议题 (Session)

Yifeng Jiang offers an overview of HDF 3.0, the open source IoT platform that everyone can easily start using right now. HDF supports data collection from the edge, flow management to send data to the data center and the cloud, real-time processing, and visualization and analytics with open source technology and can be used with simple drag-and-drop operations.

Edwin Law was the third person and first engineer on the Data team at Grab (formerly MyTeksi and Grab Taxi), which encompasses data engineering, data science, and data analytics. Edwin leads the almost-15-member-strong Data Engineering and Database Operations teams as their engineering manager.

Presentations

使用大数据推动东南亚前行 (Driving Southeast Asia forward with big data) 议题 (Session)

Grab is sitting at the junction of the digital and physical worlds. Its vision is to drive Southeast Asia forward and transform the way people travel and pay across the region. Feng Cheng and Edwin Law explain Grab's data architecture and offer a history of its data platform migration and stream-processing apps.

Tony Lee is the chief security officer at JD.

Presentations

在京东利用大数据进行安全分析 (Leveraging big data for security analytics at JD) 议题 (Session)

JD.com is one of the largest B2C online retailers in the world. Its mission is to provide a safe and secure marketplace for its 226M active users and 120K third-party vendors. Jimmy Zhigang Su and Tony Lee discuss the transformations big data has enabled at JD, including threat intelligence, account security, and end-point security.

Kyligence Inc技术合伙人兼高级软件架构师,Apache Kylin Committer & PMC Member,专注于大数据技术研发,KyBot技术负责人。毕业于上海交通大学计算机系;曾任eBay全球分析基础架构部高级工程师、微软云计算和企业产品部软件开发工程师;曾是微软商业产品Dynamics亚太团队核心成员,参与开发了新一代基于云端的ERP解决方案。

Presentations

Apache Kylin 2.0:从Hadoop上的OLAP 引擎到实时数据仓库 (Apache Kylin 2.0: From an OLAP engine on Hadoop to a real-time data warehouse) 议题 (Session)

Apache Kylin v2.0即将发布!作为领先的大数据OLAP分析引擎,现在的Apache Kylin羽翼更丰:支持雪花模型、更加全面的SQL语法、初出茅庐的Spark Cubing、更好地支持实时流式数据接入等等。Apache Kylin正逐渐从一个Hadoop上的传统OLAP平台,演变为一个Hadoop上的实时数据仓库。

李栋(Li Dong), Kyligence 专家见面会 (Meet the Experts)

Apache Kylin v2.0已经发布!作为领先的大数据OLAP分析引擎,现在的Apache Kylin羽翼更丰,支持雪花模型、基于Spark进行预计算、更加全面的SQL语法、实时流式数据接入等等,Apache Kylin正逐渐从一个传统OLAP转变为一个实时数据仓库。本次交流将讨论Apache Kylin v2.0中的最新功能及真实的实践案例。

Fangshi Li is a senior software engineer on Linkedin’s Hadoop team. Fangshi built and open-sourced Dr. Elephant. He is currently doing Hive- and Spark-related work. Fangshi holds a degree from Carnegie Mellon.

Presentations

在领英搭建Hadoop和Kafka之间的桥梁——Hadoop团队的视角 (Building the bridge between Hadoop and Kafka at LinkedIn: A Hadoop team's perspective) 议题 (Session)

Kafka和Hadoop是LinkedIn数据基础设施online和offline部分的核心。Kafka是LinkedIn创造并且开源的,目前集群有超过一千台机器,每天收集并处理14万亿条消息。LinkedIn的Hadoop集群有超过1万台机器和50pb数据,每天处理20万个任务。在本议题中,我将会以一个Hadoop成员的角度讲解linkedin如何搭建Hadoop和Kafka的桥梁,让他们更好的一起工作。内容包括 1)讲解LinkedIn数据架构 dataset从产生到Kafka到Hadoop并且最终呈现给用户(数据分析师)的整个ETL流程 2)讲解我们的一个use case来使用Apache Flume和Kafka收集分析Hadoop集群的数据并且搭建实时分析程序 3)讲解我们最新的工作,提供统一的sql接口让用户可以同时处理Kafka数据流和hdfs的数据

Tianhui Michael Li is the founder and CEO of the Data Incubator. Michael has worked as a data scientist lead at Foursquare, a quant at D.E. Shaw and JPMorgan, and a rocket scientist at NASA. At Foursquare, Michael discovered that his favorite part of the job was teaching and mentoring smart people about data science. He decided to build a startup that lets him focus on what he really loves. He did his PhD at Princeton as a Hertz fellow and read Part III Maths at Cambridge as a Marshall scholar.

Presentations

使用开源人工智能和机器学习工具训练现实世界的信用模型(Training a real-world credit model using open source artificial intelligence and machine learning tools) 议题 (Session)

Michael Li demonstrates how to iteratively train and refine a simple yet robust credit model for loan-default prediction, based on real-world loan performance data using 100% open source machine learning and artificial intelligence tools. The data is based on US$26 billion in loans issued over 10 years.

Yu Li is a senior technical expert at Alibaba leading the Alibaba Search HBase team. An HBase committer, Yu has over seven years’ work experience in the Hadoop stack for enterprise solution and has supported Alibaba for three Singles’ Days.

Presentations

生产环境里的堆外内存HBase读路径——阿里巴巴的故事 (Off-heap HBase read path in production: The Alibaba story) 议题 (Session)

Yu Li explains how Alibaba met the challenge of tens of millions requests per second to its Alibaba-Search HBase cluster on 2016 Singles' Day. With read-path off-heaping, Alibaba improved the throughput by 30% and achieved a predicable latency.

利智超来自于Intel大数据技术团队,专注于大数据分析领域, Spark contributor。他的同事和他致力于在Apache Spark平台上开发分布式机器学习算法,以满足大数据背景下的机器学习需求。他还为这些分布式机器学习算法在Intel平台上进行优化,以及帮助Intel的客户为他们的业务开发大数据分析程序。

Presentations

Apache Spark高级实践和原理解析 (Apache Spark advanced practice and principles) 培训 (Training)

这几年随着大数据分析和机器学习等等在工业界中越来越广泛的应用,越来越多的人选择在大数据平台比如Apache Spark之上构建大规模数据处理、分析和机器学习,以便利用大量原始数据和扩展架构。如何深入理解大数据关键技术并更好的运用它们?本次课程将结合当前大数据技术的浪潮和趋势,为您介绍Apache Spark的高级实践和原理解析,帮助您加深领会Apache Spark的精华设计思想,以及如何与流式分析、机器学习,深度学习等紧密结合,在数据采集,分析处理,特征提取,机器学习等方面提供一致性和集成性的高级实践。

使用BigDL在Apache Spark上进行大规模分布式深度学习 (Distributed deep learning at scale on Apache Spark with BigDL) 议题 (Session)

Zhichao Li, Shengsheng Huang, and Yiheng Wanghow explore how data scientists have adopted BigDL for deep learning analysis on large amounts of data in a distributed fashion, allowing them to use their big data cluster as a unified data analytics platform for data storage, data processing and mining, feature engineering, traditional (non-deep) machine learning, and deep learning workloads.

使用BigDL构建深度学习来驱动Apache Spark上的大数据分析,Intel赞助议题(Building deep learning power big data analytics on Apache Spark using BigDL—sponsored by Intel) 议题 (Session)

Intel’s BigDL is an open source distributed deep learning framework for Apache Spark. Yiheng Wang and Zhichao Li discuss the technologies Intel engineers developed and what they learned from building deep learning applications on Spark using BigDL, including image recognition and object detection (faster-rcnn and SSD) and speech recognition with deep speech and acoustic transformers.

林元庆,现任百度深度学习实验室(IDL)主任,拥有清华大学光学工程硕士学位和宾夕法尼亚大学电气工程博士学位。

林元庆在机器学习和计算机视觉等研究领域拥有多年的研究经验和显著的成果。在加入百度前,曾任NEC美国实验室媒体分析部门主管。在他的带领下NEC研究团队在深度学习、计算机视觉和无人驾驶等领域取得世界领先水平。2005年至今在顶级国际会议和期刊发表论文30余篇,拥有11项美国专利,曾担任NIPS大会领域主席、大规模视觉识别和检索国际研讨会联合主席等。

加入百度后,林元庆致力于带领深度学习实验室研发具有统治级别的人工智能技术,其领导的团队在多个领域实现了技术上重大进展并且应用到百度的多项产品中去,极大地提升了产品的性能以及用户的体验,其带领的团队在多项重要计算机视觉技术在国际测试集上取得世界第一名的好成绩。

Yuanqing Lin is the head of the Baidu Research, where he is responsible for managing the company’s four sublabs: IDL, Big Data Lab (BDL), Augmented Reality Lab (ARL), and Silicon Valley Artificial Intelligence Lab (SVAIL). Previously, Yuanqing was the head of the Media Analysis department at NEC Labs America, where he worked on large-scale fine-grained image recognition and 3D visual sensing for autonomous driving. Since 2005, Yuanqing has published more than 40 papers in top academic conferences and journals and holds 11 US patents. He served as area chair of NIPS 2015 and cochair of the International Workshop on Large-Scale Visual Recognition and Retrieval 2012. He holds a PhD in electrical engineering from the University of Pennsylvania.

Presentations

DuFace:大规模人脸识别 (DuFace: Large-scale face recognition) 主题演讲 (Keynote)

Yuanqing Lin explores Baidu’s progress in face recognition based on big data.

刘晗,腾讯AI Lab强化学习中心总监,美国西北大学计算机系,统计系,工业工程与管理科学系终身教授。曾任教于普林斯顿大学运筹与金融科学系以及约翰霍普金斯大学生物统计与计算机科学系。2010年美国卡耐基梅隆大学计算机学院机器学习与统计学联合博士毕业。专业方向为人工智能,机器学习,以及大数据分析。在统计学与机器学习国际顶级会议及期刊上共发表论文100余篇。曾获ICML最佳论文奖以及世界连续优化会议最佳论文奖。刘晗博士是美国国家自然科学基金会生涯奖 (NSF CAREER Award)得主,美国斯隆研究奖 (Alfred P Sloan Research Fellowship in Mathematics),国际数理统计协会Tweedie奖(IMS Tweedie New Researcher Award), 美国数理统计协会年轻学者奖 (ASA Noether Young Scholar Award),以及普林斯顿大学 Howard B Wentz 奖。 并担任顶级机器学习顶级会议NIPS,ICML的领域主席。现任顶级统计学期刊JASA和EJS的副编委。

Presentations

Shaoshan Liu is the cofounder and president of PerceptIn, a company working on developing a next-generation robotics platform. Previously, he worked on autonomous driving and deep learning infrastructure at Baidu USA. Shaoshan holds a PhD in computer engineering from the University of California, Irvine.

Presentations

使用Alluxio助力机器人云 (Powering robotics clouds with Alluxio) 议题 (Session)

The rise of robotics applications demands new cloud architectures that deliver high throughput and low latency. Shaoshan Liu explains how PerceptIn designed and implemented a cloud architecture to support these emerging user requirements using Alluxio.

Ben Lorica is the chief data scientist at O’Reilly Media. Ben has applied business intelligence, data mining, machine learning, and statistical analysis in a variety of settings, including direct marketing, consumer and market research, targeted advertising, text mining, and financial engineering. His background includes stints with an investment management company, internet startups, and financial services.

Presentations

周五欢迎致辞 (Friday opening welcome) 主题演讲 (Keynote)

大会日程主席 Ben Lorica、Jason Dai 与 Doug Cutting致辞开始第一天主题演讲。

周六欢迎致辞 (Saturday opening welcome) 主题演讲 (Keynote)

大会日程主席 Jason Dai、 Doug Cutting, 与 Ben Lorica致辞开始第二天主题演讲。

机器学习时代(The Age of Machine Learning) 主题演讲 (Keynote)

Details to come.

Zhenxiao Luo is a software engineer at Uber working on Presto and Parquet. Before joining Uber, he led the development and operations of Presto at Netflix. Zhenxiao has big data experience at Facebook, Cloudera, and Vertica on Hadoop-related projects. He holds a master’s degree from the University of Wisconsin-Madison and a bachelor’s degree from Fudan University.

Presentations

列式存储在Uber (Columnar storage at Uber) 议题 (Session)

As Uber continues to grow, its big data systems must also grow in scalability, reliability, and performance to help Uber make business decisions, give user recommendations, and analyze experiments across all data sources. Zhenxiao Luo shares his experience running columnar storage in production at Uber and discusses query optimization techniques in SQL engines.

Ted Malaska is a group technical architect on the Battle.net team at Blizzard, helping support great titles like World of Warcraft, Overwatch, and HearthStone. Previously, Ted was a principal solutions architect at Cloudera helping clients find success with the Hadoop ecosystem and a lead architect at the Financial Industry Regulatory Authority (FINRA). He has also contributed code to Apache Flume, Apache Avro, Apache Yarn, Apache HDFS, Apache Spark, Apache Sqoop, and many more. Ted is a coauthor of Hadoop Application Architectures, a frequent speaker at many conferences, and a frequent blogger on data architectures.

Presentations

数据应用与数据产品架构 (Architecting data applications and data products) 教学辅导课 (Tutorial)

Ted Malaska walks you through building a fraud-detection system, using an end-to-end case study to provide a concrete example of how to architect and implement real-time systems via Apache Hadoop components like Kafka, HBase, Impala, and Spark.

大数据的数据模型 (Big data modeling) 教学辅导课 (Tutorial)

The recent advancement in distributed processing engines, from Spark to Impala to Spark Streaming and Storm, has proved exciting. Ted Malaska explains why, if your design only focuses on the processing layer to get speed and power, you may be missing half the story and leaving a significant amount of optimization untapped.

成为Apache Spark明星路上的技巧 (Tricks of the trade to be an Apache Spark rock star) 议题 (Session)

It's one thing to write an Apache Spark application that gets you to an answer. It’s another thing to know you used all the tricks in the book to make it run as fast as possible. Ted Malaska shares some of those tricks.

Jiangjie Qin is on the Data Infrastructure team at LinkedIn. He works on Apache Kafka and is a Kafka Committer. Previously, he worked at IBM, where he managed IBM’s zSeries platform for banking clients. Jiangjie holds a master’s degree in information networking from Carnegie Mellon’s Information Networking Institute.

Presentations

从简单到复杂:Apache Kafka应用实例详解 (From simple to complex: A detailed explanation of Apache Kafka applications in practice) 教学辅导课 (Tutorial)

Apache Kafka作为近年来最流行的消息系统之一,其使用场景已经从最初的集中系统消息队列发展到更为复杂的一系列使用场景,包括流处理,数据库复制,CDC等等。本次演讲将以Kafka在LinkedIn的实践为基础详细介绍Kafka的各种应用场景。

Jimmy Su is the head of JD security research center in Silicon Valley, where he leads the security research projects in the areas of account security, APT detection, IoT security, mobile security, and email security.

Presentations

在京东利用大数据进行安全分析 (Leveraging big data for security analytics at JD) 议题 (Session)

JD.com is one of the largest B2C online retailers in the world. Its mission is to provide a safe and secure marketplace for its 226M active users and 120K third-party vendors. Jimmy Zhigang Su and Tony Lee discuss the transformations big data has enabled at JD, including threat intelligence, account security, and end-point security.

Daniel Templeton has a long history in high-performance computing, open source communities, and technology evangelism. Today, Daniel works on the YARN development team at Cloudera, where he focuses on the resource manager, fair scheduler, and Docker support.

Presentations

Apache Hadoop 3.0的特性和开发进展的更新 (Apache Hadoop 3.0 features and development update) 议题 (Session)

Apache Hadoop 3.0 has made steady progress toward a planned release this year. Andrew Wang and Daniel Templeton offer an overview of new features, including HDFS erasure coding, YARN Timeline Service v2, and MapReduce task-level optimization, and discuss current release management status and community testing efforts dedicated to making Hadoop 3.0 the best Hadoop major release yet.

Ramkrishna Vasudevan is a senior software engineer at Intel working with Apache HBase. He is also an Apache Phoenix PMC member. Recently, Ramkrishna has been actively working on performance-related features in HBase.

Presentations

生产环境里的堆外内存HBase读路径——阿里巴巴的故事 (Off-heap HBase read path in production: The Alibaba story) 议题 (Session)

Yu Li explains how Alibaba met the challenge of tens of millions requests per second to its Alibaba-Search HBase cluster on 2016 Singles' Day. With read-path off-heaping, Alibaba improved the throughput by 30% and achieved a predicable latency.

Andrew Wang is a software engineer at Cloudera on the HDFS team, an Apache Hadoop committer and PMC member, and the release manager for Hadoop 3.0. Previously, he was a PhD student in the AMPLab at UC Berkeley, where he worked on problems related to distributed systems and warehouse-scale computing. He holds a master’s and a bachelor’s degree in computer science from UC Berkeley and UVA respectively.

Presentations

Apache Hadoop 3.0的特性和开发进展的更新 (Apache Hadoop 3.0 features and development update) 议题 (Session)

Apache Hadoop 3.0 has made steady progress toward a planned release this year. Andrew Wang and Daniel Templeton offer an overview of new features, including HDFS erasure coding, YARN Timeline Service v2, and MapReduce task-level optimization, and discuss current release management status and community testing efforts dedicated to making Hadoop 3.0 the best Hadoop major release yet.

HDFS纠删码最新探秘 (Demystifying erasure coding in HDFS) 议题 (Session)

Hadoop3.0 引入了纠删码技术。在常见配置下,纠删码相对于传统数据3备份模式可以降低50%的存储成本,同时提高数据的可靠性。在本次演讲中,我们首先会简短的介绍HDFS纠删码技术, 然后深入了解在Hadoop 3.0 GA 前我们为保证纠删码功能稳定性做的工作,以及分享Hadoop生态系统中重要成员Spark, Hive,Impala, Kylin等等在HDFS 纠删码上的性能表现。最后,我们会给出在生产环境中部署使用纠删码技术的一些考虑和建议。

Carson Wang is a big data software engineer at Intel, focusing on developing and improving new big data technologies. He is an active open source contributor to the Spark and Alluxio projects. Prior to Intel, Carson was an engineer at Microsoft working on cloud computing technologies.

Presentations

Apache Spark高级实践和原理解析 (Apache Spark advanced practice and principles) 培训 (Training)

这几年随着大数据分析和机器学习等等在工业界中越来越广泛的应用,越来越多的人选择在大数据平台比如Apache Spark之上构建大规模数据处理、分析和机器学习,以便利用大量原始数据和扩展架构。如何深入理解大数据关键技术并更好的运用它们?本次课程将结合当前大数据技术的浪潮和趋势,为您介绍Apache Spark的高级实践和原理解析,帮助您加深领会Apache Spark的精华设计思想,以及如何与流式分析、机器学习,深度学习等紧密结合,在数据采集,分析处理,特征提取,机器学习等方面提供一致性和集成性的高级实践。

王道远,英特尔亚太研发有限公司资深软件工程师,自2014年起参与Spark SQL开发,是Apache Spark开源社区的活跃贡献者。在参与Spark开发之前,他参与了IDH版本Hive的开发。译有《Spark快速大数据分析》一书。

Presentations

Apache Spark高级实践和原理解析 (Apache Spark advanced practice and principles) 培训 (Training)

这几年随着大数据分析和机器学习等等在工业界中越来越广泛的应用,越来越多的人选择在大数据平台比如Apache Spark之上构建大规模数据处理、分析和机器学习,以便利用大量原始数据和扩展架构。如何深入理解大数据关键技术并更好的运用它们?本次课程将结合当前大数据技术的浪潮和趋势,为您介绍Apache Spark的高级实践和原理解析,帮助您加深领会Apache Spark的精华设计思想,以及如何与流式分析、机器学习,深度学习等紧密结合,在数据采集,分析处理,特征提取,机器学习等方面提供一致性和集成性的高级实践。

OAP: 使用Spark SQL进行即席查询 (OAP: Using Spark SQL for ad hoc queries) 议题 (Session)

OAP是英特尔大数据团队和百度基础架构团队的开源合作项目,旨在针对在Spark SQL上进行的大规模数据即席查询进行优化,满足在百度线上业务中对于海量搜索日志进行秒级查询的需求。 OAP通过用户自定义的分布式索引和自动缓存等技术,极大地加速了一些特定场景下的SQL查询。OAP支持多种索引类型,可以让用户根据数据特征选择适当的索引,加速查询的同时,引入较少的额外存储开销。 在百度的生产环境中,OAP已经作为平台提供的查询加速方案,为部分实际查询带来5倍左右的性能提升,大大节约了查询的运行时间,丰富了Spark SQL的应用场景。

Yiheng Wang is a software development engineer on the Big Data Technology team at Intel working in the area of big data analytics. Yiheng and his colleagues are developing and optimizing distributed machine learning algorithms (e.g., neural network and logistic regression) on Apache Spark. He also helps Intel customers build and optimize their big data analytics applications.

Presentations

Apache Spark高级实践和原理解析 (Apache Spark advanced practice and principles) 培训 (Training)

这几年随着大数据分析和机器学习等等在工业界中越来越广泛的应用,越来越多的人选择在大数据平台比如Apache Spark之上构建大规模数据处理、分析和机器学习,以便利用大量原始数据和扩展架构。如何深入理解大数据关键技术并更好的运用它们?本次课程将结合当前大数据技术的浪潮和趋势,为您介绍Apache Spark的高级实践和原理解析,帮助您加深领会Apache Spark的精华设计思想,以及如何与流式分析、机器学习,深度学习等紧密结合,在数据采集,分析处理,特征提取,机器学习等方面提供一致性和集成性的高级实践。

使用Apache Spark和BigDL来构建深度学习驱动的大数据分析 (Building deep learning-powered big data analytics using Apache Spark and BigDL) 教学辅导课 (Tutorial)

深度学习已经在很多的领域(例如计算机视觉、自然语言处理和语音识别等)取得了顶尖水准的表现,对工业界有极大的潜在应用价值。我们应该注意到深度学习和大数据的联系非常得紧密。首先,深度学习的模型需要使用大量的数据来训练,这就是为什么它直到大数据时代才开始蓬勃发展。其次,现在绝大部分的大数据都是视频、音频和文字数据,非常适合使用深度学习算法来处理。为了能释放深度学习的能力,我们就应该把它运用在大数据的环境里。

使用BigDL在Apache Spark上进行大规模分布式深度学习 (Distributed deep learning at scale on Apache Spark with BigDL) 议题 (Session)

Zhichao Li, Shengsheng Huang, and Yiheng Wanghow explore how data scientists have adopted BigDL for deep learning analysis on large amounts of data in a distributed fashion, allowing them to use their big data cluster as a unified data analytics platform for data storage, data processing and mining, feature engineering, traditional (non-deep) machine learning, and deep learning workloads.

使用BigDL构建深度学习来驱动Apache Spark上的大数据分析,Intel赞助议题(Building deep learning power big data analytics on Apache Spark using BigDL—sponsored by Intel) 议题 (Session)

Intel’s BigDL is an open source distributed deep learning framework for Apache Spark. Yiheng Wang and Zhichao Li discuss the technologies Intel engineers developed and what they learned from building deep learning applications on Spark using BigDL, including image recognition and object detection (faster-rcnn and SSD) and speech recognition with deep speech and acoustic transformers.

Dennis Weng is vice president of engineering at JD Group. A 20-year IT veteran in cutting-edge technology companies, Dennis is an expert on storage systems and very large clustering systems. He holds nearly 20 US patents. Three years ago, he returned to China to lead the AI and Big Data group at JD. Dennis holds a master’s degree from Lakehead University in Canada.

Presentations

电子商务的未来:AI和大数据(An ecommerce future: AI and big data) 主题演讲 (Keynote)

Online shopping accounts for over 15% of China's overall shopping market and has been growing more than 20% every year. Over the past 13 years, JD has successfully become a direct sale online retail giant. Dennis Weng explains how JD has used rich and high-value customer and business data to become one of the most important data companies in China.

具有四年Hadoop及其生态系统的项目经验,专注在大数据解决方案的设计、部署和实现,具有多个行业,例如电信、保险、制造业以及公共安全方面的项目经历。
擅长于通过高效的Hadoop架构设计和实现,结合运用多种大数据工具,帮助业务部门从大数据中获取最终价值。

Presentations

使用Spark/BigDL高级机器学习实现寿险业务再发现 (Reimplement life insurance services using Spark and BigDL advanced machine learning) 议题 (Session)

中国人寿多年来积累了大量数据,如何深度挖掘数据的价值,用于业务推动、风险管理、客户服务等领域,是我们数据部门的主要目标。我们将介绍中国人寿如何使用Spark以及Spark上的深度学习库BigDL构建针对保险业务场景的高级分析应用。我们尝试了多种前沿的高级机器学习和深度学习技术,我们将分享我们的机器学习系统的架构,应用构建的流程,以及从中吸取到的经验和教训。

Mingxi Wu is the vice president of engineering at GraphSQL, a startup building a leading real-time graph data platform. Over his 15-year career, Mingxi has focused on database research and data management software building in Microsoft’s SQL server group and Oracle’s relational database optimizer group. He has won research awards in the most prestigious publication venues in database and data mining (SIGMOD, KDD, and VLDB). Recently, he has been focusing on building an easy-to-use and highly expressive graph query language. Mingxi holds a PhD from the University of Florida, where he specialized in both database and data mining.

Presentations

GraphSQL: 崭新的游戏规则一个完整的高效图数据和分析平台 (GraphSQL is a game changer: A complete high performance graph data and analytics platform) 议题 (Session)

Mingxi Wu and Yu Xu offer an overview of GraphSQL, a high-performance enterprise graph data platform for real-time graph analytics that enables businesses to transform structured, semistructured, and unstructured data and massive enterprise data silos into an intelligent interconnected data network, uncovering implicit patterns and critical insights to drive business growth.

乌明希(Wu Mingxi), GraphSQL 专家见面会 (Meet the Experts)

我本人是做传统数据库出身,也深入研发过开源平台上的数据管理。 我非常愿意和与会者探讨交流现实生产线上的大数据问题,结合我个人经验,围绕具体大数据问题来评估图数据平台解决该问题的优势。对有合作意向的单位,愿意面对面进一步交流。

吴中毕业于清华大学,在微软全球执行副总裁沈向洋博士的指导下获得计算机科学与技术学科的博士学位。现于DataVisor担任技术总监,并主要负责DataVisor中国区业务。在全球顶级计算机视觉会议如CVPR,ICCV,PAMI 等发表多篇有影响力的论文,并在大数据搜索,大数据安全领域有多项专利申请。在加入DataVisor之前,吴中在微软的Bing部门从事图像搜索工作,工作范围包括大规模文本及图像特征的抽取、索引,搭建高性能系统和设计高效算法,通过提高数十亿图像搜索索引的质量,进而提升Bing图像搜索结果的相关性。

Presentations

欺诈的潜伏性: 如何利用大数据进行反欺诈检测 (The latency of fraud: How to use big data to detect fraud) 议题 (Session)

你的用户中有多少是潜伏的欺诈者,等待发起攻击?所有线上用户社区都会存在隐藏群组、潜伏期账号欺诈的风险。根据DataVisor全球范围线上服务超过10亿用户和5千亿事件的分析数据,这个议题旨在详细阐述潜伏期欺诈账号存在的威胁性,探索欺诈者是如何应用复杂的攻击技术来逃避系统检测,以及Spark大数据安全分析的应用。

Tony Xing is a senior product manager on the Shared Data team within Microsoft’s Application and Service group. Previously, he was a senior product manager on the Skype data team within Microsoft’s Application and Service group. Tony is a frequent speaker at Strata.

Presentations

微软的通用异常检测平台 (The common anomaly detection platform at Microsoft) 议题 (Session)

Tony Xing offers an overview of Microsoft's common anomaly detection platform, an API service built internally to provide product teams the flexibility to plug in any anomaly detection algorithms to fit their own signal types.

中国人寿大数据机器学习项目经理,专注于大数据分析和机器学习的研究与应用

Presentations

使用Spark/BigDL高级机器学习实现寿险业务再发现 (Reimplement life insurance services using Spark and BigDL advanced machine learning) 议题 (Session)

中国人寿多年来积累了大量数据,如何深度挖掘数据的价值,用于业务推动、风险管理、客户服务等领域,是我们数据部门的主要目标。我们将介绍中国人寿如何使用Spark以及Spark上的深度学习库BigDL构建针对保险业务场景的高级分析应用。我们尝试了多种前沿的高级机器学习和深度学习技术,我们将分享我们的机器学习系统的架构,应用构建的流程,以及从中吸取到的经验和教训。

Yu Xu is the cofounder and CEO at GraphSQL. Previously, Yu held data engineering roles at Twitter, Teradata, and IBM. He is the author of 26 US patents (13 issued and 13 pending) in the areas of parallel computing, large-scale data analysis, information retrieval, and data management and has published 13 papers at top database conferences. Yu has also served on the program committees of top conferences in his field.

Presentations

GraphSQL: 崭新的游戏规则一个完整的高效图数据和分析平台 (GraphSQL is a game changer: A complete high performance graph data and analytics platform) 议题 (Session)

Mingxi Wu and Yu Xu offer an overview of GraphSQL, a high-performance enterprise graph data platform for real-time graph analytics that enables businesses to transform structured, semistructured, and unstructured data and massive enterprise data silos into an intelligent interconnected data network, uncovering implicit patterns and critical insights to drive business growth.

Ming Yang is the cofounder and vice president of software at Horizon Robotics. Previously, he was one of the founding members of the Facebook Artificial Intelligence Research (FAIR) team and a former senior researcher at NEC Labs America. Ming is a well-recognized researcher in computer vision and machine learning. His research interests include object tracking, face recognition, massive image retrieval, and multimedia content analysis. He holds 14 US patents and has over 20 publications in top conferences like CVPR and ICCV and 8 publications in top international journal T-PAMI, with more than 5,000 citations. During his tenure at Facebook, Ming led the deep learning research project Deep Face, which had a significant impact in the deep learning research community and was widely reported by media including Science magazine, MIT Tech Review, and Forbes. He has served as a member of the program committee for multiple top international conferences, including CVPR, ICCV, NIPS, and ACMMM. He has also been a reviewer for several top international journals, including T-PAMI, IJCV, and T-IP. As the leader of the NEC-UIUC team, Ming and his team achieved the best result in the TRECVid 2008 and 2009 Event Detection Evaluation. He was also a member of the NEC team that won first place in the ImageNet 2010 Large Scale Visual Recognition Challenge. He holds a BEng and MEng from the Department of Electrical Engineering at Tsinghua University and a PhD from the Department of Electrical Engineering and Computer Science at Northwestern University.

Presentations

用于深度学习的异步计算(Heterogeneous computing for deep learning) 主题演讲 (Keynote)

Ming Yang offers a brief introduction to deep neural network computation as well as an overview and comparison of the competing heterogeneous computing options, such as DSP, GPU, TPU, FPGA, and ASIC.

俞本权现任蚂蚁金服首席数据技术架构师。2016年加入蚂蚁金服致力于打造面向海量数据的稳定、高效、灵活、低成本的存储与实时计算能力,实现蚂蚁技术平台从海量实时交易处理到海量实时数据决策的升级。曾在PayPal担任信用卡事业部首席架构师7年,之后在谷歌任主任工程师6年期间,先后领队完成了YouTube ecommerce 平台建设,youtube.com/movies的创建,Google Wallet与Chrome深度整合,承担Google Analytics的数据和后台系统架构设计,并主导下一代谷歌分析后台的研发。

Presentations

GeaBase:蚂蚁金服大规模实时分布式图数据库(GeaBase: Ant Financial’s large-scale and real-time distributed graph database) 议题 (Session)

介绍GeaBase(Graph Exploration and Analytics Database),蚂蚁金服自主研发的新一代分布式实时图数据库。支持海量数据规模,高并发的低延迟实时响应和大规模迭代运算。本次分享将介绍GesBase架构,工程实现和实际的应用。

英特尔大数据技术中心高级技术经理。在服务器软硬件行业十年以上行业经验,目前致力于大数据分析相关软件方案的推广工作。

Presentations

使用Spark/BigDL高级机器学习实现寿险业务再发现 (Reimplement life insurance services using Spark and BigDL advanced machine learning) 议题 (Session)

中国人寿多年来积累了大量数据,如何深度挖掘数据的价值,用于业务推动、风险管理、客户服务等领域,是我们数据部门的主要目标。我们将介绍中国人寿如何使用Spark以及Spark上的深度学习库BigDL构建针对保险业务场景的高级分析应用。我们尝试了多种前沿的高级机器学习和深度学习技术,我们将分享我们的机器学习系统的架构,应用构建的流程,以及从中吸取到的经验和教训。

张李晔是新氦科技大数据架构师,目前主要专注于基于容器的流处理和实时分析平台的搭建和开发。新氦科技是新智集团下属,上海的一家大数据基础架构公司。在加入新氦科技之前张李晔在英特尔亚太研发有限公司担任大数据软件工程师,曾从事Spark和Hive的相关的代码开发、性能调优等工作。

Presentations

HAP:多流动态实时分析系统 (HAP: A multistream, dynamic, real-time analytic system) 议题 (Session)

HAP是一个实时分析系统,能够支持流式的输入,并且可以支持多流的碰撞,同时,可以根据查询层来动态的改变底层的流式处理方式以实现不同业务需求。另外在Kubernetes上可以实现水平扩展、高可用、高效、高速,并在保证数据exactly once语义的情况下实现秒级的数据分析和查询。

2008年硕士毕业于上海交通大学,2012年初加入PayPal Risk Data Science团队。2013年初开始研发基于Hadoop/YARN的机器学习框架,以满足PayPal日益增长的风控大数据的需要。主要负责使用Hadoop/YARN实现分布式的神经网络、逻辑回归以及梯度提升树等算法。目前在PayPal Risk负责分布式机器学习的框架的研发以及机器学习工程化的端到端的系统建设。

Presentations

大规模机器学习在PayPal风险控制部门的实践 (Large-scale machine learning in PayPal’s Risk Management department) 议题 (Session)

PayPal风险控制部门一直致力于利用基于大数据的机器学习的模型检测欺诈交易以及欺诈用户。本次演讲主要分享PayPal风险控制部门内部如何利用Hadoop/YARN实现分布式的逻辑回归、神经网络以及梯度提升树等机器学习算法,以及如何针对不同的算法做特征工程,构建端到端的机器学习管道。最后分享如何将这些算法组合起来提升模型的性能和稳定性。

GrowingIO 创始人& CEO,硅谷十三年数据分析经历,亲手建立 LinkedIn 百人商务分析和数据科学团队,支撑 LinkedIn 所有与营收相关业务的高速增长。Data Science Central 评选其为“世界前十位前沿数据科学家”。
2015 年 5 月,创办基于用户行为的新一代数据分析产品 — GrowingIO,无需埋点即可采集全量、实时用户行为数据,帮助管理者、产品经理、市场运营、数据分析师、增长黑客提升转化率、优化网站/APP,实现数据驱动业务和用户增长。
GrowingIO 获得《快公司》评选的 2015 年中国最佳创新公司 50 强,并获得经纬中国、NEA、Greylock A 轮2000万美元融资。

Presentations

数据驱动企业增长 (Data-driven business growth) 议题 (Session)

当流量红利渐消,数据驱动用户和收入增长成为新的核心;用数据驱动决策,而不是靠拍脑袋;数据分析究竟有哪些魅力?如何帮助企业创造巨大的商业价值,如何令公司全员做到数据决策;硅谷最前沿的方法论、工具、技术,最前沿的产品理念有哪些?

Xuefu Zhang is a software engineer at Uber, where he is the tech team lead for SQL on Hadoop. A veteran of the open source community, Xuefu spends most of his time on Apache Hive and Pig. Previously, he was the tech lead for Hive at Cloudera and led a global effort for the Hive on Spark project, worked on the Hadoop team at Yahoo, and spent his early career at Informatica gaining important experience in enterprise software development, especially in ETL and data warehousing. Xuefu is an Apache member and a PMC member for Hive, Sentry, and Pig.

Presentations

为Hadoop上的大数据准备的统一的SQL (Unified SQL for big data on Hadoop) 议题 (Session)

Xuefu Zhang offers an overview of U-SQL, which was developed internally by engineers at Uber and is envisioned as the future of SQL platforms. U-SQL enables automatic parsing, translation, optimization, and routing for user queries written in any supported query language and provides a unified SQL interface for SQL users who might not be familiar with the underlying SQL engines.

6年多软件开发和管理经验,曾在新浪平台架构部担任大数据team leader,负责微博核心数据存储以及大数据计算解决方案,以及在久其、锐安科技担任开发工程师,积累了丰富的软件开发与项目经验,目前就职于TalkingData DTU。专
注于大数据领域,对Hadoop、Spark、HBase的维护与开发有深入研究。

Presentations

基于 Spark 的数据管理、探索、计算平台 (A Spark-based data management, exploration, and computing platform) 议题 (Session)

TalkingData于13年底开始引入Spark,目前数据中心所有数据处理都以迁至Spark计算平台。 随着业务的快速发展,数据源及数据量的大幅提升,数据资产管理和数据分析、挖掘工作日趋增多,慢慢的沉淀出了基于Spark、Alluxio、Jenkins等开源技术的数据管理、探索及计算平台。 演讲者主要介绍平台的背景及其技术架构演进,以及在使用过程中踩过的一些坑和后续规划。

现任领英公司研发经理,领导核心大数据团队。该团队开发和应用HDFS,YARN,Spark,TensorFlow等开源技术,为领英公司的大数据平台提供核心的存储/计算引擎。

张喆同时还是Apache Hadoop项目的管理委员会(PMC)成员。也是Hadoop3的主要功能之一,HDFS纠删码(HDFS-EC)的作者。在加入领英之前,张喆就职于Cloudera和IBM沃森研究中心。2006年至今,在国际会议和期刊上发表论文20余篇,拥有5项美国专利。在IBM期间,获杰出技术成就奖(Outstanding Technology Achievement Award)。

Zhe Zhang is an engineering manager at LinkedIn, where he leads the Core Big Data Services team, which leverages open source technologies such as Hadoop, Spark, TensorFlow, and beyond to form the storage-compute engine of LinkedIn’s big data platform. Zhe is a PMC member of Apache Hadoop and author of HDFS erasure coding, a major feature for Hadoop 3.0. Previously, Zhe worked at Cloudera and IBM’s T. J. Watson Research Center. Zhe has over 20 research publications and 5 US patents. While at IBM, he received the Research Accomplishment Award and the Outstanding Technology Achievement Award.

Presentations

成长的烦恼--领英大数据平台500倍扩展中应对的挑战 (Growing pains: When your big data platform grows really big) 主题演讲 (Keynote)

领英是全球最早应用大数据技术的公司之一。在过去9年的时间里,领英的大数据平台扩展了将近500倍,从20台节点支持10个用户运行MapReduce,到现在超过1万台节点支持几千名工程师和科学家运行从交互式Presto查询到TensorFlow深度学习的各种大规模数据分析。这个报告会分享领英的大数据平台团队怎样解决大规模和高速增长带来的各种挑战。

领英大数据平台--超过1万节点,每天15万个作业,智能连接5亿职场用户 (LinkedIn's big data platform: 10,000+ nodes and 150,000+ daily jobs connecting 500 million members) 议题 (Session)

领英是全球最早应用大数据技术的公司之一。在过去9年的时间里,领英的大数据平台扩展了将近500倍,从20台节点支持10个用户运行MapReduce,到现在超过1万台节点支持几千名工程师和科学家运行从交互式Presto查询到TensorFlow深度学习的各种大规模数据分析。这个报告会分享领英的大数据平台团队怎样解决大规模和高速增长带来的各种挑战。

Jien Zhou works at China UnionPay (CUP), where he is responsible for the planning and design of the CUP data warehouse and large data platform system, the CUP cloud platform, autonomous middleware, database research and development, and the open source technology roadmap and platform planning. He was also involved in the design of the first and second generations of the CUP transfer system. Jien holds a bachelor’s degree, master’s degree, and PhD, all from the University of Science and Technology of China.

周继恩博士,现任中国银联股份有限公司银联科技事业部副总经理。周博士参与中国银联一代和二代转接系统设计,负责中国银联数据仓库、大数据平台系统的规划设计,负责银联云平台,自主中间件、数据库研发,开源技术路线和平台规划。周继恩博士早年就读中国科学技术大学,本硕博。获得中国科学技术大学博士学位。

Presentations

驱动金融服务的可能性 (Powering possibilities in financial services) 主题演讲 (Keynote)

Mick Hollison and Jien Zhou discuss how organizations are applying machine learning and advanced analytics to improve customer service and reduce the threat of fraud and cyberattack and explain how China UnionPay is using big data to deliver a better customer experience and manage risk.

Xiaoyong Zhu is a program manager at Microsoft focusing on scalable machine learning and advanced analytics.

Presentations

使用R和Apache Spark处理大规模数据 (Scaling R faster and larger using Apache Spark) 议题 (Session)

R is a popular data science tool for data analysis. However, it has many drawbacks, such as its memory utilization and single-thread design, that limit its usage for big data analysis. Xiaoyong Zhu explains how to use R to analyze terabytes of data.

万达网络科技区块链研发

Presentations

Hyperledger与CDH大数据生态系统的融合以及应用实践 (Hyperledger’s integration with CDH's big data ecosystem and its real-world applications) 议题 (Session)

区块链,比特币背后的技术,是一个去中心的分布式账本技术。Hyperledger是一个开源,跨行业的区块链平台技术。它是一个由金融,银行,物联网,供应链,制造业的行业领袖协同组成的全球协作项目。我们将Hyperledger同CDH进行集成,以利用CDH的服务部署,监控,管理功能。通过这个项目,用户可以方便地在CDH托管的数据中心部署Hyperledger集群,而且便于利用CDH大数据平台分析Hyperledger的数据,提取更多的商业价值。在万达内部使用的项目包含:数字权益平台和共享商业平台。其中共享商业平台包含了金融和供应链等多个环节。我们相信这个项目对于Hyperledger开源社区将很有帮助。

大数据技术专家,多年大数据实践经验,参与了天云大数据公司主要项目的技术规划,公司大数据产品技术负责人。

Presentations

Hadoop上的OLTP,BeagleData赞助议题(OLTP on Hadoop—sponsored by BeagleData) 议题 (Session)

在电信运营商、银行、保险、公安、军队、广电、政府等多个行业,每天都有巨量的数据产生,为了及时准确从数据中获取价值,合理高效的处理数据,我们结合在各个项目上的实施工作,在大数据领域做了很多实际的研究,在这里分享下我们在大数据领域里是如何实现高并发实时事务的,完成大数据的最后一公里的

2014年3月加入淘宝技术部,专注于集团内的Spark集群和服务建设。2015年5月加入阿里云,致力于在公有云上提供开源计算服务,关注分布式计算方向,Apache Hadoop和Spark社区贡献者。

Presentations

Hadoop遇到云上对象存储——实现原理、陷阱和性能优化 (When Hadoop meets object storage: Implementation principles, pitfalls, and performance optimization) 议题 (Session)

Hadoop社区很早就支持公有云上的对象存储,比如AWS S3和Azure Storge。最近发布的Apache Hadoop 3.0 (alpha)版本中增加了更多的云存储服务支持,比如Azure Data Lake和阿里云OSS。这些云存储都提供了Hadoop兼容的文件系统,用户可以把他们当成另一个HDFS使用。但是对象存储和HDFS在实现原理上有很多的不同,所以即使两者有类似的文件系统接口,很多API的行为完全不同。 本议题以阿里云OSS的实践出发,介绍阿里云OSS FileSystem实现进入Apache Hadoop历程。同时会介绍对象存储在文件上传、下载、删除和移动上和传统文件系统的区别,从性能和成本上评估HDFS和OSS文件系统的优劣。最后会结合对象存储的特性,给出一些优化方案,可以提升Hive或Spark等开源访问对象存储的性能。

英特尔大数据架构师,Spark开源贡献者。10年软件开发经验,熟悉大数据,流计算,存储,虚拟化。曾帮助多家公司构建基于Spark的流处理方案。

Presentations

Apache Spark高级实践和原理解析 (Apache Spark advanced practice and principles) 培训 (Training)

这几年随着大数据分析和机器学习等等在工业界中越来越广泛的应用,越来越多的人选择在大数据平台比如Apache Spark之上构建大规模数据处理、分析和机器学习,以便利用大量原始数据和扩展架构。如何深入理解大数据关键技术并更好的运用它们?本次课程将结合当前大数据技术的浪潮和趋势,为您介绍Apache Spark的高级实践和原理解析,帮助您加深领会Apache Spark的精华设计思想,以及如何与流式分析、机器学习,深度学习等紧密结合,在数据采集,分析处理,特征提取,机器学习等方面提供一致性和集成性的高级实践。

千惠子,万达网络科技集团大数据中心高级工程师,现从事数据脱敏平台的研发工作。数据脱敏平台ShadowMask项目致力于实现隐私数据智能脱敏,在可靠保护敏感信息的同时,保障数据的可分析价值。

Presentations

ShadowMask: 脱敏你的敏感的大数据 (ShadowMask: Anonymize your sensitive big data) 议题 (Session)

数据安全是大数据平台需要的非常重要的特性,如何防止用户敏感信息泄露是数据安全最大的威胁之一。ShadowMask是一个基于Spark大数据平台的开源数据脱敏项目,满足大数据用户对于用户隐私数据脱敏的需求,控制隐私数据泄露风险与数据处理需求的平衡。本次演讲主要介绍项目目标,架构,挑战,应用案例以及当前项目状态。

叶小萌,现任蚂蚁金服基础技术部图计算及存储技术团队负责人。1993年毕业于复旦大学计算机科学系,1998年获得美国缅因大学计算机科学硕士学位。毕业之后就职于多家互联网公司以及企业级软件公司。2011年加入Facebook,参与和领导了搜索引擎、图索引引擎等分布式系统的设计和开发。2015年加入蚂蚁金服,主导研发了实时图数据库GeaBase,实现了对超大规模关系网络毫秒极的复杂查询及变更。目前团队的主要职责是研发新一代的图计算引擎,以及规划和实现蚂蚁金服统一的存储体系

Presentations

GeaBase:蚂蚁金服大规模实时分布式图数据库(GeaBase: Ant Financial’s large-scale and real-time distributed graph database) 议题 (Session)

介绍GeaBase(Graph Exploration and Analytics Database),蚂蚁金服自主研发的新一代分布式实时图数据库。支持海量数据规模,高并发的低延迟实时响应和大规模迭代运算。本次分享将介绍GesBase架构,工程实现和实际的应用。

机器学习老兵,前mediav高级算法工程师,前阿里巴巴淘宝技术部算法专家,现任万达网络研究院资深研究员,对于机器学习,最优化算法在计算广告,推荐系统上有较多的经验,对于基于大数据的反欺诈,授信风险评估有所接触,

Presentations

从LR到DNN点击率预估系统的进化 (The evolution of CTR prediction systems, from LR to DNN) 议题 (Session)

广告点击率(ctr)预估的是一个热点问题,从事计算广告的公司一般都有自己的ctr系统,如何稳定可控地改进点击率预估系统,数据,架构,算法这三方面在不同的时间点要做什么是我这次想要分享的主题.通过回顾一个点击率预估系统是如何从最初的单纯的ETL+LR的形式逐步演变为包括模型在线训练,自动baddit,自动大规模特征探索的成熟在线系统.着重介绍在演化的几个关键节点上基于当时情况选择那个技术方向的思考过程,相当于结合ML&DL的知识体系和最近2年的发展,以业内几个比较知名的应用场景为线索,以几个关键节点(千人千面的上下线,双11的逐年演化)为例子来介绍大规模机器学习,分布式最优化的相关知识点,为参会者在面对在具体业务中遇到ML,DL相关问题如何做选型提供一份历史案例的参考

超过15年的大数据/数据仓库领域从业经验。 对大数据/数据仓库的咨询规划、架构设计、技术体系、实施方法论及主流解决方案有深入的研究和实践。曾任Teradata金融行业解决方案高级总监,惠普软件部大数据解决方案高级架构师,恒大互联网集团大数据研发中心架构负责人等。

Presentations

Apache Kylin金融行业大数据分析的应用与实践 (Apache Kylin big data analysis: Application and practice in the financial industry) 议题 (Session)

长期以来,金融企业大多采用传统的DW/BI技术来构建数据分析平台,但传统DW/BI技术已经难以应对大数据时代带来的数据量爆发、分析需求倍增、业务急需创新等挑战。 我们将通过保险、证券等领先金融企业的实际案例,介绍Apache Kylin大数据分析平台,如何帮助这些企业突破传统技术的瓶颈,实现了海量数据、高并发、多维度下的极速分析和业务创新,释放大数据价值。

TalkingData首席数据科学家。2016年创建Fregata开源项目。曾在IBM中国研究院,腾讯数据平台部,华为诺亚方舟实验室任职。10年大规模机器学习,数据挖掘有深入的研究和实践经验。目前在TalkingData, 负责数据科学工作。

Presentations

Fregata:在Spark上支持万亿维模型的机器学习算法库(Fregata: Machine learning algorithm libraries for supporting trillion-dimensional model on Spark) 议题 (Session)

TalkingData的一些核心业务能力如Lookalike十分依赖大规模机器学习的能力,我们发现现有的大规模机器学习技术都不能很好的满足我们的需要。因为我们需要支持大规模数据的高速,稳定,无需调参的机器学习算法,而这是目前的一些主流平台和工具无法提供的能力。为此我们在算法和系统方面做了一些研究,取得了一些成果。我们开源的Fregata机器学习算法库完全基于Spark标准接口,在Logisti Regression, Softmax算法上能够做到无需调参,高速,支持万亿维度的模型。Fregata Logistic Regression算法,在消耗大约2-4台服务器的机器资源,对于5.1亿条,1万亿维度的训练数据,可以在15分钟内完成训练。我们在本次演讲中将介绍Fregata在算法上和系统方面的一些工作。

张铭,北京大学信息科学技术学院教授,博士生导师,ACM Education Council惟一的中国委员兼任中国ACM教育专委会主 席,是ACM/IEEE IT2017学科规范起草小组成员。自1984年考入北京大学,分别获得学士、硕士和博士学位。研究方向为文本挖掘、社会网络分析、教育大数据等,目前主持国家自然科学基金和教育部博士点基金在研项目,合作发表科研学术论文100多篇(ICML, KDD, AAAI, IJCAI, ACL, WWW, TKDE等A类会议和期刊),获得ICML 2014最佳论文奖。发表了SIGCSE、L@S等教学研究论文,出版学术专著1部,获软件著作权6项,获发明专利3项。主编多部教材,其中2部教材为国家“十一五”规划教材,《数据结构与算法》获北京市精品教材奖并得到国家“十二五”规划教材支持。主持的“数据结构与算法”被评选为国家级和北京市级精品课程,也是教育部精品资源共享课程。

Presentations

基于深度学习的网络表示 (Network representations based on deep learning) 议题 (Session)

网络结构在现实世界中无处不在(如航线网络、通信网络、论文引用网络、世界万维网和社交网络等),大规模的网络结构数据和丰富的网络节点信息对相关的研究方法提出了新的挑战,受到了学术界和工业界的广泛关注。本报告对基于神经网络的网络表示方法进行了详细的介绍,这些方法可以处理现实世界中拥有百万级节点和十亿级边的网络结构,主要考虑了网络结构信息和网络节点自身信息(如文本信息和属性信息等)。学习网络的低维网络表示,在不同应用领域中体现出很好的效率和效果。

阿里云大数据事业部iDST

施兴,现任阿里云大数据事业部iDST团队高级算法专家,负责分布式机器学习平台的研发和应用。

曾任职于百度,从事Hadoop分布式存储和计算,主要开发了基于Hadoop的大规模分布式索引库的计算,支持动态增加输入和支持二进制数据的bistream,降低30%建库机器成本,同时计算时间缩短了50%。

2010年加入阿里巴巴参与Hadoop, Hbase的开发,后专注机器学习算法平台的调度和算法开发,支撑了广告,搜索的MPI机器学习算法任务。目前带领团队主要从事算法平台产品的开发和机器学习算法再一些垂直领域的应用。

Presentations

机器学习平台赋能企业创新应用 (A machine learning platform that enables enterprise innovation applications) 议题 (Session)

过去的几年,人工智能最火热的主要还是深度学习技术在各个场景下的应用。 可以看到,深度学习基本上还是互联网企业的独享,传统的中小企业很难有专门的团队去研究深度学习,也很难获取大规模的数据和计算能力。可以想象,未来会有更大规模的人工智能的需求和领域,依托阿里云自主研发的分布式数据存储与计算平台,我们研发了人工智能平台PAI(Platform of AI),期望将人工智能的能力赋能给各个企业。同时,针对一些通用领域,比如身份证,行驶证的图片识别,我们也基于PAI开发了一些上层的服务输出。 我们将介绍PAI的一些功能基础能力和我们现有成熟服务能力的输出,以及如何基于PAI训练一个自定义的人工智能模型并提供服务的流程。

李元健,百度基础架构部资深研发工程师,Apache Spark contributor。11年加入百度,先后参与并负责百度实时计算平台DStream,Tracing平台Rig,Spark平台及公有云BigSQL等核心服务的研发工作。

Presentations

OAP: 使用Spark SQL进行即席查询 (OAP: Using Spark SQL for ad hoc queries) 议题 (Session)

OAP是英特尔大数据团队和百度基础架构团队的开源合作项目,旨在针对在Spark SQL上进行的大规模数据即席查询进行优化,满足在百度线上业务中对于海量搜索日志进行秒级查询的需求。 OAP通过用户自定义的分布式索引和自动缓存等技术,极大地加速了一些特定场景下的SQL查询。OAP支持多种索引类型,可以让用户根据数据特征选择适当的索引,加速查询的同时,引入较少的额外存储开销。 在百度的生产环境中,OAP已经作为平台提供的查询加速方案,为部分实际查询带来5倍左右的性能提升,大大节约了查询的运行时间,丰富了Spark SQL的应用场景。

李嘉璇,《TensorFlow技术解析与实战》作者,创建TensorFlow交流社区(tf.greatgeekgrace.com),活跃于国内各大技术社区,知乎编程问题回答者。对深度学习框架的架构、源码分析及在不同领域的应用有浓厚兴趣。有处理图像、社交文本数据情感分析、数据挖掘等实战经验,参与过基于深度学习的自动驾驶二维感知系统Hackathon竞赛, 曾任职百度研发工程师。

Presentations

TensorFlow与自然语言处理模型的应用 (TensorFlow applications for natural language processing models) 议题 (Session)

常常听到这种说法,自然语言处理是人工智能的桂冠。NLP从语言学上来看,研究的方向包括词干提取、词性还原、分词、词性标注、命名实体识别、词性消歧、句法分析、篇章分析等等。在这些基础的研究内容之上,面向具体的文本处理应用有机器翻译、文本摘要、情感分类、问答系统、聊天机器人等。使用的模型也在非常新颖地发展,从原来的RNN到GRU、到LSTM、到CW-RNN、到Seq2Seq、到加入Attention机制。从原本的Static unrolling到现在的Dynamic unrolling,甚至seqGAN。 自然语言处理的各个模型都有什么特点,除了加入双向以及加深网络外还有什么演化规律,每一次演化都是为了解决哪些技术哪点?接下来NLP基础模型还可能有哪些研究方向?在Sequential Data的处理及表示上有什什么演进规律和可以借鉴的经验?让我们来一起聊一聊这些话题。

李成华,飔拓(Stormor)董事长兼CTO,曾担任京东深度神经网络(深度学习)实验室首席科学家。韩国国立全北大学数据挖掘与机器学习方向博士,加拿大圣西维尔大学和加拿大约克大学博士后。美国麻省理工学院媒体实验室(MIT Media Lab)访问科学家。曾任海信集团国家重点实验室数据挖掘技术专家,负责海信集团硬件智能化创新与数据挖掘的研发。李成华博士在机器学习特别是神经网络和数据挖掘方面有数十年的研究和工作经验。在世界高级期刊Expert System with application, information processing and management, neurocomputing等发表论文30余篇,专利数十篇。

Presentations

智能聊天机器人技术和应用 (Technologies and applications of intelligent chat robots) 议题 (Session)

随着技术和市场的高歌猛进,人工智能正成为数据、服务、产品接入人类生活的重要入口。聊天机器人的演变与发展让它慢慢融入人们的日常生活,从手机上的虚拟助理到实际在线客服,聊天机器人的发展之路不算短。 传统的智能聊天机器人有一个比较大的痛点是交互体验不好,智能化程度低,而通过研究深度学习,自然语言处理,短文本处理,大数据等技术,改进智能聊天机器人的应答准确率,提高咨询效率。 这次演讲会与大家分享智能聊天机器人的核心设计思路,如何利用深度学习,自然语言处理,知识图谱,用户画像等技术进行实现,以及使用深度学习构建聊天机器人采用的主体技术框架以及面临的一些独特问题及相应的解决方案。

今日头条高级工程师,负责整个Spark平台。

Presentations

Spark在今日头条的实践 (Spark in JinRi TouTiao) 议题 (Session)

讲述今日头条是如何用Spark来处理海量数据,以及在实际使用中的一些改进。

李银辉拥有丰富的分布式系统研发经验,目前负责万达网络科技集团大数据平台的平台基础设施研发工作。此前曾就职中国电信,利用Hadoop生态圈组件,处理中国电信集团海量数据的etl工作。

Presentations

ShadowMask: 脱敏你的敏感的大数据 (ShadowMask: Anonymize your sensitive big data) 议题 (Session)

数据安全是大数据平台需要的非常重要的特性,如何防止用户敏感信息泄露是数据安全最大的威胁之一。ShadowMask是一个基于Spark大数据平台的开源数据脱敏项目,满足大数据用户对于用户隐私数据脱敏的需求,控制隐私数据泄露风险与数据处理需求的平衡。本次演讲主要介绍项目目标,架构,挑战,应用案例以及当前项目状态。

目前在阿里云iDST大规模算法团队负责大规模深度学习算法基础设施相关建设工作,对大规模分布式机器学习的开发、建设、优化以及在不同业务场景中的落地应用有较为深入的理解和认识。之前先后在奇虎360担当广告技术部门架构师,Yahoo北京研发中心担当效果广告系统技术负责人。

Presentations

Pluto:一款分布式异构深度学习框架 (Pluto: A distributed heterogeneous deep learning framework) 议题 (Session)

本分享会介绍阿里云iDST PAI团队研发的一款分布式深度学习框架Pluto。在Pluto里,阿里云PAI团队基于Caffe和TensorFlow这两款开源框架进行了分布式性能的深度优化定制,相较于优化前取得了显著的性能提升,在一些场景下取得了10X的收敛加速比提升。并成功应用到了集团安全、金融风险建模、证件类图片识别、客服问答、机器翻译等集团核心业务建模场景里,显著提升了建模迭代效率。

杨军 (Yang Jun), 阿里巴巴 专家见面会 (Meet the Experts)

希望能够讨论下面的内容: 1.大规模机器学习的技术进展 2.大规模机器学习在不同公司的实际应用场景 3.DL&RL在图像、语音之外的应用case。

研究生毕业于中国科学技术大学,现任联想大数据产品研发部高级经理,负责大数据产品架构与算法研究等工作。曾在施乐、阿里巴巴、华为、百度、万达电商等公司从事数据挖掘研发工作,工作涉及机器学习/模式识别在图像处理、电子商务、搜索推荐、知识图谱、零售方面的应用。

Presentations

多视图建模与半监督学习:应用于海量用户数据挖掘与行为分析 (Multiview modeling and semisupervised learning applied to massive user data mining and behavior analysis) 议题 (Session)

在无法直接收集个人信息的情况下,企业需要根据用户行为数据,来预测用户的特定属性(如性别、职业、学历、购买力、年龄以及其它个人生命周期的状态等)。(目标) 一些有监督机器学习算法被用来实现这一目标,但是,面对数千万甚至上亿的海量用户、数百亿甚至更多的行为数据,标注量需要达到一定规模,才能保障机器学习的效果,而为了获得标注数据,是成本非常巨大的工作。(难点) 在实践中,我们通过多个角度对用户进行建模,构造不同的用户数据视图,在每个视图下选择合适的机器学习算法,应用cotraining半监督学习算法,通过多个数据视图机器学习算法的协同训练(cotraining),在使用非常少量的标注数据的情况下,就能在用户属性预测方面达到良好的效果。(方法)

王振华 (Zhenhua Wang) is a research engineer at Huawei Technologies, where he works on building a big data analytics platform based on Apache Spark. He holds a PhD in computer science from Zhejiang University. His research interests include information retrieval and web data mining.

Presentations

基于成本的Spark SQL优化器框架 (A cost-based optimizer framework for Spark SQL) 议题 (Session)

我们把基于成本的优化器框架贡献给社区版本Spark 2.2。在我们的框架中,我们计算每个数据库操作符的基数和输出大小。通过可靠的统计和精确的估算,我们能够在这些领域做出好的决定:选择散列连接(hash join)操作的正确构建端(build side),选择正确的连接算法(如broadcast hash join与 shuffled hash join), 调整连接的顺序等等。这个基于成本的优化器框架对Spark SQL查询的性能有很好的提升 。在这次演讲中,我们将展示Spark SQL的新的基于成本的优化器框架及其对TPC-DS查询的性能影响。

中国人寿大数据项目负责人,有丰富的数据服务项目的实施经验

Presentations

使用Spark/BigDL高级机器学习实现寿险业务再发现 (Reimplement life insurance services using Spark and BigDL advanced machine learning) 议题 (Session)

中国人寿多年来积累了大量数据,如何深度挖掘数据的价值,用于业务推动、风险管理、客户服务等领域,是我们数据部门的主要目标。我们将介绍中国人寿如何使用Spark以及Spark上的深度学习库BigDL构建针对保险业务场景的高级分析应用。我们尝试了多种前沿的高级机器学习和深度学习技术,我们将分享我们的机器学习系统的架构,应用构建的流程,以及从中吸取到的经验和教训。

莫云目前就职于宜人贷数据团队,任数据工程师。他曾负责实时查询引擎的技术选型和Impala框架的建立。此项目已支撑全公司的实时交互查询。他负责创建的知识图谱已积累了1.7亿节点和10亿关系,并已成功应用于反欺诈方向。在宜人贷就职前,莫云曾任职于搜狐畅游,负责广告平台部Hadoop及数据仓库的建立。

Presentations

SDK + FinGraph + Go:用一手行为数据和图谱信息创造商业价值 (SDK + FinGraph + Go: Create business value with firsthand user behavior data and knowledge graph information) 议题 (Session)

在移动互联网流量红利过后,我们怎样深度挖掘一手移动数据,实时响应用户需求,通过用户行为和知识图谱技术,创造商业价值?我们会通过具体业务案例,分享一个SDK + FinGraph + Go的技术框架。此框架只用一行代码将SDK埋入APP,通过实时/准实时的上传机制和Flume + Kafka的实时处理分析,获取用户意向;用Spark Streaming流式处理,HBase KV查询输出,和Neo4j集群做的关联、存储来挖掘图谱信息;并通过Go高效的开发基础平台,Python连接自动提报后台,scikit-learn做事件识别,和Cypher挖掘图谱关系来预测用户意愿,引导用户行为 - 用实时数据创造商业价值。

任职英特尔亚太研发中心大数据部门,作为资深研发工程师在安全和大数据领域从事开发和优化工作多年。目前担任研发经理,所在团队在Hadoop和Streaming领域诸多项目上有重要参与和贡献。热衷开源贡献,是Apache Hadoop committer,Apache Directory PMC 和Apache Kerby的关键发起者。

Kai Zheng is a big data engineering manager at Intel, where he explores broad enablement and optimization on the company’s IA platform. He has worked in big data space for a number of years across the security, storage, and computing domains. Kai is also an Apache Hadoop committer, a Kerby initiator, and a major contributor to HDFS erasure coding.

Presentations

HDFS纠删码最新探秘 (Demystifying erasure coding in HDFS) 议题 (Session)

Hadoop3.0 引入了纠删码技术。在常见配置下,纠删码相对于传统数据3备份模式可以降低50%的存储成本,同时提高数据的可靠性。在本次演讲中,我们首先会简短的介绍HDFS纠删码技术, 然后深入了解在Hadoop 3.0 GA 前我们为保证纠删码功能稳定性做的工作,以及分享Hadoop生态系统中重要成员Spark, Hive,Impala, Kylin等等在HDFS 纠删码上的性能表现。最后,我们会给出在生产环境中部署使用纠删码技术的一些考虑和建议。

陈雨强,第四范式联合创始人、首席研究科学家。

世界级深度学习、迁移学习专家。
在百度主持了世界首个商用深度学习系统、在今日头条主持了全新的信息流推荐与广告系统的设计实现。

学术方面,他曾在 NIPS、AAAI、ACL、SIGKDD 等顶会上发表论文,并获 APWeb2010 Best Paper Award,KDD Cup 2011 名列前三,其学术工作在 2010 年作被全球权威科技杂志 MIT Technology Review 报道。

Presentations

人工智能工业应用痛点及解决思路 (Pain points in AI industrial applications and solutions) 议题 (Session)

AI的强大让各行各业纷纷侧目,未来对AI的应用情况将极大影响一家企业在市场中的位置。 然而, 在实验室叱咤风云的AI技术一旦应用到实际,难免水土不服。 那么,AI工业应用的必要条件是什么?痛点有哪些?如何解决?如何从系统层面、模型&特征层面、模型维度层面、实施上线层面实现突破?针对常见场景中的常见难点,有哪些黑科技正在起作用? 本演讲旨在分享演讲者在互联网、金融、电信等领域的人工智能工业应用实践中的痛点及解决思路。

中国人寿数据科学家,专注于大数据分析领域, 主要研究高级数据分析方法和机器学习原理

Presentations

使用Spark/BigDL高级机器学习实现寿险业务再发现 (Reimplement life insurance services using Spark and BigDL advanced machine learning) 议题 (Session)

中国人寿多年来积累了大量数据,如何深度挖掘数据的价值,用于业务推动、风险管理、客户服务等领域,是我们数据部门的主要目标。我们将介绍中国人寿如何使用Spark以及Spark上的深度学习库BigDL构建针对保险业务场景的高级分析应用。我们尝试了多种前沿的高级机器学习和深度学习技术,我们将分享我们的机器学习系统的架构,应用构建的流程,以及从中吸取到的经验和教训。

马晓宇是PingCAP的技术主管,负责TiDB大数据生态的整合以及MPP引擎开发。

Presentations

Spark和TiDB (Spark on TiDB) 议题 (Session)

SparkTI (Spark on TiDB)是TiDB基于Apache Spark的独立于原生系统的计算引擎。它将Spark和TiDB深度集成,在原有MySQL Workload之外借助Spark支持了更多样的用户场景和API。这个项目在SparkSQL和Catalyst引擎之外实现了一套扩展的,为TiDB定制的SQL前端(Parser,Planner和优化器):它了解TiDB如何组织数据,并知晓如何借助TiDB本身的计算能力加速查询,而不仅仅是一个Connector。凭借SparkTI,TiDB将成为Hadoop生态的一部分,铺平了OLTP系统和离线分析集群之间的鸿沟。

工学博士,现任广发银行数据中心总经理,银行业信息科技发展与风险管理专家,广州市金融高级专业人才,曾任工商银行北京数据中心信息科技专家。黄先生作为银行业科技条线的资深专家,在基础设施与运行维护、信息科技治理与管理、大数据研究及规划等领域具有丰富经验。

Presentations

大数据时代银行客户社交关系圈研究与应用 (Research on and the application of a social relation circle of bank customers in the big data era) 议题 (Session)

为加深对银行客户的洞察,提升银行营销获客与风险管控能力,广发银行基于Hadoop大数据平台,通过Hive on Spark、图计算进行数据加工,结合LFM社群发现、增强决策树等机器学习算法构建了银行客户社交关系模型,挖掘出银行客户社交关系圈,并应用于银行实际业务中。银行客户社交关系圈全面的反映了银行个人客户资金、社交等关系,以全新的视角实现银行对客户洞察从点到面、从单客到客群的扩展,填补银行个人客户社交关系研究与应用的空白。

联系OReillyData

关注OReillyData微信号获取最新会议信息并浏览前沿数据文章。

WeChat QRcode

 

Stay Connected Image 1
Stay Connected Image 3
Stay Connected Image 2

阅读关于大数据的最新理念。

ORB Data Site