O'Reilly、Cloudera 主办
Make Data Work
2017年7月12-13日:培训
2017年7月13-15日:会议
北京,中国

从LR到DNN点击率预估系统的进化 (The evolution of CTR prediction systems, from LR to DNN)

此演讲使用中文 (This will be presented in Chinese)

吴炜 (万达网络研究院)
16:20–17:00 Friday, 2017-07-14
数据科学&高级分析 (Data science & advanced analytics)
地点: 报告厅(Auditorium) 观众水平 (Level): 中级 (Intermediate)

必要预备知识 (Prerequisite Knowledge)

DNN 逻辑回归 计算广告 DNN Logistic Regression Computational Advertising

您将学到什么 (What you'll learn)

将深度学习和传统机器学习相结合,待补充

描述 (Description)

对于广告点击率预估的是一个热点问题,很多从事计算广告的公司都有自己的点击率预估系统,其中知识点涉及从相对简单的logistic regression到最近google提出的Wide & Deep Learning.如何稳定可控地改进点击率预估系统,对于数据,架构,算法这三方面在不同的时间点要做什么事情是我这次想要分享的主题.基于过去在meidav(现360商业产品事业部),阿里妈妈的工作经验,回顾一个成熟的点击率预估系统是如何从最初的单纯的ETL+LR的形式逐步演变为包括模型在线训练,自动baddit,自动大规模特征探索的有强大现金流收益的在线系统.不仅仅告诉参会人员系统演化的结果,也介绍在演化的几个关键节点上基于当时情况因素选择那个技术方向的思考过程,相当于结合机器学习和深度学习的知识体系和最近2年的发展,以业内几个比较知名的应用场景为线索,以几个关键节点(千人千面的上下线,双11的逐年演化)为例子, 对于基于广告和推荐的流量变现系统做一个梳理,介绍大规模机器学习,分布式最优化的相关知识点,为参会者在面对在具体业务中遇到机器学习,深度学习相关问题如何做选型提供一份历史案例的参考.最后,面对现在IoT技术逐步扩散的趋势, 介绍目前根据移动设备+人脸识别进行受众锚定的技术以及基于这个技术上将户外显示屏广告(幕墙,灯箱,商场内显示屏)接入到竞价广告系统中并和APP一起建立O2O营销平台的工作.
主要会介绍下面的知识点

  • GBDT+FTRL融合模型
    这个技术最早是由Facebook在2014文章Practical Lessons from Predicting Clicks on Ads at中提出的,思路大致如下:采用基础特征(一般是统计类特征)训练出GBDT模型,当每个样本点经过GBDT模型的每一个树时,会落到一个叶子节点,即产生了一个中间特征,所有这些中间特征会配合其他ID类特征的特征一起输入到LR模型来做CTR预估.在基础特征特征非常多(数万种)的情况下,即使只找二阶组合就有上亿中选择,使用人工先验来找组合维度在目前阿里的体量下基本已经是不可能的事情了,GBDT提供了潜在的有意义组合维度作为接下来模型的输入
  • 对抗稀疏性:FM和FFM
    使用组合维度极有可能遇到维度稀疏的问题, 寻找到有训练一个FM模型就可以得到embedding, 然后再把FM 学到的模型输入给DNN作为对于组合的输入
  • Wide and Deep Learning模型
    这个模型的特点是结合了离散LR 以及 Deep Neural Network,category feature 通过embedding的方式输入到DNN学习, 其他一些特征通过LR 方式学习。LR部分通过Feature Cross 精细刻画场景, DNN部分则强调Generalization, Combine 二者尝试得到更优的效果。


Prediction of CTR (click-through rate) is a hot topic. Many computational advertising companies have their own CTR prediction systems, which use knowledge points like the relatively simple logistic regression and wide and deep learning recently proposed by Google. Drawing on firsthand experience at meidav (now 360 Business Product Group) and Alimama.com, 吴炜 explains how to improve CTR prediction systems in a controlled way, covering how CTR prediction systems have evolved from simple ETL+LR systems to very profitable online systems that include online model training, auto-baddit, and automated large-scale feature engineering.

Using a few well-known usage scenarios in this industry and several real-world examples, such as “A Thousand People, A Thousand Faces” and the yearly evolution of Alibaba Singles’ Day, 吴炜 unscrambles traffic monetization based on advertising and recommender systems and discusses large-scale machine learning, distributed optimization, a technology based mobile devices and face recognition, and how to integrate outdoor screen advertising (e.g., billboards, lightboxes, screens inside malls, etc.) into bidding advertisement systems and build an O2O marketing platform along with APPs.

Topics include:

  • GBDT+FTRL—This technology was first introduced by Facebook in 2014 in the paper “Practical Lessons from Predicting Clicks on Ads at Facebook.” Its basic idea is that base features (normally statistical features) can be used to train a GBDT model and that when each sample point passes this GBDT’s model tree, it will fall into a leave node, hence generating a middle-level feature. All middle-level features work with other ID-type features to form the input of LR models, which then perform CTR prediction. GBDT provides potential meaningful combinations of features as the input of following models.
  • Working against sparsity: FM and FFM—It is quite possible that a certain combination of features will introduce sparsity. The solution is to find and train an FM model to get embedding and then use that FM-trained model as the combination feature input to DNN.
  • Wide and deep learning models—Wide and deep learning models can integrate discreet LR and deep neural networks. Category features are fed to a DNN for learning, and other features are learned via LR. LR does fine-grained and scenario-specific prediction via feature cross, while the DNN highlights generalization. Combining the two methods generates better results.
Photo of 吴炜

吴炜

万达网络研究院

机器学习老兵,前mediav高级算法工程师,前阿里巴巴淘宝技术部算法专家,现任万达网络研究院资深研究员,对于机器学习,最优化算法在计算广告,推荐系统上有较多的经验,对于基于大数据的反欺诈,授信风险评估有所接触,

联系OReillyData

关注OReillyData微信号获取最新会议信息并浏览前沿数据文章。

WeChat QRcode

 

Stay Connected Image 1
Stay Connected Image 3
Stay Connected Image 2

阅读关于大数据的最新理念。

ORB Data Site