Presented by O'Reilly and Cloudera
Make Data Work
July 12-13, 2017: Training
July 13-15, 2017: Tutorials & Conference
Beijing, China

In-Person Training
数据科学精髓:互联网金融实例 - 量化线上金融信用与欺诈风险的评估 (Data science essentials: Examples from internet finance—Quantifying credit and fraud risks online)

Jike Chong (Tsinghua University | Acorns)
09:00–17:00 Wednesday, 2017-07-12
数据科学&高级分析 (Data science & advanced analytics)
Location: 多功能厅6(Function Room 6) 观众水平 (Level): 中级 (Intermediate)
平均得分:: ****.
(4.50, 2 次得分)

Participants should plan to attend both days of this 2-day training course. Platinum and Training passes do not include access to tutorials on Thursday.

您想了解互联网金融幕后的量化分析流程吗?个人信用是怎样通过大数据被量化的?在实践过程中,机器学习算法的应用存在着哪些需要关注的方面?怎样通过图谱分析来融合多维数据,为我们区分正常用户和欺诈用户? 这套辅导课基于清华大学交叉信息研究院2017年春天新开设的一门"量化金融信用与风控分析”研究生课。其中会用LendingClub的真实借贷数据做为案例,解说一些具体模型的实现。

What you'll learn, and how you can apply it

  • Explore financial creditworthiness assessments—development areas, real technical challenges, and practical data science solutions

This training is for you because...

  • You're a practitioner in data science who is assessing your customers' trustworthiness and financial value.
  • You're a practitioner in the financial field who is learning about rigorous data science workflows and considerations.
  • You're a data scientist or data mining engineer who is facing specific modeling challenges such as incremental data streams and imbalanced datasets and interested in learning about effective techniques and practices.

Prerequisites:

  • Experience in data science or data analysis

Hardware and/or installation requirements:

  • A laptop

您想了解互联网金融幕后的量化分析流程吗?个人信用是怎样通过大数据被量化的?在实践过程中,机器学习算法的应用存在着哪些需要关注的方面?怎样通过图谱分析来融合多维数据,为我们区分正常用户和欺诈用户?

这套辅导课基于清华大学交叉信息研究院2017年春天新开设的一门"量化金融信用与风控分析”研究生课。其中会用Lending Club的真实借贷数据做为案例,解说一些具体模型的实现。

第一天(上午):
1. 金融信用行业概况

  • 什么是信用?
  • 信用贷款行业概况
  • 信用贷款风险
  • 金融产品的设计

2. 数据特性与评估标准

  • 中美信用评分的现状
  • 信息源:身份鉴别+还款能力/意愿,个人设备信息,个人线上/线下行为信息
  • 风控术语与评估标准
  • 数据源获取挑战

第一天(下午):
3. 数据采集与特征提取

  • 数据源的选择
    • 信贷金融属性强度,数据产生的频率,反应还款能力/意愿
  • 特征的挖掘
    • 特征的挖掘,有效性/稳定性的评估
    • 特征的组合,
    • 迁移学习,主动学习,表征学习
  • 知识图谱的应用
    • 实体和关系的定义
    • 图数据库的技术实现
    • 用Cypher做图谱挖掘
    • 社区挖掘算法案例
  • 设备指纹

4. 信用和欺诈的标注

  • 标注获取的挑战
    • 成本高,周期长,定义多样
  • 信用标注
    • 早期产品模型,成熟产品模型
  • 欺诈标注
    -欺诈标注的五层分层

第二天(上午):
5. 信用和欺诈模型的搭建

  • Incremental learning
    • Static windowing approach
    • Updating approach
    • Forgetting genuine approach
  • 数据非平衡处理:
    • Random oversampling and undersampling
    • Informed undersampling
    • Synthetic sampling with data generation
    • Adaptive synthetic sampling
    • Sampling with data cleaning techniques
  • 模型策略
    • Linear regression
    • GBT
    • Deep learning
    • Ensembles
  • 结果评估
    • 混淆矩阵
    • 排序评估方法
    • ROC curve
    • PR curve

第二天(下午):
6. 商业决策和评估:

  • 利率和额度的确定
  • 营利性的评估

7. 黑色产业链

  • 黑色产业链一览
  • 安全与用户体验的权衡
  • 对应策略

8. 行业案例


What is the quantitative risk assessment framework behind the $564B consumer lending industry? How is an individual’s trust worthiness assessed online? What are the specific areas of concern in data and modeling that need to be addressed? How can we leverage a knowledge graph to integrate information from a variety of data sources to distinguish normal and risky users?

Based on a rigorous graduate-level course that he developed and taught in 2017 at Tsinghua University, Jike Chong offers an overview of the financial industry’s data properties and assessment metrics, data collection and feature extraction techniques, labeling of risky users, construction of a credit or fraud risk model, business decisioning based on risk assessments, and a snapshot from the fraud landscape. Jike uses real industry data (from Lending Club) to illustrate some of the technical concepts, along with hands-on examples and exercises.

Outline

Day 1

An industry overview

  • What is trust and trustworthiness?
  • Overview of the unsecuritized lending industry
  • Various categories of risks
  • Domains of concerns for a credit-driven financial product

Data properties and assessment metrics

  • US and Chinese credit scoring systems
  • Categories of data sources for identity verification and assessments of ability and willingness to repay
  • Risk terminology introduction and risk assessment metrics
  • Data acquisition challenges

Demo and exercise with Lending Club data to compute risk assessment metrics

Data collection and feature extraction

  • Data source selection
  • Feature mining techniques
    • Depth of mining
    • Feature effectiveness evaluation
    • Feature stability
    • Transfer learning versus active learning versus representation learning
  • The use of knowledge graph and graph-mining
    • Nodes, edges, and properties
    • Applicable graph database technologies
    • Cypher for graph mining
    • Samples of graph algorithms
  • Device fingerprinting

Credit and fraud labeling

  • Challenges in obtaining the labels
  • Credit risk data label production
    • New and mature product risk label production
  • Fraud risk data label production
    • Five levels of fraud risk labeling

Day 2

Construction of a credit or fraud risk model

  • Sampling with incremental learning
    • Static windowing approach
    • Updating approach
    • Forgetting genuine approach
  • Class-imbalance challenge:
    • Random oversampling and undersampling
    • Informed undersampling
    • Synthetic sampling with data generation
    • Adaptive synthetic sampling
    • Sampling with data cleaning techniques
  • Modeling approach
    • Linear regression
    • GBT
    • Deep learning
    • Ensembles
  • Result assessments
    • Confusion matrix
    • Ranking metrics
    • ROC curve
    • PR curve

Business decisions

  • Pricing and limits
  • Product profitability assessments

The dark side

  • The value chain of the fraud under world
  • The tension between safe and UX
  • The possible responses

Further examples from the industry

About your instructor

Photo of Jike Chong

种骥科博士现任清华大学访问教授和宜人贷 (NYSE:YRD) 首席数据科学家。在宜人贷,种骥科的数据科学团队支持反欺诈风控和数字驱动的运营和创新。之前,种骥科曾任职于美国Simply Hired招聘平台,创建了数据科学部, 并应邀为白宫科技办公室参谋大数据技术产品设计。还曾就职于美国Silver Lake 私募公司任Kraftwerk基金数据科学架构师,负责大数据技术在私募投资风控方面的应用。种骥科曾任美国卡内基梅隆大学教授与博士生导师,持有加州大学伯克利分校电子工程和计算机科学系博士学位,卡内基梅隆大学电子和计算机工程系硕士及本科学位,和9项专利(5项获准,4项待批)。

会议注册

Get the Platinum pass or the Training pass to add this course to your package.

Connect with O'ReillyData

Use the QR Code to follow OReillyData and get the latest conference information and browse data articles.

WeChat QRcode

 

Stay Connected Image 1
Stay Connected Image 3
Stay Connected Image 2

Read the latest ideas on big data.

ORB Data Site