新闻中心

【金融风控系列】_[3]_贷款违约识别

2025-07-22
浏览次数:
返回列表
本文围绕Kaggle的Home Credit Default Risk赛题展开,利用客户申请表等7张表数据构建模型预测客户还款能力。通过数据清洗、特征工程,融合多表信息生成衍生特征,经LightGBM模型训练,最终线上评分为0.78277,为信用记录不足人群的贷款评估提供参考。

☞☞☞AI 智能聊天, 问答助手, AI 智能搜索, 免费无限量使用 DeepSeek R1 模型☜☜☜

【金融风控系列】_[3]_贷款违约识别 -

Home Credit Default Risk(家庭信用违约风险)

该赛题来自 KAGGLE,仅用作学习交流


由于信用记录不足或不存在,许多人往往被划分为低信用借贷人而难以获得贷款。 为了确保这些人群获得贷款,Home Credit公司利用替代数据(包括电信和交易信息)预测客户的还款能力。

Home Credit一共提供了7张表,一共218个字段,其中训练集样本约31万(逾期8%),测试集样本约5万。


信息表

application_train/test 客户申请表

包含了

  • 目标变量(客户是否违约-0/1变量)
  • 客户申请贷款信息(贷款类型, 贷款总额, 年金)
  • 客户基本信息(性别, 年龄, 家庭, 学历, 职业, 行业, 居住地情况)
  • 客户财务信息(年收入, 房/车情况)
  • 申请时提供的资料等.

bureau/bureau_balance 由其他金融机构提供给征信中心的客户信用记录历史(月数据)

包含了客户在征信中心的

  • 信用记录,
  • 违约金额,
  • 违约时间等.

以时间序列(按行)的形式进行记录.

POS_CASH_balance 客户在Home Credit数据库中POS(point of sales)和现金贷款历史(月数据)

包含了客户

美图云修 美图云修

商业级AI影像处理工具

美图云修 50 查看详情 美图云修
  • 已付款情况
  • 未付款情况

credit_card_balance 客户在Home Credit数据库中信用卡的snapshot历史(月数据)

包含了客户

  • 消费次数
  • 消费金额

等情况.

previous_application 客户先前的申请记录

包含了客户所有历史申请记录(申请信息, 申请结果等).

installments_payments 客户先前信用卡的还款记录

包含了客户的还款情况

  • 还款日期
  • 是否逾期
  • 还款金额
  • 是否欠款等

参考:

[1] https://zhuanlan.zhihu.com/p/43541825

[2] https://www.kaggle.com/xucheng/cv-7993-private-score-7996/

[3] https://zhuanlan.zhihu.com/p/40790434

[4] https://www.kaggle.com/tahmidnafi/cse499

[5] https://blog.csdn.net/zhangchen2449/article/details/83338978

主要字段表

Field Description
SK_ID_CURR 此次申请的ID
TARGET 申请人本次申请的还款风险:1-风险较高;0-风险较低
NAME_CONTRACT_TYPE 贷款类型:cash(现金)还是revolving(周转金,一次申请,多次循环提取)
CODE_GENDER 申请人性别
FLAG_OWN_CAR 申请人是否有车
FLAG_OWN_REALTY 申请人是否有房
CNT_CHILDREN 申请人子女个数
AMT_INCOME_TOTAL 申请人收入状况
AMT_CREDIT 此次申请的贷款金额
AMT_ANNUITY 贷款年金
AMT_GOODS_PRICE 如果是消费贷款,改字段表示商品的实际价格
NAME_TYPE_SUITE 申请人此次申请的陪同人员
NAME_INCOME_TYPE 申请人收入类型
NAME_EDUCATION_TYPE 申请人受教育程度
NAME_FAMILY_STATUS 申请人婚姻状况
NAME_HOUSING_TYPE 申请人居住状况(租房,已购房,和父母一起住等)
REGION_POPULATION_RELATIVE 申请人居住地人口密度,已标准化
DAYS_BIRTH 申请人出生日(距离申请当日的天数,负值)
DAYS_EMPLOYED 申请人当前工作的工作年限(距离申请当日的天数,负值)
DAYS_REGISTRATION 申请人最近一次修改注册信息的时间(距离申请当日的天数,负值)
DAYS_ID_PUBLISH 申请人最近一次修改申请贷款的身份证明文件的时间(距离申请当日的天数,负值)
FLAG_MOBIL 申请人是否提供个人电话(1-yes,0-no)
FLAG_EMP_PHONE 申请人是否提供家庭电话(1-yes,0-no)
FLAG_WORK_PHONE 申请人是否提供工作电话(1-yes,0-no)
FLAG_CONT_MOBILE 申请人个人电话是否能拨通(1-yes,0-no)
FLAG_EMAIL 申请人是否提供电子邮箱(1-yes,0-no)
OCCUPATION_TYPE 申请人职务
REGION_RATING_CLIENT ben公司对申请人居住区域的评分等级(1,2,3)
REGION_RATING_CLIENT_W_CITY 在考虑所在城市的情况下,ben公司对申请人居住区域的评分等级(1,2,3)
WEEKDAY_APPR_PROCESS_START 申请人发起申请日是星期几
HOUR_APPR_PROCESS_START 申请人发起申请的hour
REG_REGION_NOT_LIVE_REGION 申请人提供的的永久地址和联系地址是否匹配(1-不匹配,2-匹配,区域级别的)
REG_REGION_NOT_WORK_REGION 申请人提供的的永久地址和工作地址是否匹配(1-不匹配,2-匹配,区域级别的)
LIVE_REGION_NOT_WORK_REGION 申请人提供的的联系地址和工作地址是否匹配(1-不匹配,2-匹配,区域级别的)
REG_CITY_NOT_LIVE_CITY 申请人提供的的永久地址和联系地址是否匹配(1-不匹配,2-匹配,城市级别的)
REG_CITY_NOT_WORK_CITY 申请人提供的的永久地址和工作地址是否匹配(1-不匹配,2-匹配,城市级别的)
LIVE_CITY_NOT_WORK_CITY 申请人提供的的联系地址和工作地址是否匹配(1-不匹配,2-匹配,城市级别的)
ORGANIZATION_TYPE 申请人工作所属组织类型
EXT_SOURCE_1 外部数据源1的标准化评分
EXT_SOURCE_2 外部数据源2的标准化评分
EXT_SOURCE_3 外部数据源3的标准化评分
APARTMENTS_*G EMERGENCYSTATE_MODE 申请人居住环境各项指标的标准化评分
OBS_30_CNT_SOCIAL_CIRC LE DEF_60_CNT_SOCIAL_CIRCLE 这部分字段含义没看懂
DAYS_LAST_PHONE_CHANGE 申请人最近一次修改手机号码的时间(距离申请当日的天数,负值)
FLAG_DOCUMENT_2 FLAG_DOCUMENT_21 申请人是否额外提供了文件2,3,4. . .21
AMT_REQ_CREDIT_BUREAU_HOUR 申请人发起申请前1个小时以内,被查询征信的次数
AMT_REQ_CREDIT_BUREAU_DAY 申请人发起申请前一天以内,被查询征信的次数
AMT_REQ_CREDIT_BUREAU_WEEK 申请人发起申请前一周以内,被查询征信的次数
AMT_REQ_CREDIT_BUREAU_MONTH 申请人发起申请前一个月以内,被查询征信的次数
AMT_REQ_CREDIT_BUREAU_QRT 申请人发起申请前一个季度以内,被查询征信的次数
AMT_REQ_CREDIT_BUREAU_YEAR 申请人发起申请前一年以内,被查询征信的次数
In [20]
#!unzip -q -o data/data105246/home_credit_default_risk.zip -d /home/aistudio/data
       
unzip:  cannot find or open data/data104475/IEEE_CIS_Fraud_Detection.zip, data/data104475/IEEE_CIS_Fraud_Detection.zip.zip or data/data104475/IEEE_CIS_Fraud_Detection.zip.ZIP.
        In [22]
# 安装依赖包!pip install xgboost
!pip install lightgbm
       
Looking in indexes: https://mirror.baidu.com/pypi/simple/
Requirement already satisfied: xgboost in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (1.3.3)
Requirement already satisfied: scipy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from xgboost) (1.6.3)
Requirement already satisfied: numpy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from xgboost) (1.20.3)
Looking in indexes: https://mirror.baidu.com/pypi/simple/
Requirement already satisfied: lightgbm in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (3.1.1)
Requirement already satisfied: scikit-learn!=0.22.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (0.24.2)
Requirement already satisfied: numpy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (1.20.3)
Requirement already satisfied: scipy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (1.6.3)
Requirement already satisfied: wheel in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (0.36.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn!=0.22.0->lightgbm) (2.1.0)
Requirement already satisfied: joblib>=0.11 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn!=0.22.0->lightgbm) (0.14.1)
        In [23]
import osimport gcimport numpy as npimport pandas as pdfrom scipy.stats import kurtosisfrom sklearn.metrics import roc_auc_scorefrom sklearn.preprocessing import MinMaxScalerfrom sklearn.impute import SimpleImputerfrom sklearn.linear_model import LogisticRegressionimport matplotlib.pyplot as pltimport seaborn as snsimport warningsfrom sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFoldimport xgboost as xgbfrom xgboost import XGBClassifier
warnings.simplefilter(action='ignore', category=FutureWarning)from lightgbm import LGBMClassifier
    In [24]
DATA_DIRECTORY = "./data"df_train = pd.read_csv(os.path.join(DATA_DIRECTORY, 'application_train.csv'))
df_test = pd.read_csv(os.path.join(DATA_DIRECTORY, 'application_test.csv'))
df = df_train.append(df_test)del df_train, df_test; gc.collect()
       
39
                In [25]
df = df[df['AMT_INCOME_TOTAL'] < 20000000]
df = df[df['CODE_GENDER'] != 'XNA']
df['DAYS_EMPLOYED'].replace(365243, np.nan, inplace=True)
df['DAYS_LAST_PHONE_CHANGE'].replace(0, np.nan, inplace=True)
    In [26]
def get_age_group(days_birth):
    age_years = -days_birth / 365
    if age_years < 27: return 1
    elif age_years < 40: return 2
    elif age_years < 50: return 3
    elif age_years < 65: return 4
    elif age_years < 99: return 5
    else: return 0
    In [27]
docs = [f for f in df.columns if 'FLAG_DOC' in f]
df['DOCUMENT_COUNT'] = df[docs].sum(axis=1)
df['NEW_DOC_KURT'] = df[docs].kurtosis(axis=1)
df['AGE_RANGE'] = df['DAYS_BIRTH'].apply(lambda x: get_age_group(x))
    In [28]
df['EXT_SOURCES_PROD'] = df['EXT_SOURCE_1'] * df['EXT_SOURCE_2'] * df['EXT_SOURCE_3']
df['EXT_SOURCES_WEIGHTED'] = df.EXT_SOURCE_1 * 2 + df.EXT_SOURCE_2 * 1 + df.EXT_SOURCE_3 * 3np.warnings.filterwarnings('ignore', r'All-NaN (slice|axis) encountered')for function_name in ['min', 'max', 'mean', 'nanmedian', 'var']:
    feature_name = 'EXT_SOURCES_{}'.format(function_name.upper())
    df[feature_name] = eval('np.{}'.format(function_name))(
        df[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']], axis=1)
    In [29]
df['CREDIT_TO_ANNUITY_RATIO'] = df['AMT_CREDIT'] / df['AMT_ANNUITY']
df['CREDIT_TO_GOODS_RATIO'] = df['AMT_CREDIT'] / df['AMT_GOODS_PRICE']
df['ANNUITY_TO_INCOME_RATIO'] = df['AMT_ANNUITY'] / df['AMT_INCOME_TOTAL']
df['CREDIT_TO_INCOME_RATIO'] = df['AMT_CREDIT'] / df['AMT_INCOME_TOTAL']
df['INCOME_TO_EMPLOYED_RATIO'] = df['AMT_INCOME_TOTAL'] / df['DAYS_EMPLOYED']
df['INCOME_TO_BIRTH_RATIO'] = df['AMT_INCOME_TOTAL'] / df['DAYS_BIRTH']    
df['EMPLOYED_TO_BIRTH_RATIO'] = df['DAYS_EMPLOYED'] / df['DAYS_BIRTH']
df['ID_TO_BIRTH_RATIO'] = df['DAYS_ID_PUBLISH'] / df['DAYS_BIRTH']
df['CAR_TO_BIRTH_RATIO'] = df['OWN_CAR_AGE'] / df['DAYS_BIRTH']
df['CAR_TO_EMPLOYED_RATIO'] = df['OWN_CAR_AGE'] / df['DAYS_EMPLOYED']
df['PHONE_TO_BIRTH_RATIO'] = df['DAYS_LAST_PHONE_CHANGE'] / df['DAYS_BIRTH']
    In [30]
def do_mean(df, group_cols, counted, agg_name):
    gp = df[group_cols + [counted]].groupby(group_cols)[counted].mean().reset_index().rename(
        columns={counted: agg_name})
    df = df.merge(gp, on=group_cols, how='left')    del gp
    gc.collect()    return df
    In [31]
def do_median(df, group_cols, counted, agg_name):
    gp = df[group_cols + [counted]].groupby(group_cols)[counted].median().reset_index().rename(
        columns={counted: agg_name})
    df = df.merge(gp, on=group_cols, how='left')    del gp
    gc.collect()    return df
    In [32]
def do_std(df, group_cols, counted, agg_name):
    gp = df[group_cols + [counted]].groupby(group_cols)[counted].std().reset_index().rename(
        columns={counted: agg_name})
    df = df.merge(gp, on=group_cols, how='left')    del gp
    gc.collect()    return df
    In [33]
def do_sum(df, group_cols, counted, agg_name):
    gp = df[group_cols + [counted]].groupby(group_cols)[counted].sum().reset_index().rename(
        columns={counted: agg_name})
    df = df.merge(gp, on=group_cols, how='left')    del gp
    gc.collect()    return df
    In [34]
group = ['ORGANIZATION_TYPE', 'NAME_EDUCATION_TYPE', 'OCCUPATION_TYPE', 'AGE_RANGE', 'CODE_GENDER']
df = do_median(df, group, 'EXT_SOURCES_MEAN', 'GROUP_EXT_SOURCES_MEDIAN')
df = do_std(df, group, 'EXT_SOURCES_MEAN', 'GROUP_EXT_SOURCES_STD')
df = do_mean(df, group, 'AMT_INCOME_TOTAL', 'GROUP_INCOME_MEAN')
df = do_std(df, group, 'AMT_INCOME_TOTAL', 'GROUP_INCOME_STD')
df = do_mean(df, group, 'CREDIT_TO_ANNUITY_RATIO', 'GROUP_CREDIT_TO_ANNUITY_MEAN')
df = do_std(df, group, 'CREDIT_TO_ANNUITY_RATIO', 'GROUP_CREDIT_TO_ANNUITY_STD')
df = do_mean(df, group, 'AMT_CREDIT', 'GROUP_CREDIT_MEAN')
df = do_mean(df, group, 'AMT_ANNUITY', 'GROUP_ANNUITY_MEAN')
df = do_std(df, group, 'AMT_ANNUITY', 'GROUP_ANNUITY_STD')
    In [35]
def label_encoder(df, categorical_columns=None):
    if not categorical_columns:
        categorical_columns = [col for col in df.columns if df[col].dtype == 'object']    for col in categorical_columns:
        df[col], uniques = pd.factorize(df[col])    return df, categorical_columns
    In [36]
def drop_application_columns(df):
    drop_list = [        'CNT_CHILDREN', 'CNT_FAM_MEMBERS', 'HOUR_APPR_PROCESS_START',        'FLAG_EMP_PHONE', 'FLAG_MOBIL', 'FLAG_CONT_MOBILE', 'FLAG_EMAIL', 'FLAG_PHONE',        'FLAG_OWN_REALTY', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION',        'REG_CITY_NOT_WORK_CITY', 'OBS_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE',        'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_YEAR', 
        'COMMONAREA_MODE', 'NONLIVINGAREA_MODE', 'ELEVATORS_MODE', 'NONLIVINGAREA_*G',        'FLOORSMIN_MEDI', 'LANDAREA_MODE', 'NONLIVINGAREA_MEDI', 'LIVINGAPARTMENTS_MODE',        'FLOORSMIN_*G', 'LANDAREA_*G', 'FLOORSMIN_MODE', 'LANDAREA_MEDI',        'COMMONAREA_MEDI', 'YEARS_BUILD_*G', 'COMMONAREA_*G', 'BASEMENTAREA_*G',        'BASEMENTAREA_MODE', 'NONLIVINGAPARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 
        'LIVINGAPARTMENTS_*G', 'ELEVATORS_*G', 'YEARS_BUILD_MEDI', 'ENTRANCES_MODE',        'NONLIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'LIVINGAPARTMENTS_MEDI',        'YEARS_BUILD_MODE', 'YEARS_BEGINEXPLUATATION_*G', 'ELEVATORS_MEDI', 'LIVINGAREA_MEDI',        'YEARS_BEGINEXPLUATATION_MODE', 'NONLIVINGAPARTMENTS_*G', 'HOUSETYPE_MODE',        'FONDKAPREMONT_MODE', 'EMERGENCYSTATE_MODE'
    ]    for doc_num in [2,4,5,6,7,9,10,11,12,13,14,15,16,17,19,20,21]:
        drop_list.append('FLAG_DOCUMENT_{}'.format(doc_num))
    df.drop(drop_list, axis=1, inplace=True)    return df
    In [37]
df, le_encoded_cols = label_encoder(df, None)
df = drop_application_columns(df)
    In [38]
df = pd.get_dummies(df)
    In [39]
bureau = pd.read_csv(os.path.join(DATA_DIRECTORY, 'bureau.csv'))
    In [40]
bureau['CREDIT_DURATION'] = -bureau['DAYS_CREDIT'] + bureau['DAYS_CREDIT_ENDDATE']
bureau['ENDDATE_DIF'] = bureau['DAYS_CREDIT_ENDDATE'] - bureau['DAYS_ENDDATE_FACT']
bureau['DEBT_PERCENTAGE'] = bureau['AMT_CREDIT_SUM'] / bureau['AMT_CREDIT_SUM_DEBT']
bureau['DEBT_CREDIT_DIFF'] = bureau['AMT_CREDIT_SUM'] - bureau['AMT_CREDIT_SUM_DEBT']
bureau['CREDIT_TO_ANNUITY_RATIO'] = bureau['AMT_CREDIT_SUM'] / bureau['AMT_ANNUITY']
    In [41]
def one_hot_encoder(df, categorical_columns=None, nan_as_category=True):
    original_columns = list(df.columns)    if not categorical_columns:
        categorical_columns = [col for col in df.columns if df[col].dtype == 'object']
    df = pd.get_dummies(df, columns=categorical_columns, dummy_na=nan_as_category)
    categorical_columns = [c for c in df.columns if c not in original_columns]    return df, categorical_columns
    In [42]
def group(df_to_agg, prefix, aggregations, aggregate_by= 'SK_ID_CURR'):
    agg_df = df_to_agg.groupby(aggregate_by).agg(aggregations)
    agg_df.columns = pd.Index(['{}{}_{}'.format(prefix, e[0], e[1].upper())                               for e in agg_df.columns.tolist()])    return agg_df.reset_index()
    In [43]
def group_and_merge(df_to_agg, df_to_merge, prefix, aggregations, aggregate_by= 'SK_ID_CURR'):
    agg_df = group(df_to_agg, prefix, aggregations, aggregate_by= aggregate_by)    return df_to_merge.merge(agg_df, how='left', on= aggregate_by)
    In [44]
def get_bureau_balance(path, num_rows= None):
    bb = pd.read_csv(os.path.join(path, 'bureau_balance.csv'))
    bb, categorical_cols = one_hot_encoder(bb, nan_as_category= False)    # Calculate rate for each category with decay
    bb_processed = bb.groupby('SK_ID_BUREAU')[categorical_cols].mean().reset_index()    # Min, Max, Count and mean duration of payments (months)
    agg = {'MONTHS_BALANCE': ['min', 'max', 'mean', 'size']}
    bb_processed = group_and_merge(bb, bb_processed, '', agg, 'SK_ID_BUREAU')    del bb; gc.collect()    return bb_processed
    In [45]
bureau, categorical_cols = one_hot_encoder(bureau, nan_as_category= False)
bureau = bureau.merge(get_bureau_balance(DATA_DIRECTORY), how='left', on='SK_ID_BUREAU')
bureau['STATUS_12345'] = 0for i in range(1,6):
    bureau['STATUS_12345'] += bureau['STATUS_{}'.format(i)]
    In [46]
features = ['AMT_CREDIT_MAX_OVERDUE', 'AMT_CREDIT_SUM_OVERDUE', 'AMT_CREDIT_SUM',    'AMT_CREDIT_SUM_DEBT', 'DEBT_PERCENTAGE', 'DEBT_CREDIT_DIFF', 'STATUS_0', 'STATUS_12345']
agg_length = bureau.groupby('MONTHS_BALANCE_SIZE')[features].mean().reset_index()
agg_length.rename({feat: 'LL_' + feat for feat in features}, axis=1, inplace=True)
bureau = bureau.merge(agg_length, how='left', on='MONTHS_BALANCE_SIZE')del agg_length; gc.collect()
       
39
                In [47]
BUREAU_AGG = {    'SK_ID_BUREAU': ['nunique'],    'DAYS_CREDIT': ['min', 'max', 'mean'],    'DAYS_CREDIT_ENDDATE': ['min', 'max'],    'AMT_CREDIT_MAX_OVERDUE': ['max', 'mean'],    'AMT_CREDIT_SUM': ['max', 'mean', 'sum'],    'AMT_CREDIT_SUM_DEBT': ['max', 'mean', 'sum'],    'AMT_CREDIT_SUM_OVERDUE': ['max', 'mean', 'sum'],    'AMT_ANNUITY': ['mean'],    'DEBT_CREDIT_DIFF': ['mean', 'sum'],    'MONTHS_BALANCE_MEAN': ['mean', 'var'],    'MONTHS_BALANCE_SIZE': ['mean', 'sum'],    'STATUS_0': ['mean'],    'STATUS_1': ['mean'],    'STATUS_12345': ['mean'],    'STATUS_C': ['mean'],    'STATUS_X': ['mean'],    'CREDIT_ACTIVE_Active': ['mean'],    'CREDIT_ACTIVE_Closed': ['mean'],    'CREDIT_ACTIVE_Sold': ['mean'],    'CREDIT_TYPE_Consumer credit': ['mean'],    'CREDIT_TYPE_Credit card': ['mean'],    'CREDIT_TYPE_Car loan': ['mean'],    'CREDIT_TYPE_Mortgage': ['mean'],    'CREDIT_TYPE_Microloan': ['mean'],    'LL_AMT_CREDIT_SUM_OVERDUE': ['mean'],    'LL_DEBT_CREDIT_DIFF': ['mean'],    'LL_STATUS_12345': ['mean'],
}

BUREAU_ACTIVE_AGG = {    'DAYS_CREDIT': ['max', 'mean'],    'DAYS_CREDIT_ENDDATE': ['min', 'max'],    'AMT_CREDIT_MAX_OVERDUE': ['max', 'mean'],    'AMT_CREDIT_SUM': ['max', 'sum'],    'AMT_CREDIT_SUM_DEBT': ['mean', 'sum'],    'AMT_CREDIT_SUM_OVERDUE': ['max', 'mean'],    'DAYS_CREDIT_UPDATE': ['min', 'mean'],    'DEBT_PERCENTAGE': ['mean'],    'DEBT_CREDIT_DIFF': ['mean'],    'CREDIT_TO_ANNUITY_RATIO': ['mean'],    'MONTHS_BALANCE_MEAN': ['mean', 'var'],    'MONTHS_BALANCE_SIZE': ['mean', 'sum'],
}

BUREAU_CLOSED_AGG = {    'DAYS_CREDIT': ['max', 'var'],    'DAYS_CREDIT_ENDDATE': ['max'],    'AMT_CREDIT_MAX_OVERDUE': ['max', 'mean'],    'AMT_CREDIT_SUM_OVERDUE': ['mean'],    'AMT_CREDIT_SUM': ['max', 'mean', 'sum'],    'AMT_CREDIT_SUM_DEBT': ['max', 'sum'],    'DAYS_CREDIT_UPDATE': ['max'],    'ENDDATE_DIF': ['mean'],    'STATUS_12345': ['mean'],
}

BUREAU_LOAN_TYPE_AGG = {    'DAYS_CREDIT': ['mean', 'max'],    'AMT_CREDIT_MAX_OVERDUE': ['mean', 'max'],    'AMT_CREDIT_SUM': ['mean', 'max'],    'AMT_CREDIT_SUM_DEBT': ['mean', 'max'],    'DEBT_PERCENTAGE': ['mean'],    'DEBT_CREDIT_DIFF': ['mean'],    'DAYS_CREDIT_ENDDATE': ['max'],
}

BUREAU_TIME_AGG = {    'AMT_CREDIT_MAX_OVERDUE': ['max', 'mean'],    'AMT_CREDIT_SUM_OVERDUE': ['mean'],    'AMT_CREDIT_SUM': ['max', 'sum'],    'AMT_CREDIT_SUM_DEBT': ['mean', 'sum'],    'DEBT_PERCENTAGE': ['mean'],    'DEBT_CREDIT_DIFF': ['mean'],    'STATUS_0': ['mean'],    'STATUS_12345': ['mean'],
}
    In [48]
agg_bureau = group(bureau, 'BUREAU_', BUREAU_AGG)
active = bureau[bureau['CREDIT_ACTIVE_Active'] == 1]
agg_bureau = group_and_merge(active,agg_bureau,'BUREAU_ACTIVE_',BUREAU_ACTIVE_AGG)
closed = bureau[bureau['CREDIT_ACTIVE_Closed'] == 1]
agg_bureau = group_and_merge(closed,agg_bureau,'BUREAU_CLOSED_',BUREAU_CLOSED_AGG)del active, closed; gc.collect()for credit_type in ['Consumer credit', 'Credit card', 'Mortgage', 'Car loan', 'Microloan']:
    type_df = bureau[bureau['CREDIT_TYPE_' + credit_type] == 1]
    prefix = 'BUREAU_' + credit_type.split(' ')[0].upper() + '_'
    agg_bureau = group_and_merge(type_df, agg_bureau, prefix, BUREAU_LOAN_TYPE_AGG)    del type_df; gc.collect()for time_frame in [6, 12]:
    prefix = "BUREAU_LAST{}M_".format(time_frame)
    time_frame_df = bureau[bureau['DAYS_CREDIT'] >= -30*time_frame]
    agg_bureau = group_and_merge(time_frame_df, agg_bureau, prefix, BUREAU_TIME_AGG)    del time_frame_df; gc.collect()
    In [49]
sort_bureau = bureau.sort_values(by=['DAYS_CREDIT'])
gr = sort_bureau.groupby('SK_ID_CURR')['AMT_CREDIT_MAX_OVERDUE'].last().reset_index()
gr.rename({'AMT_CREDIT_MAX_OVERDUE': 'BUREAU_LAST_LOAN_MAX_OVERDUE'}, inplace=True)
agg_bureau = agg_bureau.merge(gr, on='SK_ID_CURR', how='left')
agg_bureau['BUREAU_DEBT_OVER_CREDIT'] = \
    agg_bureau['BUREAU_AMT_CREDIT_SUM_DEBT_SUM']/agg_bureau['BUREAU_AMT_CREDIT_SUM_SUM']
agg_bureau['BUREAU_ACTIVE_DEBT_OVER_CREDIT'] = \
    agg_bureau['BUREAU_ACTIVE_AMT_CREDIT_SUM_DEBT_SUM']/agg_bureau['BUREAU_ACTIVE_AMT_CREDIT_SUM_SUM']
    In [50]
df = pd.merge(df, agg_bureau, on='SK_ID_CURR', how='left')del agg_bureau, bureau
gc.collect()
       
39
                In [51]
prev = pd.read_csv(os.path.join(DATA_DIRECTORY, 'previous_application.csv'))
pay = pd.read_csv(os.path.join(DATA_DIRECTORY, 'installments_payments.csv'))
    In [52]
PREVIOUS_AGG = {    'SK_ID_PREV': ['nunique'],    'AMT_ANNUITY': ['min', 'max', 'mean'],    'AMT_DOWN_PAYMENT': ['max', 'mean'],    'HOUR_APPR_PROCESS_START': ['min', 'max', 'mean'],    'RATE_DOWN_PAYMENT': ['max', 'mean'],    'DAYS_DECISION': ['min', 'max', 'mean'],    'CNT_PAYMENT': ['max', 'mean'],    'DAYS_TERMINATION': ['max'],    # Engineered features
    'CREDIT_TO_ANNUITY_RATIO': ['mean', 'max'],    'APPLICATION_CREDIT_DIFF': ['min', 'max', 'mean'],    'APPLICATION_CREDIT_RATIO': ['min', 'max', 'mean', 'var'],    'DOWN_PAYMENT_TO_CREDIT': ['mean'],
}

PREVIOUS_ACTIVE_AGG = {    'SK_ID_PREV': ['nunique'],    'SIMPLE_INTERESTS': ['mean'],    'AMT_ANNUITY': ['max', 'sum'],    'AMT_APPLICATION': ['max', 'mean'],    'AMT_CREDIT': ['sum'],    'AMT_DOWN_PAYMENT': ['max', 'mean'],    'DAYS_DECISION': ['min', 'mean'],    'CNT_PAYMENT': ['mean', 'sum'],    'DAYS_LAST_DUE_1ST_VERSION': ['min', 'max', 'mean'],    # Engineered features
    'AMT_PAYMENT': ['sum'],    'INSTALMENT_PAYMENT_DIFF': ['mean', 'max'],    'REMAINING_DEBT': ['max', 'mean', 'sum'],    'REPAYMENT_RATIO': ['mean'],
}
PREVIOUS_LATE_PAYMENTS_AGG = {    'DAYS_DECISION': ['min', 'max', 'mean'],    'DAYS_LAST_DUE_1ST_VERSION': ['min', 'max', 'mean'],    # Engineered features
    'APPLICATION_CREDIT_DIFF': ['min'],    'NAME_CONTRACT_TYPE_Consumer loans': ['mean'],    'NAME_CONTRACT_TYPE_Cash loans': ['mean'],    'NAME_CONTRACT_TYPE_Revolving loans': ['mean'],
}

PREVIOUS_LOAN_TYPE_AGG = {    'AMT_CREDIT': ['sum'],    'AMT_ANNUITY': ['mean', 'max'],    'SIMPLE_INTERESTS': ['min', 'mean', 'max', 'var'],    'APPLICATION_CREDIT_DIFF': ['min', 'var'],    'APPLICATION_CREDIT_RATIO': ['min', 'max', 'mean'],    'DAYS_DECISION': ['max'],    'DAYS_LAST_DUE_1ST_VERSION': ['max', 'mean'],    'CNT_PAYMENT': ['mean'],
}

PREVIOUS_TIME_AGG = {    'AMT_CREDIT': ['sum'],    'AMT_ANNUITY': ['mean', 'max'],    'SIMPLE_INTERESTS': ['mean', 'max'],    'DAYS_DECISION': ['min', 'mean'],    'DAYS_LAST_DUE_1ST_VERSION': ['min', 'max', 'mean'],    # Engineered features
    'APPLICATION_CREDIT_DIFF': ['min'],    'APPLICATION_CREDIT_RATIO': ['min', 'max', 'mean'],    'NAME_CONTRACT_TYPE_Consumer loans': ['mean'],    'NAME_CONTRACT_TYPE_Cash loans': ['mean'],    'NAME_CONTRACT_TYPE_Revolving loans': ['mean'],
}

PREVIOUS_APPROVED_AGG = {    'SK_ID_PREV': ['nunique'],    'AMT_ANNUITY': ['min', 'max', 'mean'],    'AMT_CREDIT': ['min', 'max', 'mean'],    'AMT_DOWN_PAYMENT': ['max'],    'AMT_GOODS_PRICE': ['max'],    'HOUR_APPR_PROCESS_START': ['min', 'max'],    'DAYS_DECISION': ['min', 'mean'],    'CNT_PAYMENT': ['max', 'mean'],    'DAYS_TERMINATION': ['mean'],    # Engineered features
    'CREDIT_TO_ANNUITY_RATIO': ['mean', 'max'],    'APPLICATION_CREDIT_DIFF': ['max'],    'APPLICATION_CREDIT_RATIO': ['min', 'max', 'mean'],    # The following features are only for approved applications
    'DAYS_FIRST_DRAWING': ['max', 'mean'],    'DAYS_FIRST_DUE': ['min', 'mean'],    'DAYS_LAST_DUE_1ST_VERSION': ['min', 'max', 'mean'],    'DAYS_LAST_DUE': ['max', 'mean'],    'DAYS_LAST_DUE_DIFF': ['min', 'max', 'mean'],    'SIMPLE_INTERESTS': ['min', 'max', 'mean'],
}

PREVIOUS_REFUSED_AGG = {    'AMT_APPLICATION': ['max', 'mean'],    'AMT_CREDIT': ['min', 'max'],    'DAYS_DECISION': ['min', 'max', 'mean'],    'CNT_PAYMENT': ['max', 'mean'],    # Engineered features
    'APPLICATION_CREDIT_DIFF': ['min', 'max', 'mean', 'var'],    'APPLICATION_CREDIT_RATIO': ['min', 'mean'],    'NAME_CONTRACT_TYPE_Consumer loans': ['mean'],    'NAME_CONTRACT_TYPE_Cash loans': ['mean'],    'NAME_CONTRACT_TYPE_Revolving loans': ['mean'],
}
    In [53]
ohe_columns = [    'NAME_CONTRACT_STATUS', 'NAME_CONTRACT_TYPE', 'CHANNEL_TYPE',    'NAME_TYPE_SUITE', 'NAME_YIELD_GROUP', 'PRODUCT_COMBINATION',    'NAME_PRODUCT_TYPE', 'NAME_CLIENT_TYPE']
prev, categorical_cols = one_hot_encoder(prev, ohe_columns, nan_as_category= False)
    In [54]
prev['APPLICATION_CREDIT_DIFF'] = prev['AMT_APPLICATION'] - prev['AMT_CREDIT']
prev['APPLICATION_CREDIT_RATIO'] = prev['AMT_APPLICATION'] / prev['AMT_CREDIT']
prev['CREDIT_TO_ANNUITY_RATIO'] = prev['AMT_CREDIT']/prev['AMT_ANNUITY']
prev['DOWN_PAYMENT_TO_CREDIT'] = prev['AMT_DOWN_PAYMENT'] / prev['AMT_CREDIT']
total_payment = prev['AMT_ANNUITY'] * prev['CNT_PAYMENT']
prev['SIMPLE_INTERESTS'] = (total_payment/prev['AMT_CREDIT'] - 1)/prev['CNT_PAYMENT']
    In [55]
approved = prev[prev['NAME_CONTRACT_STATUS_Approved'] == 1]
active_df = approved[approved['DAYS_LAST_DUE'] == 365243]
active_pay = pay[pay['SK_ID_PREV'].isin(active_df['SK_ID_PREV'])]
active_pay_agg = active_pay.groupby('SK_ID_PREV')[['AMT_INSTALMENT', 'AMT_PAYMENT']].sum()
active_pay_agg.reset_index(inplace= True)
active_pay_agg['INSTALMENT_PAYMENT_DIFF'] = active_pay_agg['AMT_INSTALMENT'] - active_pay_agg['AMT_PAYMENT']
active_df = active_df.merge(active_pay_agg, on= 'SK_ID_PREV', how= 'left')
active_df['REMAINING_DEBT'] = active_df['AMT_CREDIT'] - active_df['AMT_PAYMENT']
active_df['REPAYMENT_RATIO'] = active_df['AMT_PAYMENT'] / active_df['AMT_CREDIT']
active_agg_df = group(active_df, 'PREV_ACTIVE_', PREVIOUS_ACTIVE_AGG)
active_agg_df['TOTAL_REPAYMENT_RATIO'] = active_agg_df['PREV_ACTIVE_AMT_PAYMENT_SUM']/\
                                            active_agg_df['PREV_ACTIVE_AMT_CREDIT_SUM']del active_pay, active_pay_agg, active_df; gc.collect()
       
0
                In [56]
prev['DAYS_FIRST_DRAWING'].replace(365243, np.nan, inplace= True)
prev['DAYS_FIRST_DUE'].replace(365243, np.nan, inplace= True)
prev['DAYS_LAST_DUE_1ST_VERSION'].replace(365243, np.nan, inplace= True)
prev['DAYS_LAST_DUE'].replace(365243, np.nan, inplace= True)
prev['DAYS_TERMINATION'].replace(365243, np.nan, inplace= True)
    In [57]
prev['DAYS_LAST_DUE_DIFF'] = prev['DAYS_LAST_DUE_1ST_VERSION'] - prev['DAYS_LAST_DUE']
approved['DAYS_LAST_DUE_DIFF'] = approved['DAYS_LAST_DUE_1ST_VERSION'] - approved['DAYS_LAST_DUE']
    In [58]
categorical_agg = {key: ['mean'] for key in categorical_cols}
    In [59]
agg_prev = group(prev, 'PREV_', {**PREVIOUS_AGG, **categorical_agg})
agg_prev = agg_prev.merge(active_agg_df, how='left', on='SK_ID_CURR')del active_agg_df; gc.collect()
       
0
                In [60]
agg_prev = group_and_merge(approved, agg_prev, 'APPROVED_', PREVIOUS_APPROVED_AGG)
refused = prev[prev['NAME_CONTRACT_STATUS_Refused'] == 1]
agg_prev = group_and_merge(refused, agg_prev, 'REFUSED_', PREVIOUS_REFUSED_AGG)del approved, refused; gc.collect()
       
0
                In [61]
for loan_type in ['Consumer loans', 'Cash loans']:
    type_df = prev[prev['NAME_CONTRACT_TYPE_{}'.format(loan_type)] == 1]
    prefix = 'PREV_' + loan_type.split(" ")[0] + '_'
    agg_prev = group_and_merge(type_df, agg_prev, prefix, PREVIOUS_LOAN_TYPE_AGG)    del type_df; gc.collect()
    In [62]
pay['LATE_PAYMENT'] = pay['DAYS_ENTRY_PAYMENT'] - pay['DAYS_INSTALMENT']
pay['LATE_PAYMENT'] = pay['LATE_PAYMENT'].apply(lambda x: 1 if x > 0 else 0)
dpd_id = pay[pay['LATE_PAYMENT'] > 0]['SK_ID_PREV'].unique()
    In [63]
agg_dpd = group_and_merge(prev[prev['SK_ID_PREV'].isin(dpd_id)], agg_prev,                                    'PREV_LATE_', PREVIOUS_LATE_PAYMENTS_AGG)del agg_dpd, dpd_id; gc.collect()
       
0
                In [64]
for time_frame in [12, 24]:
    time_frame_df = prev[prev['DAYS_DECISION'] >= -30*time_frame]
    prefix = 'PREV_LAST{}M_'.format(time_frame)
    agg_prev = group_and_merge(time_frame_df, agg_prev, prefix, PREVIOUS_TIME_AGG)    del time_frame_df; gc.collect()del prev; gc.collect()
       
0
                In [65]
df = pd.merge(df, agg_prev, on='SK_ID_CURR', how='left')
    In [66]
train = df[df['TARGET'].notnull()]
test = df[df['TARGET'].isnull()]del df
gc.collect()
       
98
                In [67]
labels = train['TARGET']
test_lebels=test['TARGET']
train = train.drop(columns=['TARGET'])
test = test.drop(columns=['TARGET'])
    In [68]
feature = list(train.columns)

train.replace([np.inf, -np.inf], np.nan, inplace=True)
test.replace([np.inf, -np.inf], np.nan, inplace=True)
test_df = test.copy()
train_df = train.copy()
train_df['TARGET'] = labels
test_df['TARGET'] = test_lebels
    In [69]
imputer = SimpleImputer(strategy = 'median')
imputer.fit(train)
imputer.fit(test)
train = imputer.transform(train)
test = imputer.transform(test)
    In [70]
scaler = MinMaxScaler(feature_range = (0, 1))
scaler.fit(train)
scaler.fit(test)
train = scaler.transform(train)
test = scaler.transform(test)
    In [71]
from lightgbm import LGBMClassifier

lgbmc = LGBMClassifier()
lgbmc.fit(train, labels)
       
LGBMClassifier()
                In [72]
lgbm_pred = lgbmc.predict_proba(test)[:, 1]
    In [74]
submit = test_df[['SK_ID_CURR']]
submit['TARGET'] = lgbm_pred
    In [75]
submit.to_csv('lgbm.csv', index = False)
   

总结

数据的提交结果如下:(提交需要科学上网)

数据集 Home Credit Default Risk
线上评分 0.78277

以上就是【金融风控系列】_[3]_贷款违约识别的详细内容,更多请关注其它相关文章!


# python  # ai  # seo吸引人眼睛  # 百度营销推广培训班长沙  # 考试院网站建设  # seo黑帽技术博客  # 官网  # 数据库中  # 线上  # 中文网  # 美图  # 包含了  # 当日  # 邮箱  # red  # 征信  # follow  # udio  # whee  # descript  # type  # 不匹配  # 一言  # 中国铁路网站建设  # 万家福营销推广方案设计  # 营销产品策略推广方法  # 网站优化口号  # 推广网站架构怎么做  # 榆次网站建设推广公司 


相关栏目: 【 行业资讯67740 】 【 技术百科0 】 【 网络运营39195


相关推荐: typescript和node学哪个  苹果16有哪些bug  linux下如何重定位命令  苹果电脑如何输入命令  电动车仪表盘上的power是什么意思  虚拟机如何用命令清除垃圾  折叠屏手机为什么这么小  单片机计时程序怎么写  oracle中datediff函数怎么用 Oracle中DATEDIFF函数详解  微信最多可以加多少好友  calm是什么意思  春运抢票准备什么东西  vi命令如何退出  美食音乐每日推荐怎么写  docs命令如何进入d  征信不好如何短期恢复  复制 命令如何撤销  苹果手机16系统有哪些  typescript在浏览器里怎么用  如何安装m.2固态硬盘  如何在一串数字前面去掉四位数的命令  65寸电视长宽多少厘米  linux如何跳回命令行界面  酷狗音乐pc版的每日推荐在哪 酷狗音乐PC版每日推荐查找指南  win7旗舰版wifi怎么打开  春运抢票可以抢几张  自己如何安装固态硬盘  shell如何执行sql脚本命令行  硬盘和固态硬盘如何区分  春运抢票技巧攻略  命令行ftp如何创建目录  typescript为什么现在才火  怎么更新typescript  充电器上的power是什么意思  33000日元等于多少人民币  j*a数组对象怎么取  typescript中文怎么读  linux如何调出命令行  市盈率ttm是什么意思  为什么用typescript  typescript怎么用  智能锁type-c接口是什么  显卡上面TYPE-C是什么接口  市盈率静是什么意思  bored是什么意思  typescript参数怎么用  夸克po什么意思  js怎么设置typescript  grub命令如何进dos  j*a怎么读取char数组 

搜索