新闻中心
【金融风控系列】_[3]_贷款违约识别
本文围绕Kaggle的Home Credit Default Risk赛题展开,利用客户申请表等7张表数据构建模型预测客户还款能力。通过数据清洗、特征工程,融合多表信息生成衍生特征,经LightGBM模型训练,最终线上评分为0.78277,为信用记录不足人群的贷款评估提供参考。
☞☞☞AI 智能聊天, 问答助手, AI 智能搜索, 免费无限量使用 DeepSeek R1 模型☜☜☜
![【金融风控系列】_[3]_贷款违约识别 -](https://img.php.cn/upload/article/202507/22/2025072211465245131.jpg)
Home Credit Default Risk(家庭信用违约风险)
该赛题来自 KAGGLE,仅用作学习交流
由于信用记录不足或不存在,许多人往往被划分为低信用借贷人而难以获得贷款。 为了确保这些人群获得贷款,Home Credit公司利用替代数据(包括电信和交易信息)预测客户的还款能力。
Home Credit一共提供了7张表,一共218个字段,其中训练集样本约31万(逾期8%),测试集样本约5万。
信息表
application_train/test 客户申请表
包含了
- 目标变量(客户是否违约-0/1变量)
- 客户申请贷款信息(贷款类型, 贷款总额, 年金)
- 客户基本信息(性别, 年龄, 家庭, 学历, 职业, 行业, 居住地情况)
- 客户财务信息(年收入, 房/车情况)
- 申请时提供的资料等.
bureau/bureau_balance 由其他金融机构提供给征信中心的客户信用记录历史(月数据)
包含了客户在征信中心的
- 信用记录,
- 违约金额,
- 违约时间等.
以时间序列(按行)的形式进行记录.
POS_CASH_balance 客户在Home Credit数据库中POS(point of sales)和现金贷款历史(月数据)
包含了客户
美图云修
商业级AI影像处理工具
50
查看详情
- 已付款情况
- 未付款情况
credit_card_balance 客户在Home Credit数据库中信用卡的snapshot历史(月数据)
包含了客户
- 消费次数
- 消费金额
等情况.
previous_application 客户先前的申请记录
包含了客户所有历史申请记录(申请信息, 申请结果等).
installments_payments 客户先前信用卡的还款记录
包含了客户的还款情况
- 还款日期
- 是否逾期
- 还款金额
- 是否欠款等
参考:
[1] https://zhuanlan.zhihu.com/p/43541825
[2] https://www.kaggle.com/xucheng/cv-7993-private-score-7996/
[3] https://zhuanlan.zhihu.com/p/40790434
[4] https://www.kaggle.com/tahmidnafi/cse499
[5] https://blog.csdn.net/zhangchen2449/article/details/83338978
主要字段表
| Field | Description |
|---|---|
| SK_ID_CURR | 此次申请的ID |
| TARGET | 申请人本次申请的还款风险:1-风险较高;0-风险较低 |
| NAME_CONTRACT_TYPE | 贷款类型:cash(现金)还是revolving(周转金,一次申请,多次循环提取) |
| CODE_GENDER | 申请人性别 |
| FLAG_OWN_CAR | 申请人是否有车 |
| FLAG_OWN_REALTY | 申请人是否有房 |
| CNT_CHILDREN | 申请人子女个数 |
| AMT_INCOME_TOTAL | 申请人收入状况 |
| AMT_CREDIT | 此次申请的贷款金额 |
| AMT_ANNUITY | 贷款年金 |
| AMT_GOODS_PRICE | 如果是消费贷款,改字段表示商品的实际价格 |
| NAME_TYPE_SUITE | 申请人此次申请的陪同人员 |
| NAME_INCOME_TYPE | 申请人收入类型 |
| NAME_EDUCATION_TYPE | 申请人受教育程度 |
| NAME_FAMILY_STATUS | 申请人婚姻状况 |
| NAME_HOUSING_TYPE | 申请人居住状况(租房,已购房,和父母一起住等) |
| REGION_POPULATION_RELATIVE | 申请人居住地人口密度,已标准化 |
| DAYS_BIRTH | 申请人出生日(距离申请当日的天数,负值) |
| DAYS_EMPLOYED | 申请人当前工作的工作年限(距离申请当日的天数,负值) |
| DAYS_REGISTRATION | 申请人最近一次修改注册信息的时间(距离申请当日的天数,负值) |
| DAYS_ID_PUBLISH | 申请人最近一次修改申请贷款的身份证明文件的时间(距离申请当日的天数,负值) |
| FLAG_MOBIL | 申请人是否提供个人电话(1-yes,0-no) |
| FLAG_EMP_PHONE | 申请人是否提供家庭电话(1-yes,0-no) |
| FLAG_WORK_PHONE | 申请人是否提供工作电话(1-yes,0-no) |
| FLAG_CONT_MOBILE | 申请人个人电话是否能拨通(1-yes,0-no) |
| FLAG_EMAIL | 申请人是否提供电子邮箱(1-yes,0-no) |
| OCCUPATION_TYPE | 申请人职务 |
| REGION_RATING_CLIENT | ben公司对申请人居住区域的评分等级(1,2,3) |
| REGION_RATING_CLIENT_W_CITY | 在考虑所在城市的情况下,ben公司对申请人居住区域的评分等级(1,2,3) |
| WEEKDAY_APPR_PROCESS_START | 申请人发起申请日是星期几 |
| HOUR_APPR_PROCESS_START | 申请人发起申请的hour |
| REG_REGION_NOT_LIVE_REGION | 申请人提供的的永久地址和联系地址是否匹配(1-不匹配,2-匹配,区域级别的) |
| REG_REGION_NOT_WORK_REGION | 申请人提供的的永久地址和工作地址是否匹配(1-不匹配,2-匹配,区域级别的) |
| LIVE_REGION_NOT_WORK_REGION | 申请人提供的的联系地址和工作地址是否匹配(1-不匹配,2-匹配,区域级别的) |
| REG_CITY_NOT_LIVE_CITY | 申请人提供的的永久地址和联系地址是否匹配(1-不匹配,2-匹配,城市级别的) |
| REG_CITY_NOT_WORK_CITY | 申请人提供的的永久地址和工作地址是否匹配(1-不匹配,2-匹配,城市级别的) |
| LIVE_CITY_NOT_WORK_CITY | 申请人提供的的联系地址和工作地址是否匹配(1-不匹配,2-匹配,城市级别的) |
| ORGANIZATION_TYPE | 申请人工作所属组织类型 |
| EXT_SOURCE_1 | 外部数据源1的标准化评分 |
| EXT_SOURCE_2 | 外部数据源2的标准化评分 |
| EXT_SOURCE_3 | 外部数据源3的标准化评分 |
| APARTMENTS_*G EMERGENCYSTATE_MODE | 申请人居住环境各项指标的标准化评分 |
| OBS_30_CNT_SOCIAL_CIRC LE DEF_60_CNT_SOCIAL_CIRCLE | 这部分字段含义没看懂 |
| DAYS_LAST_PHONE_CHANGE | 申请人最近一次修改手机号码的时间(距离申请当日的天数,负值) |
| FLAG_DOCUMENT_2 FLAG_DOCUMENT_21 | 申请人是否额外提供了文件2,3,4. . .21 |
| AMT_REQ_CREDIT_BUREAU_HOUR | 申请人发起申请前1个小时以内,被查询征信的次数 |
| AMT_REQ_CREDIT_BUREAU_DAY | 申请人发起申请前一天以内,被查询征信的次数 |
| AMT_REQ_CREDIT_BUREAU_WEEK | 申请人发起申请前一周以内,被查询征信的次数 |
| AMT_REQ_CREDIT_BUREAU_MONTH | 申请人发起申请前一个月以内,被查询征信的次数 |
| AMT_REQ_CREDIT_BUREAU_QRT | 申请人发起申请前一个季度以内,被查询征信的次数 |
| AMT_REQ_CREDIT_BUREAU_YEAR | 申请人发起申请前一年以内,被查询征信的次数 |
#!unzip -q -o data/data105246/home_credit_default_risk.zip -d /home/aistudio/data
unzip: cannot find or open data/data104475/IEEE_CIS_Fraud_Detection.zip, data/data104475/IEEE_CIS_Fraud_Detection.zip.zip or data/data104475/IEEE_CIS_Fraud_Detection.zip.ZIP.In [22]
# 安装依赖包!pip install xgboost !pip install lightgbm
Looking in indexes: https://mirror.baidu.com/pypi/simple/ Requirement already satisfied: xgboost in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (1.3.3) Requirement already satisfied: scipy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from xgboost) (1.6.3) Requirement already satisfied: numpy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from xgboost) (1.20.3) Looking in indexes: https://mirror.baidu.com/pypi/simple/ Requirement already satisfied: lightgbm in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (3.1.1) Requirement already satisfied: scikit-learn!=0.22.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (0.24.2) Requirement already satisfied: numpy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (1.20.3) Requirement already satisfied: scipy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (1.6.3) Requirement already satisfied: wheel in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (0.36.2) Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn!=0.22.0->lightgbm) (2.1.0) Requirement already satisfied: joblib>=0.11 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn!=0.22.0->lightgbm) (0.14.1)In [23]
import osimport gcimport numpy as npimport pandas as pdfrom scipy.stats import kurtosisfrom sklearn.metrics import roc_auc_scorefrom sklearn.preprocessing import MinMaxScalerfrom sklearn.impute import SimpleImputerfrom sklearn.linear_model import LogisticRegressionimport matplotlib.pyplot as pltimport seaborn as snsimport warningsfrom sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFoldimport xgboost as xgbfrom xgboost import XGBClassifier warnings.simplefilter(action='ignore', category=FutureWarning)from lightgbm import LGBMClassifierIn [24]
DATA_DIRECTORY = "./data"df_train = pd.read_csv(os.path.join(DATA_DIRECTORY, 'application_train.csv')) df_test = pd.read_csv(os.path.join(DATA_DIRECTORY, 'application_test.csv')) df = df_train.append(df_test)del df_train, df_test; gc.collect()
39In [25]
df = df[df['AMT_INCOME_TOTAL'] < 20000000] df = df[df['CODE_GENDER'] != 'XNA'] df['DAYS_EMPLOYED'].replace(365243, np.nan, inplace=True) df['DAYS_LAST_PHONE_CHANGE'].replace(0, np.nan, inplace=True)In [26]
def get_age_group(days_birth):
age_years = -days_birth / 365
if age_years < 27: return 1
elif age_years < 40: return 2
elif age_years < 50: return 3
elif age_years < 65: return 4
elif age_years < 99: return 5
else: return 0
In [27]
docs = [f for f in df.columns if 'FLAG_DOC' in f] df['DOCUMENT_COUNT'] = df[docs].sum(axis=1) df['NEW_DOC_KURT'] = df[docs].kurtosis(axis=1) df['AGE_RANGE'] = df['DAYS_BIRTH'].apply(lambda x: get_age_group(x))In [28]
df['EXT_SOURCES_PROD'] = df['EXT_SOURCE_1'] * df['EXT_SOURCE_2'] * df['EXT_SOURCE_3']
df['EXT_SOURCES_WEIGHTED'] = df.EXT_SOURCE_1 * 2 + df.EXT_SOURCE_2 * 1 + df.EXT_SOURCE_3 * 3np.warnings.filterwarnings('ignore', r'All-NaN (slice|axis) encountered')for function_name in ['min', 'max', 'mean', 'nanmedian', 'var']:
feature_name = 'EXT_SOURCES_{}'.format(function_name.upper())
df[feature_name] = eval('np.{}'.format(function_name))(
df[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']], axis=1)
In [29]
df['CREDIT_TO_ANNUITY_RATIO'] = df['AMT_CREDIT'] / df['AMT_ANNUITY'] df['CREDIT_TO_GOODS_RATIO'] = df['AMT_CREDIT'] / df['AMT_GOODS_PRICE'] df['ANNUITY_TO_INCOME_RATIO'] = df['AMT_ANNUITY'] / df['AMT_INCOME_TOTAL'] df['CREDIT_TO_INCOME_RATIO'] = df['AMT_CREDIT'] / df['AMT_INCOME_TOTAL'] df['INCOME_TO_EMPLOYED_RATIO'] = df['AMT_INCOME_TOTAL'] / df['DAYS_EMPLOYED'] df['INCOME_TO_BIRTH_RATIO'] = df['AMT_INCOME_TOTAL'] / df['DAYS_BIRTH'] df['EMPLOYED_TO_BIRTH_RATIO'] = df['DAYS_EMPLOYED'] / df['DAYS_BIRTH'] df['ID_TO_BIRTH_RATIO'] = df['DAYS_ID_PUBLISH'] / df['DAYS_BIRTH'] df['CAR_TO_BIRTH_RATIO'] = df['OWN_CAR_AGE'] / df['DAYS_BIRTH'] df['CAR_TO_EMPLOYED_RATIO'] = df['OWN_CAR_AGE'] / df['DAYS_EMPLOYED'] df['PHONE_TO_BIRTH_RATIO'] = df['DAYS_LAST_PHONE_CHANGE'] / df['DAYS_BIRTH']In [30]
def do_mean(df, group_cols, counted, agg_name):
gp = df[group_cols + [counted]].groupby(group_cols)[counted].mean().reset_index().rename(
columns={counted: agg_name})
df = df.merge(gp, on=group_cols, how='left') del gp
gc.collect() return df
In [31]
def do_median(df, group_cols, counted, agg_name):
gp = df[group_cols + [counted]].groupby(group_cols)[counted].median().reset_index().rename(
columns={counted: agg_name})
df = df.merge(gp, on=group_cols, how='left') del gp
gc.collect() return df
In [32]
def do_std(df, group_cols, counted, agg_name):
gp = df[group_cols + [counted]].groupby(group_cols)[counted].std().reset_index().rename(
columns={counted: agg_name})
df = df.merge(gp, on=group_cols, how='left') del gp
gc.collect() return df
In [33]
def do_sum(df, group_cols, counted, agg_name):
gp = df[group_cols + [counted]].groupby(group_cols)[counted].sum().reset_index().rename(
columns={counted: agg_name})
df = df.merge(gp, on=group_cols, how='left') del gp
gc.collect() return df
In [34]
group = ['ORGANIZATION_TYPE', 'NAME_EDUCATION_TYPE', 'OCCUPATION_TYPE', 'AGE_RANGE', 'CODE_GENDER'] df = do_median(df, group, 'EXT_SOURCES_MEAN', 'GROUP_EXT_SOURCES_MEDIAN') df = do_std(df, group, 'EXT_SOURCES_MEAN', 'GROUP_EXT_SOURCES_STD') df = do_mean(df, group, 'AMT_INCOME_TOTAL', 'GROUP_INCOME_MEAN') df = do_std(df, group, 'AMT_INCOME_TOTAL', 'GROUP_INCOME_STD') df = do_mean(df, group, 'CREDIT_TO_ANNUITY_RATIO', 'GROUP_CREDIT_TO_ANNUITY_MEAN') df = do_std(df, group, 'CREDIT_TO_ANNUITY_RATIO', 'GROUP_CREDIT_TO_ANNUITY_STD') df = do_mean(df, group, 'AMT_CREDIT', 'GROUP_CREDIT_MEAN') df = do_mean(df, group, 'AMT_ANNUITY', 'GROUP_ANNUITY_MEAN') df = do_std(df, group, 'AMT_ANNUITY', 'GROUP_ANNUITY_STD')In [35]
def label_encoder(df, categorical_columns=None):
if not categorical_columns:
categorical_columns = [col for col in df.columns if df[col].dtype == 'object'] for col in categorical_columns:
df[col], uniques = pd.factorize(df[col]) return df, categorical_columns
In [36]
def drop_application_columns(df):
drop_list = [ 'CNT_CHILDREN', 'CNT_FAM_MEMBERS', 'HOUR_APPR_PROCESS_START', 'FLAG_EMP_PHONE', 'FLAG_MOBIL', 'FLAG_CONT_MOBILE', 'FLAG_EMAIL', 'FLAG_PHONE', 'FLAG_OWN_REALTY', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_WORK_CITY', 'OBS_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_YEAR',
'COMMONAREA_MODE', 'NONLIVINGAREA_MODE', 'ELEVATORS_MODE', 'NONLIVINGAREA_*G', 'FLOORSMIN_MEDI', 'LANDAREA_MODE', 'NONLIVINGAREA_MEDI', 'LIVINGAPARTMENTS_MODE', 'FLOORSMIN_*G', 'LANDAREA_*G', 'FLOORSMIN_MODE', 'LANDAREA_MEDI', 'COMMONAREA_MEDI', 'YEARS_BUILD_*G', 'COMMONAREA_*G', 'BASEMENTAREA_*G', 'BASEMENTAREA_MODE', 'NONLIVINGAPARTMENTS_MEDI', 'BASEMENTAREA_MEDI',
'LIVINGAPARTMENTS_*G', 'ELEVATORS_*G', 'YEARS_BUILD_MEDI', 'ENTRANCES_MODE', 'NONLIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'LIVINGAPARTMENTS_MEDI', 'YEARS_BUILD_MODE', 'YEARS_BEGINEXPLUATATION_*G', 'ELEVATORS_MEDI', 'LIVINGAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MODE', 'NONLIVINGAPARTMENTS_*G', 'HOUSETYPE_MODE', 'FONDKAPREMONT_MODE', 'EMERGENCYSTATE_MODE'
] for doc_num in [2,4,5,6,7,9,10,11,12,13,14,15,16,17,19,20,21]:
drop_list.append('FLAG_DOCUMENT_{}'.format(doc_num))
df.drop(drop_list, axis=1, inplace=True) return df
In [37]
df, le_encoded_cols = label_encoder(df, None) df = drop_application_columns(df)In [38]
df = pd.get_dummies(df)In [39]
bureau = pd.read_csv(os.path.join(DATA_DIRECTORY, 'bureau.csv'))In [40]
bureau['CREDIT_DURATION'] = -bureau['DAYS_CREDIT'] + bureau['DAYS_CREDIT_ENDDATE'] bureau['ENDDATE_DIF'] = bureau['DAYS_CREDIT_ENDDATE'] - bureau['DAYS_ENDDATE_FACT'] bureau['DEBT_PERCENTAGE'] = bureau['AMT_CREDIT_SUM'] / bureau['AMT_CREDIT_SUM_DEBT'] bureau['DEBT_CREDIT_DIFF'] = bureau['AMT_CREDIT_SUM'] - bureau['AMT_CREDIT_SUM_DEBT'] bureau['CREDIT_TO_ANNUITY_RATIO'] = bureau['AMT_CREDIT_SUM'] / bureau['AMT_ANNUITY']In [41]
def one_hot_encoder(df, categorical_columns=None, nan_as_category=True):
original_columns = list(df.columns) if not categorical_columns:
categorical_columns = [col for col in df.columns if df[col].dtype == 'object']
df = pd.get_dummies(df, columns=categorical_columns, dummy_na=nan_as_category)
categorical_columns = [c for c in df.columns if c not in original_columns] return df, categorical_columns
In [42]
def group(df_to_agg, prefix, aggregations, aggregate_by= 'SK_ID_CURR'):
agg_df = df_to_agg.groupby(aggregate_by).agg(aggregations)
agg_df.columns = pd.Index(['{}{}_{}'.format(prefix, e[0], e[1].upper()) for e in agg_df.columns.tolist()]) return agg_df.reset_index()
In [43]
def group_and_merge(df_to_agg, df_to_merge, prefix, aggregations, aggregate_by= 'SK_ID_CURR'):
agg_df = group(df_to_agg, prefix, aggregations, aggregate_by= aggregate_by) return df_to_merge.merge(agg_df, how='left', on= aggregate_by)
In [44]
def get_bureau_balance(path, num_rows= None):
bb = pd.read_csv(os.path.join(path, 'bureau_balance.csv'))
bb, categorical_cols = one_hot_encoder(bb, nan_as_category= False) # Calculate rate for each category with decay
bb_processed = bb.groupby('SK_ID_BUREAU')[categorical_cols].mean().reset_index() # Min, Max, Count and mean duration of payments (months)
agg = {'MONTHS_BALANCE': ['min', 'max', 'mean', 'size']}
bb_processed = group_and_merge(bb, bb_processed, '', agg, 'SK_ID_BUREAU') del bb; gc.collect() return bb_processed
In [45]
bureau, categorical_cols = one_hot_encoder(bureau, nan_as_category= False)
bureau = bureau.merge(get_bureau_balance(DATA_DIRECTORY), how='left', on='SK_ID_BUREAU')
bureau['STATUS_12345'] = 0for i in range(1,6):
bureau['STATUS_12345'] += bureau['STATUS_{}'.format(i)]
In [46]
features = ['AMT_CREDIT_MAX_OVERDUE', 'AMT_CREDIT_SUM_OVERDUE', 'AMT_CREDIT_SUM', 'AMT_CREDIT_SUM_DEBT', 'DEBT_PERCENTAGE', 'DEBT_CREDIT_DIFF', 'STATUS_0', 'STATUS_12345'] agg_length = bureau.groupby('MONTHS_BALANCE_SIZE')[features].mean().reset_index() agg_length.rename({feat: 'LL_' + feat for feat in features}, axis=1, inplace=True) bureau = bureau.merge(agg_length, how='left', on='MONTHS_BALANCE_SIZE')del agg_length; gc.collect()
39In [47]
BUREAU_AGG = { 'SK_ID_BUREAU': ['nunique'], 'DAYS_CREDIT': ['min', 'max', 'mean'], 'DAYS_CREDIT_ENDDATE': ['min', 'max'], 'AMT_CREDIT_MAX_OVERDUE': ['max', 'mean'], 'AMT_CREDIT_SUM': ['max', 'mean', 'sum'], 'AMT_CREDIT_SUM_DEBT': ['max', 'mean', 'sum'], 'AMT_CREDIT_SUM_OVERDUE': ['max', 'mean', 'sum'], 'AMT_ANNUITY': ['mean'], 'DEBT_CREDIT_DIFF': ['mean', 'sum'], 'MONTHS_BALANCE_MEAN': ['mean', 'var'], 'MONTHS_BALANCE_SIZE': ['mean', 'sum'], 'STATUS_0': ['mean'], 'STATUS_1': ['mean'], 'STATUS_12345': ['mean'], 'STATUS_C': ['mean'], 'STATUS_X': ['mean'], 'CREDIT_ACTIVE_Active': ['mean'], 'CREDIT_ACTIVE_Closed': ['mean'], 'CREDIT_ACTIVE_Sold': ['mean'], 'CREDIT_TYPE_Consumer credit': ['mean'], 'CREDIT_TYPE_Credit card': ['mean'], 'CREDIT_TYPE_Car loan': ['mean'], 'CREDIT_TYPE_Mortgage': ['mean'], 'CREDIT_TYPE_Microloan': ['mean'], 'LL_AMT_CREDIT_SUM_OVERDUE': ['mean'], 'LL_DEBT_CREDIT_DIFF': ['mean'], 'LL_STATUS_12345': ['mean'],
}
BUREAU_ACTIVE_AGG = { 'DAYS_CREDIT': ['max', 'mean'], 'DAYS_CREDIT_ENDDATE': ['min', 'max'], 'AMT_CREDIT_MAX_OVERDUE': ['max', 'mean'], 'AMT_CREDIT_SUM': ['max', 'sum'], 'AMT_CREDIT_SUM_DEBT': ['mean', 'sum'], 'AMT_CREDIT_SUM_OVERDUE': ['max', 'mean'], 'DAYS_CREDIT_UPDATE': ['min', 'mean'], 'DEBT_PERCENTAGE': ['mean'], 'DEBT_CREDIT_DIFF': ['mean'], 'CREDIT_TO_ANNUITY_RATIO': ['mean'], 'MONTHS_BALANCE_MEAN': ['mean', 'var'], 'MONTHS_BALANCE_SIZE': ['mean', 'sum'],
}
BUREAU_CLOSED_AGG = { 'DAYS_CREDIT': ['max', 'var'], 'DAYS_CREDIT_ENDDATE': ['max'], 'AMT_CREDIT_MAX_OVERDUE': ['max', 'mean'], 'AMT_CREDIT_SUM_OVERDUE': ['mean'], 'AMT_CREDIT_SUM': ['max', 'mean', 'sum'], 'AMT_CREDIT_SUM_DEBT': ['max', 'sum'], 'DAYS_CREDIT_UPDATE': ['max'], 'ENDDATE_DIF': ['mean'], 'STATUS_12345': ['mean'],
}
BUREAU_LOAN_TYPE_AGG = { 'DAYS_CREDIT': ['mean', 'max'], 'AMT_CREDIT_MAX_OVERDUE': ['mean', 'max'], 'AMT_CREDIT_SUM': ['mean', 'max'], 'AMT_CREDIT_SUM_DEBT': ['mean', 'max'], 'DEBT_PERCENTAGE': ['mean'], 'DEBT_CREDIT_DIFF': ['mean'], 'DAYS_CREDIT_ENDDATE': ['max'],
}
BUREAU_TIME_AGG = { 'AMT_CREDIT_MAX_OVERDUE': ['max', 'mean'], 'AMT_CREDIT_SUM_OVERDUE': ['mean'], 'AMT_CREDIT_SUM': ['max', 'sum'], 'AMT_CREDIT_SUM_DEBT': ['mean', 'sum'], 'DEBT_PERCENTAGE': ['mean'], 'DEBT_CREDIT_DIFF': ['mean'], 'STATUS_0': ['mean'], 'STATUS_12345': ['mean'],
}
In [48]
agg_bureau = group(bureau, 'BUREAU_', BUREAU_AGG)
active = bureau[bureau['CREDIT_ACTIVE_Active'] == 1]
agg_bureau = group_and_merge(active,agg_bureau,'BUREAU_ACTIVE_',BUREAU_ACTIVE_AGG)
closed = bureau[bureau['CREDIT_ACTIVE_Closed'] == 1]
agg_bureau = group_and_merge(closed,agg_bureau,'BUREAU_CLOSED_',BUREAU_CLOSED_AGG)del active, closed; gc.collect()for credit_type in ['Consumer credit', 'Credit card', 'Mortgage', 'Car loan', 'Microloan']:
type_df = bureau[bureau['CREDIT_TYPE_' + credit_type] == 1]
prefix = 'BUREAU_' + credit_type.split(' ')[0].upper() + '_'
agg_bureau = group_and_merge(type_df, agg_bureau, prefix, BUREAU_LOAN_TYPE_AGG) del type_df; gc.collect()for time_frame in [6, 12]:
prefix = "BUREAU_LAST{}M_".format(time_frame)
time_frame_df = bureau[bureau['DAYS_CREDIT'] >= -30*time_frame]
agg_bureau = group_and_merge(time_frame_df, agg_bureau, prefix, BUREAU_TIME_AGG) del time_frame_df; gc.collect()
In [49]
sort_bureau = bureau.sort_values(by=['DAYS_CREDIT'])
gr = sort_bureau.groupby('SK_ID_CURR')['AMT_CREDIT_MAX_OVERDUE'].last().reset_index()
gr.rename({'AMT_CREDIT_MAX_OVERDUE': 'BUREAU_LAST_LOAN_MAX_OVERDUE'}, inplace=True)
agg_bureau = agg_bureau.merge(gr, on='SK_ID_CURR', how='left')
agg_bureau['BUREAU_DEBT_OVER_CREDIT'] = \
agg_bureau['BUREAU_AMT_CREDIT_SUM_DEBT_SUM']/agg_bureau['BUREAU_AMT_CREDIT_SUM_SUM']
agg_bureau['BUREAU_ACTIVE_DEBT_OVER_CREDIT'] = \
agg_bureau['BUREAU_ACTIVE_AMT_CREDIT_SUM_DEBT_SUM']/agg_bureau['BUREAU_ACTIVE_AMT_CREDIT_SUM_SUM']
In [50]
df = pd.merge(df, agg_bureau, on='SK_ID_CURR', how='left')del agg_bureau, bureau gc.collect()
39In [51]
prev = pd.read_csv(os.path.join(DATA_DIRECTORY, 'previous_application.csv')) pay = pd.read_csv(os.path.join(DATA_DIRECTORY, 'installments_payments.csv'))In [52]
PREVIOUS_AGG = { 'SK_ID_PREV': ['nunique'], 'AMT_ANNUITY': ['min', 'max', 'mean'], 'AMT_DOWN_PAYMENT': ['max', 'mean'], 'HOUR_APPR_PROCESS_START': ['min', 'max', 'mean'], 'RATE_DOWN_PAYMENT': ['max', 'mean'], 'DAYS_DECISION': ['min', 'max', 'mean'], 'CNT_PAYMENT': ['max', 'mean'], 'DAYS_TERMINATION': ['max'], # Engineered features
'CREDIT_TO_ANNUITY_RATIO': ['mean', 'max'], 'APPLICATION_CREDIT_DIFF': ['min', 'max', 'mean'], 'APPLICATION_CREDIT_RATIO': ['min', 'max', 'mean', 'var'], 'DOWN_PAYMENT_TO_CREDIT': ['mean'],
}
PREVIOUS_ACTIVE_AGG = { 'SK_ID_PREV': ['nunique'], 'SIMPLE_INTERESTS': ['mean'], 'AMT_ANNUITY': ['max', 'sum'], 'AMT_APPLICATION': ['max', 'mean'], 'AMT_CREDIT': ['sum'], 'AMT_DOWN_PAYMENT': ['max', 'mean'], 'DAYS_DECISION': ['min', 'mean'], 'CNT_PAYMENT': ['mean', 'sum'], 'DAYS_LAST_DUE_1ST_VERSION': ['min', 'max', 'mean'], # Engineered features
'AMT_PAYMENT': ['sum'], 'INSTALMENT_PAYMENT_DIFF': ['mean', 'max'], 'REMAINING_DEBT': ['max', 'mean', 'sum'], 'REPAYMENT_RATIO': ['mean'],
}
PREVIOUS_LATE_PAYMENTS_AGG = { 'DAYS_DECISION': ['min', 'max', 'mean'], 'DAYS_LAST_DUE_1ST_VERSION': ['min', 'max', 'mean'], # Engineered features
'APPLICATION_CREDIT_DIFF': ['min'], 'NAME_CONTRACT_TYPE_Consumer loans': ['mean'], 'NAME_CONTRACT_TYPE_Cash loans': ['mean'], 'NAME_CONTRACT_TYPE_Revolving loans': ['mean'],
}
PREVIOUS_LOAN_TYPE_AGG = { 'AMT_CREDIT': ['sum'], 'AMT_ANNUITY': ['mean', 'max'], 'SIMPLE_INTERESTS': ['min', 'mean', 'max', 'var'], 'APPLICATION_CREDIT_DIFF': ['min', 'var'], 'APPLICATION_CREDIT_RATIO': ['min', 'max', 'mean'], 'DAYS_DECISION': ['max'], 'DAYS_LAST_DUE_1ST_VERSION': ['max', 'mean'], 'CNT_PAYMENT': ['mean'],
}
PREVIOUS_TIME_AGG = { 'AMT_CREDIT': ['sum'], 'AMT_ANNUITY': ['mean', 'max'], 'SIMPLE_INTERESTS': ['mean', 'max'], 'DAYS_DECISION': ['min', 'mean'], 'DAYS_LAST_DUE_1ST_VERSION': ['min', 'max', 'mean'], # Engineered features
'APPLICATION_CREDIT_DIFF': ['min'], 'APPLICATION_CREDIT_RATIO': ['min', 'max', 'mean'], 'NAME_CONTRACT_TYPE_Consumer loans': ['mean'], 'NAME_CONTRACT_TYPE_Cash loans': ['mean'], 'NAME_CONTRACT_TYPE_Revolving loans': ['mean'],
}
PREVIOUS_APPROVED_AGG = { 'SK_ID_PREV': ['nunique'], 'AMT_ANNUITY': ['min', 'max', 'mean'], 'AMT_CREDIT': ['min', 'max', 'mean'], 'AMT_DOWN_PAYMENT': ['max'], 'AMT_GOODS_PRICE': ['max'], 'HOUR_APPR_PROCESS_START': ['min', 'max'], 'DAYS_DECISION': ['min', 'mean'], 'CNT_PAYMENT': ['max', 'mean'], 'DAYS_TERMINATION': ['mean'], # Engineered features
'CREDIT_TO_ANNUITY_RATIO': ['mean', 'max'], 'APPLICATION_CREDIT_DIFF': ['max'], 'APPLICATION_CREDIT_RATIO': ['min', 'max', 'mean'], # The following features are only for approved applications
'DAYS_FIRST_DRAWING': ['max', 'mean'], 'DAYS_FIRST_DUE': ['min', 'mean'], 'DAYS_LAST_DUE_1ST_VERSION': ['min', 'max', 'mean'], 'DAYS_LAST_DUE': ['max', 'mean'], 'DAYS_LAST_DUE_DIFF': ['min', 'max', 'mean'], 'SIMPLE_INTERESTS': ['min', 'max', 'mean'],
}
PREVIOUS_REFUSED_AGG = { 'AMT_APPLICATION': ['max', 'mean'], 'AMT_CREDIT': ['min', 'max'], 'DAYS_DECISION': ['min', 'max', 'mean'], 'CNT_PAYMENT': ['max', 'mean'], # Engineered features
'APPLICATION_CREDIT_DIFF': ['min', 'max', 'mean', 'var'], 'APPLICATION_CREDIT_RATIO': ['min', 'mean'], 'NAME_CONTRACT_TYPE_Consumer loans': ['mean'], 'NAME_CONTRACT_TYPE_Cash loans': ['mean'], 'NAME_CONTRACT_TYPE_Revolving loans': ['mean'],
}
In [53]
ohe_columns = [ 'NAME_CONTRACT_STATUS', 'NAME_CONTRACT_TYPE', 'CHANNEL_TYPE', 'NAME_TYPE_SUITE', 'NAME_YIELD_GROUP', 'PRODUCT_COMBINATION', 'NAME_PRODUCT_TYPE', 'NAME_CLIENT_TYPE'] prev, categorical_cols = one_hot_encoder(prev, ohe_columns, nan_as_category= False)In [54]
prev['APPLICATION_CREDIT_DIFF'] = prev['AMT_APPLICATION'] - prev['AMT_CREDIT'] prev['APPLICATION_CREDIT_RATIO'] = prev['AMT_APPLICATION'] / prev['AMT_CREDIT'] prev['CREDIT_TO_ANNUITY_RATIO'] = prev['AMT_CREDIT']/prev['AMT_ANNUITY'] prev['DOWN_PAYMENT_TO_CREDIT'] = prev['AMT_DOWN_PAYMENT'] / prev['AMT_CREDIT'] total_payment = prev['AMT_ANNUITY'] * prev['CNT_PAYMENT'] prev['SIMPLE_INTERESTS'] = (total_payment/prev['AMT_CREDIT'] - 1)/prev['CNT_PAYMENT']In [55]
approved = prev[prev['NAME_CONTRACT_STATUS_Approved'] == 1]
active_df = approved[approved['DAYS_LAST_DUE'] == 365243]
active_pay = pay[pay['SK_ID_PREV'].isin(active_df['SK_ID_PREV'])]
active_pay_agg = active_pay.groupby('SK_ID_PREV')[['AMT_INSTALMENT', 'AMT_PAYMENT']].sum()
active_pay_agg.reset_index(inplace= True)
active_pay_agg['INSTALMENT_PAYMENT_DIFF'] = active_pay_agg['AMT_INSTALMENT'] - active_pay_agg['AMT_PAYMENT']
active_df = active_df.merge(active_pay_agg, on= 'SK_ID_PREV', how= 'left')
active_df['REMAINING_DEBT'] = active_df['AMT_CREDIT'] - active_df['AMT_PAYMENT']
active_df['REPAYMENT_RATIO'] = active_df['AMT_PAYMENT'] / active_df['AMT_CREDIT']
active_agg_df = group(active_df, 'PREV_ACTIVE_', PREVIOUS_ACTIVE_AGG)
active_agg_df['TOTAL_REPAYMENT_RATIO'] = active_agg_df['PREV_ACTIVE_AMT_PAYMENT_SUM']/\
active_agg_df['PREV_ACTIVE_AMT_CREDIT_SUM']del active_pay, active_pay_agg, active_df; gc.collect()
0In [56]
prev['DAYS_FIRST_DRAWING'].replace(365243, np.nan, inplace= True) prev['DAYS_FIRST_DUE'].replace(365243, np.nan, inplace= True) prev['DAYS_LAST_DUE_1ST_VERSION'].replace(365243, np.nan, inplace= True) prev['DAYS_LAST_DUE'].replace(365243, np.nan, inplace= True) prev['DAYS_TERMINATION'].replace(365243, np.nan, inplace= True)In [57]
prev['DAYS_LAST_DUE_DIFF'] = prev['DAYS_LAST_DUE_1ST_VERSION'] - prev['DAYS_LAST_DUE'] approved['DAYS_LAST_DUE_DIFF'] = approved['DAYS_LAST_DUE_1ST_VERSION'] - approved['DAYS_LAST_DUE']In [58]
categorical_agg = {key: ['mean'] for key in categorical_cols}
In [59]
agg_prev = group(prev, 'PREV_', {**PREVIOUS_AGG, **categorical_agg})
agg_prev = agg_prev.merge(active_agg_df, how='left', on='SK_ID_CURR')del active_agg_df; gc.collect()
0In [60]
agg_prev = group_and_merge(approved, agg_prev, 'APPROVED_', PREVIOUS_APPROVED_AGG) refused = prev[prev['NAME_CONTRACT_STATUS_Refused'] == 1] agg_prev = group_and_merge(refused, agg_prev, 'REFUSED_', PREVIOUS_REFUSED_AGG)del approved, refused; gc.collect()
0In [61]
for loan_type in ['Consumer loans', 'Cash loans']:
type_df = prev[prev['NAME_CONTRACT_TYPE_{}'.format(loan_type)] == 1]
prefix = 'PREV_' + loan_type.split(" ")[0] + '_'
agg_prev = group_and_merge(type_df, agg_prev, prefix, PREVIOUS_LOAN_TYPE_AGG) del type_df; gc.collect()
In [62]
pay['LATE_PAYMENT'] = pay['DAYS_ENTRY_PAYMENT'] - pay['DAYS_INSTALMENT'] pay['LATE_PAYMENT'] = pay['LATE_PAYMENT'].apply(lambda x: 1 if x > 0 else 0) dpd_id = pay[pay['LATE_PAYMENT'] > 0]['SK_ID_PREV'].unique()In [63]
agg_dpd = group_and_merge(prev[prev['SK_ID_PREV'].isin(dpd_id)], agg_prev, 'PREV_LATE_', PREVIOUS_LATE_PAYMENTS_AGG)del agg_dpd, dpd_id; gc.collect()
0In [64]
for time_frame in [12, 24]:
time_frame_df = prev[prev['DAYS_DECISION'] >= -30*time_frame]
prefix = 'PREV_LAST{}M_'.format(time_frame)
agg_prev = group_and_merge(time_frame_df, agg_prev, prefix, PREVIOUS_TIME_AGG) del time_frame_df; gc.collect()del prev; gc.collect()
0In [65]
df = pd.merge(df, agg_prev, on='SK_ID_CURR', how='left')In [66]
train = df[df['TARGET'].notnull()] test = df[df['TARGET'].isnull()]del df gc.collect()
98In [67]
labels = train['TARGET'] test_lebels=test['TARGET'] train = train.drop(columns=['TARGET']) test = test.drop(columns=['TARGET'])In [68]
feature = list(train.columns) train.replace([np.inf, -np.inf], np.nan, inplace=True) test.replace([np.inf, -np.inf], np.nan, inplace=True) test_df = test.copy() train_df = train.copy() train_df['TARGET'] = labels test_df['TARGET'] = test_lebelsIn [69]
imputer = SimpleImputer(strategy = 'median') imputer.fit(train) imputer.fit(test) train = imputer.transform(train) test = imputer.transform(test)In [70]
scaler = MinMaxScaler(feature_range = (0, 1)) scaler.fit(train) scaler.fit(test) train = scaler.transform(train) test = scaler.transform(test)In [71]
from lightgbm import LGBMClassifier lgbmc = LGBMClassifier() lgbmc.fit(train, labels)
LGBMClassifier()In [72]
lgbm_pred = lgbmc.predict_proba(test)[:, 1]In [74]
submit = test_df[['SK_ID_CURR']] submit['TARGET'] = lgbm_predIn [75]
submit.to_csv('lgbm.csv', index = False)
总结
数据的提交结果如下:(提交需要科学上网)
| 数据集 | Home Credit Default Risk |
|---|---|
| 线上评分 | 0.78277 |
以上就是【金融风控系列】_[3]_贷款违约识别的详细内容,更多请关注其它相关文章!
# python
# ai
# seo吸引人眼睛
# 百度营销推广培训班长沙
# 考试院网站建设
# seo黑帽技术博客
# 官网
# 数据库中
# 线上
# 中文网
# 美图
# 包含了
# 当日
# 邮箱
# red
# 征信
# follow
# udio
# whee
# descript
# type
# 不匹配
# 一言
# 中国铁路网站建设
# 万家福营销推广方案设计
# 营销产品策略推广方法
# 网站优化口号
# 推广网站架构怎么做
# 榆次网站建设推广公司
相关栏目:
【
行业资讯67740 】
【
技术百科0 】
【
网络运营39195 】
相关推荐:
typescript和node学哪个
苹果16有哪些bug
linux下如何重定位命令
苹果电脑如何输入命令
电动车仪表盘上的power是什么意思
虚拟机如何用命令清除垃圾
折叠屏手机为什么这么小
单片机计时程序怎么写
oracle中datediff函数怎么用 Oracle中DATEDIFF函数详解
微信最多可以加多少好友
calm是什么意思
春运抢票准备什么东西
vi命令如何退出
美食音乐每日推荐怎么写
docs命令如何进入d
征信不好如何短期恢复
复制 命令如何撤销
苹果手机16系统有哪些
typescript在浏览器里怎么用
如何安装m.2固态硬盘
如何在一串数字前面去掉四位数的命令
65寸电视长宽多少厘米
linux如何跳回命令行界面
酷狗音乐pc版的每日推荐在哪 酷狗音乐PC版每日推荐查找指南
win7旗舰版wifi怎么打开
春运抢票可以抢几张
自己如何安装固态硬盘
shell如何执行sql脚本命令行
硬盘和固态硬盘如何区分
春运抢票技巧攻略
命令行ftp如何创建目录
typescript为什么现在才火
怎么更新typescript
充电器上的power是什么意思
33000日元等于多少人民币
j*a数组对象怎么取
typescript中文怎么读
linux如何调出命令行
市盈率ttm是什么意思
为什么用typescript
typescript怎么用
智能锁type-c接口是什么
显卡上面TYPE-C是什么接口
市盈率静是什么意思
bored是什么意思
typescript参数怎么用
夸克po什么意思
js怎么设置typescript
grub命令如何进dos
j*a怎么读取char数组


2025-07-22
浏览次数:次
返回列表
_CREDIT_SUM_OVERDUE', 'AMT_CREDIT_SUM', 'AMT_CREDIT_SUM_DEBT', 'DEBT_PERCENTAGE', 'DEBT_CREDIT_DIFF', 'STATUS_0', 'STATUS_12345']
agg_length = bureau.groupby('MONTHS_BALANCE_SIZE')[features].mean().reset_index()
agg_length.rename({feat: 'LL_' + feat for feat in features}, axis=1, inplace=True)
bureau = bureau.merge(agg_length, how='left', on='MONTHS_BALANCE_SIZE')del agg_length; gc.collect()