Top1方案!2021科大訊飛-車輛貸款違約預測賽事
1. 引言
Hello,大家好。我是“摸魚打比賽”隊的wangli,首先介紹下自己吧,一枚半路出家的野生算法工程師。之所以起名字叫摸魚打比賽,是因為當時5/6月份自己還處于業(yè)務交接沒那么忙的一個狀態(tài)中,然后想起自己也已經(jīng)畢業(yè)兩年,但對賽圈一直還是比較關(guān)注的,平日看到一些題目也會手癢,但奈何打工人下班之后惰性使然只想躺平,畢業(yè)之后始終沒有好好打一場比賽,偶爾也會在深夜里問起自己:“廉頗老矣,尚能飯否”,就想著,這回我就利用下這段尚且不忙的日子好好打一場比賽吧。于是我就參加了這次的比賽,不僅僥幸獲得了車貸這個小比賽的第一,然后還結(jié)識了一些好友,比如我尚在讀研的隊友陳兄,以及忙于秋招中的好友崔兄。真是收獲滿滿~
那么,接下來我就給大家介紹一下這場比賽中,我的一些具體的解題思路和感悟。
2. 賽題背景
賽題鏈接:https://challenge.xfyun.cn/topic/info?type=car-loan
可以看到,這個賽題做的是車貸違約預測問題,基于,參賽選手們需要建立風險識別模型來預測可能違約的借款人。這道賽題,相比其他賽題,車貸違約預測這道題的難度是沒那么大的,原因有二:
賽題難度:非常傳統(tǒng)的風控逾期預測,二分類問題,很多其他比賽的代碼可能稍微改一下就能套上來用; 競爭程度:賽題本身的獎金并不多,因此參賽的選手也不多。
我個人是前期在打商品推薦賽(同“摸魚打比賽”ID)的時候順便打一下這個比賽,在最后幾天有認真去挖了一些特征。(說到這個基于用戶畫像的商品推薦賽,就有點慚愧,前期感覺自己還是可以一戰(zhàn)的,一度是在Top 3的,后面8月開始由于工作太忙,復賽開始之后就一直沒有提交,說到底還是自己時間管理能力太菜了。就看看國慶期間能不能有時間再做一下吧)
再說回這個比賽:
數(shù)據(jù)量的話還是可以的,其中 訓練集15w,測試集3w 包含52個特征字段,各個字段主辦方也是給了相應的解釋 評估指標:F1 Score
所以,其實可以很快的寫出一個baseline來,對于數(shù)據(jù)新手來說,是一個比較友好的比賽了。
3. 解題思路
這種偏數(shù)據(jù)挖掘的比賽的關(guān)鍵點在于如何基于對數(shù)據(jù)的理解抽象歸納出有用的特征,因此,我一開始做的時候,并沒有想著說去套各種高大上的模型,而是通過對數(shù)據(jù)的分析去構(gòu)造一些特征。如果不想往后看代碼的話,我在這一章節(jié)會簡單把我的整個方案講一下:
正負樣本分布:可以看到這道題的正負樣本比為 82:18 這樣,在風控里面其實已經(jīng)屬于正負樣本分布較為平衡的數(shù)據(jù)了,所以我在比賽中,并沒有刻意的去往正負樣本不平衡這塊去做,有做了一些過采樣的嘗試,但效果反而不增反降。
特征工程:
模型選取:
閾值選?。?/strong>由于該題是用F1 Score作為評判標準的,因此,我們需要自己劃一個閾值,然后決定哪些樣本預測為正樣本,哪些樣本預測為負樣本。在嘗試了不同方案后,我們的方案基于oof的預測結(jié)果,選出一個在oof上表現(xiàn)最優(yōu)的閾值,此時在榜上的效果是最佳的(千分位的提升)
融合策略:最后選定了兩個模型來融合,一個是LightGBM,一個是XGBoost(哈哈哈,就很土有沒有),然后,直接按預測概率加權(quán)融合的話效果是比較一般的,而按照其ranking值分位點化之后再加權(quán)融合效果會更好。效果而言,單模LGB最優(yōu)是0.5892,XGB是在0.5872這邊,按照概率加權(quán)最優(yōu)是0.59011,按照排序加權(quán)最優(yōu)是0.59038
其實主要思路和方案,就如同上述文字所描述的了。但看起來總是干巴巴的,如果你還對代碼有興趣的話,可以繼續(xù)往下看。畢竟 Talk is Cheap, :)
4. 具體實現(xiàn) & 代碼詳解
4.1 特征工程
target encoding/mean encoding,這里要注意的是,為了防止過擬合,需要分折來做
#?用來TG編碼的特征:
TARGET_ENCODING_FETAS?=?[
????????????????????????????'employment_type',
?????????????????????????????'branch_id',
?????????????????????????????'supplier_id',
?????????????????????????????'manufacturer_id',
?????????????????????????????'area_id',
?????????????????????????????'employee_code_id',
?????????????????????????????'asset_cost_bin'
?????????????????????????]
#?具體實現(xiàn):
def?gen_target_encoding_feats(train,?test,?encode_cols,?target_col,?n_fold=10):
????'''生成target?encoding特征'''
????#?for?training?set?-?cv
????tg_feats?=?np.zeros((train.shape[0],?len(encode_cols)))
????kfold?=?StratifiedKFold(n_splits=n_fold,?random_state=1024,?shuffle=True)
????for?_,?(train_index,?val_index)?in?enumerate(kfold.split(train[encode_cols],?train[target_col])):
????????df_train,?df_val?=?train.iloc[train_index],?train.iloc[val_index]
????????for?idx,?col?in?enumerate(encode_cols):
????????????target_mean_dict?=?df_train.groupby(col)[target_col].mean()
????????????df_val[f'{col}_mean_target']?=?df_val[col].map(target_mean_dict)
????????????tg_feats[val_index,?idx]?=?df_val[f'{col}_mean_target'].values
????for?idx,?encode_col?in?enumerate(encode_cols):
????????train[f'{encode_col}_mean_target']?=?tg_feats[:,?idx]
????#?for?testing?set
????for?col?in?encode_cols:
????????target_mean_dict?=?train.groupby(col)[target_col].mean()
????????test[f'{col}_mean_target']?=?test[col].map(target_mean_dict)
????return?train,?test
年利率特征/分箱等特征:
def?gen_new_feats(train,?test):
????'''生成新特征:如年利率/分箱等特征'''
????#?Step?1:?合并訓練集和測試集
????data?=?pd.concat([train,?test])
????#?Step?2:?具體特征工程
????#?計算二級賬戶的年利率
????data['sub_Rate']?=?(data['sub_account_monthly_payment']?*?data['sub_account_tenure']?-?data[
????????'sub_account_sanction_loan'])?/?data['sub_account_sanction_loan']
????#?計算主賬戶的年利率
????data['main_Rate']?=?(data['main_account_monthly_payment']?*?data['main_account_tenure']?-?data[
????????'main_account_sanction_loan'])?/?data['main_account_sanction_loan']
????#?對部分特征進行分箱操作
????#?等寬分箱
????loan_to_asset_ratio_labels?=?[i?for?i?in?range(10)]
????data['loan_to_asset_ratio_bin']?=?pd.cut(data["loan_to_asset_ratio"],?10,?labels=loan_to_asset_ratio_labels)
????#?等頻分箱
????data['asset_cost_bin']?=?pd.qcut(data['asset_cost'],?10,?labels=loan_to_asset_ratio_labels)
????#?自定義分箱
????amount_cols?=?[
???????????????????'total_monthly_payment',
???????????????????'main_account_sanction_loan',
???????????????????'main_account_disbursed_loan',
???????????????????'sub_account_sanction_loan',
???????????????????'sub_account_disbursed_loan',
???????????????????'main_account_monthly_payment',
???????????????????'sub_account_monthly_payment',
???????????????????'total_sanction_loan'
????????????????]
????amount_labels?=?[i?for?i?in?range(10)]
????for?col?in?amount_cols:
????????total_monthly_payment_bin?=?[-1,?5000,?10000,?30000,?50000,?100000,?300000,?500000,?1000000,?3000000,?data[col].max()]
????????data[col?+?'_bin']?=?pd.cut(data[col],?total_monthly_payment_bin,?labels=amount_labels).astype(int)
????#?Step?3:?返回包含新特征的訓練集?&?測試集
????return?data[data['loan_default'].notnull()],?data[data['loan_default'].isnull()]
近鄰欺詐特征(ID前后10個近鄰的欺詐概率,其實可以更多不同嘗試尋找最優(yōu)的近鄰數(shù),但精力有限哈哈)
def?gen_neighbor_feats(train,?test):
????'''產(chǎn)生近鄰欺詐特征'''
????if?not?os.path.exists('../user_data/neighbor_default_probs.pkl'):
????????#?該特征需要跑的時間較久,因此將其存成了pkl文件
????????neighbor_default_probs?=?[]
????????for?i?in?tqdm(range(train.customer_id.max())):
????????????if?i?>=?10?and?i?199706:
????????????????customer_id_neighbors?=?list(range(i?-?10,?i))?+?list(range(i?+?1,?i?+?10))
????????????elif?i?199706:
????????????????customer_id_neighbors?=?list(range(0,?i))?+?list(range(i?+?1,?i?+?10))
????????????else:
????????????????customer_id_neighbors?=?list(range(i?-?10,?i))?+?list(range(i?+?1,?199706))
????????????customer_id_neighbors?=?[customer_id_neighbor?for?customer_id_neighbor?in?customer_id_neighbors?if
?????????????????????????????????????customer_id_neighbor?in?train.customer_id.values.tolist()]
????????????neighbor_default_prob?=?train.set_index('customer_id').loc[customer_id_neighbors].loan_default.mean()
????????????neighbor_default_probs.append(neighbor_default_prob)
????????df_neighbor_default_prob?=?pd.DataFrame({'customer_id':?range(0,?train.customer_id.max()),
?????????????????????????????????????????????????'neighbor_default_prob':?neighbor_default_probs})
????????save_pkl(df_neighbor_default_prob,?'../user_data/neighbor_default_probs.pkl')
????else:
????????df_neighbor_default_prob?=?load_pkl('../user_data/neighbor_default_probs.pkl')
????train?=?pd.merge(left=train,?right=df_neighbor_default_prob,?on='customer_id',?how='left')
????test?=?pd.merge(left=test,?right=df_neighbor_default_prob,?on='customer_id',?how='left')
????return?train,?test
最終我只選取了47維特征:
USED_FEATS?=?[
?????????????????'customer_id',
?????????????????'neighbor_default_prob',
?????????????????'disbursed_amount',
?????????????????'asset_cost',
?????????????????'branch_id',
?????????????????'supplier_id',
?????????????????'manufacturer_id',
?????????????????'area_id',
?????????????????'employee_code_id',
?????????????????'credit_score',
?????????????????'loan_to_asset_ratio',
?????????????????'year_of_birth',
?????????????????'age',
?????????????????'sub_Rate',
?????????????????'main_Rate',
?????????????????'loan_to_asset_ratio_bin',
?????????????????'asset_cost_bin',
?????????????????'employment_type_mean_target',
?????????????????'branch_id_mean_target',
?????????????????'supplier_id_mean_target',
?????????????????'manufacturer_id_mean_target',
?????????????????'area_id_mean_target',
?????????????????'employee_code_id_mean_target',
?????????????????'asset_cost_bin_mean_target',
?????????????????'credit_history',
?????????????????'average_age',
?????????????????'total_disbursed_loan',
?????????????????'main_account_disbursed_loan',
?????????????????'total_sanction_loan',
?????????????????'main_account_sanction_loan',
?????????????????'active_to_inactive_act_ratio',
?????????????????'total_outstanding_loan#39;,
?????????????????'main_account_outstanding_loan',
?????????????????'Credit_level',
?????????????????'outstanding_disburse_ratio',
?????????????????'total_account_loan_no',
?????????????????'main_account_tenure',
?????????????????'main_account_loan_no',
?????????????????'main_account_monthly_payment',
?????????????????'total_monthly_payment',
?????????????????'main_account_active_loan_no',
?????????????????'main_account_inactive_loan_no',
?????????????????'sub_account_inactive_loan_no',
?????????????????'enquirie_no',
?????????????????'main_account_overdue_no',
?????????????????'total_overdue_no',
?????????????????'last_six_month_defaulted_no'
????????????]
4.2 模型訓練
LightGBM(十折效果更優(yōu))
def?train_lgb_kfold(X_train,?y_train,?X_test,?n_fold=5):
????'''train?lightgbm?with?k-fold?split'''
????gbms?=?[]
????kfold?=?StratifiedKFold(n_splits=n_fold,?random_state=1024,?shuffle=True)
????oof_preds?=?np.zeros((X_train.shape[0],))
????test_preds?=?np.zeros((X_test.shape[0],))
????for?fold,?(train_index,?val_index)?in?enumerate(kfold.split(X_train,?y_train)):
????????logging.info(f'############?fold?{fold}?###########')
????????X_tr,?X_val,?y_tr,?y_val?=?X_train.iloc[train_index],?X_train.iloc[val_index],?y_train[train_index],?y_train[val_index]
????????dtrain?=?lgb.Dataset(X_tr,?y_tr)
????????dvalid?=?lgb.Dataset(X_val,?y_val,?reference=dtrain)
????????params?=?{
????????????'objective':?'binary',
????????????'metric':?'auc',
????????????'num_leaves':?64,
????????????'learning_rate':?0.02,
????????????'min_data_in_leaf':?150,
????????????'feature_fraction':?0.8,
????????????'bagging_fraction':?0.7,
????????????'n_jobs':?-1,
????????????'seed':?1024
????????}
????????gbm?=?lgb.train(params,
????????????????????????dtrain,
????????????????????????num_boost_round=1000,
????????????????????????valid_sets=[dtrain,?dvalid],
????????????????????????verbose_eval=50,
????????????????????????early_stopping_rounds=20)
????????oof_preds[val_index]?=?gbm.predict(X_val,?num_iteration=gbm.best_iteration)
????????test_preds?+=?gbm.predict(X_test,?num_iteration=gbm.best_iteration)?/?kfold.n_splits
????????gbms.append(gbm)
????return?gbms,?oof_preds,?test_preds
XGBoost
def?train_xgb_kfold(X_train,?y_train,?X_test,?n_fold=5):
????'''train?xgboost?with?k-fold?split'''
????gbms?=?[]
????kfold?=?StratifiedKFold(n_splits=10,?random_state=1024,?shuffle=True)
????oof_preds?=?np.zeros((X_train.shape[0],))
????test_preds?=?np.zeros((X_test.shape[0],))
????for?fold,?(train_index,?val_index)?in?enumerate(kfold.split(X_train,?y_train)):
????????logging.info(f'############?fold?{fold}?###########')
????????X_tr,?X_val,?y_tr,?y_val?=?X_train.iloc[train_index],?X_train.iloc[val_index],?y_train[train_index],?y_train[val_index]
????????dtrain?=?xgb.DMatrix(X_tr,?y_tr)
????????dvalid?=?xgb.DMatrix(X_val,?y_val)
????????dtest?=?xgb.DMatrix(X_test)
????????params={
????????????'booster':'gbtree',
????????????'objective':?'binary:logistic',
????????????'eval_metric':?['logloss',?'auc'],
????????????'max_depth':?8,
????????????'subsample':0.9,
????????????'min_child_weight':?10,
????????????'colsample_bytree':0.85,
????????????'lambda':?10,
????????????'eta':?0.02,
????????????'seed':?1024
????????}
????????watchlist?=?[(dtrain,?'train'),?(dvalid,?'test')]
????????gbm?=?xgb.train(params,
????????????????????????dtrain,
????????????????????????num_boost_round=1000,
????????????????????????evals=watchlist,
????????????????????????verbose_eval=50,
????????????????????????early_stopping_rounds=20)
????????oof_preds[val_index]?=?gbm.predict(dvalid,?iteration_range=(0,?gbm.best_iteration))
????????test_preds?+=?gbm.predict(dtest,?iteration_range=(0,?gbm.best_iteration))?/?kfold.n_splits
????????gbms.append(gbm)
????return?gbms,?oof_preds,?test_preds
4.3 模型融合與閾值選取
def?gen_submit_file(df_test,?test_preds,?thres,?save_path):
????df_test['test_preds_binary']?=?np.where(test_preds?>?thres,?1,?0)
????df_test_submit?=?df_test[['customer_id',?'test_preds_binary']]
????df_test_submit.columns?=?['customer_id',?'loan_default']
????print(f'saving?result?to:?{save_path}')
????df_test_submit.to_csv(save_path,?index=False)
????print('done!')
????return?df_test_submit
def?gen_thres_new(df_train,?oof_preds):
????df_train['oof_preds']?=?oof_preds
????quantile_point?=?df_train['loan_default'].mean()
????thres?=?df_train['oof_preds'].quantile(1?-?quantile_point)
????_thresh?=?[]
????for?thres_item?in?np.arange(thres?-?0.2,?thres?+?0.2,?0.01):
????????_thresh.append(
????????????[thres_item,?f1_score(df_train['loan_default'],?np.where(oof_preds?>?thres_item,?1,?0),?average='macro')])
????_thresh?=?np.array(_thresh)
????best_id?=?_thresh[:,?1].argmax()
????best_thresh?=?_thresh[best_id][0]
????print("閾值:?{}\n訓練集的f1:?{}".format(best_thresh,?_thresh[best_id][1]))
????return?best_thresh
#?結(jié)果
df_oof_res?=?pd.DataFrame({'customer_id':?train['customer_id'],
???????????????????????????'oof_preds_xgb':?oof_preds_xgb,
???????????????????????????'oof_preds_lgb':?oof_preds_lgb,
???????????????????????????'loan_default':?train['loan_default']
??????????????????????????})
#?模型融合
df_oof_res['xgb_rank']?=?df_oof_res['oof_preds_xgb'].rank(pct=True)
df_oof_res['lgb_rank']?=?df_oof_res['oof_preds_lgb'].rank(pct=True)
df_oof_res['preds']?=?0.31?*?df_oof_res['xgb_rank']?+?0.69?*?df_oof_res['lgb_rank']
#?得到最優(yōu)閾值
thres?=?gen_thres_new(df_oof_res,?df_oof_res['preds'])
df_test_res?=?pd.DataFrame({'customer_id':?test['customer_id'],
????????????????????????????'test_preds_xgb':?test_preds_xgb,
????????????????????????????'test_preds_lgb':?test_preds_lgb})
df_test_res['xgb_rank']?=?df_test_res['test_preds_xgb'].rank(pct=True)
df_test_res['lgb_rank']?=?df_test_res['test_preds_lgb'].rank(pct=True)
df_test_res['preds']?=?0.31?*?df_test_res['xgb_rank']?+?0.69?*?df_test_res['lgb_rank']
#?結(jié)果產(chǎn)出
df_submit?=?gen_submit_file(df_test_res,?df_test_res['preds'],?thres,
????????????????????????????save_path='../prediction_result/result.csv')
完整代碼

推薦閱讀
歡迎長按掃碼關(guān)注「數(shù)據(jù)管道」
