<kbd id="afajh"><form id="afajh"></form></kbd>
<strong id="afajh"><dl id="afajh"></dl></strong>
    <del id="afajh"><form id="afajh"></form></del>
        1. <th id="afajh"><progress id="afajh"></progress></th>
          <b id="afajh"><abbr id="afajh"></abbr></b>
          <th id="afajh"><progress id="afajh"></progress></th>

          Top1方案!2021科大訊飛-車輛貸款違約預測賽事

          共 5778字,需瀏覽 12分鐘

           ·

          2021-10-26 09:24

          作者:望尼瑪,浙江大學,Datawhale優(yōu)秀選手
          知乎|https://www.zhihu.com/people/lin-a-bi-78/posts

          1. 引言

          Hello,大家好。我是“摸魚打比賽”隊的wangli,首先介紹下自己吧,一枚半路出家的野生算法工程師。之所以起名字叫摸魚打比賽,是因為當時5/6月份自己還處于業(yè)務交接沒那么忙的一個狀態(tài)中,然后想起自己也已經(jīng)畢業(yè)兩年,但對賽圈一直還是比較關(guān)注的,平日看到一些題目也會手癢,但奈何打工人下班之后惰性使然只想躺平,畢業(yè)之后始終沒有好好打一場比賽,偶爾也會在深夜里問起自己:“廉頗老矣,尚能飯否”,就想著,這回我就利用下這段尚且不忙的日子好好打一場比賽吧。于是我就參加了這次的比賽,不僅僥幸獲得了車貸這個小比賽的第一,然后還結(jié)識了一些好友,比如我尚在讀研的隊友陳兄,以及忙于秋招中的好友崔兄。真是收獲滿滿~

          那么,接下來我就給大家介紹一下這場比賽中,我的一些具體的解題思路和感悟。

          2. 賽題背景

          賽題鏈接:https://challenge.xfyun.cn/topic/info?type=car-loan

          可以看到,這個賽題做的是車貸違約預測問題,基于,參賽選手們需要建立風險識別模型來預測可能違約的借款人。這道賽題,相比其他賽題,車貸違約預測這道題的難度是沒那么大的,原因有二:

          • 賽題難度:非常傳統(tǒng)的風控逾期預測,二分類問題,很多其他比賽的代碼可能稍微改一下就能套上來用;
          • 競爭程度:賽題本身的獎金并不多,因此參賽的選手也不多。

          我個人是前期在打商品推薦賽(同“摸魚打比賽”ID)的時候順便打一下這個比賽,在最后幾天有認真去挖了一些特征。(說到這個基于用戶畫像的商品推薦賽,就有點慚愧,前期感覺自己還是可以一戰(zhàn)的,一度是在Top 3的,后面8月開始由于工作太忙,復賽開始之后就一直沒有提交,說到底還是自己時間管理能力太菜了。就看看國慶期間能不能有時間再做一下吧)

          再說回這個比賽:

          • 數(shù)據(jù)量的話還是可以的,其中 訓練集15w,測試集3w
          • 包含52個特征字段,各個字段主辦方也是給了相應的解釋
          • 評估指標:F1 Score

          所以,其實可以很快的寫出一個baseline來,對于數(shù)據(jù)新手來說,是一個比較友好的比賽了。

          3. 解題思路

          這種偏數(shù)據(jù)挖掘的比賽的關(guān)鍵點在于如何基于對數(shù)據(jù)的理解抽象歸納出有用的特征,因此,我一開始做的時候,并沒有想著說去套各種高大上的模型,而是通過對數(shù)據(jù)的分析去構(gòu)造一些特征。如果不想往后看代碼的話,我在這一章節(jié)會簡單把我的整個方案講一下:

          • 正負樣本分布:可以看到這道題的正負樣本比為 82:18 這樣,在風控里面其實已經(jīng)屬于正負樣本分布較為平衡的數(shù)據(jù)了,所以我在比賽中,并沒有刻意的去往正負樣本不平衡這塊去做,有做了一些過采樣的嘗試,但效果反而不增反降。

          • 特征工程:

          • 首先我一開始就發(fā)現(xiàn)有很多ID類的特征,然后我就基于這些ID類特征做了一些target encoding特征,這些簡單的特征 + 樹模型就已經(jīng)0.583了,能讓我前期一直處在Top 10;
            而后,從業(yè)務角度構(gòu)造了一些諸如:主賬戶和二級賬戶的年利率特征(因為往往銀行的利率表現(xiàn)了其對用戶的信用預測);從數(shù)據(jù)分布角度對一些金額類的特征做了些分箱操作;再從特征本身的有效性和冗余角度出發(fā),剔除了一些毫無信息量的特征,比如貸款日期等。這時,我們可以做到0.587這樣的水平;
            然后,在一次誤打誤撞的模型訓練時,我誤把客戶ID放進模型中去訓練了,結(jié)果我發(fā)現(xiàn)似乎還對模型性能有一定提升?那我這時候的想法是:這一定是由于欺詐有些集中性導致的,黑產(chǎn)可能在借貸銀行(where)或借貸時間(when)上存在一定的集中性,而這種集中性一方面可以通過branch_id/supplier_id/manufacturer_id等反映出來,另一方面,本身客戶的customer_id也是可以體現(xiàn)時間上的集中性,因此,我又基于這個點構(gòu)造了近鄰欺詐特征,這時候我們就能做到0.589了;
          • 模型選取

          • 前期,我一直是用的LightGBM,然后也沒有很仔細的去調(diào)參(比如hyperopt/ optuna等工具,我都沒有用),就很隨意(平平無奇的手動調(diào)參小天才)
            后期,我開始嘗試其他的XGBoost/CatBoost/TabNet等模型,但是發(fā)現(xiàn)CatBoost和TabNet效果都不是很好,就沒有深入往下去鉆了(主要白天還是要上班的,因此精力有限,說是摸魚打比賽,但更準確的說是 熬夜打比賽)
          • 閾值選?。?/strong>由于該題是用F1 Score作為評判標準的,因此,我們需要自己劃一個閾值,然后決定哪些樣本預測為正樣本,哪些樣本預測為負樣本。在嘗試了不同方案后,我們的方案基于oof的預測結(jié)果,選出一個在oof上表現(xiàn)最優(yōu)的閾值,此時在榜上的效果是最佳的(千分位的提升)

          • 融合策略:最后選定了兩個模型來融合,一個是LightGBM,一個是XGBoost(哈哈哈,就很土有沒有),然后,直接按預測概率加權(quán)融合的話效果是比較一般的,而按照其ranking值分位點化之后再加權(quán)融合效果會更好。效果而言,單模LGB最優(yōu)是0.5892,XGB是在0.5872這邊,按照概率加權(quán)最優(yōu)是0.59011,按照排序加權(quán)最優(yōu)是0.59038

          其實主要思路和方案,就如同上述文字所描述的了。但看起來總是干巴巴的,如果你還對代碼有興趣的話,可以繼續(xù)往下看。畢竟 Talk is Cheap, :)

          4. 具體實現(xiàn) & 代碼詳解

          4.1 特征工程

          • target encoding/mean encoding,這里要注意的是,為了防止過擬合,需要分折來做
          #?用來TG編碼的特征:
          TARGET_ENCODING_FETAS?=?[
          ????????????????????????????'employment_type',
          ?????????????????????????????'branch_id',
          ?????????????????????????????'supplier_id',
          ?????????????????????????????'manufacturer_id',
          ?????????????????????????????'area_id',
          ?????????????????????????????'employee_code_id',
          ?????????????????????????????'asset_cost_bin'
          ?????????????????????????]

          #?具體實現(xiàn):
          def?gen_target_encoding_feats(train,?test,?encode_cols,?target_col,?n_fold=10):
          ????'''生成target?encoding特征'''
          ????#?for?training?set?-?cv
          ????tg_feats?=?np.zeros((train.shape[0],?len(encode_cols)))
          ????kfold?=?StratifiedKFold(n_splits=n_fold,?random_state=1024,?shuffle=True)
          ????for?_,?(train_index,?val_index)?in?enumerate(kfold.split(train[encode_cols],?train[target_col])):
          ????????df_train,?df_val?=?train.iloc[train_index],?train.iloc[val_index]
          ????????for?idx,?col?in?enumerate(encode_cols):
          ????????????target_mean_dict?=?df_train.groupby(col)[target_col].mean()
          ????????????df_val[f'{col}_mean_target']?=?df_val[col].map(target_mean_dict)
          ????????????tg_feats[val_index,?idx]?=?df_val[f'{col}_mean_target'].values

          ????for?idx,?encode_col?in?enumerate(encode_cols):
          ????????train[f'{encode_col}_mean_target']?=?tg_feats[:,?idx]

          ????#?for?testing?set
          ????for?col?in?encode_cols:
          ????????target_mean_dict?=?train.groupby(col)[target_col].mean()
          ????????test[f'{col}_mean_target']?=?test[col].map(target_mean_dict)

          ????return?train,?test
          • 年利率特征/分箱等特征:
          def?gen_new_feats(train,?test):
          ????'''生成新特征:如年利率/分箱等特征'''
          ????#?Step?1:?合并訓練集和測試集
          ????data?=?pd.concat([train,?test])

          ????#?Step?2:?具體特征工程
          ????#?計算二級賬戶的年利率
          ????data['sub_Rate']?=?(data['sub_account_monthly_payment']?*?data['sub_account_tenure']?-?data[
          ????????'sub_account_sanction_loan'])?/?data['sub_account_sanction_loan']

          ????#?計算主賬戶的年利率
          ????data['main_Rate']?=?(data['main_account_monthly_payment']?*?data['main_account_tenure']?-?data[
          ????????'main_account_sanction_loan'])?/?data['main_account_sanction_loan']

          ????#?對部分特征進行分箱操作
          ????#?等寬分箱
          ????loan_to_asset_ratio_labels?=?[i?for?i?in?range(10)]
          ????data['loan_to_asset_ratio_bin']?=?pd.cut(data["loan_to_asset_ratio"],?10,?labels=loan_to_asset_ratio_labels)
          ????#?等頻分箱
          ????data['asset_cost_bin']?=?pd.qcut(data['asset_cost'],?10,?labels=loan_to_asset_ratio_labels)
          ????#?自定義分箱
          ????amount_cols?=?[
          ???????????????????'total_monthly_payment',
          ???????????????????'main_account_sanction_loan',
          ???????????????????'main_account_disbursed_loan',
          ???????????????????'sub_account_sanction_loan',
          ???????????????????'sub_account_disbursed_loan',
          ???????????????????'main_account_monthly_payment',
          ???????????????????'sub_account_monthly_payment',
          ???????????????????'total_sanction_loan'
          ????????????????]
          ????amount_labels?=?[i?for?i?in?range(10)]
          ????for?col?in?amount_cols:
          ????????total_monthly_payment_bin?=?[-1,?5000,?10000,?30000,?50000,?100000,?300000,?500000,?1000000,?3000000,?data[col].max()]
          ????????data[col?+?'_bin']?=?pd.cut(data[col],?total_monthly_payment_bin,?labels=amount_labels).astype(int)

          ????#?Step?3:?返回包含新特征的訓練集?&?測試集
          ????return?data[data['loan_default'].notnull()],?data[data['loan_default'].isnull()]
          • 近鄰欺詐特征(ID前后10個近鄰的欺詐概率,其實可以更多不同嘗試尋找最優(yōu)的近鄰數(shù),但精力有限哈哈)
          def?gen_neighbor_feats(train,?test):
          ????'''產(chǎn)生近鄰欺詐特征'''
          ????if?not?os.path.exists('../user_data/neighbor_default_probs.pkl'):
          ????????#?該特征需要跑的時間較久,因此將其存成了pkl文件
          ????????neighbor_default_probs?=?[]
          ????????for?i?in?tqdm(range(train.customer_id.max())):
          ????????????if?i?>=?10?and?i?199706:
          ????????????????customer_id_neighbors?=?list(range(i?-?10,?i))?+?list(range(i?+?1,?i?+?10))
          ????????????elif?i?199706:
          ????????????????customer_id_neighbors?=?list(range(0,?i))?+?list(range(i?+?1,?i?+?10))
          ????????????else:
          ????????????????customer_id_neighbors?=?list(range(i?-?10,?i))?+?list(range(i?+?1,?199706))

          ????????????customer_id_neighbors?=?[customer_id_neighbor?for?customer_id_neighbor?in?customer_id_neighbors?if
          ?????????????????????????????????????customer_id_neighbor?in?train.customer_id.values.tolist()]
          ????????????neighbor_default_prob?=?train.set_index('customer_id').loc[customer_id_neighbors].loan_default.mean()
          ????????????neighbor_default_probs.append(neighbor_default_prob)

          ????????df_neighbor_default_prob?=?pd.DataFrame({'customer_id':?range(0,?train.customer_id.max()),
          ?????????????????????????????????????????????????'neighbor_default_prob':?neighbor_default_probs})
          ????????save_pkl(df_neighbor_default_prob,?'../user_data/neighbor_default_probs.pkl')
          ????else:
          ????????df_neighbor_default_prob?=?load_pkl('../user_data/neighbor_default_probs.pkl')
          ????train?=?pd.merge(left=train,?right=df_neighbor_default_prob,?on='customer_id',?how='left')
          ????test?=?pd.merge(left=test,?right=df_neighbor_default_prob,?on='customer_id',?how='left')

          ????return?train,?test

          最終我只選取了47維特征:

          USED_FEATS?=?[
          ?????????????????'customer_id',
          ?????????????????'neighbor_default_prob',
          ?????????????????'disbursed_amount',
          ?????????????????'asset_cost',
          ?????????????????'branch_id',
          ?????????????????'supplier_id',
          ?????????????????'manufacturer_id',
          ?????????????????'area_id',
          ?????????????????'employee_code_id',
          ?????????????????'credit_score',
          ?????????????????'loan_to_asset_ratio',
          ?????????????????'year_of_birth',
          ?????????????????'age',
          ?????????????????'sub_Rate',
          ?????????????????'main_Rate',
          ?????????????????'loan_to_asset_ratio_bin',
          ?????????????????'asset_cost_bin',
          ?????????????????'employment_type_mean_target',
          ?????????????????'branch_id_mean_target',
          ?????????????????'supplier_id_mean_target',
          ?????????????????'manufacturer_id_mean_target',
          ?????????????????'area_id_mean_target',
          ?????????????????'employee_code_id_mean_target',
          ?????????????????'asset_cost_bin_mean_target',
          ?????????????????'credit_history',
          ?????????????????'average_age',
          ?????????????????'total_disbursed_loan',
          ?????????????????'main_account_disbursed_loan',
          ?????????????????'total_sanction_loan',
          ?????????????????'main_account_sanction_loan',
          ?????????????????'active_to_inactive_act_ratio',
          ?????????????????'total_outstanding_loan&##39;,
          ?????????????????'
          main_account_outstanding_loan',
          ?????????????????'
          Credit_level',
          ?????????????????'
          outstanding_disburse_ratio',
          ?????????????????'
          total_account_loan_no',
          ?????????????????'
          main_account_tenure',
          ?????????????????'
          main_account_loan_no',
          ?????????????????'
          main_account_monthly_payment',
          ?????????????????'
          total_monthly_payment',
          ?????????????????'
          main_account_active_loan_no',
          ?????????????????'
          main_account_inactive_loan_no',
          ?????????????????'
          sub_account_inactive_loan_no',
          ?????????????????'
          enquirie_no',
          ?????????????????'
          main_account_overdue_no',
          ?????????????????'
          total_overdue_no',
          ?????????????????'
          last_six_month_defaulted_no'
          ????????????]

          4.2 模型訓練

          • LightGBM(十折效果更優(yōu))
          def?train_lgb_kfold(X_train,?y_train,?X_test,?n_fold=5):
          ????'''train?lightgbm?with?k-fold?split'''
          ????gbms?=?[]
          ????kfold?=?StratifiedKFold(n_splits=n_fold,?random_state=1024,?shuffle=True)
          ????oof_preds?=?np.zeros((X_train.shape[0],))
          ????test_preds?=?np.zeros((X_test.shape[0],))

          ????for?fold,?(train_index,?val_index)?in?enumerate(kfold.split(X_train,?y_train)):
          ????????logging.info(f'############?fold?{fold}?###########')
          ????????X_tr,?X_val,?y_tr,?y_val?=?X_train.iloc[train_index],?X_train.iloc[val_index],?y_train[train_index],?y_train[val_index]
          ????????dtrain?=?lgb.Dataset(X_tr,?y_tr)
          ????????dvalid?=?lgb.Dataset(X_val,?y_val,?reference=dtrain)

          ????????params?=?{
          ????????????'objective':?'binary',
          ????????????'metric':?'auc',
          ????????????'num_leaves':?64,
          ????????????'learning_rate':?0.02,
          ????????????'min_data_in_leaf':?150,
          ????????????'feature_fraction':?0.8,
          ????????????'bagging_fraction':?0.7,
          ????????????'n_jobs':?-1,
          ????????????'seed':?1024
          ????????}

          ????????gbm?=?lgb.train(params,
          ????????????????????????dtrain,
          ????????????????????????num_boost_round=1000,
          ????????????????????????valid_sets=[dtrain,?dvalid],
          ????????????????????????verbose_eval=50,
          ????????????????????????early_stopping_rounds=20)

          ????????oof_preds[val_index]?=?gbm.predict(X_val,?num_iteration=gbm.best_iteration)
          ????????test_preds?+=?gbm.predict(X_test,?num_iteration=gbm.best_iteration)?/?kfold.n_splits
          ????????gbms.append(gbm)

          ????return?gbms,?oof_preds,?test_preds
          • XGBoost
          def?train_xgb_kfold(X_train,?y_train,?X_test,?n_fold=5):
          ????'''train?xgboost?with?k-fold?split'''
          ????gbms?=?[]
          ????kfold?=?StratifiedKFold(n_splits=10,?random_state=1024,?shuffle=True)
          ????oof_preds?=?np.zeros((X_train.shape[0],))
          ????test_preds?=?np.zeros((X_test.shape[0],))

          ????for?fold,?(train_index,?val_index)?in?enumerate(kfold.split(X_train,?y_train)):
          ????????logging.info(f'############?fold?{fold}?###########')
          ????????X_tr,?X_val,?y_tr,?y_val?=?X_train.iloc[train_index],?X_train.iloc[val_index],?y_train[train_index],?y_train[val_index]
          ????????dtrain?=?xgb.DMatrix(X_tr,?y_tr)
          ????????dvalid?=?xgb.DMatrix(X_val,?y_val)
          ????????dtest?=?xgb.DMatrix(X_test)

          ????????params={
          ????????????'booster':'gbtree',
          ????????????'objective':?'binary:logistic',
          ????????????'eval_metric':?['logloss',?'auc'],
          ????????????'max_depth':?8,
          ????????????'subsample':0.9,
          ????????????'min_child_weight':?10,
          ????????????'colsample_bytree':0.85,
          ????????????'lambda':?10,
          ????????????'eta':?0.02,
          ????????????'seed':?1024
          ????????}

          ????????watchlist?=?[(dtrain,?'train'),?(dvalid,?'test')]

          ????????gbm?=?xgb.train(params,
          ????????????????????????dtrain,
          ????????????????????????num_boost_round=1000,
          ????????????????????????evals=watchlist,
          ????????????????????????verbose_eval=50,
          ????????????????????????early_stopping_rounds=20)

          ????????oof_preds[val_index]?=?gbm.predict(dvalid,?iteration_range=(0,?gbm.best_iteration))
          ????????test_preds?+=?gbm.predict(dtest,?iteration_range=(0,?gbm.best_iteration))?/?kfold.n_splits
          ????????gbms.append(gbm)

          ????return?gbms,?oof_preds,?test_preds

          4.3 模型融合與閾值選取

          def?gen_submit_file(df_test,?test_preds,?thres,?save_path):
          ????df_test['test_preds_binary']?=?np.where(test_preds?>?thres,?1,?0)
          ????df_test_submit?=?df_test[['customer_id',?'test_preds_binary']]
          ????df_test_submit.columns?=?['customer_id',?'loan_default']
          ????print(f'saving?result?to:?{save_path}')
          ????df_test_submit.to_csv(save_path,?index=False)
          ????print('done!')
          ????return?df_test_submit

          def?gen_thres_new(df_train,?oof_preds):
          ????df_train['oof_preds']?=?oof_preds
          ????quantile_point?=?df_train['loan_default'].mean()
          ????thres?=?df_train['oof_preds'].quantile(1?-?quantile_point)

          ????_thresh?=?[]
          ????for?thres_item?in?np.arange(thres?-?0.2,?thres?+?0.2,?0.01):
          ????????_thresh.append(
          ????????????[thres_item,?f1_score(df_train['loan_default'],?np.where(oof_preds?>?thres_item,?1,?0),?average='macro')])

          ????_thresh?=?np.array(_thresh)
          ????best_id?=?_thresh[:,?1].argmax()
          ????best_thresh?=?_thresh[best_id][0]

          ????print("閾值:?{}\n訓練集的f1:?{}".format(best_thresh,?_thresh[best_id][1]))
          ????return?best_thresh

          #?結(jié)果
          df_oof_res?=?pd.DataFrame({'customer_id':?train['customer_id'],
          ???????????????????????????'oof_preds_xgb':?oof_preds_xgb,
          ???????????????????????????'oof_preds_lgb':?oof_preds_lgb,
          ???????????????????????????'loan_default':?train['loan_default']
          ??????????????????????????})

          #?模型融合
          df_oof_res['xgb_rank']?=?df_oof_res['oof_preds_xgb'].rank(pct=True)
          df_oof_res['lgb_rank']?=?df_oof_res['oof_preds_lgb'].rank(pct=True)
          df_oof_res['preds']?=?0.31?*?df_oof_res['xgb_rank']?+?0.69?*?df_oof_res['lgb_rank']

          #?得到最優(yōu)閾值
          thres?=?gen_thres_new(df_oof_res,?df_oof_res['preds'])

          df_test_res?=?pd.DataFrame({'customer_id':?test['customer_id'],
          ????????????????????????????'test_preds_xgb':?test_preds_xgb,
          ????????????????????????????'test_preds_lgb':?test_preds_lgb})

          df_test_res['xgb_rank']?=?df_test_res['test_preds_xgb'].rank(pct=True)
          df_test_res['lgb_rank']?=?df_test_res['test_preds_lgb'].rank(pct=True)
          df_test_res['preds']?=?0.31?*?df_test_res['xgb_rank']?+?0.69?*?df_test_res['lgb_rank']

          #?結(jié)果產(chǎn)出
          df_submit?=?gen_submit_file(df_test_res,?df_test_res['preds'],?thres,
          ????????????????????????????save_path='../prediction_result/result.csv')

          完整代碼

          Github地址:
          https://github.com/WangliLin/xunfei2021_car_loan_top1
          結(jié)果復現(xiàn)直接運行sh test.sh 即可。
          ·················END·················

          推薦閱讀

          1. 我在字節(jié)做了哪些事

          2. 寫給所有數(shù)據(jù)人。

          3. 從留存率業(yè)務案例談0-1的數(shù)據(jù)指標體系

          4. 數(shù)據(jù)分析師的一周

          5. 超級菜鳥如何入門數(shù)據(jù)分析?


          歡迎長按掃碼關(guān)注「數(shù)據(jù)管道」

          瀏覽 46
          點贊
          評論
          收藏
          分享

          手機掃一掃分享

          分享
          舉報
          評論
          圖片
          表情
          推薦
          點贊
          評論
          收藏
          分享

          手機掃一掃分享

          分享
          舉報
          <kbd id="afajh"><form id="afajh"></form></kbd>
          <strong id="afajh"><dl id="afajh"></dl></strong>
            <del id="afajh"><form id="afajh"></form></del>
                1. <th id="afajh"><progress id="afajh"></progress></th>
                  <b id="afajh"><abbr id="afajh"></abbr></b>
                  <th id="afajh"><progress id="afajh"></progress></th>
                  99日在线视频 | 亚洲自拍电影 | 色婷婷AV无码久久精品 | 色玖玖 插person | 75大香蕉 |