俺也去在线视频,在线成人看片免费看黄a,色色五月天婷婷,日韩在线黄,天天操夜夜操狠狠,精品操逼网站,撸一撸天天日,日韩免费黄色AⅤ电影

作者：望尼瑪，浙江大學，Datawhale優(yōu)秀選手

知乎｜https://www.zhihu.com/people/lin-a-bi-78/posts

1. 引言

Hello，大家好。我是“摸魚打比賽”隊的wangli，首先介紹下自己吧，一枚半路出家的野生算法工程師。之所以起名字叫摸魚打比賽，是因為當時5/6月份自己還處于業(yè)務交接沒那么忙的一個狀態(tài)中，然后想起自己也已經(jīng)畢業(yè)兩年，但對賽圈一直還是比較關(guān)注的，平日看到一些題目也會手癢，但奈何打工人下班之后惰性使然只想躺平，畢業(yè)之后始終沒有好好打一場比賽，偶爾也會在深夜里問起自己：“廉頗老矣，尚能飯否”，就想著，這回我就利用下這段尚且不忙的日子好好打一場比賽吧。于是我就參加了這次的比賽，不僅僥幸獲得了車貸這個小比賽的第一，然后還結(jié)識了一些好友，比如我尚在讀研的隊友陳兄，以及忙于秋招中的好友崔兄。真是收獲滿滿~

那么，接下來我就給大家介紹一下這場比賽中，我的一些具體的解題思路和感悟。

2. 賽題背景

賽題鏈接：https://challenge.xfyun.cn/topic/info?type=car-loan

可以看到，這個賽題做的是車貸違約預測問題，基于，參賽選手們需要建立風險識別模型來預測可能違約的借款人。這道賽題，相比其他賽題，車貸違約預測這道題的難度是沒那么大的，原因有二：

賽題難度：非常傳統(tǒng)的風控逾期預測，二分類問題，很多其他比賽的代碼可能稍微改一下就能套上來用；
競爭程度：賽題本身的獎金并不多，因此參賽的選手也不多。

我個人是前期在打商品推薦賽（同“摸魚打比賽”ID）的時候順便打一下這個比賽，在最后幾天有認真去挖了一些特征。（說到這個基于用戶畫像的商品推薦賽，就有點慚愧，前期感覺自己還是可以一戰(zhàn)的，一度是在Top 3的，后面8月開始由于工作太忙，復賽開始之后就一直沒有提交，說到底還是自己時間管理能力太菜了。就看看國慶期間能不能有時間再做一下吧）

再說回這個比賽：

數(shù)據(jù)量的話還是可以的，其中訓練集15w，測試集3w
包含52個特征字段，各個字段主辦方也是給了相應的解釋
評估指標：F1 Score

所以，其實可以很快的寫出一個baseline來，對于數(shù)據(jù)新手來說，是一個比較友好的比賽了。

3. 解題思路

這種偏數(shù)據(jù)挖掘的比賽的關(guān)鍵點在于如何基于對數(shù)據(jù)的理解抽象歸納出有用的特征，因此，我一開始做的時候，并沒有想著說去套各種高大上的模型，而是通過對數(shù)據(jù)的分析去構(gòu)造一些特征。如果不想往后看代碼的話，我在這一章節(jié)會簡單把我的整個方案講一下：

正負樣本分布：可以看到這道題的正負樣本比為 82:18 這樣，在風控里面其實已經(jīng)屬于正負樣本分布較為平衡的數(shù)據(jù)了，所以我在比賽中，并沒有刻意的去往正負樣本不平衡這塊去做，有做了一些過采樣的嘗試，但效果反而不增反降。
特征工程：

首先我一開始就發(fā)現(xiàn)有很多ID類的特征，然后我就基于這些ID類特征做了一些target encoding特征，這些簡單的特征 + 樹模型就已經(jīng)0.583了，能讓我前期一直處在Top 10；

而后，從業(yè)務角度構(gòu)造了一些諸如：主賬戶和二級賬戶的年利率特征（因為往往銀行的利率表現(xiàn)了其對用戶的信用預測）；從數(shù)據(jù)分布角度對一些金額類的特征做了些分箱操作；再從特征本身的有效性和冗余角度出發(fā)，剔除了一些毫無信息量的特征，比如貸款日期等。這時，我們可以做到0.587這樣的水平；

然后，在一次誤打誤撞的模型訓練時，我誤把客戶ID放進模型中去訓練了，結(jié)果我發(fā)現(xiàn)似乎還對模型性能有一定提升？那我這時候的想法是：這一定是由于欺詐有些集中性導致的，黑產(chǎn)可能在借貸銀行（where）或借貸時間（when）上存在一定的集中性，而這種集中性一方面可以通過branch_id/supplier_id/manufacturer_id等反映出來，另一方面，本身客戶的customer_id也是可以體現(xiàn)時間上的集中性，因此，我又基于這個點構(gòu)造了近鄰欺詐特征，這時候我們就能做到0.589了；

模型選取：

前期，我一直是用的LightGBM，然后也沒有很仔細的去調(diào)參（比如hyperopt/ optuna等工具，我都沒有用），就很隨意（平平無奇的手動調(diào)參小天才）

后期，我開始嘗試其他的XGBoost/CatBoost/TabNet等模型，但是發(fā)現(xiàn)CatBoost和TabNet效果都不是很好，就沒有深入往下去鉆了（主要白天還是要上班的，因此精力有限，說是摸魚打比賽，但更準確的說是熬夜打比賽）

閾值選?。?/strong>由于該題是用F1 Score作為評判標準的，因此，我們需要自己劃一個閾值，然后決定哪些樣本預測為正樣本，哪些樣本預測為負樣本。在嘗試了不同方案后，我們的方案基于oof的預測結(jié)果，選出一個在oof上表現(xiàn)最優(yōu)的閾值，此時在榜上的效果是最佳的（千分位的提升）
融合策略：最后選定了兩個模型來融合，一個是LightGBM，一個是XGBoost（哈哈哈，就很土有沒有），然后，直接按預測概率加權(quán)融合的話效果是比較一般的，而按照其ranking值分位點化之后再加權(quán)融合效果會更好。效果而言，單模LGB最優(yōu)是0.5892，XGB是在0.5872這邊，按照概率加權(quán)最優(yōu)是0.59011，按照排序加權(quán)最優(yōu)是0.59038

其實主要思路和方案，就如同上述文字所描述的了。但看起來總是干巴巴的，如果你還對代碼有興趣的話，可以繼續(xù)往下看。畢竟 Talk is Cheap， :)

4. 具體實現(xiàn) & 代碼詳解

4.1 特征工程

target encoding/mean encoding，這里要注意的是，為了防止過擬合，需要分折來做

#?用來TG編碼的特征：
TARGET_ENCODING_FETAS?=?[
????????????????????????????'employment_type',
?????????????????????????????'branch_id',
?????????????????????????????'supplier_id',
?????????????????????????????'manufacturer_id',
?????????????????????????????'area_id',
?????????????????????????????'employee_code_id',
?????????????????????????????'asset_cost_bin'
?????????????????????????]

#?具體實現(xiàn)：
def?gen_target_encoding_feats(train,?test,?encode_cols,?target_col,?n_fold=10):
????'''生成target?encoding特征'''
????#?for?training?set?-?cv
????tg_feats?=?np.zeros((train.shape[0],?len(encode_cols)))
????kfold?=?StratifiedKFold(n_splits=n_fold,?random_state=1024,?shuffle=True)
????for?_,?(train_index,?val_index)?in?enumerate(kfold.split(train[encode_cols],?train[target_col])):
????????df_train,?df_val?=?train.iloc[train_index],?train.iloc[val_index]
????????for?idx,?col?in?enumerate(encode_cols):
????????????target_mean_dict?=?df_train.groupby(col)[target_col].mean()
????????????df_val[f'{col}_mean_target']?=?df_val[col].map(target_mean_dict)
????????????tg_feats[val_index,?idx]?=?df_val[f'{col}_mean_target'].values

????for?idx,?encode_col?in?enumerate(encode_cols):
????????train[f'{encode_col}_mean_target']?=?tg_feats[:,?idx]

????#?for?testing?set
????for?col?in?encode_cols:
????????target_mean_dict?=?train.groupby(col)[target_col].mean()
????????test[f'{col}_mean_target']?=?test[col].map(target_mean_dict)

????return?train,?test

年利率特征/分箱等特征：

def?gen_new_feats(train,?test):
????'''生成新特征：如年利率/分箱等特征'''
????#?Step?1:?合并訓練集和測試集
????data?=?pd.concat([train,?test])

????#?Step?2:?具體特征工程
????#?計算二級賬戶的年利率
????data['sub_Rate']?=?(data['sub_account_monthly_payment']?*?data['sub_account_tenure']?-?data[
????????'sub_account_sanction_loan'])?/?data['sub_account_sanction_loan']

????#?計算主賬戶的年利率
????data['main_Rate']?=?(data['main_account_monthly_payment']?*?data['main_account_tenure']?-?data[
????????'main_account_sanction_loan'])?/?data['main_account_sanction_loan']

????#?對部分特征進行分箱操作
????#?等寬分箱
????loan_to_asset_ratio_labels?=?[i?for?i?in?range(10)]
????data['loan_to_asset_ratio_bin']?=?pd.cut(data["loan_to_asset_ratio"],?10,?labels=loan_to_asset_ratio_labels)
????#?等頻分箱
????data['asset_cost_bin']?=?pd.qcut(data['asset_cost'],?10,?labels=loan_to_asset_ratio_labels)
????#?自定義分箱
????amount_cols?=?[
???????????????????'total_monthly_payment',
???????????????????'main_account_sanction_loan',
???????????????????'main_account_disbursed_loan',
???????????????????'sub_account_sanction_loan',
???????????????????'sub_account_disbursed_loan',
???????????????????'main_account_monthly_payment',
???????????????????'sub_account_monthly_payment',
???????????????????'total_sanction_loan'
????????????????]
????amount_labels?=?[i?for?i?in?range(10)]
????for?col?in?amount_cols:
????????total_monthly_payment_bin?=?[-1,?5000,?10000,?30000,?50000,?100000,?300000,?500000,?1000000,?3000000,?data[col].max()]
????????data[col?+?'_bin']?=?pd.cut(data[col],?total_monthly_payment_bin,?labels=amount_labels).astype(int)

????#?Step?3:?返回包含新特征的訓練集?&?測試集
????return?data[data['loan_default'].notnull()],?data[data['loan_default'].isnull()]

近鄰欺詐特征（ID前后10個近鄰的欺詐概率，其實可以更多不同嘗試尋找最優(yōu)的近鄰數(shù)，但精力有限哈哈）

def?gen_neighbor_feats(train,?test):
????'''產(chǎn)生近鄰欺詐特征'''
????if?not?os.path.exists('../user_data/neighbor_default_probs.pkl'):
????????#?該特征需要跑的時間較久，因此將其存成了pkl文件
????????neighbor_default_probs?=?[]
????????for?i?in?tqdm(range(train.customer_id.max())):
????????????if?i?>=?10?and?i?199706:
????????????????customer_id_neighbors?=?list(range(i?-?10,?i))?+?list(range(i?+?1,?i?+?10))
????????????elif?i?199706:
????????????????customer_id_neighbors?=?list(range(0,?i))?+?list(range(i?+?1,?i?+?10))
????????????else:
????????????????customer_id_neighbors?=?list(range(i?-?10,?i))?+?list(range(i?+?1,?199706))

????????????customer_id_neighbors?=?[customer_id_neighbor?for?customer_id_neighbor?in?customer_id_neighbors?if
?????????????????????????????????????customer_id_neighbor?in?train.customer_id.values.tolist()]
????????????neighbor_default_prob?=?train.set_index('customer_id').loc[customer_id_neighbors].loan_default.mean()
????????????neighbor_default_probs.append(neighbor_default_prob)

????????df_neighbor_default_prob?=?pd.DataFrame({'customer_id':?range(0,?train.customer_id.max()),
?????????????????????????????????????????????????'neighbor_default_prob':?neighbor_default_probs})
????????save_pkl(df_neighbor_default_prob,?'../user_data/neighbor_default_probs.pkl')
????else:
????????df_neighbor_default_prob?=?load_pkl('../user_data/neighbor_default_probs.pkl')
????train?=?pd.merge(left=train,?right=df_neighbor_default_prob,?on='customer_id',?how='left')
????test?=?pd.merge(left=test,?right=df_neighbor_default_prob,?on='customer_id',?how='left')

????return?train,?test

最終我只選取了47維特征：

USED_FEATS?=?[
?????????????????'customer_id',
?????????????????'neighbor_default_prob',
?????????????????'disbursed_amount',
?????????????????'asset_cost',
?????????????????'branch_id',
?????????????????'supplier_id',
?????????????????'manufacturer_id',
?????????????????'area_id',
?????????????????'employee_code_id',
?????????????????'credit_score',
?????????????????'loan_to_asset_ratio',
?????????????????'year_of_birth',
?????????????????'age',
?????????????????'sub_Rate',
?????????????????'main_Rate',
?????????????????'loan_to_asset_ratio_bin',
?????????????????'asset_cost_bin',
?????????????????'employment_type_mean_target',
?????????????????'branch_id_mean_target',
?????????????????'supplier_id_mean_target',
?????????????????'manufacturer_id_mean_target',
?????????????????'area_id_mean_target',
?????????????????'employee_code_id_mean_target',
?????????????????'asset_cost_bin_mean_target',
?????????????????'credit_history',
?????????????????'average_age',
?????????????????'total_disbursed_loan',
?????????????????'main_account_disbursed_loan',
?????????????????'total_sanction_loan',
?????????????????'main_account_sanction_loan',
?????????????????'active_to_inactive_act_ratio',
?????????????????'total_outstanding_loan&##39;,
?????????????????'main_account_outstanding_loan',
?????????????????'Credit_level',
?????????????????'outstanding_disburse_ratio',
?????????????????'total_account_loan_no',
?????????????????'main_account_tenure',
?????????????????'main_account_loan_no',
?????????????????'main_account_monthly_payment',
?????????????????'total_monthly_payment',
?????????????????'main_account_active_loan_no',
?????????????????'main_account_inactive_loan_no',
?????????????????'sub_account_inactive_loan_no',
?????????????????'enquirie_no',
?????????????????'main_account_overdue_no',
?????????????????'total_overdue_no',
?????????????????'last_six_month_defaulted_no'
????????????]

4.2 模型訓練

LightGBM（十折效果更優(yōu)）

def?train_lgb_kfold(X_train,?y_train,?X_test,?n_fold=5):
????'''train?lightgbm?with?k-fold?split'''
????gbms?=?[]
????kfold?=?StratifiedKFold(n_splits=n_fold,?random_state=1024,?shuffle=True)
????oof_preds?=?np.zeros((X_train.shape[0],))
????test_preds?=?np.zeros((X_test.shape[0],))

????for?fold,?(train_index,?val_index)?in?enumerate(kfold.split(X_train,?y_train)):
????????logging.info(f'############?fold?{fold}?###########')
????????X_tr,?X_val,?y_tr,?y_val?=?X_train.iloc[train_index],?X_train.iloc[val_index],?y_train[train_index],?y_train[val_index]
????????dtrain?=?lgb.Dataset(X_tr,?y_tr)
????????dvalid?=?lgb.Dataset(X_val,?y_val,?reference=dtrain)

????????params?=?{
????????????'objective':?'binary',
????????????'metric':?'auc',
????????????'num_leaves':?64,
????????????'learning_rate':?0.02,
????????????'min_data_in_leaf':?150,
????????????'feature_fraction':?0.8,
????????????'bagging_fraction':?0.7,
????????????'n_jobs':?-1,
????????????'seed':?1024
????????}

????????gbm?=?lgb.train(params,
????????????????????????dtrain,
????????????????????????num_boost_round=1000,
????????????????????????valid_sets=[dtrain,?dvalid],
????????????????????????verbose_eval=50,
????????????????????????early_stopping_rounds=20)

????????oof_preds[val_index]?=?gbm.predict(X_val,?num_iteration=gbm.best_iteration)
????????test_preds?+=?gbm.predict(X_test,?num_iteration=gbm.best_iteration)?/?kfold.n_splits
????????gbms.append(gbm)

????return?gbms,?oof_preds,?test_preds

XGBoost

def?train_xgb_kfold(X_train,?y_train,?X_test,?n_fold=5):
????'''train?xgboost?with?k-fold?split'''
????gbms?=?[]
????kfold?=?StratifiedKFold(n_splits=10,?random_state=1024,?shuffle=True)
????oof_preds?=?np.zeros((X_train.shape[0],))
????test_preds?=?np.zeros((X_test.shape[0],))

????for?fold,?(train_index,?val_index)?in?enumerate(kfold.split(X_train,?y_train)):
????????logging.info(f'############?fold?{fold}?###########')
????????X_tr,?X_val,?y_tr,?y_val?=?X_train.iloc[train_index],?X_train.iloc[val_index],?y_train[train_index],?y_train[val_index]
????????dtrain?=?xgb.DMatrix(X_tr,?y_tr)
????????dvalid?=?xgb.DMatrix(X_val,?y_val)
????????dtest?=?xgb.DMatrix(X_test)

????????params={
????????????'booster':'gbtree',
????????????'objective':?'binary:logistic',
????????????'eval_metric':?['logloss',?'auc'],
????????????'max_depth':?8,
????????????'subsample':0.9,
????????????'min_child_weight':?10,
????????????'colsample_bytree':0.85,
????????????'lambda':?10,
????????????'eta':?0.02,
????????????'seed':?1024
????????}

????????watchlist?=?[(dtrain,?'train'),?(dvalid,?'test')]

????????gbm?=?xgb.train(params,
????????????????????????dtrain,
????????????????????????num_boost_round=1000,
????????????????????????evals=watchlist,
????????????????????????verbose_eval=50,
????????????????????????early_stopping_rounds=20)

????????oof_preds[val_index]?=?gbm.predict(dvalid,?iteration_range=(0,?gbm.best_iteration))
????????test_preds?+=?gbm.predict(dtest,?iteration_range=(0,?gbm.best_iteration))?/?kfold.n_splits
????????gbms.append(gbm)

????return?gbms,?oof_preds,?test_preds

4.3 模型融合與閾值選取

def?gen_submit_file(df_test,?test_preds,?thres,?save_path):
????df_test['test_preds_binary']?=?np.where(test_preds?>?thres,?1,?0)
????df_test_submit?=?df_test[['customer_id',?'test_preds_binary']]
????df_test_submit.columns?=?['customer_id',?'loan_default']
????print(f'saving?result?to:?{save_path}')
????df_test_submit.to_csv(save_path,?index=False)
????print('done!')
????return?df_test_submit

def?gen_thres_new(df_train,?oof_preds):
????df_train['oof_preds']?=?oof_preds
????quantile_point?=?df_train['loan_default'].mean()
????thres?=?df_train['oof_preds'].quantile(1?-?quantile_point)

????_thresh?=?[]
????for?thres_item?in?np.arange(thres?-?0.2,?thres?+?0.2,?0.01):
????????_thresh.append(
????????????[thres_item,?f1_score(df_train['loan_default'],?np.where(oof_preds?>?thres_item,?1,?0),?average='macro')])

????_thresh?=?np.array(_thresh)
????best_id?=?_thresh[:,?1].argmax()
????best_thresh?=?_thresh[best_id][0]

????print("閾值:?{}\n訓練集的f1:?{}".format(best_thresh,?_thresh[best_id][1]))
????return?best_thresh

#?結(jié)果
df_oof_res?=?pd.DataFrame({'customer_id':?train['customer_id'],
???????????????????????????'oof_preds_xgb':?oof_preds_xgb,
???????????????????????????'oof_preds_lgb':?oof_preds_lgb,
???????????????????????????'loan_default':?train['loan_default']
??????????????????????????})

#?模型融合
df_oof_res['xgb_rank']?=?df_oof_res['oof_preds_xgb'].rank(pct=True)
df_oof_res['lgb_rank']?=?df_oof_res['oof_preds_lgb'].rank(pct=True)
df_oof_res['preds']?=?0.31?*?df_oof_res['xgb_rank']?+?0.69?*?df_oof_res['lgb_rank']

#?得到最優(yōu)閾值
thres?=?gen_thres_new(df_oof_res,?df_oof_res['preds'])

df_test_res?=?pd.DataFrame({'customer_id':?test['customer_id'],
????????????????????????????'test_preds_xgb':?test_preds_xgb,
????????????????????????????'test_preds_lgb':?test_preds_lgb})

df_test_res['xgb_rank']?=?df_test_res['test_preds_xgb'].rank(pct=True)
df_test_res['lgb_rank']?=?df_test_res['test_preds_lgb'].rank(pct=True)
df_test_res['preds']?=?0.31?*?df_test_res['xgb_rank']?+?0.69?*?df_test_res['lgb_rank']

#?結(jié)果產(chǎn)出
df_submit?=?gen_submit_file(df_test_res,?df_test_res['preds'],?thres,
????????????????????????????save_path='../prediction_result/result.csv')

完整代碼

Github地址：

https://github.com/WangliLin/xunfei2021_car_loan_top1

結(jié)果復現(xiàn)直接運行sh test.sh 即可。

·················END·················

Top1方案！2021科大訊飛-車輛貸款違約預測賽事

1. 引言

2. 賽題背景

3. 解題思路

4. 具體實現(xiàn) & 代碼詳解

4.1 特征工程

4.2 模型訓練

4.3 模型融合與閾值選取

完整代碼

推薦閱讀

Top1方案！2021科大訊飛-車輛貸款違約預測賽事

1. 引言

2. 賽題背景

3. 解題思路

4. 具體實現(xiàn) & 代碼詳解

4.1 特征工程

4.2 模型訓練

4.3 模型融合與閾值選取

完整代碼

推薦閱讀

Top1方案！2021科大訊飛-車輛貸款違約預測賽事