【機(jī)器學(xué)習(xí)】集成學(xué)習(xí)代碼練習(xí)(隨機(jī)森林、GBDT、XGBoost、LightGBM等)
本文是中國大學(xué)慕課《機(jī)器學(xué)習(xí)》的“集成學(xué)習(xí)”章節(jié)的課后代碼。
課程地址:
https://www.icourse163.org/course/WZU-1464096179
課程完整代碼:
https://github.com/fengdu78/WZU-machine-learning-course
代碼修改并注釋:黃海廣,[email protected]
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
from sklearn.model_selection import train_test_split
生成數(shù)據(jù)
生成12000行的數(shù)據(jù),訓(xùn)練集和測試集按照3:1劃分
from sklearn.datasets import make_hastie_10_2
data, target = make_hastie_10_2()
X_train, X_test, y_train, y_test = train_test_split(data, target, random_state=123)
X_train.shape, X_test.shape
((9000, 10), (3000, 10))
模型對比
對比六大模型,都使用默認(rèn)參數(shù)
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.model_selection import cross_val_score
import time
clf1 = LogisticRegression()
clf2 = RandomForestClassifier()
clf3 = AdaBoostClassifier()
clf4 = GradientBoostingClassifier()
clf5 = XGBClassifier()
clf6 = LGBMClassifier()
for clf, label in zip([clf1, clf2, clf3, clf4, clf5, clf6], [
'Logistic Regression', 'Random Forest', 'AdaBoost', 'GBDT', 'XGBoost',
'LightGBM'
]):
start = time.time()
scores = cross_val_score(clf, X_train, y_train, scoring='accuracy', cv=5)
end = time.time()
running_time = end - start
print("Accuracy: %0.8f (+/- %0.2f),耗時(shí)%0.2f秒。模型名稱[%s]" %
(scores.mean(), scores.std(), running_time, label))
Accuracy: 0.47488889 (+/- 0.00),耗時(shí)0.04秒。模型名稱[Logistic Regression]
Accuracy: 0.88966667 (+/- 0.01),耗時(shí)16.34秒。模型名稱[Random Forest]
Accuracy: 0.88311111 (+/- 0.00),耗時(shí)3.39秒。模型名稱[AdaBoost]
Accuracy: 0.91388889 (+/- 0.01),耗時(shí)13.14秒。模型名稱[GBDT]
Accuracy: 0.92977778 (+/- 0.00),耗時(shí)3.60秒。模型名稱[XGBoost]
Accuracy: 0.93188889 (+/- 0.01),耗時(shí)0.58秒。模型名稱[LightGBM]
對比了六大模型,可以看出,邏輯回歸速度最快,但準(zhǔn)確率最低。而LightGBM,速度快,而且準(zhǔn)確率最高,所以,現(xiàn)在處理結(jié)構(gòu)化數(shù)據(jù)的時(shí)候,大部分都是用LightGBM算法。
XGBoost的使用
1.原生XGBoost的使用
import xgboost as xgb
#記錄程序運(yùn)行時(shí)間
import time
start_time = time.time()
#xgb矩陣賦值
xgb_train = xgb.DMatrix(X_train, y_train)
xgb_test = xgb.DMatrix(X_test, label=y_test)
##參數(shù)
params = {
'booster': 'gbtree',
# 'silent': 1, #設(shè)置成1則沒有運(yùn)行信息輸出,最好是設(shè)置為0.
#'nthread':7,# cpu 線程數(shù) 默認(rèn)最大
'eta': 0.007, # 如同學(xué)習(xí)率
'min_child_weight': 3,
# 這個(gè)參數(shù)默認(rèn)是 1,是每個(gè)葉子里面 h 的和至少是多少,對正負(fù)樣本不均衡時(shí)的 0-1 分類而言
#,假設(shè) h 在 0.01 附近,min_child_weight 為 1 意味著葉子節(jié)點(diǎn)中最少需要包含 100 個(gè)樣本。
#這個(gè)參數(shù)非常影響結(jié)果,控制葉子節(jié)點(diǎn)中二階導(dǎo)的和的最小值,該參數(shù)值越小,越容易 overfitting。
'max_depth': 6, # 構(gòu)建樹的深度,越大越容易過擬合
'gamma': 0.1, # 樹的葉子節(jié)點(diǎn)上作進(jìn)一步分區(qū)所需的最小損失減少,越大越保守,一般0.1、0.2這樣子。
'subsample': 0.7, # 隨機(jī)采樣訓(xùn)練樣本
'colsample_bytree': 0.7, # 生成樹時(shí)進(jìn)行的列采樣
'lambda': 2, # 控制模型復(fù)雜度的權(quán)重值的L2正則化項(xiàng)參數(shù),參數(shù)越大,模型越不容易過擬合。
#'alpha':0, # L1 正則項(xiàng)參數(shù)
#'scale_pos_weight':1, #如果取值大于0的話,在類別樣本不平衡的情況下有助于快速收斂。
#'objective': 'multi:softmax', #多分類的問題
#'num_class':10, # 類別數(shù),多分類與 multisoftmax 并用
'seed': 1000, #隨機(jī)種子
#'eval_metric': 'auc'
}
plst = list(params.items())
num_rounds = 500 # 迭代次數(shù)
watchlist = [(xgb_train, 'train'), (xgb_test, 'val')]
#訓(xùn)練模型并保存
# early_stopping_rounds 當(dāng)設(shè)置的迭代次數(shù)較大時(shí),early_stopping_rounds 可在一定的迭代次數(shù)內(nèi)準(zhǔn)確率沒有提升就停止訓(xùn)練
model = xgb.train(
plst,
xgb_train,
num_rounds,
watchlist,
early_stopping_rounds=100,
)
#model.save_model('./model/xgb.model') # 用于存儲訓(xùn)練出的模型
print("best best_ntree_limit", model.best_ntree_limit)
y_pred = model.predict(xgb_test, ntree_limit=model.best_ntree_limit)
print('error=%f' %
(sum(1
for i in range(len(y_pred)) if int(y_pred[i] > 0.5) != y_test[i]) /
float(len(y_pred))))
# 輸出運(yùn)行時(shí)長
cost_time = time.time() - start_time
print("xgboost success!", '\n', "cost time:", cost_time, "(s)......")
[0] train-rmse:1.11000 val-rmse:1.10422
[1] train-rmse:1.10734 val-rmse:1.10182
[2] train-rmse:1.10465 val-rmse:1.09932
[3] train-rmse:1.10207 val-rmse:1.09694
……
[497] train-rmse:0.62135 val-rmse:0.68680
[498] train-rmse:0.62096 val-rmse:0.68650
[499] train-rmse:0.62056 val-rmse:0.68624
best best_ntree_limit 500
error=0.826667
xgboost success!
cost time: 3.5742645263671875 (s)......
2.使用scikit-learn接口
會改變的函數(shù)名是:
eta -> learning_rate
lambda -> reg_lambda
alpha -> reg_alpha
from sklearn.model_selection import train_test_split
from sklearn import metrics
from xgboost import XGBClassifier
clf = XGBClassifier(
# silent=0, #設(shè)置成1則沒有運(yùn)行信息輸出,最好是設(shè)置為0.是否在運(yùn)行升級時(shí)打印消息。
#nthread=4,# cpu 線程數(shù) 默認(rèn)最大
learning_rate=0.3, # 如同學(xué)習(xí)率
min_child_weight=1,
# 這個(gè)參數(shù)默認(rèn)是 1,是每個(gè)葉子里面 h 的和至少是多少,對正負(fù)樣本不均衡時(shí)的 0-1 分類而言
#,假設(shè) h 在 0.01 附近,min_child_weight 為 1 意味著葉子節(jié)點(diǎn)中最少需要包含 100 個(gè)樣本。
#這個(gè)參數(shù)非常影響結(jié)果,控制葉子節(jié)點(diǎn)中二階導(dǎo)的和的最小值,該參數(shù)值越小,越容易 overfitting。
max_depth=6, # 構(gòu)建樹的深度,越大越容易過擬合
gamma=0, # 樹的葉子節(jié)點(diǎn)上作進(jìn)一步分區(qū)所需的最小損失減少,越大越保守,一般0.1、0.2這樣子。
subsample=1, # 隨機(jī)采樣訓(xùn)練樣本 訓(xùn)練實(shí)例的子采樣比
max_delta_step=0, #最大增量步長,我們允許每個(gè)樹的權(quán)重估計(jì)。
colsample_bytree=1, # 生成樹時(shí)進(jìn)行的列采樣
reg_lambda=1, # 控制模型復(fù)雜度的權(quán)重值的L2正則化項(xiàng)參數(shù),參數(shù)越大,模型越不容易過擬合。
#reg_alpha=0, # L1 正則項(xiàng)參數(shù)
#scale_pos_weight=1, #如果取值大于0的話,在類別樣本不平衡的情況下有助于快速收斂。平衡正負(fù)權(quán)重
#objective= 'multi:softmax', #多分類的問題 指定學(xué)習(xí)任務(wù)和相應(yīng)的學(xué)習(xí)目標(biāo)
#num_class=10, # 類別數(shù),多分類與 multisoftmax 并用
n_estimators=100, #樹的個(gè)數(shù)
seed=1000 #隨機(jī)種子
#eval_metric= 'auc'
)
clf.fit(X_train, y_train)
y_true, y_pred = y_test, clf.predict(X_test)
print("Accuracy : %.4g" % metrics.accuracy_score(y_true, y_pred))
Accuracy : 0.936
LIghtGBM的使用
1.原生接口
import lightgbm as lgb
from sklearn.metrics import mean_squared_error
# 加載你的數(shù)據(jù)
# print('Load data...')
# df_train = pd.read_csv('../regression/regression.train', header=None, sep='\t')
# df_test = pd.read_csv('../regression/regression.test', header=None, sep='\t')
#
# y_train = df_train[0].values
# y_test = df_test[0].values
# X_train = df_train.drop(0, axis=1).values
# X_test = df_test.drop(0, axis=1).values
# 創(chuàng)建成lgb特征的數(shù)據(jù)集格式
lgb_train = lgb.Dataset(X_train, y_train) # 將數(shù)據(jù)保存到LightGBM二進(jìn)制文件將使加載更快
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train) # 創(chuàng)建驗(yàn)證數(shù)據(jù)
# 將參數(shù)寫成字典下形式
params = {
'task': 'train',
'boosting_type': 'gbdt', # 設(shè)置提升類型
'objective': 'regression', # 目標(biāo)函數(shù)
'metric': {'l2', 'auc'}, # 評估函數(shù)
'num_leaves': 31, # 葉子節(jié)點(diǎn)數(shù)
'learning_rate': 0.05, # 學(xué)習(xí)速率
'feature_fraction': 0.9, # 建樹的特征選擇比例
'bagging_fraction': 0.8, # 建樹的樣本采樣比例
'bagging_freq': 5, # k 意味著每 k 次迭代執(zhí)行bagging
'verbose': 1 # <0 顯示致命的, =0 顯示錯(cuò)誤 (警告), >0 顯示信息
}
print('Start training...')
# 訓(xùn)練 cv and train
gbm = lgb.train(params,
lgb_train,
num_boost_round=500,
valid_sets=lgb_eval,
early_stopping_rounds=5) # 訓(xùn)練數(shù)據(jù)需要參數(shù)列表和數(shù)據(jù)集
print('Save model...')
gbm.save_model('model.txt') # 訓(xùn)練后保存模型到文件
print('Start predicting...')
# 預(yù)測數(shù)據(jù)集
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration
) #如果在訓(xùn)練期間啟用了早期停止,可以通過best_iteration方式從最佳迭代中獲得預(yù)測
# 評估模型
print('error=%f' %
(sum(1
for i in range(len(y_pred)) if int(y_pred[i] > 0.5) != y_test[i]) /
float(len(y_pred))))
Start training...
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000448 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2550
[LightGBM] [Info] Number of data points in the train set: 9000, number of used features: 10
[LightGBM] [Info] Start training from score 0.012000
[1] valid_0's auc: 0.814399 valid_0's l2: 0.965563
Training until validation scores don't improve for 5 rounds
[2] valid_0's auc: 0.84729 valid_0's l2: 0.934647
[3] valid_0's auc: 0.872805 valid_0's l2: 0.905265
[4] valid_0's auc: 0.884117 valid_0's l2: 0.877875
[5] valid_0's auc: 0.895115 valid_0's l2: 0.852189
……
[191] valid_0's auc: 0.982783 valid_0's l2: 0.319851
[192] valid_0's auc: 0.982751 valid_0's l2: 0.319971
[193] valid_0's auc: 0.982685 valid_0's l2: 0.320043
Early stopping, best iteration is:
[188] valid_0's auc: 0.982794 valid_0's l2: 0.319746
Save model...
Start predicting...
error=0.664000
2.scikit-learn接口
from sklearn import metrics
from lightgbm import LGBMClassifier
clf = LGBMClassifier(
boosting_type='gbdt', # 提升樹的類型 gbdt,dart,goss,rf
num_leaves=31, #樹的最大葉子數(shù),對比xgboost一般為2^(max_depth)
max_depth=-1, #最大樹的深度
learning_rate=0.1, #學(xué)習(xí)率
n_estimators=100, # 擬合的樹的棵樹,相當(dāng)于訓(xùn)練輪數(shù)
subsample_for_bin=200000,
objective=None,
class_weight=None,
min_split_gain=0.0, # 最小分割增益
min_child_weight=0.001, # 分支結(jié)點(diǎn)的最小權(quán)重
min_child_samples=20,
subsample=1.0, # 訓(xùn)練樣本采樣率 行
subsample_freq=0, # 子樣本頻率
colsample_bytree=1.0, # 訓(xùn)練特征采樣率 列
reg_alpha=0.0, # L1正則化系數(shù)
reg_lambda=0.0, # L2正則化系數(shù)
random_state=None,
n_jobs=-1,
silent=True,
)
clf.fit(X_train, y_train, eval_metric='auc')
#設(shè)置驗(yàn)證集合 verbose=False不打印過程
clf.fit(X_train, y_train)
y_true, y_pred = y_test, clf.predict(X_test)
print("Accuracy : %.4g" % metrics.accuracy_score(y_true, y_pred))
Accuracy : 0.927
參考
1.https://xgboost.readthedocs.io/
2.https://lightgbm.readthedocs.io/
3.https://blog.csdn.net/q383700092/article/details/53763328?locationNum=9&fps=1
往期精彩回顧
適合初學(xué)者入門人工智能的路線及資料下載 (圖文+視頻)機(jī)器學(xué)習(xí)入門系列下載 機(jī)器學(xué)習(xí)及深度學(xué)習(xí)筆記等資料打印 《統(tǒng)計(jì)學(xué)習(xí)方法》的代碼復(fù)現(xiàn)專輯 機(jī)器學(xué)習(xí)交流qq群955171419,加入微信群請掃碼
評論
圖片
表情
