久久r这里只有精品,精品黄色小视频,精品无码一区三区四区五区,约了个苗条身材妹子在线,第四色色五月,亚洲无码在线免费观看视频,一夲道HEYZO无码专区,午夜寂寞院

關(guān)注上方“Python數(shù)據(jù)科學(xué)”，選擇星標(biāo)，

關(guān)鍵時間，第一時間送達(dá)！

作者：東哥起飛

來源：Python數(shù)據(jù)科學(xué)

大家好，我是東哥。

大家都知道，Python 和 SAS 是兩個很常用的數(shù)據(jù)挖掘工具。Python 開源、免費、有豐富的三方庫，一般在互聯(lián)網(wǎng)公司廣泛使用。而SAS需付費，且費用較高，一般互聯(lián)網(wǎng)公司無法承擔(dān)，更多的是在銀行等傳統(tǒng)金融機(jī)構(gòu)中使用，不過這兩年由于Python太火，原本使用SAS的也開始逐漸轉(zhuǎn)向Python了。

擁抱開源，越來越多的愛好者造出優(yōu)秀的Python輪子，比如當(dāng)下比較流行的萬金油模型Xgboost、LightGBM，在各種競賽的top級方案中均有被使用。而SAS的腳步就比較慢了，對于一些比較新的東西都無法直接提供，所以對于那些使用SAS的朋友，就很難受了。

一直以來很多粉絲問過東哥這個問題：有沒有一種可以將Python模型轉(zhuǎn)成SAS的工具？

因為我本身是兩個技能都具備的，實際工作中一般都是配合使用，也很少想過進(jìn)行轉(zhuǎn)換。但是，最近東哥逛技術(shù)論壇剛好發(fā)現(xiàn)了一個騷操作，借助Python的三方庫m2cgen和Python腳本即可完成Python模型到SAS的轉(zhuǎn)換。

m2cgen是什么？

m2cgen是一個Python的第三方庫，主要功能就是將Python訓(xùn)練過的模型轉(zhuǎn)換為其它語言，比如 R 和 VBA。遺憾的是，目前m2cgen尚不支持SAS，但這并不影響我們最終轉(zhuǎn)換為SAS。

我們?nèi)匀皇褂?code style="font-size: 14px;border-radius: 4px;font-family: "Operator Mono", Consolas, Monaco, Menlo, monospace;word-break: break-all;color: rgb(155, 110, 35);background-color: rgb(255, 245, 227);padding: 3px;margin: 3px;">m2cgen，需要借助它間接轉(zhuǎn)換成SAS。具體的方案就是先將Python模型轉(zhuǎn)換為VBA代碼，然后再將VBA代碼更改為 SAS腳本，曲線救國。

如何使用m2cgen？

我直接用一個例子說明下如何操作。

數(shù)據(jù)我們使用sklearn自帶的iris dataset，鏈接如下：

https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html

下面，演示一下如何將Python的XGBoost模型轉(zhuǎn)成SAS代碼。

首先導(dǎo)入所需的庫包和數(shù)據(jù)。

# 導(dǎo)入庫
import pandas as pd
import numpy as np
import os 
import re
from sklearn import datasets
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import m2cgen as m2c
# 導(dǎo)入數(shù)據(jù)
iris = datasets.load_iris()
X = iris.data
Y = iris.target

然后，我們劃分?jǐn)?shù)據(jù)集，直接扔進(jìn)XGBoost里面，建立base模型。

# 劃分?jǐn)?shù)據(jù)為訓(xùn)練集和測試集
seed = 2020
test_size = 0.3
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)
# 訓(xùn)練數(shù)據(jù)
model = XGBClassifier()
model.fit(X_train, y_train)

然后，再將XGBoost模型轉(zhuǎn)換為VBA。使用m2cgen的export_to_visual_basic方法就可以直接轉(zhuǎn)成VBA了。轉(zhuǎn)換成其他語言腳本也是同理，非常簡單。

code = m2c.export_to_visual_basic(model, function_name = 'pred')

核心的騷操作來了！

m2cgen不支持SAS，但我們可以把VBA代碼稍加改動，就能變成符合SAS標(biāo)準(zhǔn)的代碼了。而這個改動也無需手動一個個改，寫一段Python腳本即可實現(xiàn)VBA腳本轉(zhuǎn)換為SAS腳本。

改動的地方不多，主要包括：刪除在SAS環(huán)境中不能使用的代碼，像上面結(jié)果中的Module xxx，Function yyy ，Dim var Z As Double，還有在語句結(jié)尾加上;，這些為的就是遵循SAS的語法規(guī)則。

下面就是轉(zhuǎn)換的Python腳本，可以自動執(zhí)行上面所說的轉(zhuǎn)換操作。

# 1、移除SAS中不能使用的代碼
code = re.sub('Dim var.* As Double', '', code)
code = re.sub('End If', '', code)
# 下面操作將修改成符合SAS的代碼
# 2、修改起始
code = re.sub('Module Model\nFunction pred\(ByRef inputVector\(\) As Double\) As Double\(\)\n', 
                'DATA pred_result;\nSET dataset_name;', code)
# 3、修改結(jié)尾
code = re.sub('End Function\nEnd Module\n', 'RUN;', code)
# 4、在結(jié)尾加上分號';'
all_match_list = re.findall('[0-9]+\n', code)
for idx in range(len(all_match_list)):
    original_str = all_match_list[idx]
    new_str = all_match_list[idx][:-1]+';\n'
    code = code.replace(original_str, new_str)
all_match_list = re.findall('\)\n', code)
for idx in range(len(all_match_list)):
    original_str = all_match_list[idx]
    new_str = all_match_list[idx][:-1]+';\n'
    code = code.replace(original_str, new_str)
# 用var來替代inputVector
dictionary = {'inputVector(0)':'sepal_length',
              'inputVector(1)':'sepal_width',
              'inputVector(2)':'petal_length',
              'inputVector(3)':'petal_width'} 
for key in dictionary.keys():
    code = code.replace(key, dictionary[key])
# 修改預(yù)測標(biāo)簽
code = re.sub('Math.Exp', 'Exp', code)
code = re.sub('pred = .*\n', '', code)
temp_var_list = re.findall(r"var[0-9]+\(\d\)", code)
for var_idx in range(len(temp_var_list)):
    code = re.sub(re.sub('\\(', '\\(', re.sub('\\)', '\\)', temp_var_list[var_idx])), iris.target_names[var_idx]+'_prob', code)

對以上腳本分步解釋說明一下。

1、開頭、結(jié)尾、輸出名稱

前三個部分非常簡單。使用正則表達(dá)式刪除多余的行，然后將腳本的開頭更改為DATA pred_result; \ nSETdataset_name;。

使用過SAS的同學(xué)就很熟悉了，pred_result是運行SAS腳本后的輸出表名稱，dataset_name是我們需要預(yù)測的輸入表名稱。

最后再將腳本的結(jié)尾更改為RUN;。

# 移除SAS中不能使用的代碼
code = re.sub('Dim var.* As Double', '', code)
code = re.sub('End If', '', code)
# 下面操作將修改成符合SAS的代碼
# 修改起始
code = re.sub('Module Model\nFunction pred\(ByRef inputVector\(\) As Double\) As Double\(\)\n', 
                'DATA pred_result;\nSET dataset_name;', code)
# 修改結(jié)尾
code = re.sub('End Function\nEnd Module\n', 'RUN;', code)

2、語句末尾添加分號

為遵循SAS中的語法規(guī)則，還需將每個語句的結(jié)尾加上;。仍用正則表達(dá)式，然后for循環(huán)在每一行最后添加字符;即可。

# 在結(jié)尾加上分號';'
all_match_list = re.findall('[0-9]+\n', code)
for idx in range(len(all_match_list)):
    original_str = all_match_list[idx]
    new_str = all_match_list[idx][:-1]+';\n'
    code = code.replace(original_str, new_str)
all_match_list = re.findall('\)\n', code)
for idx in range(len(all_match_list)):
    original_str = all_match_list[idx]
    new_str = all_match_list[idx][:-1]+';\n'
    code = code.replace(original_str, new_str)

3、映射變量名稱

使用字典將InputVector與變量名稱映射到輸入數(shù)據(jù)集中，一次性更改所有InputVector。

# 用var來替代inputVector
dictionary = {'inputVector(0)':'sepal_length',
              'inputVector(1)':'sepal_width',
              'inputVector(2)':'petal_length',
              'inputVector(3)':'petal_width'} 
for key in dictionary.keys():
    code = code.replace(key, dictionary[key])

4、映射變量名稱

最后一步就是更改預(yù)測標(biāo)簽。

# 修改預(yù)測標(biāo)簽
code = re.sub('Math.Exp', 'Exp', code)
code = re.sub('pred = .*\n', '', code)
temp_var_list = re.findall(r"var[0-9]+\(\d\)", code)
for var_idx in range(len(temp_var_list)):
    code = re.sub(re.sub('\\(', '\\(', re.sub('\\)', '\\)', temp_var_list[var_idx])), iris.target_names[var_idx]+'_prob', code)

然后保存sas模型文件。

＃保存輸出
vb = open('vb1.sas', 'w')
vb.write(code)
vb.close()

最后，為了驗證sas腳本是否正確，我們將sas模型的預(yù)測結(jié)果和Python的結(jié)果進(jìn)行一下對比。

# python 預(yù)測
python_pred = pd.DataFrame(model.predict_proba(X_test))
python_pred.columns = ['setosa_prob','versicolor_prob','virginica_prob']
python_pred
# sas 預(yù)測
sas_pred = pd.read_csv('pred_result.csv')
sas_pred = sas_pred.iloc[:,-3:]
sas_pred
(abs(python_pred - sas_pred) > 0.00001).sum()

可以看到，兩個預(yù)測的結(jié)果基本上一樣，基本沒問題，我們就可以在sas中跑xgboost模型了。