【時(shí)間序列】使用 Auto-TS 自動(dòng)化時(shí)間序列預(yù)測(cè)
Auto-TS 是 AutoML 的一部分,它將自動(dòng)化機(jī)器學(xué)習(xí)管道的一些組件。這自動(dòng)化庫(kù)有助于非專家訓(xùn)練基本的機(jī)器學(xué)習(xí)模型,而無需在該領(lǐng)域有太多知識(shí)。在本文中,小編和你一起學(xué)習(xí)如何使用 Auto-TS 庫(kù)自動(dòng)執(zhí)行時(shí)間序列預(yù)測(cè)模型。
什么是自動(dòng) TS?
它是一個(gè)開源 Python 庫(kù),主要用于自動(dòng)化時(shí)間序列預(yù)測(cè)。它將使用一行代碼自動(dòng)訓(xùn)練多個(gè)時(shí)間序列模型,這將幫助我們?yōu)槲覀兊膯栴}陳述選擇最好的模型。
在 python 開源庫(kù) Auto-TS 中,auto-ts.Auto_TimeSeries() 使用訓(xùn)練數(shù)據(jù)調(diào)用的主要函數(shù)。然后我們可以選擇想要的模型類型,例如 stats、ml 或FB prophet-based models (基于 FB 先知的模型)。我們還可以調(diào)整參數(shù),這些參數(shù)將根據(jù)我們希望它基于的評(píng)分參數(shù)自動(dòng)選擇最佳模型。它將返回最佳模型和一個(gè)字典,其中包含提到的預(yù)測(cè)周期數(shù)的預(yù)測(cè)(默認(rèn)值 = 2)。
Auto_timeseries 是用于時(shí)間序列數(shù)據(jù)的復(fù)雜模型構(gòu)建實(shí)用程序。由于它自動(dòng)化了復(fù)雜工作中涉及的許多任務(wù),因此它假定了許多智能默認(rèn)值。5.但是我們可以改變它們。Auto_Timeseries 將基于 Statsmodels ARIMA、Seasonal ARIMA 和 Scikit-Learn ML 快速構(gòu)建預(yù)測(cè)模型。它將自動(dòng)選擇給出指定最佳分?jǐn)?shù)的最佳模型。
Auto_TimeSeries 能夠幫助我們使用 ARIMA、SARIMAX、VAR、可分解(趨勢(shì)+季節(jié)性+殘差)模型和集成機(jī)器學(xué)習(xí)模型等技術(shù)構(gòu)建和選擇多個(gè)時(shí)間序列模型。
Auto-TS 庫(kù)的特點(diǎn)
它使用遺傳規(guī)劃優(yōu)化找到最佳時(shí)間序列預(yù)測(cè)模型。 它訓(xùn)練普通模型、統(tǒng)計(jì)模型、機(jī)器學(xué)習(xí)模型和深度學(xué)習(xí)模型,具有所有可能的超參數(shù)配置和交叉驗(yàn)證。 它通過學(xué)習(xí)最佳 NaN 插補(bǔ)和異常值去除來執(zhí)行數(shù)據(jù)轉(zhuǎn)換以處理雜亂的數(shù)據(jù)。 選擇用于模型選擇的指標(biāo)組合。
安裝
pip?install?auto-ts??#?或
pip?install?git+git://github.com/AutoViML/Auto_TS
依賴包,如下依賴包需要提前安裝
dask
scikit-learn
FB?Prophet
statsmodels
pmdarima
XGBoost
導(dǎo)入庫(kù)
from?auto_ts?import?auto_timeseries
巨坑警告
根據(jù)上述安裝步驟安裝成功后,很大概率會(huì)出現(xiàn)這樣的錯(cuò)誤:
Running?setup.py?clean?for?fbprophet
Failed?to?build?fbprophet
Installing?collected?packages:?fbprophet
??Running?setup.py?install?for?fbprophet?...?error
?......
??from?pystan?import?StanModel
ModuleNotFoundError:?No?module?named?'pystan'
這個(gè)時(shí)候你會(huì)裝pystan:pip install pystan 。安裝完成后,還是會(huì)出現(xiàn)上述報(bào)錯(cuò)。如果你也出現(xiàn)了如上情況,不要慌,云朵君已經(jīng)幫你踩過坑了。
參考解決方案:(Mac/anaconda)
1. 安裝 Ephem:
conda?install?-c?anaconda?ephem

2. 安裝 Pystan:
conda?install?-c?conda-forge?pystan

3. 安裝 Fbprophet:
(這個(gè)會(huì)花費(fèi)4小時(shí)+)
conda?install?-c?conda-forge?fbprophet
4. 最后安裝:
pip?install?prophet
pip?install?fbprophet
5. 最后直到出現(xiàn):
Successfully?installed?cmdstanpy-0.9.5?fbprophet-0.7.1?holidays-0.13
如果上述還不行,你先嘗試重啟anaconda,如果還不行,則需要先安裝:
conda?install?gcc
再上述步驟走一遍。
上述過程可能要花費(fèi)1天時(shí)間?。?/span>
最后嘗試導(dǎo)入,成功!
from?auto_ts?import?auto_timeseries
Imported?auto_timeseries?version:0.0.65.?Call?by?using:
model?=?auto_timeseries(score_type='rmse',?
time_interval='M',?
non_seasonal_pdq=None,?
seasonality=False,????????
seasonal_period=12,?
model_type=['best'],?
verbose=2,?
dask_xgboost_flag=0)
model.fit(traindata,?
ts_column,target)
model.predict(testdata,?model='best')
auto_timeseries 中可用的參數(shù)
model?=?auto_timeseries(
score_type='rmse',?
time_interval='Month',
non_seasonal_pdq=None,
seasonity=False,
season_period=12,??
model_type=['Prophet'],verbose=2)
可以調(diào)整參數(shù)并分析模型性能的變化。有關(guān)參數(shù)的更多詳細(xì)信息參考auto-ts文檔[1]。
使用的數(shù)據(jù)集
本文使用了從 Kaggle 下載的 2006 年 1 月至 2018 年 1 月的亞馬遜股票價(jià)格[2]數(shù)據(jù)集。該庫(kù)僅提供訓(xùn)練時(shí)間序列預(yù)測(cè)模型。數(shù)據(jù)集應(yīng)該有一個(gè)時(shí)間或日期格式列。
最初,使用時(shí)間/日期列加載時(shí)間序列數(shù)據(jù)集:
df?=?pd.read_csv(
????"Amazon_Stock_Price.csv",?
????usecols=['Date',?'Close'])
df['Date']?=?pd.to_datetime(df['Date'])
df?=?df.sort_values('Date')
現(xiàn)在,將整個(gè)數(shù)據(jù)拆分為訓(xùn)練數(shù)據(jù)和測(cè)試數(shù)據(jù):
train_df?=?df.iloc[:2800]
test_df?=?df.iloc[2800:]
現(xiàn)在,我們將可視化拆分訓(xùn)練測(cè)試:
train_df.Close.plot(
??????figsize=(15,8),?
??????title=?'AMZN?Stock?Price',?fontsize=14,?
??????label='Train')
test_df.Close.plot(
??????figsize=(15,8),?
??????title=?'AMZN?Stock?Price', fontsize=14,?
??????label='Test')

現(xiàn)在,讓我們初始化 Auto-TS 模型對(duì)象,并擬合訓(xùn)練數(shù)據(jù):
model?=?auto_timeseries(
??forecast_period=219,?
??score_type='rmse',?
??time_interval='D',?
??model_type='best')
model.fit(traindata=?train_df,
????ts_column="Date",
????target="Close")
現(xiàn)在讓我們比較不同模型的準(zhǔn)確率:
model.get_leaderboard()
model.plot_cv_scores()
得到如下結(jié)果:
Start of Fit.....
? ? Target variable given as = Close
Start of loading of data.....
? ? Inputs: ts_column = Date, sep = ,, target = ['Close']
? ? Using given input: pandas dataframe...
? ? Date column exists in given train data...
? ? train data shape = (2800, 1)
Alert: Could not detect strf_time_format of Date. Provide strf_time format during "setup" for better results.
Running Augmented Dickey-Fuller test with paramters:
? ? maxlag: 31 regression: c autolag: BIC
Data is stationary after one differencing
There is 1 differencing needed in this datasets for VAR model
No time series plot since verbose = 0. Continuing
Time Interval is given as D
? ? Correct Time interval given as a valid Pandas date-range frequency...
WARNING: Running best models will take time... Be Patient...
==================================================
Building Prophet Model
==================================================
Running Facebook Prophet Model...
? Starting Prophet Fit
? ? ? No seasonality assumed since seasonality flag is set to False
? Starting Prophet Cross Validation
Max. iterations using expanding window cross validation = 5
Fold Number: 1 --> Train Shape: 1705 Test Shape: 219
? ? RMSE = 30.01
? ? Std Deviation of actuals = 19.52
? ? Normalized RMSE (as pct of std dev) = 154%
Cross Validation window: 1 completed
Fold Number: 2 --> Train Shape: 1924 Test Shape: 219
? ? RMSE = 45.33
? ? Std Deviation of actuals = 34.21
? ? Normalized RMSE (as pct of std dev) = 132%
Cross Validation window: 2 completed
Fold Number: 3 --> Train Shape: 2143 Test Shape: 219
? ? RMSE = 65.61
? ? Std Deviation of actuals = 39.85
? ? Normalized RMSE (as pct of std dev) = 165%
Cross Validation window: 3 completed
Fold Number: 4 --> Train Shape: 2362 Test Shape: 219
? ? RMSE = 178.53
? ? Std Deviation of actuals = 75.28
? ? Normalized RMSE (as pct of std dev) = 237%
Cross Validation window: 4 completed
Fold Number: 5 --> Train Shape: 2581 Test Shape: 219
? ? RMSE = 148.18
? ? Std Deviation of actuals = 57.62
? ? Normalized RMSE (as pct of std dev) = 257%
Cross Validation window: 5 completed
-------------------------------------------
Model Cross Validation Results:
-------------------------------------------
? ? MAE (Mean Absolute Error = 85.20
? ? MSE (Mean Squared Error = 12218.34
? ? MAPE (Mean Absolute Percent Error) = 17%
? ? RMSE (Root Mean Squared Error) = 110.5366
? ? Normalized RMSE (MinMax) = 18%
? ? Normalized RMSE (as Std Dev of Actuals)= 60%
Time Taken = 13 seconds
? End of Prophet Fit
==================================================
Building Auto SARIMAX Model
==================================================
Running Auto SARIMAX Model...
? ? Using smaller parameters for larger dataset with greater than 1000 samples
? ? Using smaller parameters for larger dataset with greater than 1000 samples
? ? Using smaller parameters for larger dataset with greater than 1000 samples
? ? Using smaller parameters for larger dataset with greater than 1000 samples
? ? Using smaller parameters for larger dataset with greater than 1000 samples
SARIMAX RMSE (all folds): 73.9230
SARIMAX Norm RMSE (all folds): 35%
-------------------------------------------
Model Cross Validation Results:
-------------------------------------------
? ? MAE (Mean Absolute Error = 64.24
? ? MSE (Mean Squared Error = 7962.95
? ? MAPE (Mean Absolute Percent Error) = 12%
? ? RMSE (Root Mean Squared Error) = 89.2354
? ? Normalized RMSE (MinMax) = 14%
? ? Normalized RMSE (as Std Dev of Actuals)= 48%
? ? Using smaller parameters for larger dataset with greater than 1000 samples
Refitting data with previously found best parameters
? ? Best aic metric = 18805.2
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?SARIMAX Results? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??
==============================================================================
Dep. Variable:? ? ? ? ? ? ? ? ? Close? ?No. Observations:? ? ? ? ? ? ? ? ?2800
Model:? ? ? ? ? ? ? ?SARIMAX(2, 2, 0)? ?Log Likelihood? ? ? ? ? ? ? ?-9397.587
Date:? ? ? ? ? ? ? ? Mon, 28 Feb 2022? ?AIC? ? ? ? ? ? ? ? ? ? ? ? ? 18805.174
Time:? ? ? ? ? ? ? ? ? ? ? ? 19:45:31? ?BIC? ? ? ? ? ? ? ? ? ? ? ? ? 18834.854
Sample:? ? ? ? ? ? ? ? ? ? ? ? ? ? ?0? ?HQIC? ? ? ? ? ? ? ? ? ? ? ? ?18815.888
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?- 2800? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
Covariance Type:? ? ? ? ? ? ? ? ? opg? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
==============================================================================
? ? ? ? ? ? ? ? ?coef? ? std err? ? ? ? ? z? ? ? P>|z|? ? ? [0.025? ? ? 0.975]
------------------------------------------------------------------------------
intercept? ? ?-0.0033? ? ? 0.557? ? ?-0.006? ? ? 0.995? ? ? -1.094? ? ? ?1.088
drift? ? ? ?3.618e-06? ? ? 0.000? ? ? 0.015? ? ? 0.988? ? ? -0.000? ? ? ?0.000
ar.L1? ? ? ? ?-0.6405? ? ? 0.008? ? -79.601? ? ? 0.000? ? ? -0.656? ? ? -0.625
ar.L2? ? ? ? ?-0.2996? ? ? 0.009? ? -32.618? ? ? 0.000? ? ? -0.318? ? ? -0.282
sigma2? ? ? ? 48.6323? ? ? 0.456? ? 106.589? ? ? 0.000? ? ? 47.738? ? ? 49.527
===================================================================================
Ljung-Box (L1) (Q):? ? ? ? ? ? ? ? ? 14.84? ?Jarque-Bera (JB):? ? ? ? ? ? ?28231.48
Prob(Q):? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 0.00? ?Prob(JB):? ? ? ? ? ? ? ? ? ? ? ? ?0.00
Heteroskedasticity (H):? ? ? ? ? ? ? 19.43? ?Skew:? ? ? ? ? ? ? ? ? ? ? ? ? ? ?0.56
Prob(H) (two-sided):? ? ? ? ? ? ? ? ? 0.00? ?Kurtosis:? ? ? ? ? ? ? ? ? ? ? ? 18.53
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
===============================================
Skipping VAR Model since dataset is > 1000 rows and it will take too long
===============================================
==================================================
Building ML Model
==================================================
Creating 2 lagged variables for Machine Learning model...
? ? You have set lag = 3 in auto_timeseries setup to feed prior targets. You cannot set lags > 10 ...
### Be careful setting dask_xgboost_flag to True since dask is unstable and doesn't work sometime's ###
########### Single-Label Regression Model Tuning and Training Started ####
Fitting ML model
? ? 11 variables used in training ML model = ['Close(t-1)', 'Date_hour', 'Date_minute', 'Date_dayofweek', 'Date_quarter', 'Date_month', 'Date_year', 'Date_dayofyear', 'Date_dayofmonth', 'Date_weekofyear', 'Date_weekend']
Running Cross Validation using XGBoost model..
? ? Max. iterations using expanding window cross validation = 2
train fold shape (2519, 11), test fold shape = (280, 11)
### Number of booster rounds = 250 for XGBoost which can be set during setup ####
? ? Hyper Param Tuning XGBoost with CPU parameters. This will take time. Please be patient...
Cross-validated Score = 31.896 in num rounds = 249
Time taken for Hyper Param tuning of XGBoost (in minutes) = 0.0
Top 10 features:
['Date_year', 'Close(t-1)', 'Date_quarter', 'Date_month', 'Date_weekofyear', 'Date_dayofyear', 'Date_dayofmonth', 'Date_dayofweek']
? ? Time taken for training XGBoost on entire train data (in minutes) = 0.0
Returning the following:
? ? Model =
? ? Scaler = Pipeline(steps=[('columntransformer',
? ? ? ? ? ? ? ? ?ColumnTransformer(transformers=[('simpleimputer',
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? SimpleImputer(),
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ['Close(t-1)', 'Date_hour',
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?'Date_minute',
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?'Date_dayofweek',
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?'Date_quarter', 'Date_month',
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?'Date_year',
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?'Date_dayofyear',
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?'Date_dayofmonth',
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?'Date_weekofyear',
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?'Date_weekend'])])),
? ? ? ? ? ? ? ? ('maxabsscaler', MaxAbsScaler())])
? ? (3) sample predictions:[359.8374? 356.59747 355.447? ]
XGBoost model tuning completed
Target = Close...CV results:
? ? RMSE = 246.63
? ? Std Deviation of actuals = 94.60
? ? Normalized RMSE (as pct of std dev) = 261%
Fitting model on entire train set. Please be patient...
? ? Time taken to train model (in seconds) = 0
Best Model is: auto_SARIMAX
? ? Best Model (Mean CV) Score: 73.92
--------------------------------------------------
Total time taken: 52 seconds.
--------------------------------------------------
Leaderboard with best model on top of list:
? ? ? ? ? ? name? ? ? ? rmse
1? auto_SARIMAX? ?73.922971
0? ? ? ?Prophet? ?93.532440
2? ? ? ? ? ? ML? 246.630613






現(xiàn)在我們?cè)跍y(cè)試數(shù)據(jù)上測(cè)試我們的模型:
future_predictions?=?model.predict(testdata=219)
#?或?
model.predict(
????testdata=test_df.Close)
使用預(yù)測(cè)周期=219作為auto_SARIMAX模型的輸入進(jìn)行預(yù)測(cè):
future_predictions

可視化看下future_predictions是什么樣子:

最后,可視化測(cè)試數(shù)據(jù)值和預(yù)測(cè):
pred_df?=?pd.concat(
????[test_df,future_predictions],
????axis=1)
ax.plot('Date','Close','b',
?????????data=pred_df,
?????????label='Test')
ax.plot('Date','yhat','r',
?????????data=pred_df,
?????????label='Predicitions')

auto_timeseries 中可用的參數(shù):
model?=?auto_timeseries(?
????score_type='rmse',
????time_interval='Month',
????non_seasonal_pdq=None,?
????seasonity=False,??
????season_period=12,
????model_type=['Prophet'],
????verbose=2)
model.fit() 中可用的參數(shù):
model.fit(traindata=train_data,
????ts_column=ts_column,
????target=target,
????cv=5,?sep=","?)
model.predict() 中可用的參數(shù):
model?=?model.predict(testdata?=?'可以是數(shù)據(jù)框或代表預(yù)測(cè)周期的整數(shù)';??
??????????????????????model?=?'best',?'或代表訓(xùn)練模型的任何其他字符串')
可以使用所有這些參數(shù)并分析我們模型的性能,然后可以為我們的問題陳述選擇最合適的模型??梢圆榭?span style="color: #1e6bb8;font-weight: bold;">auto-ts文檔[3]詳細(xì)檢查所有這些參數(shù)。
寫在最后
在本文中,討論了如何在一行 Python 代碼中自動(dòng)化時(shí)間序列模型。Auto-TS 對(duì)數(shù)據(jù)進(jìn)行預(yù)處理,因?yàn)樗鼜臄?shù)據(jù)中刪除異常值并通過學(xué)習(xí)最佳 NaN 插補(bǔ)來處理混亂的數(shù)據(jù)。
通過初始化 Auto-TS 對(duì)象并擬合訓(xùn)練數(shù)據(jù),它將自動(dòng)訓(xùn)練多個(gè)時(shí)間序列模型,例如 ARIMA、SARIMAX、FB Prophet、VAR,并得出性能最佳的模型。模型的結(jié)果跟數(shù)據(jù)集的大小有一定的關(guān)系。如果我們嘗試增加數(shù)據(jù)集的大小,結(jié)果應(yīng)該會(huì)有所改善。
參考資料
auto-ts文檔: https://pypi.org/project/auto-ts/
[2]亞馬遜股票價(jià)格: https://www.kaggle.com/szrlee/stock-time-series-20050101-to-20171231?select=AMZN_2006-01-01_to_2018-01-01.csv
[3]auto-ts文檔: https://pypi.org/project/auto-ts/
往期精彩回顧
