從零開始使用 Nadam 進(jìn)行梯度下降優(yōu)化

梯度下降是一種優(yōu)化算法,它使用目標(biāo)函數(shù)的梯度來導(dǎo)航搜索空間。 納丹(Nadam)是亞當(dāng)(Adam)版本的梯度下降的擴(kuò)展,其中包括了內(nèi)斯特羅夫的動量。 如何從頭開始實(shí)現(xiàn)Nadam優(yōu)化算法并將其應(yīng)用于目標(biāo)函數(shù)并評估結(jié)果。
梯度下降 Nadam優(yōu)化算法 娜達(dá)姆(Nadam)的梯度下降 二維測試問題 Nadam的梯度下降優(yōu)化 可視化的Nadam優(yōu)化
f()返回給定輸入集合的分?jǐn)?shù),導(dǎo)數(shù)函數(shù)f'()給出給定輸入集合的目標(biāo)函數(shù)的導(dǎo)數(shù)。梯度下降算法需要問題中的起點(diǎn)(x),例如輸入空間中的隨機(jī)選擇點(diǎn)。x(t)= x(t-1)–step* f'(x(t))
在給定點(diǎn)的目標(biāo)函數(shù)越陡峭,梯度的大小越大,反過來,在搜索空間中采取的步伐也越大。使用步長超參數(shù)來縮放步長的大小。
步長:超參數(shù),用于控制算法每次迭代相對于梯度在搜索空間中移動多遠(yuǎn)。
如果步長太小,則搜索空間中的移動將很小,并且搜索將花費(fèi)很長時(shí)間。如果步長太大,則搜索可能會在搜索空間附近反彈并跳過最優(yōu)值。
現(xiàn)在我們已經(jīng)熟悉了梯度下降優(yōu)化算法,接下來讓我們看一下Nadam算法。
Nadam優(yōu)化算法
Nesterov加速的自適應(yīng)動量估計(jì)或Nadam算法是對自適應(yīng)運(yùn)動估計(jì)(Adam)優(yōu)化算法的擴(kuò)展,添加了Nesterov的加速梯度(NAG)或Nesterov動量,這是一種改進(jìn)的動量。更廣泛地講,Nadam算法是對梯度下降優(yōu)化算法的擴(kuò)展。Timothy Dozat在2016年的論文“將Nesterov動量整合到Adam中”中描述了該算法。盡管論文的一個(gè)版本是在2015年以同名斯坦福項(xiàng)目報(bào)告的形式編寫的。動量將梯度的指數(shù)衰減移動平均值(第一矩)添加到梯度下降算法中。這具有消除嘈雜的目標(biāo)函數(shù)和提高收斂性的影響。Adam是梯度下降的擴(kuò)展,它增加了梯度的第一和第二矩,并針對正在優(yōu)化的每個(gè)參數(shù)自動調(diào)整學(xué)習(xí)率。NAG是動量的擴(kuò)展,其中動量的更新是使用對參數(shù)的預(yù)計(jì)更新量而不是實(shí)際當(dāng)前變量值的梯度來執(zhí)行的。在某些情況下,這樣做的效果是在找到最佳位置時(shí)減慢了搜索速度,而不是過沖。
納丹(Nadam)是對亞當(dāng)(Adam)的擴(kuò)展,它使用NAG動量代替經(jīng)典動量。讓我們逐步介紹該算法的每個(gè)元素。Nadam使用衰減步長(alpha)和一階矩(mu)超參數(shù)來改善性能。為了簡單起見,我們暫時(shí)將忽略此方面,并采用恒定值。首先,對于搜索中要優(yōu)化的每個(gè)參數(shù),我們必須保持梯度的第一矩和第二矩,分別稱為m和n。在搜索開始時(shí)將它們初始化為0.0。
m = 0
n = 0
該算法在從t = 1開始的時(shí)間t內(nèi)迭代執(zhí)行,并且每次迭代都涉及計(jì)算一組新的參數(shù)值x,例如。從x(t-1)到x(t)。如果我們專注于更新一個(gè)參數(shù),這可能很容易理解該算法,該算法概括為通過矢量運(yùn)算來更新所有參數(shù)。首先,計(jì)算當(dāng)前時(shí)間步長的梯度(偏導(dǎo)數(shù))。
g(t)= f'(x(t-1))
接下來,使用梯度和超參數(shù)“ mu”更新第一時(shí)刻。
m(t)=mu* m(t-1)+(1 –mu)* g(t)
然后使用“ nu”超參數(shù)更新第二時(shí)刻。
n(t)= nu * n(t-1)+(1 – nu)* g(t)^ 2
接下來,使用Nesterov動量對第一時(shí)刻進(jìn)行偏差校正。
mhat =(mu * m(t)/(1 – mu))+((1 – mu)* g(t)/(1 – mu))
nhat = nu * n(t)/(1 – nu)
x(t)= x(t-1)– alpha /(sqrt(nhat)+ eps)* mhat
sqrt()是平方根函數(shù),eps(epsilon)是一個(gè)較小的值,如1e-8,以避免除以零誤差。alpha:初始步長(學(xué)習(xí)率),典型值為0.002。
mu:第一時(shí)刻的衰減因子(Adam中的beta1),典型值為0.975。
nu:第二時(shí)刻的衰減因子(Adam中的beta2),典型值為0.999。
Objective()函數(shù)實(shí)現(xiàn)了此功能# objective function
def objective(x, y):
return x**2.0 + y**2.0
# 3d plot of the test function
from numpy import arange
from numpy import meshgrid
from matplotlib import pyplot
# objective function
def objective(x, y):
return x**2.0 + y**2.0
# define range for input
r_min, r_max = -1.0, 1.0
# sample input range uniformly at 0.1 increments
xaxis = arange(r_min, r_max, 0.1)
yaxis = arange(r_min, r_max, 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a surface plot with the jet color scheme
figure = pyplot.figure()
axis = figure.gca(projection='3d')
axis.plot_surface(x, y, results, cmap='jet')
# show the plot
pyplot.show()
f(0,0)= 0的熟悉的碗形狀。
# contour plot of the test function
from numpy import asarray
from numpy import arange
from numpy import meshgrid
from matplotlib import pyplot
# objective function
def objective(x, y):
return x**2.0 + y**2.0
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# sample input range uniformly at 0.1 increments
xaxis = arange(bounds[0,0], bounds[0,1], 0.1)
yaxis = arange(bounds[1,0], bounds[1,1], 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a filled contour plot with 50 levels and jet color scheme
pyplot.contourf(x, y, results, levels=50, cmap='jet')
# show the plot
pyplot.show()

x ^ 2的導(dǎo)數(shù)在每個(gè)維度上均為x * 2。f(x)= x ^ 2
f'(x)= x * 2
derived()函數(shù)在下面實(shí)現(xiàn)了這一點(diǎn)。# derivative of objective function
def derivative(x, y):
return asarray([x * 2.0, y * 2.0])
接下來,我們可以使用Nadam實(shí)現(xiàn)梯度下降優(yōu)化。首先,我們可以選擇問題范圍內(nèi)的隨機(jī)點(diǎn)作為搜索的起點(diǎn)。假定我們有一個(gè)數(shù)組,該數(shù)組定義搜索范圍,每個(gè)維度一行,并且第一列定義最小值,第二列定義維度的最大值。
# generate an initial point
x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
score = objective(x[0], x[1])
# initialize decaying moving averages
m = [0.0 for _ in range(bounds.shape[0])]
n = [0.0 for _ in range(bounds.shape[0])]
然后,我們運(yùn)行由“ n_iter”超參數(shù)定義的算法的固定迭代次數(shù)。
...
# run iterations of gradient descent
for t in range(n_iter):
...
...
# calculate gradient g(t)
g = derivative(x[0], x[1])
...
# build a solution one variable at a time
for i in range(x.shape[0]):
...
# m(t) = mu * m(t-1) + (1 - mu) * g(t)
m[i] = mu * m[i] + (1.0 - mu) * g[i]
# nhat = nu * n(t) / (1 - nu)
nhat = nu * n[i] / (1.0 - nu)
# n(t) = nu * n(t-1) + (1 - nu) * g(t)^2
n[i] = nu * n[i] + (1.0 - nu) * g[i]**2
# mhat = (mu * m(t) / (1 - mu)) + ((1 - mu) * g(t) / (1 - mu))
mhat = (mu * m[i] / (1.0 - mu)) + ((1 - mu) * g[i] / (1.0 - mu))
# nhat = nu * n(t) / (1 - nu)
nhat = nu * n[i] / (1.0 - nu)
# x(t) = x(t-1) - alpha / (sqrt(nhat) + eps) * mhat
x[i] = x[i] - alpha / (sqrt(nhat) + eps) * mhat
# evaluate candidate point
score = objective(x[0], x[1])
# report progress
print('>%d f(%s) = %.5f' % (t, x, score))
nadam()的函數(shù)中,該函數(shù)采用目標(biāo)函數(shù)和派生函數(shù)的名稱以及算法超參數(shù),并返回在搜索及其評估結(jié)束時(shí)找到的最佳解決方案。# gradient descent algorithm with nadam
def nadam(objective, derivative, bounds, n_iter, alpha, mu, nu, eps=1e-8):
# generate an initial point
x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
score = objective(x[0], x[1])
# initialize decaying moving averages
m = [0.0 for _ in range(bounds.shape[0])]
n = [0.0 for _ in range(bounds.shape[0])]
# run the gradient descent
for t in range(n_iter):
# calculate gradient g(t)
g = derivative(x[0], x[1])
# build a solution one variable at a time
for i in range(bounds.shape[0]):
# m(t) = mu * m(t-1) + (1 - mu) * g(t)
m[i] = mu * m[i] + (1.0 - mu) * g[i]
# n(t) = nu * n(t-1) + (1 - nu) * g(t)^2
n[i] = nu * n[i] + (1.0 - nu) * g[i]**2
# mhat = (mu * m(t) / (1 - mu)) + ((1 - mu) * g(t) / (1 - mu))
mhat = (mu * m[i] / (1.0 - mu)) + ((1 - mu) * g[i] / (1.0 - mu))
# nhat = nu * n(t) / (1 - nu)
nhat = nu * n[i] / (1.0 - nu)
# x(t) = x(t-1) - alpha / (sqrt(nhat) + eps) * mhat
x[i] = x[i] - alpha / (sqrt(nhat) + eps) * mhat
# evaluate candidate point
score = objective(x[0], x[1])
# report progress
print('>%d f(%s) = %.5f' % (t, x, score))
return [x, score]
alpha為0.02,μ為0.8,nu為0.999,這是經(jīng)過一點(diǎn)點(diǎn)反復(fù)試驗(yàn)后發(fā)現(xiàn)的。# seed the pseudo random number generator
seed(1)
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 50
# steps size
alpha = 0.02
# factor for average gradient
mu = 0.8
# factor for average squared gradient
nu = 0.999
# perform the gradient descent search with nadam
best, score = nadam(objective, derivative, bounds, n_iter, alpha, mu, nu)
# summarize the result
print('Done!')
print('f(%s) = %f' % (best, score))
# gradient descent optimization with nadam for a two-dimensional test function
from math import sqrt
from numpy import asarray
from numpy.random import rand
from numpy.random import seed
# objective function
def objective(x, y):
return x**2.0 + y**2.0
# derivative of objective function
def derivative(x, y):
return asarray([x * 2.0, y * 2.0])
# gradient descent algorithm with nadam
def nadam(objective, derivative, bounds, n_iter, alpha, mu, nu, eps=1e-8):
# generate an initial point
x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
score = objective(x[0], x[1])
# initialize decaying moving averages
m = [0.0 for _ in range(bounds.shape[0])]
n = [0.0 for _ in range(bounds.shape[0])]
# run the gradient descent
for t in range(n_iter):
# calculate gradient g(t)
g = derivative(x[0], x[1])
# build a solution one variable at a time
for i in range(bounds.shape[0]):
# m(t) = mu * m(t-1) + (1 - mu) * g(t)
m[i] = mu * m[i] + (1.0 - mu) * g[i]
# n(t) = nu * n(t-1) + (1 - nu) * g(t)^2
n[i] = nu * n[i] + (1.0 - nu) * g[i]**2
# mhat = (mu * m(t) / (1 - mu)) + ((1 - mu) * g(t) / (1 - mu))
mhat = (mu * m[i] / (1.0 - mu)) + ((1 - mu) * g[i] / (1.0 - mu))
# nhat = nu * n(t) / (1 - nu)
nhat = nu * n[i] / (1.0 - nu)
# x(t) = x(t-1) - alpha / (sqrt(nhat) + eps) * mhat
x[i] = x[i] - alpha / (sqrt(nhat) + eps) * mhat
# evaluate candidate point
score = objective(x[0], x[1])
# report progress
print('>%d f(%s) = %.5f' % (t, x, score))
return [x, score]
# seed the pseudo random number generator
seed(1)
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 50
# steps size
alpha = 0.02
# factor for average gradient
mu = 0.8
# factor for average squared gradient
nu = 0.999
# perform the gradient descent search with nadam
best, score = nadam(objective, derivative, bounds, n_iter, alpha, mu, nu)
print('Done!')
print('f(%s) = %f' % (best, score))
>40 f([ 5.07445337e-05 -3.32910019e-03]) = 0.00001
>41 f([-1.84325171e-05 -3.00939427e-03]) = 0.00001
>42 f([-6.78814472e-05 -2.69839367e-03]) = 0.00001
>43 f([-9.88339249e-05 -2.40042096e-03]) = 0.00001
>44 f([-0.00011368 -0.00211861]) = 0.00000
>45 f([-0.00011547 -0.00185511]) = 0.00000
>46 f([-0.0001075 -0.00161122]) = 0.00000
>47 f([-9.29922627e-05 -1.38760991e-03]) = 0.00000
>48 f([-7.48258406e-05 -1.18436586e-03]) = 0.00000
>49 f([-5.54299505e-05 -1.00116899e-03]) = 0.00000
Done!
f([-5.54299505e-05 -1.00116899e-03]) = 0.000001
nadam()函數(shù)以維護(hù)在搜索過程中找到的所有解決方案的列表,然后在搜索結(jié)束時(shí)返回此列表。下面列出了具有這些更改的功能的更新版本。# gradient descent algorithm with nadam
def nadam(objective, derivative, bounds, n_iter, alpha, mu, nu, eps=1e-8):
solutions = list()
# generate an initial point
x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
score = objective(x[0], x[1])
# initialize decaying moving averages
m = [0.0 for _ in range(bounds.shape[0])]
n = [0.0 for _ in range(bounds.shape[0])]
# run the gradient descent
for t in range(n_iter):
# calculate gradient g(t)
g = derivative(x[0], x[1])
# build a solution one variable at a time
for i in range(bounds.shape[0]):
# m(t) = mu * m(t-1) + (1 - mu) * g(t)
m[i] = mu * m[i] + (1.0 - mu) * g[i]
# n(t) = nu * n(t-1) + (1 - nu) * g(t)^2
n[i] = nu * n[i] + (1.0 - nu) * g[i]**2
# mhat = (mu * m(t) / (1 - mu)) + ((1 - mu) * g(t) / (1 - mu))
mhat = (mu * m[i] / (1.0 - mu)) + ((1 - mu) * g[i] / (1.0 - mu))
# nhat = nu * n(t) / (1 - nu)
nhat = nu * n[i] / (1.0 - nu)
# x(t) = x(t-1) - alpha / (sqrt(nhat) + eps) * mhat
x[i] = x[i] - alpha / (sqrt(nhat) + eps) * mhat
# evaluate candidate point
score = objective(x[0], x[1])
# store solution
solutions.append(x.copy())
# report progress
print('>%d f(%s) = %.5f' % (t, x, score))
return solutions
# seed the pseudo random number generator
seed(1)
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 50
# steps size
alpha = 0.02
# factor for average gradient
mu = 0.8
# factor for average squared gradient
nu = 0.999
# perform the gradient descent search with nadam
solutions = nadam(objective, derivative, bounds, n_iter, alpha, mu, nu)
# sample input range uniformly at 0.1 increments
xaxis = arange(bounds[0,0], bounds[0,1], 0.1)
yaxis = arange(bounds[1,0], bounds[1,1], 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a filled contour plot with 50 levels and jet color scheme
pyplot.contourf(x, y, results, levels=50, cmap='jet')
# plot the sample as black circles
solutions = asarray(solutions)
pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w')
# example of plotting the nadam search on a contour plot of the test function
from math import sqrt
from numpy import asarray
from numpy import arange
from numpy import product
from numpy.random import rand
from numpy.random import seed
from numpy import meshgrid
from matplotlib import pyplot
from mpl_toolkits.mplot3d import Axes3D
# objective function
def objective(x, y):
return x**2.0 + y**2.0
# derivative of objective function
def derivative(x, y):
return asarray([x * 2.0, y * 2.0])
# gradient descent algorithm with nadam
def nadam(objective, derivative, bounds, n_iter, alpha, mu, nu, eps=1e-8):
solutions = list()
# generate an initial point
x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
score = objective(x[0], x[1])
# initialize decaying moving averages
m = [0.0 for _ in range(bounds.shape[0])]
n = [0.0 for _ in range(bounds.shape[0])]
# run the gradient descent
for t in range(n_iter):
# calculate gradient g(t)
g = derivative(x[0], x[1])
# build a solution one variable at a time
for i in range(bounds.shape[0]):
# m(t) = mu * m(t-1) + (1 - mu) * g(t)
m[i] = mu * m[i] + (1.0 - mu) * g[i]
# n(t) = nu * n(t-1) + (1 - nu) * g(t)^2
n[i] = nu * n[i] + (1.0 - nu) * g[i]**2
# mhat = (mu * m(t) / (1 - mu)) + ((1 - mu) * g(t) / (1 - mu))
mhat = (mu * m[i] / (1.0 - mu)) + ((1 - mu) * g[i] / (1.0 - mu))
# nhat = nu * n(t) / (1 - nu)
nhat = nu * n[i] / (1.0 - nu)
# x(t) = x(t-1) - alpha / (sqrt(nhat) + eps) * mhat
x[i] = x[i] - alpha / (sqrt(nhat) + eps) * mhat
# evaluate candidate point
score = objective(x[0], x[1])
# store solution
solutions.append(x.copy())
# report progress
print('>%d f(%s) = %.5f' % (t, x, score))
return solutions
# seed the pseudo random number generator
seed(1)
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 50
# steps size
alpha = 0.02
# factor for average gradient
mu = 0.8
# factor for average squared gradient
nu = 0.999
# perform the gradient descent search with nadam
solutions = nadam(objective, derivative, bounds, n_iter, alpha, mu, nu)
# sample input range uniformly at 0.1 increments
xaxis = arange(bounds[0,0], bounds[0,1], 0.1)
yaxis = arange(bounds[1,0], bounds[1,1], 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a filled contour plot with 50 levels and jet color scheme
pyplot.contourf(x, y, results, levels=50, cmap='jet')
# plot the sample as black circles
solutions = asarray(solutions)
pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w')
# show the plot
pyplot.show()

作者:沂水寒城,CSDN博客專家,個(gè)人研究方向:機(jī)器學(xué)習(xí)、深度學(xué)習(xí)、NLP、CV
Blog: http://yishuihancheng.blog.csdn.net
贊 賞 作 者

更多閱讀
特別推薦

點(diǎn)擊下方閱讀原文加入社區(qū)會員
