From an ODE to accelerated stochastic gradient descent: convergence rate and empirical results

[Moved Online] Hot Topics: Optimal transport and applications to machine learning and statistics May 04, 2020 - May 08, 2020

May 07, 2020 (11:00 AM PDT - 12:00 PM PDT)

Speaker(s): Adam Oberman
Location: SLMath: Online/Virtual

Tags/Keywords

stochastic gradient descent
convex optimization
accelerated algorithms
Nesterov's method

Primary Mathematics Subject Classification

80Axx - Thermodynamics and heat transfer

Secondary Mathematics Subject Classification

80A10 - Classical and relativistic thermodynamics

Video

From An ODE To Accelerated Stochastic Gradient Descent: Convergence Rate And Empirical Results

Abstract

Su, Boyd and Candés showed that Nesterov's method for accelerated gradient descent can be obtained as the discretization of an Ordinary Differential Equation (ODE). By discretizing a modified ODE we show that : (i) we recover Nesterov's method for convex functions using a constant learning rate, (ii) choosing a decreasing learning rate leads to an accelerated stochastic gradient descent algorithm. The learning rate for the stochastic algorithm is of order k^(3/4), which is novel. We also study a second ODE and obtain a corresponding algorithm for the strongly convex case.

Convergence rates for the algorithms are established using a Lyapunov function approach which unifies the continuous time, deterministic gradient, and stochastic gradient cases. The stochastic algorithms are accelerated (in the convex case, and stochastic approximation setting), in the sense that the rate constant improves over existing results. Existing results have a rate constant which depends on G^2, the bound on the norm of the stochastic gradient. The accelerated algorithm rate constant depends on the smaller constant depends on \sigma^2, the variance of the stochastic gradient.

Numerical results in both the stochastic approximation and the finite sum setting, demonstrate that the accelerated algorithms can converge faster than SGD, or momentum SGD algorithms. Moreover, the algorithms are easily tuned, in the sense that empirical results are robust to changes in the learning rate schedule and constants. This is joint work with Maxime Laborde.

Supplements

	Notes 4.64 MB application/pdf	Download

Video/Audio Files

From An ODE To Accelerated Stochastic Gradient Descent: Convergence Rate And Empirical Results

H.264 Video	928_28396_8330_From_an_ODE_to_Accelerated_Stochastic_Gradient_Descent-_Convergence_Rate_and_Empirical_Results.mp4	Download 1080p (3.18 GB) 720p (2.44 GB) 360p (627 MB)

Troubles with video?

Please report video problems to itsupport@slmath.org.

See more of our Streaming videos on our main VMath Videos page.