Hazan Lab @ Princeton University
One of the biggest challenges for the Transformer architecture, powering modern large language models, is their computational inefficiencies on long sequences. The computational complexity of inference of the attention module scales quadratically with the context length, which can be extremely long in practice, and is thus limiting for the architecture.
The study of mathematical optimization is a hallmark of the application of the scientific method to almost all engineering fields. With the rise of machine learning and large scale problems, attention shifted from interior point methods based on Newton’s method to first order methods (FOM). The most popular and efficient FOM are adaptive gradient methods (AGM), see chapter 6 of lecture notes here and explanations below.
David Blackwell was still roaming the corridors of UC Berkeley’s stat department during my postdoc years, circa 2009. Jake Abernethy, Sasha Rakhlin, Peter Bartlett and myself were discussing his results, and his seminal contributions to prediction theory were already well known.
The following post is based on this paper and this paper.
In a famous 1906 competition at a local fair in Plymouth, England, participants were asked to guess the weight of an ox. Out of a crowd of hundreds, no one came close to the ox’s actual weight, but the average of all guesses was almost correct. How is it that combining the opinions of laymen can somehow arrive at highly reasoned decisions, despite the weak judgment of individual members? This concept of harnessing wisdom from weak rules of thumb to form a highly accurate prediction rule, is the basis of ensemble methods and boosting. Boosting is a theoretically sound methodology that has transformed machine learning across a variety of applications; in classification and regression tasks, online learning, and many more.
Spring semester is over, yay! To celebrate summer, I’ve compiled lecture notes from the graduate course COS 598D, a.k.a. “optimization for machine learning”.
The following post is based on this paper.
The picture to the right depicts Rudolf Kalman, a pioneer of modern control, receiving the presidential medal of freedom from president Barack Obama.
This post is a long-overdue sequel to parts I and II of previous posts on unsupervised learning. The long time it took me to publish this one is not by chance – I am still ambivalent on the merit, or lack of, for the notion of a worst-case PAC/statistical learnability (or, god forbid, regret minimization) without supervision.
I’ve always admired the beautiful theory on the analysis of boolean functions (see the excellent book of Ryan O’Donnell, as well Gil Kalai’s lectures). Heck, it used to be my main area of investigation back in the early grad-school days we were studying hardness of approximation, the PCP theorem, and the “new” (back at the time) technology of Fourier analysis for boolean functions. This technology also gave rise to seminal results on learnability with respect to the uniform distribution. Uniform distribution learning has been widely criticized as unrealistic, and the focus of the theoretical machine learning community has shifted to (mostly) convex, real-valued, agnostic learning in recent years.
In this continuation post I’d like to bring out a critical component of learning theory that is essentially missing in today’s approaches for unsupervised learning: improper learning by convex relaxations.
This nostalgic post is written after a tutorial in ICML 2016 as a recollection of a few memories with my friend Satyen Kale.
The following dilemma is encountered by many of my friends when teaching basic optimization: which variant/proof of gradient descent should one start with? Of course, one needs to decide on which depth of convex analysis one should dive into, and decide on issues such as “should I define strong-convexity?”, “discuss smoothness?”, “Nesterov acceleration?”, etc.
The standard curriculum in high school math includes elementary functional analysis, and methods for finding the minima, maxima and saddle points of a single dimensional function. When moving to high dimensions, this becomes beyond the reach of your typical high-school student: mathematical optimization theory spans a multitude of involved techniques in virtually all areas of computer science and mathematics.
First-order methods such as Gradient Descent, AdaGrad, SVRG, etc. dominate the landscape of optimization for machine learning due to their extremely low per-iteration computational cost. Second order methods have largely been ignored in this context due to their prohibitively large time complexity. As a general rule, any super-linear time operation is prohibitively expensive for large scale data analysis. In our recent paper we attempt to bridge this divide by exhibiting an efficient linear time second order algorithm for typical (convex) optimization problems arising in machine learning.
Classical statistical learning theory deals with finding a consistent hypothesis given sufficiently many examples. Such a hypothesis may be a rule for classifying emails into spam/not-spam, and examples are classified emails.
The duality theorem of linear programming and its extensions is a fundamental pillar of optimization. It lies at the heart of an elegant theory which allows us not only to reason about the structure of continuous optimization problems, but also design combinatorial optimization and approximation algorithms.