

Paper 
Code 

TLDR: A novel
framework
for
Qlearning that models the maximal
softvalues without needing to sample from a policy.

Modern Deep Reinforcement Learning (RL) algorithms require estimates of the maximal Qvalue, which are difficult
to
compute in continuous domains with an infinite number of possible actions. In this work, we introduce a new
update
rule for online and offline RL which directly models the maximal value using Extreme Value Theory (EVT), drawing
inspiration from Economics. By doing so, we avoid computing Qvalues using outofdistribution actions which is
often a substantial source of error. Our key insight is to introduce an objective that directly estimates the
optimal softvalue functions (LogSumExp) in the maximum entropy RL setting without needing to sample from a
policy.
Using EVT, we derive our Extreme QLearning (XQL) framework
and consequently online and, for the first
time,
offline MaxEnt Qlearning algorithms, that do not explicitly require access to a policy or its
entropy.
Our method obtains consistently strong performance in the D4RL benchmark, outperforming prior works by
10+ points on some
tasks while offering moderate improvements over SAC and TD3 on online DM Control tasks.

TLDR: The core of our approach is fitting Gumbel
distribution \(\mathcal{G}(\mu, \beta)\) to the data to
introduce Gumbel regression (or Extremal regression), a new technique which
models the
extreme values of a distribution. 
For a temperature \(\beta\), Gumbel regression estimates the operator \(\beta \log \mathbb{E}[e^{X/\beta}] \)
or the LogPartition function over samples drawn from a distribution \(X\). This is a central quantity of
interest in
Statistics as well as Physical Sciences, and it's accurate calculation has important applications in
Probabilistic Modeling, Bayesian Learning and Information Theory, such as in calculating maginal
distributions.
Nevertheless, it is very difficult to estimate in continuous spaces and usually assumed as an intractable
quantity. This has led to a host of variational inference methods such as VAEs, that use approximations to
sidestep calculating it.
Gumbel Regression enables for the first time, exact estimation of the LogPartition function by using
simple
gradient descent.
By controlling the temperature \(\beta\), Gumbel regression interpolates between the
the max (\(\beta=0\)) and the mean (\(\beta=\infty\)) of a distribution \(X\), and provides a robust
estimator for the extremal values of a distribution. Finally, Gumbel Regression admits to tight PAC
learning bounds
and has a bound approximation error on a finite dataset (Section 3 of the paper).
Our Gumbel regression loss function can be used to directly fit
the LogSumExp of the Qvalues, yielding the softoptimal value function \(V^* = LogSumExp(Q)\). Then,
we can use
Qiteration even in highdimensional continuous action spaces to find the optimal MaxEnt policy. This
general algorithm
works well in both
online, and offline settings.
For online RL, it can be used to extend existing algorithms like SAC and
TD3, with
moderate
performance
gains. On offline RL, it outperforms existing approaches, and obtains SOTA on D4RL benchmarks. Below we provide
a
highlevel overview:

(Above) XQL reaching state of the art results on the Offline D4RL
Benchmark



XQL on Franka Kitchen

IQL on Franka Kitchen


XTD3 shows moderate gains on DM Control Tasks compared to standard TD3.



XTD3 on Quadruped Run (Reward 437)

TD3 on Quadruped Run (Reward 293)



XTD3 on Hopper Hop (Reward 71)

TD3 on Hopper Hop (Reward 20)

@article{ garg2022extreme, title={Extreme QLearning: MaxEnt Reinforcement Learning Without Entropy}, url = {https://arxiv.org/abs/2301.02328}, author = {Garg, Divyansh and Hejna, Joey and Geist, Matthieu and Ermon, Stefano}, publisher = {arXiv}, year = {2023}, }