BOpt-GMM

\[ \def\N{\mathbb{N}} \def\Z{\mathbb{Z}} \def\I{\mathbb{I}} \def\Q{\mathbb{Q}} \def\R{\mathbb{R}} \def\V{\mathbb{V}} %\def\C{\mathbb{C}} \def\A{\mathcal{A}} \def\D{\mathcal{D}} \def\Cset{\mathcal{C}} \def\E{\mathcal{E}} \def\S{\mathcal{S}} \def\p{\mathcal{p}} \def\P{\mathcal{P}} \def\rneg{\tilde{\neg}} \def\rle{\prec} \def\rge{\succ} \def\rand{\curlywedge} \def\ror{\curlyvee} \newcommand{\var}[1]{V\left({#1}\right)} \newcommand{\gvar}[1]{K\left({#1}\right)} \newcommand{\app}[2]{{#1}(#2)} \newcommand{\gnd}[1]{\underline{#1}} \newcommand{\appf}[2]{\gnd{#1(#2)}} \renewcommand{\vec}[1]{\mathbf{#1}} \newcommand{\mat}[1]{\mathbf{#1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\fd}[1]{\dot{#1}} \newcommand{\sd}[1]{\ddot{#1}} \newcommand{\td}[1]{\dddot{#1}} \newcommand{\fourthd}[1]{\ddddot{#1}} \newcommand{\diff}[2]{\frac{\partial#1}{\partial#2}} \newcommand{\tdiff}[2]{d {#1}_{#2}} \newcommand{\prob}[1]{p\left(#1\right)} \newcommand{\probc}[2]{\prob{#1 \;\middle\vert\; #2}} \newcommand{\probdist}[2]{p_{#1}\left(#2\right)} \newcommand{\probcdist}[3]{\probdist{#1}{#2 \;\middle\vert\; #3}} \newcommand{\KL}[2]{D_{KL}\left(#1 \;\delimsize\|\; #2 \right)} \newcommand{\vecM}[1]{\begin{bmatrix} #1 \end{bmatrix}} \newcommand{\set}[1]{\left\{#1\right\}} \newcommand{\fset}[2]{\set{#1 \;\middle\vert\; #2}} \newcommand{\noindex}{\hspace*{-0.8em}} \newcommand{\xmark}{\ding{55}} \newcommand{\ce}[2]{{#1}_{#2}} \newcommand{\lb}[1]{\ce{\iota}{#1}} \newcommand{\ub}[1]{\ce{\upsilon}{#1}} \newcommand{\rot}[1]{\rlap{\rotatebox{45}{#1}~}} \newcommand{\tf}[3]{\tensor[^{#1}]{\mat{#2}}{_{#3}}} \newcommand{\Csset}[1]{\Cset \left(#1 \right)} \DeclareMathOperator{\diag}{\text{diag}} \DeclareMathOperator{\acos}{\text{acos}} \DeclareMathOperator{\asin}{\text{asin}} \DeclareMathOperator{\sgn}{\text{sgn}} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \]

Abstract

Sample efficient learning of manipulation skills poses a major challenge in robotics. While recent approaches demonstrate impressive advances in the type of task that can be addressed and the sensing modalities that can be incorporated, they still require large amounts of training data. Especially with regard to learning actions on robots in the real world, this poses a major problem due to the high costs associated with both demonstrations and real-world robot interactions. To address this challenge, we introduce BOpt-GMM, a hybrid approach that combines imitation learning with own experience collection. We first learn a skill model as a dynamical system encoded in a Gaussian Mixture Model from a few demonstrations. We then improve this model with Bayesian optimization building on a small number of autonomous skill executions in a sparse reward setting. We demonstrate the sample efficiency of our approach on multiple complex manipulation skills in both simulations and real-world experiments.

Technical Approach

Figure: Our approach consists of two parts: A Bayesian optimizer estimating the value $\probc{\Delta\theta}{D}$ and proposing potential new updates $\Delta\theta_i$. The second part is the evaluation function $h(\Delta\theta_i, j)$ which plays the update $\Delta\theta$ for $j$ steps and averages the returns. The results are used to inform the optimizer.

In this work, we consider a sparse reinforcement learning setting, in which a policy $\pi_\theta$ processing observations $s_t$, produces an action $a_t$, and receives a reward $r_t \in \R$ in return. The objective of the policy is to accumulate the maximum possible reward $R$ for an episode. We assume the reward to be sparse, only given out at the end of an episode for either success or failure, i.e. $R \in \set{0, 1}$. We assume the policy $\pi_\theta$ to be parameterized under a space $\Theta \subseteq \R^m$ and assume the existence of an update function $\oplus$ which can be used to derive an updated policy $\pi_{\theta,i} = \pi_\theta \oplus \Delta \theta_i$. We strive to find the optimal update $\Delta\theta^*$ which will yield the optimal policy $\pi^*=\pi_\theta \oplus \Delta \theta^*$. Performance is assessed by a non-deterministic evaluation function $h_{\theta}(\Delta\theta, j) \rightarrow \R$ which executes the policy yielded by the update for $j$ episodes and averages the rewards obtained by the executions. The overall objective is \[ \Delta\theta^* = \argmax_{\Delta\theta}\ h_{\theta}(\Delta\theta, j), \] where $j$ is constant. Each evaluation of $h_\theta$ yields a data point $(\Delta\theta_i, R_i)$, which is collected in a dataset $\D = \set{(\Delta\theta_1, R_1), \ldots, (\Delta\theta_I, R_I)}$. This dataset can be used in the search for $\Delta \theta^*$. Please refer to our paper for more details.

Publications

Adrian Röfer, Iman Nematollahi, Tim Welschehold, Wolfram Burgard, and Abhinav Valada
Bayesian Optimization for Sample-Efficient Policy Improvement in Robotic Manipulation
Under review, 2024.

(PDF) (arXiv) (BibTeX)