2⇡

2

x 2⇡ p(y)

2

Abstract xt 2⇠i

We use a Convolutional Neural Network (CNN) as the function approximator to generate the cost a Convolutional Neural Network (CNN)architectures, as the function approximator tooccupancy generate the cost A quick derivation of a simple rule for including proximal regularization with the functionWe foruse a •given scene. We This have tested three specific all of which take an Conv-Only: network applies a sequence of six convolutions, followed by(b) Batch Add the functional gradient data from the loss-augmented problem function forOcc(X a given)scene. We have threeand specific architectures, all of which take an occupancy grid representation of a(BN) scene X astested the generate the corresponding costkeep function c: gradient projection implementation in IOC. Normalization [IS15] andinput PReLU [HZRS15] non-linearities. We the resolution functional grid representation Occ(X ) of a scene X as the input and generate the corresponding cost function c: ⇤ ⇤ ⇤ ⇤ constant throughout the network. D = D [ {(x , c(x ) + ⌘ ) | x 2 ⇠ (3) t t t t i }. • Conv-Only: This network applies a sequence of six convolutions, followed by Batch • Conv-Deconv: Thisnetwork network first[HZRS15] applies a sequence convolutions to the input. Each Conv-Only: This applies a sequence of of sixfour convolutions, followed by Batch Normalization (BN) [IS15] and PReLU non-linearities. We keep the resolution These points suggest where to increase the cost function by ⌘t . is followed by aPReLU max-pooling operation that reduces resolution a factor Normalization [IS15] and [HZRS15] non-linearities. We keep theby resolution constantconvolution throughoutlayer the(BN) network. of 2. At the bottleneck, apply a 1x1 convolution followed by four deconvolution(c) layers constant throughout thewe network. the functional gradient data from the example • Conv-Deconv: This network first applies atosequence of four convolutions toexcept the input. Eachlayer Add which interpolate the output back the full resolution. Each layer the final is • Conv-Deconv: This network first appliesoperation a sequence four convolutions the Eachset of cells x that we have deltas for at the current iteration (as a subset Let Xby Xinput. be the tto⇢ convolution layerby is afollowed by a max-pooling thatofreduces resolution a factor ⇤ ⇤ followed Batch Normalization and PReLU non-linearity. D = D [ {(x , c(x ⌘t ) | xt 2 ⇠i }. (4) convolution layer is followed by a max-pooling operation that reduces resolution by a factor t let t ) be of 2. At the bottleneck, we apply a 1x1 convolution followed by four deconvolution layers of the set of all cells for this environment X ), and that corresponding delta. Let x • Conv-Deconv-Linear: This network architecture is similar to the Conv-Deconv network, of 2. At the bottleneck, we apply a 1x1 convolution followed by four deconvolution layers which interpolate the output back to the full resolution. Each layer except the final layer is be the regularizer. Then the projection term is These points suggest where to decrease the cost function by ⌘t . but we replace the Conv + max-pooling operations by strided convolutions. We only have which interpolate the output back to the full resolution. Each layer except the final layer is followed by a Batch Normalization and PReLU non-linearity. 3followed strides convolution and deconvolutional layers. Also, we replace the 1x1 convolution by a Batch Normalization and PReLU non-linearity. ⇣ to improve the hypothesis: ⌘2 X 2. Solve the regression problem • Conv-Deconv-Linear: This network architecture is similar to the Conv-Deconv network, 1 at the (1) bottleneck with two fully connected layers that allowtousthe to Conv-Deconv propagate information (2)is (1,3) (2)c(x; w) (1,4), (3) (1) • Conv-Deconv-Linear: This network architecture similar network, c (x) + ⌘ t x but we throughout replace the Conv + max-pooling operations by strided convolutions. Weare only have by a the the entire image. As before, all layers except the final layer followed X 2 but we replace Conv + max-pooling operations by strided convolutions. We only have 1 2 1 2 2 3 strides convolution and deconvolutional layers. Also,We wemade replace the 1x1 model convolution x2X Batch Normalization and PReLU non-linearities. use of this in the results t ct+1 = argmin y c(x; w) + kw wt k + w. (5) 3 strides convolution and deconvolutional layers. Also, we replace the 1x1 convolution at the bottleneck with two fully connected layers that allow us to propagate information 2 2 2 w presented in Section 3. at the withAs twobefore, fully connected layers the thatfinal allow us to propagate information throughout thebottleneck entire image. all layers except layer are followed by a the step size and ct (x) (x,y)2D where ⌘ iss is the previous cost at x. Likewise, the proximal throughout the entire image. As before, all layers except the final layer are followed by a Batch and PReLU non-linearities. made use our of this model We in the We use Normalization the neural network package Torch [tor] toWe implement models. useresults the ADAM regularization objective is Batch Normalization and PReLU non-linearities. We made use of this model in the results presented in Section 3. optimization algorithm [KB14] with default parameters, a step size of 1e-3 and a batch size of 32 for 6 Proximal presented in Section 3. •training We devise a simple step-project update to walk across the • We define targets X for the ⇣ CNN training ⌘2 all our networks. • 3 We use manifold the neural network package Torch [tor] to implement our models. We use the ADAM surface c(x; w) ct (x) , (2) We use the neural network package Torch [tor] to implement our models. We use the ADAM optimization algorithm [KB14] with default parameters, a step size of 1e-3 and a batch size of 32 for 2 ⇣ ⌘2 ⇣ ⌘2 X X optimization algorithm [KB14] with default parameters, a step size of 1e-3 and a batch size of 32 for 1 • The Compute projection Loss step is performed by a Convolutional Neural x2X Appendix: training5all our networks. LEARCH algorithm c(x; w) c (x) + ⌘ + c(x; w) c (x) (6) t x t training all our networks. Augmented Solutions Network (CNN)

n (x µ)2 o 1 p(x | y) = p(X =p(x, x | Yy) = y) p(x, y)p(X p(x, y) p exp = p(x) = = p(x)p(y) p(x | y) = x | Y = y) 2 p(x | y) = p(x |2⇡y) = 2 p(y) p(y) p(x, y) 1 Derivation p(y | x)p(x) p(x | y) = y) p(x | y) = p(X =p(x, x|Y = y) p(x, y) p(y) p(x | y) = p(x, y) = p(x)p(y) p(x, y)== p(x)p(y) p(y) p(y) Z p(x, y) p(x, y) = p(x)p(y) p(x | y) p(y = | x)p(x) p(y | x)p(x) p(x)dx =y)y) 1= = p(x)p(y) p(x, p(y) p(x, p(x, y) = p(y) p(y) p(y | x)p(x) Jim Mainprice , Arunkumar Byravan , Daniel Kappler , Dieter Fox , Stefan Schaal , Nathan Ratliff p(x, y) = Z y) = p(x)p(y) p(x, Z | x)p(x) p(x | y) =p(y) p(x | y)p(y) = p(y p(y | x)p(x) = 1 p(x, y)1p(x)dx = Z p(x)dx = 1) Inverse Reinforcement Learning (IRL) 3) Deep-LEARCH Algorithm 5) Proximal regularization 8) Results p(y) p(y | x)p(x) p(x)dx = 1 = | x)p(x) Z y)p(y p(y | x)p(x) p(x, R | y)p(y)p(y) p(x | y) = =p(x p(x | y) = = p(y | x)p(x) 0 0 0 types of environments: ➢ Learn the underlying utility function of the optimal p(y) p(y |x )p(x )dx p(x)dx = 1 p(x | y) = p(x | y)p(y) = | x)p(x) Z y) = p(x control | y)p(y)problem = p(y |from x)p(x) demonstrated behaviors p(x)dx = 1 | x)p(x) p(y | x)p(x) xp(y 2 2 t R x2X x2X The sum of their gradients for a cell x in X is p(x|p(x | y)| = = = p(y | 0x)p(x) t p(y x)p(x) p(y | x)p(x) t 1. Initialize the data set to empty D = ;, and iterate the following across all examples i: y) = p(x | y)p(y) 0 )dx0 5 Appendix: LEARCH algorithm p(y| |y) x)p(x) p(y | x)p(x) R p(y) p(y |x )p(x p(x = = h⇣ ⌘ ⇣ ⌘i R 5 Appendix: LEARCH algorithm 0 0 0 = = p(y) z (a) Solve the loss-augmented problem t = p(x p(y |x )p(x 0 0 )dx p(x | y) |0y)p(y) = )dx p(y | x)p(x) p(y) p(y |x )p(x c(x; w) of cells ct (x) +of ⌘we +deltas c(x; w) ccurrent rw c(x; (as w) a subset of the(3) x have t (x) Let X ⇢ X be the set x that for at the iteration set X Actuation t • The gradient sum the general and proximal term: loss 1. InitializeSolve the data set to empty D = ;,⇤and iterate the following examples i: their gradients for a cell x in Xt is The sum of i acrossi all ) li (xt )across . h⇣ ⌘i 1. Initialize the data set to empty⇠iD==argmin ;, and iterate c(x the tfollowing allofexamples i:for(2) all cells this environment X ), and let be that corresponding delta. Let be the regularizer. x u augmented planning x p(y | x)p(x) p(y | x)p(x) t Optimal t nal manifold projections in Deep-LEARCH ⇠2⌅ h⇣ ⌘ ⇣ ⌘i w) (a) Solve the loss-augmented problem x 2⇠ = (1 + )c(x; w) (1 + )c (x) + ⌘ r c(x; (4) R Then the projection term is | y) =xt p(y | x)p(x) = p(y | x)p(x) t x w Control xp(x (a) Solve the loss-augmented problem Environment t X 0 )p(x0 )dx0 c(x; w) ⇣ ct (x) + ⌘ x + c(x; w) ct (x) rw c(x; w) (9) p(y) p(y |x ⇤ i i ⌘ X R z p(xSensing | y) = ⇠i gradient = argmin li (x (2) t= ⌘ (b) Add the functional thet )loss-augmented ⇤ data fromc(x i t) . iproblem 0 0 0 h⇣ ⌘iw). ⇠ = argmin c(x ) l (x ) . (2) i ⇠2⌅ p(y) p(y |x )p(x )dx i t t = (1 + ) c(x; w) c (x) + r c(x; (5) t x w x 2⇠ zt ⇠2⌅ ⇤ ⇤ ⇤ ⇤ ⇣ ⌘ Compute Functional Gradient 1 )c +t (x) + ⌘ x 2 rw c(x; w) X xc(x 2⇠t ) + ⌘t ) | xt 2 ⇠i }. D = D [ {(x , (3) = (1 + )c(x; w) (1 + (10) 1 t u : Control or Action t c(x; w) the ct (x) + ⌘ x gradient , (7) xt • Resulting ⇣ target for projecting functional (b) Add the functional gradient data from the loss-augmented problem ⌘ 2 gradient of the following gradient is equivalent to the term points suggestgradient where to increase theloss-augmented cost function byproblem ⌘That (b) These Add the functional data from the t. : State 1 2 1,3 2 ⌘ x x2X t t rice Arunkumar Byravan Daniel Kappler Dieter Fox update ⇤ ⇤ ⇤ ⇤

Max Planck Institute for Intelligent Systems – Autonomous Motion Department

i t

i

i t

i

i t

ii

eti

i

cC ost

Oc cu

i

pan cy Ma p

Functional Manifold Projections in Deep-LEARCH

(11)

ed

Co st

Sy

Le arn

pan cy Ma p

6) CNN Model

Oc cu

2 for imitation or prediction of a given behavior by having solely access to demonstrated

nth

= (1 + ) c(x;⇣w) ct (x) + r⌘w c(x; w). D = D [ {(x , c(x ) + ⌘ ) | x 2 ⇠ }. (3) x (c) Add the functional gradient data from the example t t t t i ⇤ ⇤ ⇤ ⇤ Utility Function zt : Measurement or Observation D = D [ {(xt , c(xt ) + ⌘t ) | xt 2 ⇠i }. (3) 2 1+ ⌘ 1 + ⇤ ⇤ c(x; w) ct (x) cost + at x. Likewise, (6) These points suggest where to D increase function = D [the {(xcost , c(x ) ⌘tby ) | ⌘xtt. 2 ⇠i }. (4)step size and c (x) x t t where ⌘ iss the is the previous the proximal regularization t These points suggest where to increase the cost function byThat ⌘t . gradient is equivalent to the 2 gradient 1 + term of the following (c) Add the functional gradient data from the example ⇣ ⌘ objective is 1,4 3 2 These points suggest where to decrease the cost function by ⌘ . (c) Add the functional gradient data from the example t 1 + Stefan Schaal Nathan Ratliff ⇣ ⌘ ⇤ ⇤ 2 = c(x; w) t , (7) 1 + ⌘ D = D [ {(x , c(x ) ⌘ ) | x 2 ⇠ }. (4) x t t i t t ⇤ ⇤ 2. Solve the regression problemDto=improve the hypothesis: 2 c(x; w) D [ {(xt , c(xt ) ⌘t ) | xt 2 ⇠i }. (4) ct (x) + x ⇤ 2⌘ 1+ ⇣ ⌘2 X the cost function These points suggest where to1 decrease X 2 by1⌘t . 2 2 Thesecpoints suggest where to decrease thew) cost + function ⌘tt.k + the 2 ⌘c2t (x) , where tx = ct (x) + 1+1 +x . ⇣ y c(x; kw by w w.target (5) t+1 = argmin ion c(x; w) (8) 2. Solve the regression problemwto improve the hypothesis: 2 2 2 2 the (x,y)2D 2. Solve the regression problem to improve the hypothesis: = c(x;objective w) tx in,Equation 2. Cells that aren’t in X will just use t x2X Solve Regression 2 1 X 2 1 2 2 X ement Learning (IRL) has been studied for more that 15 years and is of fundamental 1 y c(x; w) + kw ct+1 = argmin w. 2 (5) • Our network first applies a sequence of three strided 2 1wt k + ⌘ ct+1w = argmin y c(x; w)2 + kw w2t k2 + w. (5) 2 where the target t = c (x) + . botics. It allows learning a utility function “explaining” the behavior of an agent, and (x,y)2D 2 2 2 x t xinput. 3 w 1+ convolutions to the 2 (x,y)2D 1 50

3

(12) (13)

Robot (2009) 27: 25–53 Figure 1: The typesAuton of environments, from left to right: fully obser

ap

ap

cup anc yM

anc yM

Co st

etic

cup

nth

ed

Co st

Oc

Sy

st

arn

Le

arn ed

Co

st

Sy

nth e

tic

Co

Le

Le ar

ned

Co st

Oc

Sy

nth e

tic Co st

• of The convergence of LEARCH with different training steps the just bottleneck, we applyintwo fully connected layers that Cells that aren’t in -XAt use the objective Equation 7. Table 1 Results experiments comparing learned to engineered prior Validation loss on the “fully observable" data set with different inner l t will optimal solutions. When the reward function is assumed to be a linear combination 6.2 Imitation learning for overhead data 2 3 of the CNN maps. Indicated costs are from Crusher’s onboard perception system allow us to propagate information throughout the entire (right). has strong convergence properties [AN04, KPRS13, BMZ+ 15, MHB15, MHB16]. 3 Images [Doerr 15]Neural image followed by three deconvolution layers which recent years, amotions, lotlearned ofsliding interest haspoint been focused on using Deep Convolutional ofFig. freedom. 4: Motion Pointing policies to favor fromspecific one degrees on the of table freedom. to another Pointing aremotions, visualized sliding as executed from one on point the actual on therobot. table to another are visualized as executed on the actual robot. Experiment Total net Avg. Total cost Max cost In this context, learning a costmap via LEARCH provides 7 Acknowledgment n On the the left in image the middle shoulder forearm motions motions are favored are. Inwhilst the right on the image, image a policy in the is middle learned forearm whichmotions maintains are.a In horizontal the righthand image, a policy is learned which maintains a horizontal hand interpolate the output back to the full resolution s)alignment. toimage encode the reward function [FLA16, WZWP16]. Using such powerful non2 distance (km) speed (m/s) incurred incurred numerous advantages. Only one set of human input is nec• Features are usually hand coded properties 4) Functional Manifold Projections approximators allows to learn from low level2features directly, thus not requiring - Eachessary layerper except ispaths followed bybeaused Batch test site;the the final same layer example can then projection step in the inner loop where we train the neural netwo End-effector orientation This research was supported in part by National Science Foundation grants IIS-1205249, IIS-1017134, ge, which can potentially lead to learn higher fidelity behaviors. The LEARning to Normalization and PReLU non-linearity Experiment 1 6.63 2.59 11108 23.6 to learn a costmap from any subset of the available features. The outline of the LEARCH algorithm that optimizes this loss • Functional gradients updates can be used to update EECS-0926052, the Office of Naval Research, the Okawa Foundation, and the Max-Planck-Society. ork [RSB09] has introduced functional gradients as a powerful technique underlying - Sum of acceleration Learned - TorchThe to overhead implement our point model. We use theintuitive ADAMvanvantage provides a very generic functionals gradient is outlined in Section 5. timization. This approach has shown to outperform subgradient methods when Any opinions, findings, and conclusions or recommendations expressed in this material are those of Experiment 1 6.49 2.38 14385 264.5 - Distance to a surface optimization with default parameters, a step size tage pointalgorithm for demonstrating behavior. A human operator r a linear reward assumption, and was shown to efficiently optimize non-linear cost the author(s) and do not necessarily reflect the views of the funding organizations. of 1e-3 and a batch size of 32 for training all our EngineeredIntuitively, the negative gradient portion of the loss L[c] dealin Compute can simply ‘draw’ an example path from start to goalnetwork on -… Functional Gradient Experiment 2 6.01 100.2 @ top (left). of a visualization of the available data (e.g. an aerial or Figure 1: The types of environments, from left to right: fully observable, lidar and object in motion function y = 2.32 ct (x), i.e.17942 - @y L(y) defines the quickest way to d • Representing the utility function with a deep network allows extend LEARCH to train CNNs using functional manifold projections, which we Validation loss on the “fully observable" data set with different inner loop training iteration steps of the Learned CNN image). In this way, it is also much simpler to prosatellite learning from low level features (negative) functional gradient data set is just the (negative) loss ARCH. Earlier work on functional gradient approaches [RSB09] built large but flat (right). Experiment 2 5.81 2.23 21220 517.9 vide examples Figure 1: The types of environments, from left to right: fully observable, lidar and object in motion (left). that imply different cost metrics (Fig. 16), as Removes theintedious design of feature functions and advantage of Validation loss on the “fully observable" data set with different inner loop training iteration steps of the CNN hat •continually grow size. Our technique maintains the convergence Engineered opposed to engineering multiple functions by hand. @ (right). allows learning higher fidelity behaviors 7) Validation ent techniques (observed in linear spaces [MBVH09]) while generalizing to fixed Since global planning on UPI is achieved using Field D*, Experiment 2 6.19 1.65 26693 224.9 ci = L(y) c (x ) . Evaluate current loss at t i step in the inner loop where we train the neural network based on the data we’ve collected. metric models (CNNs) by formally representing the function approximator as a non- projection @y many of the details discussed in Sect. 4.4 came into play the demonstrations No Prior thatloop optimizes functional its functional old of the space of all functions. We derive a simple step-project functional gradient The outline of the LEARCH projectionalgorithm step in the inner where wethis trainloss the neural network based based on the data we’ve collected. Deep-LEARCH with 3 types of •on We implemented when applying LEARCH. Specifically, Field D* is an inThe outline of the LEARCH algorithm that optimizes this loss functional based on its functional gradient is outlined in Section 5. The functional gradient r L[c] defines a step off of the function to walk across the manifold thatto is substantially more data efficient than traditional f 2) LEARning searCH (LEARCH) environments on 2D navigation tasks terpolated path planning algorithm, and so visitation counts gradient is outlined in Section 5. onsisting of a single back-propagation commonly used in Deep-IRL. We present Intuitively, the negative gradient portion of the loss L[c] dealing only with the outputs of the cost class Hto= {c =its c(·; w)|w to2theW} [RSB09]. At each Deepoverhead data, and compare performance handFig. 7: Effects of domain size and problem dimensionality on Fig.the 7: optimization Effects of domain speed size and and problem dimensionality on the optimization speed and must be computed with respect to actual distance traveled. Fully observable occupancy map Intuitively, the negative gradient portion of the loss L[c] dealing only with the outputs of the cost onFig. training 6: Cross-validation sets for two motion of the fitness of policies learned on training sets for two motion rimental results showing higher-training rates on low-dimensional 2D synthetic data. @ quality. The average number of necessary function evaluations quality. and The the resulting average number fitness is of necessary function evaluations and the resulting fitness is defines the • Collection of algorithms for inverse reinforcement gradient is projected back onto the manifold by computing the d @ quickest way to decrease the loss function. Thus function y = c (x), i.e. L(y) the tuned approach. One experiment of note compared the gent Additionally, a configuration space expansion is performed t types: 1) andpointing low (setto2)a obstacle cylindrical obstacle with high (set 1) and low (set 2) obstacle function y = c (x), i.e. L(y) defines the quickest way to decrease the loss function. Thus the @y t evaluated on four motion problems. training Three different initial evaluated configurations onas four (step motion size and problems. Three different initial configurations (step size and @y - Partially observable: only locally observable ideas have broad implications for structured beyond IRL well as deep g the avoidance. imitation For loss both andmotion another types,initial one policy minimizing the imitation loss and another learning parametrization) are shown for each task. initial parametrization) are shown for(negative) each task. functional gradient set isgradient just thedata (negative) loss(negative) partialsloss of partials the loss the functional gradient. This results in training the CNN, so eralizationwith of LEARCH with that of a hand-tuned costmap. (negative)data functional set is just the of function the loss function by the planner, averaging all costs within a set radius of the ssin minimizing have been learned, the combined summing imitation and elbow height loss have been learned, summing general. - Partially observable: emulating a moving object • Training the CNNimagery with more stochastic gradient steps - Extends theD. Maximum Margin Planning (MMP) up to 4 different policies (A,B,C,D). A cost map was trained off satellite for an approxicurrent cell. Furthermore, in an attempt to reduce noise in @ Effects of Optimizer Settings D. Effects of Optimizer Settings @ ⇤ c = L(y) . 2 i h = argmax < h, r L[c] > allows higher over all training rates, making functional c (x ) c = L(y) . algorithm to non-linear utility functions f mately 60 km size area at 60 cm resolution. An engineered the drawing of examples, corridor constraints were used as i @y ct (xi ) ndom policypolicy clearlybutoutperforms also the effects initial random policy but also @y The of the domain size on the optimizer’s The effects perforof the domain size on the optimizer’s perforh2H optimization with deep-nets a promising research avenue al manifold projection costmap had been previously produced for this same area • 800 environments with each 20 demonstrations described in Sect. 4.7, along with a normalized gradient as - Solves augmented optimal control problem petheofpolicy motion. which Thisthe learned is loss a different type of motion. This is The functional gradient r L[c] defines a step off of the function manifold spanned by our hypothesis f mance (e.g. in terms of convergence and mance best (e.g. fitness) in terms have of convergence and best fitness) have Figure 1: The types of environments, from left to right: fully observable, lidar and object in motion (left). The functional gradientclass rf L[c] a step function spanned by our hypothesis to support UPI. A subset of both maps is shown in Fig. 17. H =defines {c = c(·; w)|woff 2 of W}the [RSB09]. Atmanifold each Deep-LEARCH iteration, the functional described in Sect. 4.6. Experimental validation of these aphetrue pure for-imitation both the loss policy learned using the pure imitation loss •⇤ Grids of 100x100 pixels Exponentiated of the loss functional beengradient evaluated descent on the experiments as presented been evaluated above. on Thethe experiments as presented gradient above. isThe 2 back onto the by computing the direction h 2the H that best correlates W}Functional [RSB09]. Atmanifold each Deep-LEARCH iteration, functional he ith example denoting a statethealong the trajectory by xt 2 ⇠, the class H = {c = c(·; w)|w 2 projected The two maps were compared using a validation set of paths Validation loss on the “fully observable" data set with different inner loop training iteration steps of the CNN The space of all squared integrable L is big. Even if H is the sp proaches is presented in Fig. 15. The loss function used was edfunction using the (Acombined andtrajectory C) and range the and policy learned using combined of the domain has been varied for all range those of experiments the domain has been varied for all those experiments • We make use of the Field-D* algorithm to plan the navigation ⇤ with onto the functional gradient.by This results in training the CNN,hsolving a that problem of the form Functional gradient updates gradient is projected back the manifold computing the direction 2 H best correlates 09] loss functional is generated200 by alayers UPI team member not directly involved ina made up example), that Manifold Projection combined loss function from (B and a function of Euclidean distance between states on the curReferences and 20 million weights (just (right). [0,D). 1] and [0, 5] up to [0, 10]. The from initial[0,guess 1] and ✓0 [0, has 5] up to [0, 10]. The initial guess ✓0 has motions [Ferguson 06] h⇤ the = argmax < h, rf L[c] >. with the functional gradient. This results in training CNN, solving a problem of the form the development ofsubmanifold overhead costing. Themost average validarent path µ and the closest state on the example µi , with a arios: b) Concerning Generalization the to Similar Scenarios: Concerning the N been chosen randomly within the domain been and chosen the initial randomly step within the domain and the initial step h2H span a of at 20 million dimensions, one for e h i X X X • The validation loss σis2 of the form: Nathanwith D Ratliff, David Silver, and J Andrew Bagnell. Learning to search: Functional gradient ⇤ i been i i tion loss was[1]0.675 the engineered map, and 0.551 with scale parameter efitness corresponding of the learned test policies on the corresponding test size has set to = (max min)/5. size has been set to = (max min)/5. h = argmax < h, r L[c] > . L[c] = c(xt ) min li (xt ) , [Ratliff 09] (1) 0c(xt ) 0 f The techniques set of all approximators in27(1):25–53, that class parameterize for function imitation learning. Autonomous Robots, Juneis2009. 2 t

i

⇠2⌅ type of motion clearly The space of all squaredh2H integrable L is big. Even if H is the space of all deep neural networks with the LEARCH map. ype problems, of motion thei=1 clearly policyi learned for this The results of evaluating the performance The for results those experiof evaluating the performance for those experii [2] Arunkumar Byravan, Mathew Monfort, Brian Ziebart, Byron Boots, and Dieter Fox. Graph" # xt 2⇠i xt 2⇠i ! the weight vector moves from point to point in the ofclass of functi 200 layers and 20 million weights (just a made up example), that space of function approximators can 1 2 2 ndperforms the policies better learned than the ments randomare policy and the policies learned Additional validation was also achieved during offibased inverse optimal control for robot manipulation. In Proceedings the International shown in Figure 7. The optimalments policy areis shown computed in Figure 7. The optimal policy is computed L(µ, µi that? )= 1 − exp min [∥s − si ∥ ]/σ . [Ratliff 09] span a submanifold of at most 20 million dimensions, one for each weight. How do we know 2 The projection step can be viewed as a supervised training Joint Conference ontests Artificial Intelligence, 2015. differentiable w.r.t. w,consist so this movement si ∈µi cially the other case motion for the type. This istheespecially the for the space of all squared integrable L is big. Even if H is the space of all deep neural t for of all and lbased defining thecase margin. For regularization, wethe often• The •thetrajectories Where: cial UPI field tests. UPI field of Crusher au- between functions is sm onloss three different domain ranges. based Foron each thetask, three different domain ranges. For each task, the i some The set of all function approximators in that class is parameterized by its weightnetworks vector w. with Changing |µ| s∈µ [3] Andreas Doerr, Nathan Ratliff, Jeannette Bohg, Marc Toussaint, Stefan Schaall. Direct Loss step that finds the loss function that best correlates with the hof perform policies learned significantly for motion type 2 which perform significantly 200 layers and 20 million weights (just a made up example), that space of function approximators can submanifold in the space of functions. The Jacobian of the functi the weight vector moves from point to point in the class of functions. The function approximators are functions c to liefunction onbars therepresent manifoldfrom of function approximators H, as neural tonomously navigating a series of courses, with each course left to right the results barsofrepresent the such experiment from left to right the results of the experiment Minimization Inverse Optimal Control, Robotics Science and Systems, July 2015 - c : cost w, sodimensions, this movementone between is smooth. Therefore, createsLEARCH, a smooth as listed in Algorithm 5, was implemented functional gradient obetter theparameterization, performance on their ownoftestand set we compared toincreasing the performance ofranges. span a submanifold of differentiable atresults most are 20 w.r.t. million for functions each weight. How do we itknow that? Jim Rafi Hayne, Dmitry Berenson. Goal set inverse optimal control and iterative cular mayfor even restrict that class further by out putting bounds carried out domain carried The results for increasing are domain ranges. The defined asthe a[4] setfirst of Mainprice, widely spaced waypoints. Courses ranged order, how the function changes when we change the w submanifold in the space of functions. The Jacobian of the function approximator w.r.t. w tells us, to • The validation loss is computed on a hold-out test set of 100 l : loss augmentation function re-planning for predicting human reaching motions in shared workspaces, June, 2016 set of all algorithms function in that class is a parameterized by its weight vector w. Changing set the(cf. motion policies type B weight 1and policies on thisover setleast (cf. policies B andof these with 4 varieties of regressors: linear, neural networks, reneural network vectors. Intest Deep-LEARCH both are by the•4 The averages at 4 independent runs. averages Thehandled over error atbars least independent runs. Theapproximators error bars Other IRL can be recast using functional in length up to 20 km, withand waypoint spacing on interpolation the order to improve path planning: The field the first order, how the function changes when we change the weights. [5] Dave Ferguson Anthony Stentz. Using environments the weight vector moves from point are trees, and linear with a feature learning phase uson D their clearly own outperform training policies and1C on their own trainingNeither displayAthe standard deviation. display thethe number 1 standard of deviation. Neither the number of to point in the class of functions. The function approximators gression gradient rule update of 200 to 1000 d* m.algorithm. These tests took place at numerous loca-2006. Journal of Field Robotics, 23(2):79–101, Motion Planck Institute for Intelligent Systems ; AMD ; IS-MPI ;fitness Tübingen, differentiable w.r.t. w, so this movement between functions is smooth. Therefore, it creates aing smooth set 2). Department, Maxfunction evaluations (shown in green) nor function the resulting evaluations (shown in green) nor the resulting fitness regression trees (see Sect. 4.5). Overall, the latter ap3 theResults tions across continental U.S., each with highly varying 3 4 3 Results sity of Washington, Seattle, WA, USA, Lula Robotics Inc., Seattle, WA, USA, University of submanifold in the space of functions. The Jacobian of the function approximator w.r.t. w tells us, to @ superior all around perfortives: c) The Optimizing policies Additional (shown in blue) Objectives: is significantly The policies influenced (shown by the initial in blue) domain is significantly influenced by the initial domain proach was found to provide local terrain characteristics, and sizes ranging from tens to a ; USC; Los Angeles, USA t the first order, how the function changes when we change the weights. (1) Autonomous Motion, Max-Planck Institute for Intelligent Systems, Tübingen Germany (2) University of Washington, Seattle, USA (4) CLMC Lab, University of Southern California, Los Angeles, USA, (3) Lula Robotics Inc., Seattle, USA ctions learned canusing be directly the augmented size. CMA loss turns functions out tocan be robust be directly against slightly size. CMA inappropriate turns out to be robust against slightly inappropriate @y In order to validate our approach we have implemented Deep-LEARCH with three types of environmance. Specifically, the computational advantage of logam.is.tuebingen.mpg.de hundreds In of square kilometers. order to validate our approach we have implemented Deep-LE We then study the evolution of the validation loss in [RSB09] on a holdout set with increased learned compared using to the the pure ones which initialhave configurations. been learned The using automatic the pure adaptation initial configurations. of exploration The automatic adaptation of ments. exploration

projection step in the inner loop where we train the neural network based on the data we’ve collected. The outline of the LEARCH algorithm that optimizes this loss functional based on its functional gradient is outlined in Section 5. Intuitively, the negative gradient portion of the loss L[c] dealing only with the outputs of the cost function y = c (x), i.e. - L(y) defines the quickest way to decrease the loss function. Thus the

2

x 2⇡ p(y)

2

Abstract xt 2⇠i

We use a Convolutional Neural Network (CNN) as the function approximator to generate the cost a Convolutional Neural Network (CNN)architectures, as the function approximator tooccupancy generate the cost A quick derivation of a simple rule for including proximal regularization with the functionWe foruse a •given scene. We This have tested three specific all of which take an Conv-Only: network applies a sequence of six convolutions, followed by(b) Batch Add the functional gradient data from the loss-augmented problem function forOcc(X a given)scene. We have threeand specific architectures, all of which take an occupancy grid representation of a(BN) scene X astested the generate the corresponding costkeep function c: gradient projection implementation in IOC. Normalization [IS15] andinput PReLU [HZRS15] non-linearities. We the resolution functional grid representation Occ(X ) of a scene X as the input and generate the corresponding cost function c: ⇤ ⇤ ⇤ ⇤ constant throughout the network. D = D [ {(x , c(x ) + ⌘ ) | x 2 ⇠ (3) t t t t i }. • Conv-Only: This network applies a sequence of six convolutions, followed by Batch • Conv-Deconv: Thisnetwork network first[HZRS15] applies a sequence convolutions to the input. Each Conv-Only: This applies a sequence of of sixfour convolutions, followed by Batch Normalization (BN) [IS15] and PReLU non-linearities. We keep the resolution These points suggest where to increase the cost function by ⌘t . is followed by aPReLU max-pooling operation that reduces resolution a factor Normalization [IS15] and [HZRS15] non-linearities. We keep theby resolution constantconvolution throughoutlayer the(BN) network. of 2. At the bottleneck, apply a 1x1 convolution followed by four deconvolution(c) layers constant throughout thewe network. the functional gradient data from the example • Conv-Deconv: This network first applies atosequence of four convolutions toexcept the input. Eachlayer Add which interpolate the output back the full resolution. Each layer the final is • Conv-Deconv: This network first appliesoperation a sequence four convolutions the Eachset of cells x that we have deltas for at the current iteration (as a subset Let Xby Xinput. be the tto⇢ convolution layerby is afollowed by a max-pooling thatofreduces resolution a factor ⇤ ⇤ followed Batch Normalization and PReLU non-linearity. D = D [ {(x , c(x ⌘t ) | xt 2 ⇠i }. (4) convolution layer is followed by a max-pooling operation that reduces resolution by a factor t let t ) be of 2. At the bottleneck, we apply a 1x1 convolution followed by four deconvolution layers of the set of all cells for this environment X ), and that corresponding delta. Let x • Conv-Deconv-Linear: This network architecture is similar to the Conv-Deconv network, of 2. At the bottleneck, we apply a 1x1 convolution followed by four deconvolution layers which interpolate the output back to the full resolution. Each layer except the final layer is be the regularizer. Then the projection term is These points suggest where to decrease the cost function by ⌘t . but we replace the Conv + max-pooling operations by strided convolutions. We only have which interpolate the output back to the full resolution. Each layer except the final layer is followed by a Batch Normalization and PReLU non-linearity. 3followed strides convolution and deconvolutional layers. Also, we replace the 1x1 convolution by a Batch Normalization and PReLU non-linearity. ⇣ to improve the hypothesis: ⌘2 X 2. Solve the regression problem • Conv-Deconv-Linear: This network architecture is similar to the Conv-Deconv network, 1 at the (1) bottleneck with two fully connected layers that allowtousthe to Conv-Deconv propagate information (2)is (1,3) (2)c(x; w) (1,4), (3) (1) • Conv-Deconv-Linear: This network architecture similar network, c (x) + ⌘ t x but we throughout replace the Conv + max-pooling operations by strided convolutions. Weare only have by a the the entire image. As before, all layers except the final layer followed X 2 but we replace Conv + max-pooling operations by strided convolutions. We only have 1 2 1 2 2 3 strides convolution and deconvolutional layers. Also,We wemade replace the 1x1 model convolution x2X Batch Normalization and PReLU non-linearities. use of this in the results t ct+1 = argmin y c(x; w) + kw wt k + w. (5) 3 strides convolution and deconvolutional layers. Also, we replace the 1x1 convolution at the bottleneck with two fully connected layers that allow us to propagate information 2 2 2 w presented in Section 3. at the withAs twobefore, fully connected layers the thatfinal allow us to propagate information throughout thebottleneck entire image. all layers except layer are followed by a the step size and ct (x) (x,y)2D where ⌘ iss is the previous cost at x. Likewise, the proximal throughout the entire image. As before, all layers except the final layer are followed by a Batch and PReLU non-linearities. made use our of this model We in the We use Normalization the neural network package Torch [tor] toWe implement models. useresults the ADAM regularization objective is Batch Normalization and PReLU non-linearities. We made use of this model in the results presented in Section 3. optimization algorithm [KB14] with default parameters, a step size of 1e-3 and a batch size of 32 for 6 Proximal presented in Section 3. •training We devise a simple step-project update to walk across the • We define targets X for the ⇣ CNN training ⌘2 all our networks. • 3 We use manifold the neural network package Torch [tor] to implement our models. We use the ADAM surface c(x; w) ct (x) , (2) We use the neural network package Torch [tor] to implement our models. We use the ADAM optimization algorithm [KB14] with default parameters, a step size of 1e-3 and a batch size of 32 for 2 ⇣ ⌘2 ⇣ ⌘2 X X optimization algorithm [KB14] with default parameters, a step size of 1e-3 and a batch size of 32 for 1 • The Compute projection Loss step is performed by a Convolutional Neural x2X Appendix: training5all our networks. LEARCH algorithm c(x; w) c (x) + ⌘ + c(x; w) c (x) (6) t x t training all our networks. Augmented Solutions Network (CNN)

n (x µ)2 o 1 p(x | y) = p(X =p(x, x | Yy) = y) p(x, y)p(X p(x, y) p exp = p(x) = = p(x)p(y) p(x | y) = x | Y = y) 2 p(x | y) = p(x |2⇡y) = 2 p(y) p(y) p(x, y) 1 Derivation p(y | x)p(x) p(x | y) = y) p(x | y) = p(X =p(x, x|Y = y) p(x, y) p(y) p(x | y) = p(x, y) = p(x)p(y) p(x, y)== p(x)p(y) p(y) p(y) Z p(x, y) p(x, y) = p(x)p(y) p(x | y) p(y = | x)p(x) p(y | x)p(x) p(x)dx =y)y) 1= = p(x)p(y) p(x, p(y) p(x, p(x, y) = p(y) p(y) p(y | x)p(x) Jim Mainprice , Arunkumar Byravan , Daniel Kappler , Dieter Fox , Stefan Schaal , Nathan Ratliff p(x, y) = Z y) = p(x)p(y) p(x, Z | x)p(x) p(x | y) =p(y) p(x | y)p(y) = p(y p(y | x)p(x) = 1 p(x, y)1p(x)dx = Z p(x)dx = 1) Inverse Reinforcement Learning (IRL) 3) Deep-LEARCH Algorithm 5) Proximal regularization 8) Results p(y) p(y | x)p(x) p(x)dx = 1 = | x)p(x) Z y)p(y p(y | x)p(x) p(x, R | y)p(y)p(y) p(x | y) = =p(x p(x | y) = = p(y | x)p(x) 0 0 0 types of environments: ➢ Learn the underlying utility function of the optimal p(y) p(y |x )p(x )dx p(x)dx = 1 p(x | y) = p(x | y)p(y) = | x)p(x) Z y) = p(x control | y)p(y)problem = p(y |from x)p(x) demonstrated behaviors p(x)dx = 1 | x)p(x) p(y | x)p(x) xp(y 2 2 t R x2X x2X The sum of their gradients for a cell x in X is p(x|p(x | y)| = = = p(y | 0x)p(x) t p(y x)p(x) p(y | x)p(x) t 1. Initialize the data set to empty D = ;, and iterate the following across all examples i: y) = p(x | y)p(y) 0 )dx0 5 Appendix: LEARCH algorithm p(y| |y) x)p(x) p(y | x)p(x) R p(y) p(y |x )p(x p(x = = h⇣ ⌘ ⇣ ⌘i R 5 Appendix: LEARCH algorithm 0 0 0 = = p(y) z (a) Solve the loss-augmented problem t = p(x p(y |x )p(x 0 0 )dx p(x | y) |0y)p(y) = )dx p(y | x)p(x) p(y) p(y |x )p(x c(x; w) of cells ct (x) +of ⌘we +deltas c(x; w) ccurrent rw c(x; (as w) a subset of the(3) x have t (x) Let X ⇢ X be the set x that for at the iteration set X Actuation t • The gradient sum the general and proximal term: loss 1. InitializeSolve the data set to empty D = ;,⇤and iterate the following examples i: their gradients for a cell x in Xt is The sum of i acrossi all ) li (xt )across . h⇣ ⌘i 1. Initialize the data set to empty⇠iD==argmin ;, and iterate c(x the tfollowing allofexamples i:for(2) all cells this environment X ), and let be that corresponding delta. Let be the regularizer. x u augmented planning x p(y | x)p(x) p(y | x)p(x) t Optimal t nal manifold projections in Deep-LEARCH ⇠2⌅ h⇣ ⌘ ⇣ ⌘i w) (a) Solve the loss-augmented problem x 2⇠ = (1 + )c(x; w) (1 + )c (x) + ⌘ r c(x; (4) R Then the projection term is | y) =xt p(y | x)p(x) = p(y | x)p(x) t x w Control xp(x (a) Solve the loss-augmented problem Environment t X 0 )p(x0 )dx0 c(x; w) ⇣ ct (x) + ⌘ x + c(x; w) ct (x) rw c(x; w) (9) p(y) p(y |x ⇤ i i ⌘ X R z p(xSensing | y) = ⇠i gradient = argmin li (x (2) t= ⌘ (b) Add the functional thet )loss-augmented ⇤ data fromc(x i t) . iproblem 0 0 0 h⇣ ⌘iw). ⇠ = argmin c(x ) l (x ) . (2) i ⇠2⌅ p(y) p(y |x )p(x )dx i t t = (1 + ) c(x; w) c (x) + r c(x; (5) t x w x 2⇠ zt ⇠2⌅ ⇤ ⇤ ⇤ ⇤ ⇣ ⌘ Compute Functional Gradient 1 )c +t (x) + ⌘ x 2 rw c(x; w) X xc(x 2⇠t ) + ⌘t ) | xt 2 ⇠i }. D = D [ {(x , (3) = (1 + )c(x; w) (1 + (10) 1 t u : Control or Action t c(x; w) the ct (x) + ⌘ x gradient , (7) xt • Resulting ⇣ target for projecting functional (b) Add the functional gradient data from the loss-augmented problem ⌘ 2 gradient of the following gradient is equivalent to the term points suggestgradient where to increase theloss-augmented cost function byproblem ⌘That (b) These Add the functional data from the t. : State 1 2 1,3 2 ⌘ x x2X t t rice Arunkumar Byravan Daniel Kappler Dieter Fox update ⇤ ⇤ ⇤ ⇤

Max Planck Institute for Intelligent Systems – Autonomous Motion Department

i t

i

i t

i

i t

ii

eti

i

cC ost

Oc cu

i

pan cy Ma p

Functional Manifold Projections in Deep-LEARCH

(11)

ed

Co st

Sy

Le arn

pan cy Ma p

6) CNN Model

Oc cu

2 for imitation or prediction of a given behavior by having solely access to demonstrated

nth

= (1 + ) c(x;⇣w) ct (x) + r⌘w c(x; w). D = D [ {(x , c(x ) + ⌘ ) | x 2 ⇠ }. (3) x (c) Add the functional gradient data from the example t t t t i ⇤ ⇤ ⇤ ⇤ Utility Function zt : Measurement or Observation D = D [ {(xt , c(xt ) + ⌘t ) | xt 2 ⇠i }. (3) 2 1+ ⌘ 1 + ⇤ ⇤ c(x; w) ct (x) cost + at x. Likewise, (6) These points suggest where to D increase function = D [the {(xcost , c(x ) ⌘tby ) | ⌘xtt. 2 ⇠i }. (4)step size and c (x) x t t where ⌘ iss the is the previous the proximal regularization t These points suggest where to increase the cost function byThat ⌘t . gradient is equivalent to the 2 gradient 1 + term of the following (c) Add the functional gradient data from the example ⇣ ⌘ objective is 1,4 3 2 These points suggest where to decrease the cost function by ⌘ . (c) Add the functional gradient data from the example t 1 + Stefan Schaal Nathan Ratliff ⇣ ⌘ ⇤ ⇤ 2 = c(x; w) t , (7) 1 + ⌘ D = D [ {(x , c(x ) ⌘ ) | x 2 ⇠ }. (4) x t t i t t ⇤ ⇤ 2. Solve the regression problemDto=improve the hypothesis: 2 c(x; w) D [ {(xt , c(xt ) ⌘t ) | xt 2 ⇠i }. (4) ct (x) + x ⇤ 2⌘ 1+ ⇣ ⌘2 X the cost function These points suggest where to1 decrease X 2 by1⌘t . 2 2 Thesecpoints suggest where to decrease thew) cost + function ⌘tt.k + the 2 ⌘c2t (x) , where tx = ct (x) + 1+1 +x . ⇣ y c(x; kw by w w.target (5) t+1 = argmin ion c(x; w) (8) 2. Solve the regression problemwto improve the hypothesis: 2 2 2 2 the (x,y)2D 2. Solve the regression problem to improve the hypothesis: = c(x;objective w) tx in,Equation 2. Cells that aren’t in X will just use t x2X Solve Regression 2 1 X 2 1 2 2 X ement Learning (IRL) has been studied for more that 15 years and is of fundamental 1 y c(x; w) + kw ct+1 = argmin w. 2 (5) • Our network first applies a sequence of three strided 2 1wt k + ⌘ ct+1w = argmin y c(x; w)2 + kw w2t k2 + w. (5) 2 where the target t = c (x) + . botics. It allows learning a utility function “explaining” the behavior of an agent, and (x,y)2D 2 2 2 x t xinput. 3 w 1+ convolutions to the 2 (x,y)2D 1 50

3

(12) (13)

Robot (2009) 27: 25–53 Figure 1: The typesAuton of environments, from left to right: fully obser

ap

ap

cup anc yM

anc yM

Co st

etic

cup

nth

ed

Co st

Oc

Sy

st

arn

Le

arn ed

Co

st

Sy

nth e

tic

Co

Le

Le ar

ned

Co st

Oc

Sy

nth e

tic Co st

• of The convergence of LEARCH with different training steps the just bottleneck, we applyintwo fully connected layers that Cells that aren’t in -XAt use the objective Equation 7. Table 1 Results experiments comparing learned to engineered prior Validation loss on the “fully observable" data set with different inner l t will optimal solutions. When the reward function is assumed to be a linear combination 6.2 Imitation learning for overhead data 2 3 of the CNN maps. Indicated costs are from Crusher’s onboard perception system allow us to propagate information throughout the entire (right). has strong convergence properties [AN04, KPRS13, BMZ+ 15, MHB15, MHB16]. 3 Images [Doerr 15]Neural image followed by three deconvolution layers which recent years, amotions, lotlearned ofsliding interest haspoint been focused on using Deep Convolutional ofFig. freedom. 4: Motion Pointing policies to favor fromspecific one degrees on the of table freedom. to another Pointing aremotions, visualized sliding as executed from one on point the actual on therobot. table to another are visualized as executed on the actual robot. Experiment Total net Avg. Total cost Max cost In this context, learning a costmap via LEARCH provides 7 Acknowledgment n On the the left in image the middle shoulder forearm motions motions are favored are. Inwhilst the right on the image, image a policy in the is middle learned forearm whichmotions maintains are.a In horizontal the righthand image, a policy is learned which maintains a horizontal hand interpolate the output back to the full resolution s)alignment. toimage encode the reward function [FLA16, WZWP16]. Using such powerful non2 distance (km) speed (m/s) incurred incurred numerous advantages. Only one set of human input is nec• Features are usually hand coded properties 4) Functional Manifold Projections approximators allows to learn from low level2features directly, thus not requiring - Eachessary layerper except ispaths followed bybeaused Batch test site;the the final same layer example can then projection step in the inner loop where we train the neural netwo End-effector orientation This research was supported in part by National Science Foundation grants IIS-1205249, IIS-1017134, ge, which can potentially lead to learn higher fidelity behaviors. The LEARning to Normalization and PReLU non-linearity Experiment 1 6.63 2.59 11108 23.6 to learn a costmap from any subset of the available features. The outline of the LEARCH algorithm that optimizes this loss • Functional gradients updates can be used to update EECS-0926052, the Office of Naval Research, the Okawa Foundation, and the Max-Planck-Society. ork [RSB09] has introduced functional gradients as a powerful technique underlying - Sum of acceleration Learned - TorchThe to overhead implement our point model. We use theintuitive ADAMvanvantage provides a very generic functionals gradient is outlined in Section 5. timization. This approach has shown to outperform subgradient methods when Any opinions, findings, and conclusions or recommendations expressed in this material are those of Experiment 1 6.49 2.38 14385 264.5 - Distance to a surface optimization with default parameters, a step size tage pointalgorithm for demonstrating behavior. A human operator r a linear reward assumption, and was shown to efficiently optimize non-linear cost the author(s) and do not necessarily reflect the views of the funding organizations. of 1e-3 and a batch size of 32 for training all our EngineeredIntuitively, the negative gradient portion of the loss L[c] dealin Compute can simply ‘draw’ an example path from start to goalnetwork on -… Functional Gradient Experiment 2 6.01 100.2 @ top (left). of a visualization of the available data (e.g. an aerial or Figure 1: The types of environments, from left to right: fully observable, lidar and object in motion function y = 2.32 ct (x), i.e.17942 - @y L(y) defines the quickest way to d • Representing the utility function with a deep network allows extend LEARCH to train CNNs using functional manifold projections, which we Validation loss on the “fully observable" data set with different inner loop training iteration steps of the Learned CNN image). In this way, it is also much simpler to prosatellite learning from low level features (negative) functional gradient data set is just the (negative) loss ARCH. Earlier work on functional gradient approaches [RSB09] built large but flat (right). Experiment 2 5.81 2.23 21220 517.9 vide examples Figure 1: The types of environments, from left to right: fully observable, lidar and object in motion (left). that imply different cost metrics (Fig. 16), as Removes theintedious design of feature functions and advantage of Validation loss on the “fully observable" data set with different inner loop training iteration steps of the CNN hat •continually grow size. Our technique maintains the convergence Engineered opposed to engineering multiple functions by hand. @ (right). allows learning higher fidelity behaviors 7) Validation ent techniques (observed in linear spaces [MBVH09]) while generalizing to fixed Since global planning on UPI is achieved using Field D*, Experiment 2 6.19 1.65 26693 224.9 ci = L(y) c (x ) . Evaluate current loss at t i step in the inner loop where we train the neural network based on the data we’ve collected. metric models (CNNs) by formally representing the function approximator as a non- projection @y many of the details discussed in Sect. 4.4 came into play the demonstrations No Prior thatloop optimizes functional its functional old of the space of all functions. We derive a simple step-project functional gradient The outline of the LEARCH projectionalgorithm step in the inner where wethis trainloss the neural network based based on the data we’ve collected. Deep-LEARCH with 3 types of •on We implemented when applying LEARCH. Specifically, Field D* is an inThe outline of the LEARCH algorithm that optimizes this loss functional based on its functional gradient is outlined in Section 5. The functional gradient r L[c] defines a step off of the function to walk across the manifold thatto is substantially more data efficient than traditional f 2) LEARning searCH (LEARCH) environments on 2D navigation tasks terpolated path planning algorithm, and so visitation counts gradient is outlined in Section 5. onsisting of a single back-propagation commonly used in Deep-IRL. We present Intuitively, the negative gradient portion of the loss L[c] dealing only with the outputs of the cost class Hto= {c =its c(·; w)|w to2theW} [RSB09]. At each Deepoverhead data, and compare performance handFig. 7: Effects of domain size and problem dimensionality on Fig.the 7: optimization Effects of domain speed size and and problem dimensionality on the optimization speed and must be computed with respect to actual distance traveled. Fully observable occupancy map Intuitively, the negative gradient portion of the loss L[c] dealing only with the outputs of the cost onFig. training 6: Cross-validation sets for two motion of the fitness of policies learned on training sets for two motion rimental results showing higher-training rates on low-dimensional 2D synthetic data. @ quality. The average number of necessary function evaluations quality. and The the resulting average number fitness is of necessary function evaluations and the resulting fitness is defines the • Collection of algorithms for inverse reinforcement gradient is projected back onto the manifold by computing the d @ quickest way to decrease the loss function. Thus function y = c (x), i.e. L(y) the tuned approach. One experiment of note compared the gent Additionally, a configuration space expansion is performed t types: 1) andpointing low (setto2)a obstacle cylindrical obstacle with high (set 1) and low (set 2) obstacle function y = c (x), i.e. L(y) defines the quickest way to decrease the loss function. Thus the @y t evaluated on four motion problems. training Three different initial evaluated configurations onas four (step motion size and problems. Three different initial configurations (step size and @y - Partially observable: only locally observable ideas have broad implications for structured beyond IRL well as deep g the avoidance. imitation For loss both andmotion another types,initial one policy minimizing the imitation loss and another learning parametrization) are shown for each task. initial parametrization) are shown for(negative) each task. functional gradient set isgradient just thedata (negative) loss(negative) partialsloss of partials the loss the functional gradient. This results in training the CNN, so eralizationwith of LEARCH with that of a hand-tuned costmap. (negative)data functional set is just the of function the loss function by the planner, averaging all costs within a set radius of the ssin minimizing have been learned, the combined summing imitation and elbow height loss have been learned, summing general. - Partially observable: emulating a moving object • Training the CNNimagery with more stochastic gradient steps - Extends theD. Maximum Margin Planning (MMP) up to 4 different policies (A,B,C,D). A cost map was trained off satellite for an approxicurrent cell. Furthermore, in an attempt to reduce noise in @ Effects of Optimizer Settings D. Effects of Optimizer Settings @ ⇤ c = L(y) . 2 i h = argmax < h, r L[c] > allows higher over all training rates, making functional c (x ) c = L(y) . algorithm to non-linear utility functions f mately 60 km size area at 60 cm resolution. An engineered the drawing of examples, corridor constraints were used as i @y ct (xi ) ndom policypolicy clearlybutoutperforms also the effects initial random policy but also @y The of the domain size on the optimizer’s The effects perforof the domain size on the optimizer’s perforh2H optimization with deep-nets a promising research avenue al manifold projection costmap had been previously produced for this same area • 800 environments with each 20 demonstrations described in Sect. 4.7, along with a normalized gradient as - Solves augmented optimal control problem petheofpolicy motion. which Thisthe learned is loss a different type of motion. This is The functional gradient r L[c] defines a step off of the function manifold spanned by our hypothesis f mance (e.g. in terms of convergence and mance best (e.g. fitness) in terms have of convergence and best fitness) have Figure 1: The types of environments, from left to right: fully observable, lidar and object in motion (left). The functional gradientclass rf L[c] a step function spanned by our hypothesis to support UPI. A subset of both maps is shown in Fig. 17. H =defines {c = c(·; w)|woff 2 of W}the [RSB09]. Atmanifold each Deep-LEARCH iteration, the functional described in Sect. 4.6. Experimental validation of these aphetrue pure for-imitation both the loss policy learned using the pure imitation loss •⇤ Grids of 100x100 pixels Exponentiated of the loss functional beengradient evaluated descent on the experiments as presented been evaluated above. on Thethe experiments as presented gradient above. isThe 2 back onto the by computing the direction h 2the H that best correlates W}Functional [RSB09]. Atmanifold each Deep-LEARCH iteration, functional he ith example denoting a statethealong the trajectory by xt 2 ⇠, the class H = {c = c(·; w)|w 2 projected The two maps were compared using a validation set of paths Validation loss on the “fully observable" data set with different inner loop training iteration steps of the CNN The space of all squared integrable L is big. Even if H is the sp proaches is presented in Fig. 15. The loss function used was edfunction using the (Acombined andtrajectory C) and range the and policy learned using combined of the domain has been varied for all range those of experiments the domain has been varied for all those experiments • We make use of the Field-D* algorithm to plan the navigation ⇤ with onto the functional gradient.by This results in training the CNN,hsolving a that problem of the form Functional gradient updates gradient is projected back the manifold computing the direction 2 H best correlates 09] loss functional is generated200 by alayers UPI team member not directly involved ina made up example), that Manifold Projection combined loss function from (B and a function of Euclidean distance between states on the curReferences and 20 million weights (just (right). [0,D). 1] and [0, 5] up to [0, 10]. The from initial[0,guess 1] and ✓0 [0, has 5] up to [0, 10]. The initial guess ✓0 has motions [Ferguson 06] h⇤ the = argmax < h, rf L[c] >. with the functional gradient. This results in training CNN, solving a problem of the form the development ofsubmanifold overhead costing. Themost average validarent path µ and the closest state on the example µi , with a arios: b) Concerning Generalization the to Similar Scenarios: Concerning the N been chosen randomly within the domain been and chosen the initial randomly step within the domain and the initial step h2H span a of at 20 million dimensions, one for e h i X X X • The validation loss σis2 of the form: Nathanwith D Ratliff, David Silver, and J Andrew Bagnell. Learning to search: Functional gradient ⇤ i been i i tion loss was[1]0.675 the engineered map, and 0.551 with scale parameter efitness corresponding of the learned test policies on the corresponding test size has set to = (max min)/5. size has been set to = (max min)/5. h = argmax < h, r L[c] > . L[c] = c(xt ) min li (xt ) , [Ratliff 09] (1) 0c(xt ) 0 f The techniques set of all approximators in27(1):25–53, that class parameterize for function imitation learning. Autonomous Robots, Juneis2009. 2 t

i

⇠2⌅ type of motion clearly The space of all squaredh2H integrable L is big. Even if H is the space of all deep neural networks with the LEARCH map. ype problems, of motion thei=1 clearly policyi learned for this The results of evaluating the performance The for results those experiof evaluating the performance for those experii [2] Arunkumar Byravan, Mathew Monfort, Brian Ziebart, Byron Boots, and Dieter Fox. Graph" # xt 2⇠i xt 2⇠i ! the weight vector moves from point to point in the ofclass of functi 200 layers and 20 million weights (just a made up example), that space of function approximators can 1 2 2 ndperforms the policies better learned than the ments randomare policy and the policies learned Additional validation was also achieved during offibased inverse optimal control for robot manipulation. In Proceedings the International shown in Figure 7. The optimalments policy areis shown computed in Figure 7. The optimal policy is computed L(µ, µi that? )= 1 − exp min [∥s − si ∥ ]/σ . [Ratliff 09] span a submanifold of at most 20 million dimensions, one for each weight. How do we know 2 The projection step can be viewed as a supervised training Joint Conference ontests Artificial Intelligence, 2015. differentiable w.r.t. w,consist so this movement si ∈µi cially the other case motion for the type. This istheespecially the for the space of all squared integrable L is big. Even if H is the space of all deep neural t for of all and lbased defining thecase margin. For regularization, wethe often• The •thetrajectories Where: cial UPI field tests. UPI field of Crusher au- between functions is sm onloss three different domain ranges. based Foron each thetask, three different domain ranges. For each task, the i some The set of all function approximators in that class is parameterized by its weightnetworks vector w. with Changing |µ| s∈µ [3] Andreas Doerr, Nathan Ratliff, Jeannette Bohg, Marc Toussaint, Stefan Schaall. Direct Loss step that finds the loss function that best correlates with the hof perform policies learned significantly for motion type 2 which perform significantly 200 layers and 20 million weights (just a made up example), that space of function approximators can submanifold in the space of functions. The Jacobian of the functi the weight vector moves from point to point in the class of functions. The function approximators are functions c to liefunction onbars therepresent manifoldfrom of function approximators H, as neural tonomously navigating a series of courses, with each course left to right the results barsofrepresent the such experiment from left to right the results of the experiment Minimization Inverse Optimal Control, Robotics Science and Systems, July 2015 - c : cost w, sodimensions, this movementone between is smooth. Therefore, createsLEARCH, a smooth as listed in Algorithm 5, was implemented functional gradient obetter theparameterization, performance on their ownoftestand set we compared toincreasing the performance ofranges. span a submanifold of differentiable atresults most are 20 w.r.t. million for functions each weight. How do we itknow that? Jim Rafi Hayne, Dmitry Berenson. Goal set inverse optimal control and iterative cular mayfor even restrict that class further by out putting bounds carried out domain carried The results for increasing are domain ranges. The defined asthe a[4] setfirst of Mainprice, widely spaced waypoints. Courses ranged order, how the function changes when we change the w submanifold in the space of functions. The Jacobian of the function approximator w.r.t. w tells us, to • The validation loss is computed on a hold-out test set of 100 l : loss augmentation function re-planning for predicting human reaching motions in shared workspaces, June, 2016 set of all algorithms function in that class is a parameterized by its weight vector w. Changing set the(cf. motion policies type B weight 1and policies on thisover setleast (cf. policies B andof these with 4 varieties of regressors: linear, neural networks, reneural network vectors. Intest Deep-LEARCH both are by the•4 The averages at 4 independent runs. averages Thehandled over error atbars least independent runs. Theapproximators error bars Other IRL can be recast using functional in length up to 20 km, withand waypoint spacing on interpolation the order to improve path planning: The field the first order, how the function changes when we change the weights. [5] Dave Ferguson Anthony Stentz. Using environments the weight vector moves from point are trees, and linear with a feature learning phase uson D their clearly own outperform training policies and1C on their own trainingNeither displayAthe standard deviation. display thethe number 1 standard of deviation. Neither the number of to point in the class of functions. The function approximators gression gradient rule update of 200 to 1000 d* m.algorithm. These tests took place at numerous loca-2006. Journal of Field Robotics, 23(2):79–101, Motion Planck Institute for Intelligent Systems ; AMD ; IS-MPI ;fitness Tübingen, differentiable w.r.t. w, so this movement between functions is smooth. Therefore, it creates aing smooth set 2). Department, Maxfunction evaluations (shown in green) nor function the resulting evaluations (shown in green) nor the resulting fitness regression trees (see Sect. 4.5). Overall, the latter ap3 theResults tions across continental U.S., each with highly varying 3 4 3 Results sity of Washington, Seattle, WA, USA, Lula Robotics Inc., Seattle, WA, USA, University of submanifold in the space of functions. The Jacobian of the function approximator w.r.t. w tells us, to @ superior all around perfortives: c) The Optimizing policies Additional (shown in blue) Objectives: is significantly The policies influenced (shown by the initial in blue) domain is significantly influenced by the initial domain proach was found to provide local terrain characteristics, and sizes ranging from tens to a ; USC; Los Angeles, USA t the first order, how the function changes when we change the weights. (1) Autonomous Motion, Max-Planck Institute for Intelligent Systems, Tübingen Germany (2) University of Washington, Seattle, USA (4) CLMC Lab, University of Southern California, Los Angeles, USA, (3) Lula Robotics Inc., Seattle, USA ctions learned canusing be directly the augmented size. CMA loss turns functions out tocan be robust be directly against slightly size. CMA inappropriate turns out to be robust against slightly inappropriate @y In order to validate our approach we have implemented Deep-LEARCH with three types of environmance. Specifically, the computational advantage of logam.is.tuebingen.mpg.de hundreds In of square kilometers. order to validate our approach we have implemented Deep-LE We then study the evolution of the validation loss in [RSB09] on a holdout set with increased learned compared using to the the pure ones which initialhave configurations. been learned The using automatic the pure adaptation initial configurations. of exploration The automatic adaptation of ments. exploration

projection step in the inner loop where we train the neural network based on the data we’ve collected. The outline of the LEARCH algorithm that optimizes this loss functional based on its functional gradient is outlined in Section 5. Intuitively, the negative gradient portion of the loss L[c] dealing only with the outputs of the cost function y = c (x), i.e. - L(y) defines the quickest way to decrease the loss function. Thus the