Optimization algorithm for feedback and feedforward policies towards robot control robust to sensing failures

Taisuke Kobayashi1, Keiko Yoshizawa1
1Division of Information Science, Graduate School of Science and Technology, Nara Institute of Science and Technology, Nara, Japan

Tóm tắt

Abstract Background and problem statement Model-free or learning-based control, in particular, reinforcement learning (RL), is expected to be applied for complex robotic tasks. Traditional RL requires that a policy to be optimized is state-dependent, that means, the policy is a kind of feedback (FB) controllers. Due to the necessity of correct state observation in such a FB controller, it is sensitive to sensing failures. To alleviate this drawback of the FB controllers, feedback error learning integrates one of them with a feedforward (FF) controller. RL can be improved by dealing with the FB/FF policies, but to the best of our knowledge, a methodology for learning them in a unified manner has not been developed. Contribution In this paper, we propose a new optimization problem for optimizing both the FB/FF policies simultaneously. Inspired by control as inference, the proposed optimization problem considers minimization/maximization of divergences between trajectories, one is predicted by the composed policy and a stochastic dynamics model, and others are inferred as optimal/non-optimal ones. By approximating the stochastic dynamics model using variational method, we naturally derive a regularization between the FB/FF policies. In numerical simulations and a robot experiment, we verified that the proposed method can stably optimize the composed policy even with the different learning law from the traditional RL. In addition, we demonstrated that the FF policy is robust to the sensing failures and can hold the optimal motion.

Từ khóa


Tài liệu tham khảo

Kobayashi T, Sekiyama K, Hasegawa Y, Aoyama T, Fukuda T (2018) Unified bipedal gait for autonomous transition between walking and running in pursuit of energy minimization. Robot Auton Syst 103:27–41

Itadera S, Kobayashi T, Nakanishi J, Aoyama T, Hasegawa Y (2021) Towards physical interaction-based sequential mobility assistance using latent generative model of movement state. Adv Robot 35(1):64–79

Sutton RS, Barto AG (2018) Reinforcement learning: an introduction. MIT press, Cambridge

LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436

Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533

Modares H, Ranatunga I, Lewis FL, Popa DO (2015) Optimized assistive human-robot interaction using reinforcement learning. IEEE Trans Cybern 46(3):655–667

Tsurumine Y, Cui Y, Uchibe E, Matsubara T (2019) Deep reinforcement learning with smooth policy update: application to robotic cloth manipulation. Robot Auton Syst 112:72–83

Kalashnikov D, Irpan A, Pastor P, Ibarz J, Herzog A, Jang E, Quillen D, Holly E, Kalakrishnan M, Vanhoucke V, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, pp. 651–673

Sugimoto K, Imahayashi W, Arimoto R (2020) Relaxation of strictly positive real condition for tuning feedforward control. In: IEEE Conference on Decision and Control, pp. 1441–1447. IEEE

Kerr T (1987) Decentralized filtering and redundancy management for multisensor navigation. IEEE Trans Aerospace Elect Syst (1):83–119

Zhang L, Ning Z, Wang Z (2015) Distributed filtering for fuzzy time-delay systems with packet dropouts and redundant channels. IEEE Trans Syst Man Cybern Syst 46(4):559–572

Kalman RE, Bucy RS (1961) New results in linear filtering and prediction theory. J Basic Eng 83(1):95–108

Mu H-Q, Yuen K-V (2015) Novel outlier-resistant extended Kalman filter for robust online structural identification. J Eng Mech 141(1):04014100

Kloss A, Martius G, Bohg J (2021) How to train your differentiable filter. Auton Robots 45(4):561–578

Musial M, Lemke F (2007) Feed-forward learning: Fast reinforcement learning of controllers. In: International Work-Conference on the Interplay Between Natural and Artificial Computation, pp. 277–286. Springer

Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

Murata S, Namikawa J, Arie H, Sugano S, Tani J (2013) Learning to reproduce fluctuating time series by inferring their time-dependent stochastic properties: application in robot learning via tutoring. IEEE Trans Auton Mental Dev 5(4):298–310

Lee A, Nagabandi A, Abbeel P, Levine S (2020) Stochastic latent actor-critic: deep reinforcement learning with a latent variable model. Adv Neural Inf Process Syst. 33:741–52

Sharma A, Kitani KM (2018) Phase-parametric policies for reinforcement learning in cyclic environments. In: AAAI Conference on Artificial Intelligence, pp. 6540–6547

Azizzadenesheli K, Lazaric A, Anandkumar A (2016) Reinforcement learning of pomdps using spectral methods. In: Conference on Learning Theory, pp. 193–256

Miyamoto H, Kawato M, Setoyama T, Suzuki R (1988) Feedback-error-learning neural network for trajectory control of a robotic manipulator. Neural Netw 1(3):251–265

Nakanishi J, Schaal S (2004) Feedback error learning and nonlinear adaptive control. Neural Netw 17(10):1453–1465

Sugimoto K, Alali B, Hirata K (2008) Feedback error learning with insufficient excitation. In: IEEE Conference on Decision and Control, pp. 714–719. IEEE

Uchibe E (2018) Cooperative and competitive reinforcement and imitation learning for a mixture of heterogeneous learning modules. Front Neurorobot. 12:61

Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909

Kobayashi T (2022) Optimistic reinforcement learning by forward kullback-leibler divergence optimization. Neural Netw 152:169–180

Chung J, Kastner K, Dinh L, Goel K, Courville AC, Bengio Y (2015) A recurrent latent variable model for sequential data. In: Advances in Neural Information Processing Systems, pp. 2980–2988

Konda VR, Tsitsiklis JN (2000) Actor-critic algorithms. In: Advances in Neural Information Processing Systems, pp. 1008–1014. Citeseer

Kobayashi T, Ilboudo WEL (2021) t-soft update of target network for deep reinforcement learning. Neural Netw 136:63–71

Gallicchio C, Micheli A, Pedrelli L (2018) Design of deep echo state networks. Neural Netw 108:33–47

Kobayashi T, Murata S, Inamura T (2021) Latent representation in human-robot interaction with explicit consideration of periodic dynamics. arXiv preprint arXiv:2106.08531

Higgins I, Matthey L, Pal A, Burgess C, Glorot X, Botvinick M, Mohamed S, Lerchner A (2017) beta-vae: Learning basic visual concepts with a constrained variational framework. In: International Conference on Learning Representations

Chua K, Calandra R, McAllister R, Levine S (2018) Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In: Advances in Neural Information Processing Systems, pp. 4754–4765

Clavera I, Fu Y, Abbeel P (2020) Model-augmented actor-critic: Backpropagating through paths. In: International Conference on Learning Representations

Hershey JR, Olsen PA (2007) Approximating the kullback leibler divergence between gaussian mixture models. In: IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 4, pp. 317–320. IEEE

Ziyin L, Wang ZT, Ueda M (2020) Laprop: a better way to combine momentum with adaptive gradient. arXiv preprint arXiv:2002.04839

Cohen AH, Holmes PJ, Rand RH (1982) The nature of the coupling between segmental oscillators of the lamprey spinal generator for locomotion: a mathematical model. J Math Biol 13(3):345–369

Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A (2017) Automatic differentiation in pytorch. In: Advances in Neural Information Processing Systems Workshop

Ba JL, Kiros JR, Hinton GE (2016) Layer normalization. arXiv preprint arXiv:1607.06450

Elfwing S, Uchibe E, Doya K (2018) Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Netw 107:3–11

Takahashi H, Iwata T, Yamanaka Y, Yamada M, Yagi S (2018) Student-t variational autoencoder for robust density estimation. In: International Joint Conference on Artificial Intelligence, pp. 2696–2702

Kobayashi T (2019) Variational deep embedding with regularized student-t mixture model. In: International Conference on Artificial Neural Networks, pp. 443–455. Springer

Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Appl Intell 49(12):4335–4347

Ilboudo WEL, Kobayashi T, Sugimoto K (2020) Robust stochastic gradient descent with student-t distribution based first-order momentum. IEEE Transactions on Neural Networks and Learning Systems

Kobayashi T (2021) Towards deep robot learning with optimizer applicable to non-stationary problems. In: 2021 IEEE/SICE International Symposium on System Integration (SII), pp. 190–194. IEEE

Kobayashi T (2020) Adaptive and multiple time-scale eligibility traces for online deep reinforcement learning. arXiv preprint arXiv:2008.10040

Akiba T, Sano S, Yanase T, Ohta T, Koyama M (2019) Optuna: A next-generation hyperparameter optimization framework. In: ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2623–2631

Coumans E, Bai Y (2016) Pybullet, a python module for physics simulation for games. Robot Mach Learn. GitHub repository

Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J, Tang J, Zaremba W (2016) Openai gym. arXiv preprint arXiv:1606.01540

Kobayashi T (2020) Proximal policy optimization with relative pearson divergence. arXiv preprint arXiv:2010.03290

Zaremba W, Sutskever I, Vinyals O (2014) Recurrent neural network regularization. arXiv preprint arXiv:1409.2329

Catalano MG, Grioli G, Garabini M, Bonomo F, Mancini M, Tsagarakis N, Bicchi A (2011) Vsa-cubebot: A modular variable stiffness platform for multiple degrees of freedom robots. In: IEEE International Conference on Robotics and Automation, pp. 5090–5095. IEEE