Dual REPS: a generalization of relative entropy policy search exploiting bad experiences

Colomé Figueras, Adrià; Torras, Carme

doi:10.1109/TRO.2017.2679202

Visualitza/Obre

1841-Dual-REPS_-A-Generalization-of-Relative-Entropy-Policy-Search-Exploiting-Bad-Experiences.pdf (1,724Mb)

Veure estadístiques d'ús d'UPCommons

Estadístiques de LA Referencia / Recolecta

Cita com:

Mostra el registre d'ítem complet

Colomé Figueras, Adrià

Torras, Carme

Tipus de documentArticle

Data publicació2017-08-01

EditorInstitute of Electrical and Electronics Engineers (IEEE)

Condicions d'accésAccés obert

Attribution-NonCommercial-NoDerivs 3.0 Spain

Llevat que s'hi indiqui el contrari, els continguts d'aquesta obra estan subjectes a la llicència de Creative Commons : Reconeixement-NoComercial-SenseObraDerivada 3.0 Espanya

Abstract

Policy search (PS) algorithms are widely used for their simplicity and effectiveness in finding solutions for robotic problems. However, most current PS algorithms derive policies by statistically fitting the data from the best experiments only. This means that experiments yielding a poor performance are usually discarded or given too little influence on the policy update. In this paper, we propose a generalization of the relative entropy policy search (REPS) algorithm that takes bad experiences into consideration when computing a policy. The proposed approach, named dual REPS (DREPS) following the philosophical interpretation of the duality between good and bad, finds clusters of experimental data yielding a poor behavior and adds them to the optimization problem as a repulsive constraint. Thus, considering that there is a duality between good and bad data samples, both are taken into account in the stochastic search for a policy. Additionally, a cluster with the best samples may be included as an attractor to enforce faster convergence to a single optimal solution in multimodal problems. We first tested our proposed approach in a simulated reinforcement learning setting and found that DREPS considerably speeds up the learning process, especially during the early optimization steps and in cases where other approaches get trapped in between several alternative maxima. Further experiments in which a real robot had to learn a task with a multimodal reward function confirm the advantages of our proposed approach with respect to REPS.

Descripció

© 20xx IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

CitacióColomé, A., Torras, C. Dual REPS: a generalization of relative entropy policy search exploiting bad experiences. "IEEE transactions on robotics", 1 Agost 2017, vol. 33, núm. 4, p. 978-985.

URIhttp://hdl.handle.net/2117/110925

DOI10.1109/TRO.2017.2679202

ISSN1552-3098

Versió de l'editorhttp://ieeexplore.ieee.org/document/7889017/

Col·leccions

Veure estadístiques d'ús d'UPCommons

Mostra el registre d'ítem complet

Fitxers	Descripció	Mida	Format	Visualitza
1841-Dual-REPS_ ... oiting-Bad-Experiences.pdf		1,724Mb	PDF	Visualitza/Obre

UPCommons. Portal del coneixement obert de la UPC

Dual REPS: a generalization of relative entropy policy search exploiting bad experiences

Visualitza/Obre

Explora