Lifelike Agility and Play in Quadrupedal Robots using Reinforcement Learning and Generative Pre-trained Models

Nature Machine Intelligence, Vol. 6, No. 7, 2024.
(Cover Article)

Lei Han*, Qingxu Zhu*, Jiapeng Sheng*, Chong Zhang*, Tingguang Li*, Yizheng Zhang*, He Zhang*, Yuzhen Liu, Cheng Zhou, Rui Zhao, Jie Li, Yufeng Zhang, Rui Wang, Wanchao Chi, Xiong Li, Yonghui Zhu, Lingzhu Xiang, Xiao Teng, Zhengyou Zhang
*Equal Contribution
Tencent Robotics X

Main Video for Lifelike Agility and Play in Quadrupedal Robots.

Abstract

This project introduces the expressive power of recent advanced pretraining methods in computer vision and language understanding into motor control. Latent representations are pre-trained on animal motions. Then, the pre-trained representations are used to train control policies that can solve various challenging tasks through general reinforcement learning, demonstrating lifelike agility and versatile strategy. We consider many challenging tasks, including creeping, jumping over hurdles, freerunning over scattered blocks, etc. More interestingly, we design a multi-agent Chase Tag Game, in which two robots act as chaser and evader respectively, and their roles are switchable by triggering a Flag in the game. Real world experiments demonstrate that the MAX robots [1] deployed with the trained policy can emerge lifelike agility and strategy.

Pre-trained Representations by Imitating Animal Data on Flat Ground


To achieve lifelike agility and strategy, we first collect a certain amount of regular animal motion data on flat ground using a motion capture system. Then, a neural controller containing a specific information bottleneck structure is trained to imitate the animal motions. The pre-trained latent representations are sufficiently expressive over the collected animal motions. In this learning stage, the network only takes the robot's proprioceptive information as input and does not perceive any information from the external environment. For all real world experiments in this project, we deploy the trained controllers in simulation on the MAX robot via a zero-shot manner. MAX is a quadrupedal robot developed in-house. MAX is weighted 14 kg and each leg is with 3 actuators. Below, we demonstrate some imitation results on the MAX robots.


Performance of imitating animal motions after training latent representations over animal motion data.

Adapting to Environments


At a subsequent stage, we build additional neural network structures upon the pre-trained representations and let the network perceive exteroception from the environment. The aim at this stage is to align the latent representations compressing natural animal motions with the the environment, enabling the robot to act in response to the environment with animal-level behaviors. When the robot is capable to adapt to various complex environments, the knowledge that links animal motion representation with external perception will be stored in the neural network, again. Below, we demonstrate the learned performance on solving some challenging tasks that are not captured in the motion dataset.


Creeping

Ascending Stairs

Jumping over Hurdles

Freerunning over Blocks

Solving challenging tasks that are not captured in the motion dataset.

Chase Tag Game

Finally, we use the pre-trained networks obtained from the above two stages to solve a downstream task, a designed Chase Tag Game, to acquire knowledges at a strategic level. The knowledge trained at each stage can be expanded and adjusted without the need for re-training, allowing for continuous accumulation and learning. More interestingly, we design a multi-agent Chase Tag Game, in which two MAX robots play against each other to alternatively act the roles of chaser and evader, where the roles are determined by a flag scattered in the playground. The game setting is similar to the World Chase Tag [2], an international championship for competitive human parkour, with simplification and novel elements. Rules are introduced as below. The game is taken place in a 4.5m by 4.5m square ground with obstacles placed. At the beginning of each game episode, two MAX robots are randomly scattered in the play area, and an additional Flag is randomly placed as well. The roles of the robots are initially assigned randomly, where one MAX robot is the chaser and another is the evader. When the game starts, the mission of the chaser is to chase the evader, and as long as the distance between the two MAX robots is shorter than 0.6m, the game is terminated with the current chaser winning the game. Meanwhile, the evader aims to avoid being chased, while it has the chance to switch the roles of the current players that once it reaches the Flag within a distance of 0.3m; if this happens, the evader turns into the chaser while the previous chaser becomes the evader; and in the meantime, the Flag disappears and a new Flag will re-spawn randomly in the area. The game proceeds until the fact becomes true that the distance between the two robots is smaller than 0.6m. In all games, the average root forward velocity of the MAX robot is restricted to 0.5m/s. Several games can be watched below.




Round 1

Round 2

Round 3

Round 4

Interestingly, we observe that the robots emerge lifelike strategies in the games. For example, the chaser will give up chasing if it estimates that there is no chance to catch the evader before it reaches the Flag in near future; at this time, the chaser will hesitate and wander around, waiting for the re-spawn of the new Flag. Moreover, at the very moment that the chaser is about tot catch the evader, the chaser prefers to perform a pounce like an animal pouncing on its prey; and similar behaviors can be observed when the evader is reaching the Flag. For more details, please refer to the video at the top of this webpage.




Video 1


Video 2


Video 3


Video 4


Video 5


Video 6