Martin, Eden
[UCL]
Ronsse, Renaud
[UCL]
Macq, Benoît
[UCL]
El Hafi, Lotfi
[Ritsumeikan University]
Garcia Ricardez, Gustavo Alfonso
[UCL]
Society 5.0, the society Japan aspires to, aims to create a cyber-physical system where humans and robots collaborate. To this end, both should be able to work together on the same tasks. In conventional robotics, robots are trained and specialized to perform specific tasks. While they perform well on this pre-defined set of tasks, these models require extensive data gathering and a time-consuming process. Moreover, when facing unknown environments, they experience a decrease in performance due to their non-adaptability to unforeseen situations. Additionally, if they are part of the same working team, the robots must understand and interpret human intentions. However, most of the past proposed intention recognition methods also lack flexibility and contextualization capability. To tackle this, this thesis proposes 1) a dynamic task planning system capable of performing non-predefined tasks, and 2) a framework that combines automatic task planning with human multimodal intention communication, enhancing the success of the task and human well-being (e.g., trust, willingness to use the system again). In this regard, there have been recent improvements in zero-shot learning in Human-Robot Collaboration using large pre-trained models. Because they were trained on large amounts of data, these models can apply their knowledge to tasks beyond their training data. Visual Language Models have recently demonstrated their ability to understand and analyze images. For this reason, these models are widely used as the robot’s reasoning module. Therefore, the system proposed in this thesis is divided into three modules: 1) automatic task planning computed using GPT-4V, 2) use of GPT-4V to compute a confidence level that reflects its comprehension of the task, and 3) a multimodal communication module to correct the automatic task planning in case of failure. Firstly, automatic task planning is achieved by feeding the Visual Language Model with an image of the task currently being performed. The VLM is then asked to determine the next step to pursue the task. The confidence level is defined as a number between 0 and 10, reflecting the robot’s comprehension of the task. Multimodal communication is achieved using deictic movements and speech communication. The results show that: 1) GPT-4V is able to understand simple tabletop pick-and-place tasks and provide the next object to pick and the corresponding placement position, 2) GPT-4V is able to evaluate its comprehension for three of the four implemented tasks, and 3) multimodal communication integrated into the automatic system enhances, in the tested task, both the success rate and human well-being.
Bibliographic reference |
Martin, Eden. Task planning system using foundation models in multimodal human-robot collaboration. Ecole polytechnique de Louvain, Université catholique de Louvain, 2024. Prom. : Ronsse, Renaud ; Macq, Benoît ; El Hafi, Lotfi ; Garcia Ricardez, Gustavo Alfonso. |
Permanent URL |
http://hdl.handle.net/2078.1/thesis:46101 |