SlotGNN: Unsupervised Discovery of Multi-Object
Representations and Visual Dynamics
Learning multi-object dynamics from visual data using unsupervised techniques is challenging due to the need for robust, object representations that can be learned through robot interactions.
This paper presents a novel framework with two new architectures: SlotTransport for discovering unsupervised object representations from RGB images and SlotGNN for learning unsupervised multi-object dynamics from RGB images and robot interactions.
SlotTransport architecture is based on slot attention for unsupervised object discovery and uses a feature transport mechanism to maintain temporal alignment in object-centric representations. This enables the discovery of slots that consistently reflect the composition of multi-object scenes. These slots robustly bind to distinct objects, even under heavy occlusion or absence.
SlotGNN, a novel unsupervised graph-based dynamics model, predicts the future state of multi-object scenes. SlotGNN learns a graph representation of the scene using the discovered slots from SlotTransport and performs relational and spatial reasoning to predict the future appearance of each slot conditioned on robot actions.
We demonstrate the effectiveness of SlotTransport in learning object-centric features that accurately encode both visual and positional information. Further, we highlight the accuracy of SlotGNN in downstream robotic tasks, including challenging multi-object rearrangement and long-horizon prediction. Finally, our unsupervised approach proves effective in the real world. With only minimal additional data, our framework robustly predicts slots and their corresponding dynamics in real-world control tasks.
- Simulation Results
- Action Planning:
We demonstrate the action sequence, optimized using our learned framework. The optimal action (green arrow) is computed via model predictive control using future predictions of our SlotGNN.
- Long-Horizon Dynamics Rollout:
Given a series of robot pushing actions, we roll out the dynamics of a multi-object scene only from the initial image frame. Our graph-based dynamics model, SlotGNN, predicts stable and physically plausible long-horizon dynamics. In contrast, an MLP-based dynamics model, SlotMLP, struggles to accurately capture object interactions. Similarly, a keypoint-based dynamics model, KINet, proves to be unreliable due to inconsistencies in keypoints.
- Real-Robot Results
- Unsupervised Object Discovery:
We transfer our framework to real-world after training it in simulation. After collecting 20 real demos (5% of the amount of simulated training data) our model is able to discover accurate object-centric representations in real-world. We show examples of slots discovered with our SlotTransport in various scenes and under challenging scenarios like robot occlusion.
- Unsupervised Object Discovery - Missing Objects:
We show examples of our SlotTransport robust performance in scenes with missing objects. Trained exclusively on scenes with all three objects, it learns temporally consistent slots, ensuring reliable object discovery even when objects are missing.
- Action Planning:
We showcase the action planning in real-world. With only a goal scene image provided, the optimal action (green arrow) is computed via model predictive control, using the future predictions made by our SlotGNN.
- Action Planning - External Disturbance:
We show the action planning in dynamic scenarios where objects are displaced from their target positions by a human using a grabber stick. In response, the robot proficiently plans actions to relocate the objects back to their intended positions.