RT-Sketch

Natural language and images are commonly used as goal representations in goal-conditioned imitation learning (IL). However, natural language can be ambiguous and images can be over-specified. In this work, we propose hand-drawn sketches as a modality for goal specification in visual imitation learning. Sketches are easy for users to provide on the fly like language, but similar to images they can also help a downstream policy to be spatially-aware and even go beyond images to disambiguate task-relevant from task-irrelevant objects. We present RT-Sketch, a goal-conditioned policy for manipulation that takes a hand-drawn sketch of the desired scene as input, and outputs actions. We train RT-Sketch on a dataset of paired trajectories and corresponding synthetically generated goal sketches. We evaluate this approach on six manipulation skills involving tabletop object rearrangements on an articulated countertop. Experimentally we find that RT-Sketch is able to perform on a similar level to image or language-conditioned agents in straightforward settings, while achieving greater robustness when language goals are ambiguous or visual distractors are present. Additionally, we show that RT-Sketch has the capacity to interpret and act upon sketches with varied levels of specificity, ranging from minimal line drawings to detailed, colored drawings.

RT-Sketch: Goal-Conditioned ImitationLearning from Hand-Drawn Sketches

Abstract

Dataset Generation & Training

GAN Output

+ Thickness Augmentation

+ Color Augmentation

+ Affine Augmentation

Results

RT-Sketch is a sketch-to-action behavior-cloning agent which is

(1) On par with image-conditioned and language-conditioned agents for tabletop / countertop manipulation (2) Compatible with sketches of varied detail (3) Robust to visual distractors (4) Unaffected by semantic ambiguities

Tabletop / Countertop Manipulation

Move Near Skill

Pick Drawer

Drawer Open

Drawer Close

Knock

Upright

Robustness to Sketch Detail

Move Near Skill

Drawer Open Skill

Emergent Capabilities: Robustness to Visual Distractors

In terms of perceived semantic and spatial alignment on a 1-7 scale, RT-Sketch achieves a 1.5X and 1.6X improvement over a goal-image conditioned policy.

Goal Image

Rollout

RT-Goal Image

Goal Sketch

Rollout

RT-Sketch

Semantic Ambiguity

In terms of perceived semantic and spatial alignment on a 1-7 scale, RT-Sketch achieves a 2.4X and 2.8X improvement over RT-1.

Language Goal + Rollout

RT-1

Goal Sketch

Rollout

RT-Sketch

Quantitative Results

Likert Ratings

Pixelwise Error

Failure Modes

Imprecision

Wrong Object

RT-Sketch: Goal-Conditioned Imitation
Learning from Hand-Drawn Sketches

(1) On par with image-conditioned and language-conditioned agents for tabletop / countertop manipulation
(2) Compatible with sketches of varied detail
(3) Robust to visual distractors
(4) Unaffected by semantic ambiguities

In terms of perceived semantic and spatial alignment on a 1-7 scale, RT-Sketch achieves a
1.5X and 1.6X improvement over a goal-image conditioned policy.

In terms of perceived semantic and spatial alignment on a 1-7 scale, RT-Sketch achieves a
2.4X and 2.8X improvement over RT-1.