RT-Sketch: Goal-Conditioned Imitation
Learning from Hand-Drawn Sketches

*Equal advising, alphabetical order
1Stanford University, 2Google DeepMind, 3[Google] Intrinsic


Natural language and images are commonly used as goal representations in goal-conditioned imitation learning (IL). However, natural language can be ambiguous and images can be over-specified. In this work, we propose hand-drawn sketches as a modality for goal specification in visual imitation learning. Sketches are easy for users to provide on the fly like language, but similar to images they can also help a downstream policy to be spatially-aware and even go beyond images to disambiguate task-relevant from task-irrelevant objects. We present RT-Sketch, a goal-conditioned policy for manipulation that takes a hand-drawn sketch of the desired scene as input, and outputs actions. We train RT-Sketch on a dataset of paired trajectories and corresponding synthetically generated goal sketches. We evaluate this approach on six manipulation skills involving tabletop object rearrangements on an articulated countertop. Experimentally we find that RT-Sketch is able to perform on a similar level to image or language-conditioned agents in straightforward settings, while achieving greater robustness when language goals are ambiguous or visual distractors are present. Additionally, we show that RT-Sketch has the capacity to interpret and act upon sketches with varied levels of specificity, ranging from minimal line drawings to detailed, colored drawings.

Dataset Generation & Training

GAN Output

+ Thickness Augmentation

+ Color Augmentation

+ Affine Augmentation

Training a sketch-conditioned IL policy requires collecting a dataset of trajectories paired with sketches representing the goal state. Since this is infeasible to collect manually at scale, we first train an image-to-sketch GAN translation network which automatically converts images into sketches. We additionally augment these generated sketches with various colorspace and affine transforms, to simulate variations in hand-drawn sketches. Using this network, we automatically convert hindsight-relabeled goal images from robot trajectories into goal sketches. The results are visualized above.

To train RT-Sketch, we sample an episode from a pre-recorded dataset of robot trajectories. Treating the last observation in the trajectory as a goal image, we convert this goal image to either a GAN-generated goal sketch, a colorized sketch, or an edge-detected image. We concatenate this goal representation with the history of RGB observations in the trajectory. Finally, the concatenated inputs serve as input to RT-Sketch, which outputs tokenized actions. The purpose of training on the varied input types is to encourage the policy to afford different levels in input sketch specificity.


RT-Sketch is a sketch-to-action behavior-cloning agent which is

(1) On par with image-conditioned and language-conditioned agents for tabletop / countertop manipulation
(2) Compatible with sketches of varied detail
(3) Robust to visual distractors
(4) Unaffected by semantic ambiguities

Tabletop / Countertop Manipulation

For straightforward manipulation tasks such as those in the RT-1 benchmark, RT-Sketch performs on par with language-conditioned and image-conditioned agents for nearly all skills:

Move Near Skill

Pick Drawer

Drawer Open

Drawer Close



Robustness to Sketch Detail

RT-Sketch further affords input sketches with varied levels of detail, ranging from free-hand sketches to colorized sketches, without a performance drop compared to upper-bound representations like edge-detected images.

Move Near Skill

Drawer Open Skill

Emergent Capabilities: Robustness to Visual Distractors

Although RT-Sketch is only trained on distractor-free settings, we find that it is able to handle visual distractors in the scene well, while goal-image conditioned policies are easily thrown out of distribution and fail to make task progress. This is likely due to the minimal nature of sketches, which inherently helps the policy attend to only task-relevant objects.

In terms of perceived semantic and spatial alignment on a 1-7 scale, RT-Sketch achieves a
1.5X and 1.6X improvement over a goal-image conditioned policy.

Visualize Trial

Goal Image


RT-Goal Image

Goal Sketch



Semantic Ambiguity

While convenient, language instructions can often be underspecified, ambiguous, or may require lengthy descriptions to communicate task goals effectively. These issues do not arise with sketches, which offer a minimal yet expressive means of conveying goals. We find that RT-Sketch is performant in scenarios where language can be ambiguous or too out-of-distribution for policies like RT-1 to handle.

In terms of perceived semantic and spatial alignment on a 1-7 scale, RT-Sketch achieves a
2.4X and 2.8X improvement over RT-1.

Visualize Trial

Language Goal + Rollout


Goal Sketch



Quantitative Results

We measure performance of RT-Sketch compared to language-conditioned (RT-1) and goal image-conditioned (RT-Goal Image) policies in terms of 2 quantitative metrics: (1) human-provided Likert ratings on goal alignment, and (2) pixelwise error between (manually labeled) achieved object positions and ground truth object keypoints from reference goal images. RT-Sketch performs comparably to both in straightforward tasks (H1), can handle different levels of sketch detail with minimal performance drop (H2), and is favorable when visual distractors are present (H3) or when language would be ambiguous (H4).

Likert Ratings

Pixelwise Error

Failure Modes

RT-Sketch's main failure modes are imprecision and moving the wrong object. We see the first failure mode typically when RT-Sketch positions an object correctly but fails to reorient it (common in the upright task). The second failure mode is most apparent in the case of visual distractor objects, where RT-Sketch mistakenly picks up the wrong object and puts in the appropriate place. We posit that both of these failures are due to RT-Sketch being trained on GAN-generated sketches, which occasionally do not preserve geometric details well, leading the policy to not pay close attention to objects or object orientations.


Coke can moved to correct location, but not upright

Pepsi can moved to correct location, but not upright

Wrong Object

Apple moved instead of coke can

Coke can moved instead of fruit