RT-Sketch: Goal-Conditioned Imitation
Learning from Hand-Drawn Sketches

Additional Results


Towards Multimodal Goal Specification


Language Alone

Sketch Alone

Sketch + Language

Wrong placement location

Not upright

Upright, correctly placed

"place the pepsi can upright"

"place the pepsi can upright"

Wrong placement location

Correct placement

Correct placement

"place the orange on the counter"

"place the orange on the counter"


We are excited by the prospect of multimodal goal specification to help resolve ambiguity from a single modality alone, and provide experiments to demonstrate that sketch-and-language conditioning can be favorable to either modality alone. We train a sketch-and-language conditioned model which uses FiLM along with EfficientNet layers to tokenize both visual input and language at the input. Here, we see that while language alone (i.e. "place the can upright") can be ambiguous in terms of spatial placement, and a sketch alone does not encourage reorientation, the joint policy is better able to address the limitations of either modality alone. Similarly, for the Pick Drawer skill, the sketch-conditioned and sketch-and-language-conditioned policies are more precisely able to place the orange on the counter as desired.


Robustness to Sketches Drawn by Different People

We evaluate whether RT-Sketch can generalize to sketches drawn by different individuals and handle stylistic variations via 22 human evaluators who provide Likert ratings. Across 30 sketches drawn by 6 different individuals using line sketching (tracing), RT-Sketch achieves high spatial alignment without a significant dropoff in performance between individuals, or compared to our original sketches used in evaluation. We provide the sketches drawn by the 6 different individuals and the corresponding robot execution videos below.

Individual 1

Individual 2

Individual 3

Individual 4

Individual 5

Individual 6


Comparison to Alternative Image-to-Sketch Techniques

In this section, we highlight various image-to-sketch generation techniques we experimented with before pursuing our GAN-based approach.

Recently, two recent works, CLIPasso and CLIPascene by Vinker et. al explore methods for automatically generating a sketch from an image. These works pose sketch generation as inferring the parameters of Bezier curves representing "strokes" in order to produce a generated sketch with maximal CLIP-similarity to a given input image. These methods perform a per-image optimization to generate a plausible sketch, rather than a global batched operation across many images, limiting their scalability. Additionally, they are fundamentally more concerned with producing high-quality, aesthetically pleasing sketches which capture a lot of extraneous details. CLIPasso and CLIPascene produce sketches with many overlapping curves, which capture details about the object surface texture and appearance that are not relevant to performing the required robot tasks. During evaluation, the many overlapping curves also require more effort by a human operator to draw.

We, on the other hand, care about producing a minimal but reasonable-quality sketch. The second technique we explore is trying the pre-trained PhotoSketching GAN on internet data of paired images and sketches. However, this model output does not capture object details well, likely due to not having been trained on robot observations, and contains irrelevant sketch details. The vanilla PhotoSketching GAN produces sketches which do not preserve object outlines with high videlity, which makes it hard to visually distinguish between different objects (the sketch of the green chip bag and the white bowl look very similar). Finally, by finetuning this PhotoSketching GAN on our own data, the outputs are much closer to real, hand-drawn human sketches that capture salient object details as minimally as possible.


Evaluation Metric Visualizations

In this work, we quantify the spatial precision of policies based on the pixelwise distance between achieved object positions and their desired placements in visual goals. In particular, we manually annotate the most aligned frame from a rollout along with a manually labeled object keypoint for the target object in the achieved vs. goal images, and compute RMSE on this. To determine the most aligned frame, we use an annotation interface in which a human evaluator watches the rollout back and pauses at the timestamp in which the robot achieves alignment, as determined by the evaluator, or the end frame if alignment is not achieved. On this frame, a human evaluator can provide a 2D click to specify the object centroid. While we could have used an object detector to obtain the object keypoints, we manually annotate the keypoints to prevent conflating object detection errors with policy imprecision. Here, we visualize the manual keypoint annotations for 4 separate RT-Sketch trials along with the associated RMSE.

Excessive Retrying: RT-Goal Image

One frequent failure mode we observe with RT Goal Image is the tendency of this policy to exhibit excessive retrying behavior. We hypothesize that the policy over-attends to pixel-level differences, and as a result repeatedly attempts to manipulate objects that are already reasonably aligned. This failure to terminate can introduce a risk of toppling objects or misaligning the scene, as visualized below. RT-Sketch and RT-1 are far less susceptible, as they appear to learn notions of alignment and termination that are not hypersensitive to pixelwise differences. Below, we provide visualizations of policy behavior during evaluation for RT-Goal Image to illustrate the excessive retrying behavior of the trained policy.

Why Sketches?

We highlight several motivating examples where sketches can either guide manipulation directly or complement other modalities:

Setting a Table

We argue that sketches allow for specifying a loose, but useful level of goal specification. For instance, imagine setting up a fancy dinner table. If we were to describe the desired setup on the left in language, we would either end up with a short and underspecified description, such as: “The utensils go around the plate. The cups go next to each other. The forks go next to each other. The plate should have a knife on it.” There are three sets of utensils here, making it difficult to tell how they should be arranged around the plate, and multiple instances of cups, forks, and plates, making the instructions unclear. To effectively disambiguate, a person would need much longer descriptions or incredibly specific instructions like "put the smaller fork 2cm. to the left of the larger one," which is not convenient to have to do for every single item on the table. We argue that sketches provide a modality that specifies “roughly” what the user cares about spatially in a much more effective way.


Arranging Furniture

For mobile robots operating in households, arranging furniture is a potential use case which is difficult to accomplish with language instructions alone. Here, a sketch captures the relative placements and orientations of the furniture in a way that would take much longer to describe with language. Still, there are opportunities to combine sketches and language such as through textual annotations, for instance.


Folding / Assembly

Especially for long-horizon tasks such as multi-step folding or assemblies, a sketch can help visually convey what to manipulate in subgoals. Here, the sketch can serve as more of a schematic or diagram of the desired task, rather than just a depiction of the final desired scene.