We deploy RT-Sketch on a Franka Panda robot, demonstrating its compatibility with various robot embodiments.
RT-Sketch is compatible with any IL backbone, and we demonstrate its flexibility by implementing a version that uses Diffusion Policy instead of the original Transformer architecture.
We demonstrate 2 tasks on the Panda:
(1) setting the table, w/ different utensil/plate arrangements
(2) opening/closing cabinets
For the cabinet task, we demonstrate that the policy can learn to open/close based only on quick-to-draw arrows rather sketching the entire scene.
For tasks where sketches or language may not provide sufficient context individually, we show that RT-Sketch can be extended to accommodate both sketches and language for improved performance.
For the below experiments, we use a Franka Panda robot with a Robotiq gripper. For each task, we collect on the order of 50-60 demonstrations consisting of delta actions (from a Spacemouse) and observations from two RealSense cameras (wrist-mounted + table-mounted). We then manually sketch the goals (less than 15min. per task). We implement RT-Sketch with a goal-conditioned Diffusion Policy architecture, which uses ResNets to separately encode the agent image, wrist image, wrist depth image, and goal sketch and concatenate the embeddings as input to the noise prediction network. Training takes ~3 hours per task on a single A5000 GPU, and we deploy the resulting policy onto the robot using DDIM as the denoiser and a Polymetis controller at 10Hz.
We train RT-Sketch to perform table setting from 60 demonstrations
.Sketches can easily capture different desired arrangements of utensils, and the
policy is capable of setting the table accordingly in 10/15 trials:
Sketches inherently help the policy attend to task-relevant objects, such that the policy is able to complete the task even when things are moving around (people, spills, etc.) or distractor objects are present (napkins, etc.).
Occasionally, the policy struggles with imprecision which can lead to failed grasps, but typically still makes partial task progress:
We train RT-Sketch to perform drawer opening and closing from 50 demonstrations, but specifically we consider sketches which are arrows drawn over the current image rather than sketches of the entire scene. Here, arrows can represent which drawer (top or bottom) should be opened or closed.
.We see that even from extremely minimal goals (less than 5 seconds to specify),
the policy is able to interpret and act upon the intended goal.
Of course, the policy is not without failures and a typical failure mode is improperly grasping the cabinet handle:
We acknowledge that extremely minimal sketches remains a challenging problem, and we hope to further explore this direction in future work. Ongoing advances in image-to-sketch conversion and more drastic data augmentations can likely help to address this class of sketches. The above policies also generalize only within a relatively small range of the workspace, so we hope to improve the sample efficiency and generalization capabilities of these models in the future.
Wrong placement location
Not upright
Upright, correctly placed
"place the pepsi can upright"
"place the pepsi can upright"
Wrong placement location
Correct placement
Correct placement
"place the orange on the counter"
"place the orange on the counter"
We are excited by the prospect of multimodal goal specification to help resolve ambiguity from a single modality alone, and provide experiments to demonstrate that sketch-and-language conditioning can be favorable to either modality alone. We train a sketch-and-language conditioned model which uses FiLM along with EfficientNet layers to tokenize both visual input and language at the input. Here, we see that while language alone (i.e. "place the can upright") can be ambiguous in terms of spatial placement, and a sketch alone does not encourage reorientation, the joint policy is better able to address the limitations of either modality alone. Similarly, for the Pick Drawer skill, the sketch-conditioned and sketch-and-language-conditioned policies are more precisely able to place the orange on the counter as desired.
We evaluate whether RT-Sketch can generalize to sketches drawn by different individuals and handle stylistic variations via 22 human evaluators who provide Likert ratings. Across 30 sketches drawn by 6 different individuals using line sketching (tracing), RT-Sketch achieves high spatial alignment without a significant dropoff in performance between individuals, or compared to our original sketches used in evaluation. We provide the sketches drawn by the 6 different individuals and the corresponding robot execution videos below.
In this section, we highlight various image-to-sketch generation techniques we experimented with before pursuing our GAN-based approach.
Recently, two recent works, CLIPasso and CLIPascene by Vinker et. al explore methods for automatically generating a sketch from an image. These works pose sketch generation as inferring the parameters of Bezier curves representing "strokes" in order to produce a generated sketch with maximal CLIP-similarity to a given input image. These methods perform a per-image optimization to generate a plausible sketch, rather than a global batched operation across many images, limiting their scalability. Additionally, they are fundamentally more concerned with producing high-quality, aesthetically pleasing sketches which capture a lot of extraneous details. CLIPasso and CLIPascene produce sketches with many overlapping curves, which capture details about the object surface texture and appearance that are not relevant to performing the required robot tasks. During evaluation, the many overlapping curves also require more effort by a human operator to draw.
We, on the other hand, care about producing a minimal but reasonable-quality sketch. The second technique we explore is trying the pre-trained PhotoSketching GAN on internet data of paired images and sketches. However, this model output does not capture object details well, likely due to not having been trained on robot observations, and contains irrelevant sketch details. The vanilla PhotoSketching GAN produces sketches which do not preserve object outlines with high videlity, which makes it hard to visually distinguish between different objects (the sketch of the green chip bag and the white bowl look very similar). Finally, by finetuning this PhotoSketching GAN on our own data, the outputs are much closer to real, hand-drawn human sketches that capture salient object details as minimally as possible.
In this work, we quantify the spatial precision of policies based on the pixelwise distance between achieved object positions and their desired placements in visual goals. In particular, we manually annotate the most aligned frame from a rollout along with a manually labeled object keypoint for the target object in the achieved vs. goal images, and compute RMSE on this. To determine the most aligned frame, we use an annotation interface in which a human evaluator watches the rollout back and pauses at the timestamp in which the robot achieves alignment, as determined by the evaluator, or the end frame if alignment is not achieved. On this frame, a human evaluator can provide a 2D click to specify the object centroid. While we could have used an object detector to obtain the object keypoints, we manually annotate the keypoints to prevent conflating object detection errors with policy imprecision. Here, we visualize the manual keypoint annotations for 4 separate RT-Sketch trials along with the associated RMSE.
One frequent failure mode we observe with RT Goal Image is the tendency of this policy to exhibit excessive retrying behavior. We hypothesize that the policy over-attends to pixel-level differences, and as a result repeatedly attempts to manipulate objects that are already reasonably aligned. This failure to terminate can introduce a risk of toppling objects or misaligning the scene, as visualized below. RT-Sketch and RT-1 are far less susceptible, as they appear to learn notions of alignment and termination that are not hypersensitive to pixelwise differences. Below, we provide visualizations of policy behavior during evaluation for RT-Goal Image to illustrate the excessive retrying behavior of the trained policy.
We highlight several motivating examples where sketches can either guide manipulation directly or complement other modalities:
We argue that sketches allow for specifying a loose, but useful level of goal specification. For instance, imagine setting up a fancy dinner table. If we were to describe the desired setup on the left in language, we would either end up with a short and underspecified description, such as: “The utensils go around the plate. The cups go next to each other. The forks go next to each other. The plate should have a knife on it.” There are three sets of utensils here, making it difficult to tell how they should be arranged around the plate, and multiple instances of cups, forks, and plates, making the instructions unclear. To effectively disambiguate, a person would need much longer descriptions or incredibly specific instructions like "put the smaller fork 2cm. to the left of the larger one," which is not convenient to have to do for every single item on the table. We argue that sketches provide a modality that specifies “roughly” what the user cares about spatially in a much more effective way.
For mobile robots operating in households, arranging furniture is a potential use case which is difficult to accomplish with language instructions alone. Here, a sketch captures the relative placements and orientations of the furniture in a way that would take much longer to describe with language. Still, there are opportunities to combine sketches and language such as through textual annotations, for instance.
Especially for long-horizon tasks such as multi-step folding or assemblies, a sketch can help visually convey what to manipulate in subgoals. Here, the sketch can serve as more of a schematic or diagram of the desired task, rather than just a depiction of the final desired scene.