../lerobot

LeRobot

I took a break from robotics after a machine learning project on semantic navigation.

I was at a Christmas party and heard about an open-source initiative for 6 DoF robotic arms called LeRobot. There were even pre-arranged Bambu 3D printing files available, so I printed the SO-100 and got to assembling it as soon as my motors arrived. Assembling LeRobot follower arm

There was an issue where one motor (specifically the 3rd joint from the bottom) would just blink and not work at all. I tried getting help, but eventually gave up and packed it away for a bit.

After a 3-month break and moving to a different apartment, I pulled it out, debugged the motor, and finally got it working. Explaining solution to blinking motors

LeRobot teleop demo

The above ran with the following command:

python lerobot/scripts/control_robot.py \
  --robot.type=so100 \
  --control.type=teleoperate

This is teleoperation ("teleop") where I control a leader arm that the follower arm mimics.

Goals

Let me define goals for this project:

Move T Demo

Move-T is my challenge. The goal is to place the red T on the yellow outline.

How is something like this possible to do? How

The magic is in the ACT (Action Chunking Transformer) model that we are training. It takes human demonstration data from teleoperated robots and trains a transformer model to predict sequences of future actions (called 'action chunks') rather than single actions, which reduces error accumulation and enables precise manipulation tasks with minimal training data. I have a picture below that somewhat explains it, but to learn more you can read about the ALOHA setup.

ACT Model Diagram

1st Attempt

Setup:

I ran the following command to get data:

python lerobot/scripts/control_robot.py \
  --robot.type=so100 \
  --control.type=record \
  --control.fps=30 \
  --control.single_task="Move T to where it belongs." \
  --control.repo_id=${HF_USER}/so100_move_t2 \
  --control.tags='["so100","tutorial"]' \
  --control.warmup_time_s=10 \
  --control.episode_time_s=15 \
  --control.reset_time_s=15 \
  --control.num_episodes=40 \
  --control.push_to_hub=true

Note: I pushed to Hugging Face in case my MacBook dies during training.

Results: 1st Attempt This took 26 HOURS to train. The results were terrible - the robot couldn't complete the task reliably. It just kept slamming itself into the table.

Key learnings:

2nd Attempt

Problem Analysis: I began to think about the problem from first principles. The laptop camera was giving a poor perspective of the task, and I realized the camera angle was likely confusing the model.

Setup Changes:

Deleting data

Training Workflow:

Colab → Setup LeRobot → Download training data (HF) → Train → Upload model (HF)

Results:

2nd Attempt

Wandb metrics: Attempt 2 wandb

Key learnings:

3rd Attempt

Research Phase: I began researching what other successful robotic systems did for camera positioning. Looking at systems like ALOHA, I noticed they used multiple camera angles including wrist-mounted cameras.

Aloha setup

Key Insight: Based on this insight from the research:

Camera positioning research

Setup Changes:

Wrist camera printing

New setup data visualization

Training Workflow: This didn't train, but I had to run it twice because I didn't upload it to hf the first time, and the Colab kernel closed.

Wandb Wandb results

Results:


Warning: the following is completely autonomous 3rd Attempt take 1 3rd Attempt take 2 3rd Attempt take 3 3rd Attempt take 4 (only failed one)

Recap

Wow. This was such an amazing feeling to see that arm work and truly opens my eyes for the breakthroughs that are coming to civilization.

Obviously, my arm isn't perfect. But the trajectory is heading towards a world where you verbally tell your robot to do anything and it can do it for you.

are just a few supremely useful examples that if they were removed would make living just more enjoyable.

I discussed with my friend what an initial requirement would be a consumer robot and we came up with the following:

Q&A

Q: My brother asked a good question - why do we need both a leader and follower arm? Why can't you just guide a single arm directly?

A: We're using a Vision-Language-Action (VLA) model that learns from visual demonstrations. Having a human hand directly in the scene would interfere with the visual encoding, as the model would learn to expect human hands during execution. The leader-follower setup keeps the demonstration space clean while still capturing natural human movements.


Next Steps