June 30, 2026
For the past few months I have had a UFactory xArm sitting on my desk. Over time I've run through a variety of ideas to try and get the arm to play a successful full length game of chess against me, starting with stitched together classical algorithms and working up through fine-tuned foundation models. This is a collection of the ideas that I've tried from most brittle to most general, accompanied by a series of demo videos.
Some notes about my setup: I have a 5 DOF arm that I use to descend vertically for pick and place. I have an Intel D435i camera on the wrist and another mounted overhead, about 3 feet off the ground. Most of the inference I run locally on an Nvidia Jetson Orin but I offload some of the heaviest training and computation to Modal. For the first few sections I used a wooden board from my apartment, however I eventually switched to a generic black and white tournament set because I expect the models to be better at understanding the standard looking high contrast sets.
Just using classical manipulation algorithms, you can get pretty far. The wrist camera can calibrate the board and compute the camera gripper offset using classical computer vision.

Then we precompute the piece sizes and hard code an algorithm to move a given piece from square A to square B. Capturing pieces and taking them off the board is slightly more annoying but solvable. For this, we need to be able to take a piece off the board and place it on the side in an area that is flat with no other pieces. The camera has a depth feature that can be used to find a smooth spot and place the piece off to the side. Then, all we need is a method to interpret the user's move and to wire it up to Stockfish or another strong chess engine. For this, I passed the board state before and after to Gemini 3.0 Pro and asked it to detect the move made, which works reasonably well but adds some latency. You can see a video of this below.
This approach runs into a few problems. First, it is extremely brittle. If the board gets knocked even slightly, the entire calibration process must be redone. If a piece gets misplaced even slightly, on the second attempt to grab it the arm will smash into the piece on descent, oftentimes damaging it and shaking the table. If it is trying to take a piece that I moved, I must leave it in the exact perfect position otherwise the grasp will fail. Also, if the system is turned on in any position that is not the starting position, it will be totally lost and unable to operate because it essentially has no vision.
To improve the reliability of the system we need some sort of vision. The natural algorithm is YOLO, a state-of-the-art model for real-time object detection. It generates bounding boxes around all the objects in the field of view and can run on my local GPU.
The out-of-the-box YOLO models do not support the full range of chess pieces as classes and so I had to fine-tune it on a variety of board states. Even with hundreds of data points, this model still struggled on crowded boards (I was able to speed up the data collection algorithm by having the previous system play out games and tag objects with the depth camera).

As you can see with this example, the model struggles with two parts of this problem. First, the pieces are hard to distinguish from the overhead view. Second, there are a lot of pieces here and they blend into the board a bit and so the model gets confused. I experimented with some side angles but the model was just too unreliable to let the entire board state rely on just the vision pipeline. This testing inspired a perception benchmark I published a few months ago, which suggests that even frontier VLMs cannot solve this problem out of the box.
Still, I was able to use this model to do a small amount of last-mile improvement on the gripping algorithm to increase the probability of a successful grasp. A brief aside on grasping chess pieces. The arm that I am using is reasonably wide and boxy, which presents some issues. If it is trying to grab a pawn that is on a square next to the queen, it is likely to knock the queen on the descent. I never fully solved this problem, but one idea is to rotate the arm 45 degrees when grasping circular pieces surrounded by other tall pieces. Annoyingly, this works for everything except for knights. Knights are the trickiest piece to grab. Every other piece can be grasped from any rotational angle, but knights really only have one natural way to be picked up.
Over the past few years the robotics field has moved away from classical manipulation and toward much more general end-to-end models. These models have a more general imitation learning pipeline and are even beginning to implement some aspects of reinforcement learning. There is no consensus on the best architecture for these models (I wrote more about this here) but the general idea is to train on a series of tasks across different robot embodiments and build a strong prior about the world. The input is the robot position and two camera views, and the output is the trajectory. The most common version of this is called a VLA and it also has a strong language backbone so that it can understand a diverse array of text based tasks, even if they are not perfectly in the training set.
In theory, a general enough robotics model should be able to play chess out of the box. It may not be smart enough to make the optimal move (language models still struggle with chess), but the task is well defined and chess pieces/boards should be in any training set. However, this is certainly not the case with current models. The best existing open-source VLA is π0.5 from Physical Intelligence, but even this model struggles on robots and tasks not in its training data. By using a VLA, there is some hope of learning general chess movement and then stitching it together with a powerful engine and a detection system to play a full game.
The industry standard approach for complex manipulation is to fine-tune the model for the embodiment and the task using a series of example teleoperations. This means collecting teleoperation data for the task, and then running what would be equivalent to SFT for a language model. This teaches a more general kind of imitation, since the model hopefully brings some prior for robotic movement from pre-training. I used LoRA fine-tuning to avoid causing catastrophic forgetting issues, although I cannot say for certain how much more performance I could squeeze out with a higher quality post-training stack.
Collecting teleoperation data is not very hard but it is extremely boring. I tested this out by wiring up a PlayStation controller to my robot and collecting 50 examples before giving up. Luckily, I have another solution to this problem. I already have a brittle but workable pipeline to move pieces around the board from the previous system, and so I can collect fake teleoperation data that is just a scripted algorithm moving pieces while recording the camera views. Then this data can be used to teach the model to reason about grasping. It is possible this type of data loses some of the variance introduced by teleop. I am not sure how important this is, but I varied the square locations substantially to give the task some level of variety.
One note on my robot. As far as I know it is pretty non-standard. I only have 5 DOF, which is theoretically enough to play chess but below the much more common 6 and 7 DOF arms that have full range of motion. I restrict the movement even further, only modeling x, y, z coordinates, yaw, and gripper width. This keeps the arm always facing down but is enough to complete any chess task in my experiment. This may make the transfer from fine-tuning less robust than more common embodiments.
To get a feel for the strength of these models, I started out with the simplest possible problem, moving a white queen from one square to another. The task description is “move the white queen from g3 to b3” and the board is entirely empty except for a white queen on g3. I then recorded about 400 examples of these types of moves. The idea is that at the very least, this data should teach the model how to pick and place with the embodiment. If it is truly general, this should be enough information to extrapolate to a bunch of different shapes of tasks around the chess board and this can be scaled up.
I am aware that robotics as a field is moving extremely fast and that there are plenty of more advanced models inside the labs that are unreleased (including by Physical Intelligence). As I work through the performance of this model, keep in mind that this is a snapshot of the state of open-source models and that most frontier companies are working with something more advanced. If anyone would like to test a more advanced model, please reach out.
Working through this problem, I did not have too strong of a prior around how much the model should learn from my demonstrations. To get a baseline, I ran a pick and place task on the base model. It does not flail as badly as I expected, but it still moves the arm around randomly with no coherence.
The hope with the fine-tuning step is that at the very least, the model begins to understand how to operate the embodiment. 400 examples should be enough to get a feel for the gripper and how the camera views respond to x, y, and z coordinate updates. After running a few tests, I saw that the model was reliably able to pick up the white queen and move it to a square that is in roughly the correct area of the target.
However, to really benchmark how much the model understands about the world we must vary specific parts of the setup and run a series of additional tests.
To test the negative case, I put the queen on b7 but asked the model to move the queen from g2 to g7. In this case, the ways I could imagine this playing out are
The resulting video falls into category 4 (I tested this multiple times) suggesting that the model needs a piece somewhat in the vicinity to make a grab. However, it does not have any mechanism to immediately abort and so it ends up in some no-op oscillation pattern.
The next ablation is to test whether the model can manipulate a piece that isn't a white queen. The first test is to check if it can perform the same movements with a black queen. It seems to not struggle with this much.
This alone isn't enough, perhaps it just memorized the path between the squares and will move any object. To make this test more robust, we can set up both queens on the board and give a more vague task of “move the white/black queen to c6.”
Here, we see the model moves the white queen imprecisely in both tasks. I tested this multiple times and concluded that the model is overfit to the white queen and so probably does not have a robust concept of white queen vs. black queen.
If the pre-training had a strong world model prior, we would expect it could generalize to other pieces. The main issue with grabbing something like a pawn is that it is shorter and so the traditional queen grasp will fail. In fact, this is what we see with the pawn movement task.
To confirm this, I also tested a white king which is the same height as a queen. This movement works successfully as it seems to replicate the queen grasp.
A highly general model with real world knowledge would understand that the chessboard notation is defined by the markings on the edge, and that rotating the board 180 degrees would impact the pick and place. However, when I rotate the board, the model still searches for the piece in the old location of the board before eventually giving up.
This roughly matches the video above where we asked it to move a queen that was not there. The next task is to keep the orientation the same but shift the board a few inches to the side.
This one is more interesting, it sees enough of the queen on the screen to go for a grasp but it grasps in the old board location and not the new location. So the vision is probably not baked into the precision of the grasp.
Even though there is no fine-tuning data corresponding to just a pick up with no place, we can ask the model to pick up the white queen and it seems to understand what this means. There is almost certainly a lot of pickup task data in the pre-training and so I would have guessed that it can extrapolate in this way. However, this task fails to generate a grasp over multiple tries.
The last axis I was curious about is task composition. If a single task is normally a free form sentence, could we put two sentences together and get the model to do two things in a row? It doesn't go as poorly as the subtask, but it does not really perform the composition of tasks.
This type of movement and reliability is clearly not enough to enhance my previous chess setup and so it brings a temporary end to the experimentation. One obvious critique is that if I had captured a lot more varied data (more piece types, more board rotations, more crowded setups, etc) that this model would eventually begin to learn and generalize. I may attempt this in a future post although this was sufficiently brittle to make me skeptical that this will get to my desired level of performance without a more advanced base model.
The holy grail here would be a model where we can use limited teleoperation data for an embodiment and lean on the pre-training understanding of objects and concepts to generalize well outside the training distribution.
The main pain point here is that we are learning very shallow imitation and not deep concepts. In language modeling, the idea of verifiable rewards has a lot of momentum because it instills a deeper type of understanding into the model. If I could somehow run rollouts of chess moves and grade them, I imagine it would be easier to learn more complex concepts. Again, there are no publicly available RL training scaffolds for π0.5 although there are numerous ideas in the literature about how to do this correctly. I thought a bit about whether I should be including negative samples into the fine-tuning data somehow but I am not aware of the best practices around this.
In a world with a robust RL post-training stack, I could imagine a pipeline where I run a rollout of a chess move and then have a frontier VLM grade the rollout. This loop could run overnight and improve the performance of the system slowly.
Following the state of vision research, I am quite optimistic that there will be stronger models for me to build in the near future that are more sample efficient in post-training. If anyone reading this has a private foundation model they would be interested in letting me test, my email is bqbrady@gmail.com.