Comment Permalink

Raphael Roche8d50

Because of this, I think that there will be an interim period where a significant portion of white collar work is automated by AI, with many physical world jobs being largely unaffected.

I have read numerous papers suggesting that white-collar jobs, such as those of lawyers, will be easily replaced by AI, before more concrete or physical jobs as discussed by the author's. However, I observe that even the most advanced models struggle with reliability in legal contexts, particularly outside of standardized multiple-choice questions and U.S. law, for which they have more training data. These models are good assistants with superhuman general knowledge, pretty good writing skills, but very inegal smartness and reliability in specific cases, a tendency to say what the user / client wants to hear (even more than actual lawyers !) or to hallucinate judicial decisions.

While it is true that lawyers statistically win only about 50% of their cases and also make mistakes, my point is different. I observe a significant gap in AI's ability to handle legal tasks, and I question whether this gap will be bridged as quickly as some envision. It might be more comparable to the scenario of automated cars, where even a small gap presents a substantial barrier. 90% superhuman performance is great, 9% human-level is acceptable, but 1% stupid mistakes ruin it all.

Law is not coding. The interpretation and application of laws is not an exact science, it involves numerous cultural, psychological, media-related, political, and human-biased considerations. There is also a social dimension that cannot be overlooked. Many clients enjoy conversing with their lawyers, much like they do with their waitstaff or nurses. Similarly, managers appreciate discussing matters over meals at restaurants. The technical and social aspects are intertwined, much like the relationship between a professor and their students, or in other white-collar jobs.

While I do not doubt that this gap can eventually be bridged, I am not convinced that the gap for many white-collar jobs will be filled before more technical engineering tasks like discussed here, or automated cars, are mastered. However, some white-collar jobs that involve abstract thinking and writing, with minimal social interactions, such as those of theoretical researchers, mathematicians, computer scientists, and philosophers (am I speaking of the archetypal LessWronger ? ), may be automated sooner.

6Adam Karvonen8d

Hmm, I don't know. With the caveat that I'm not a legal expert, I do think there's a big difference between basically any job that can be done remotely most of the time and skilled physical labor jobs. I use LLMs for coding every day, and they still have tons of problems, but I do see significant progress happening. There is legitimate uncertainty over how long it will take for AIs to become reliable at tasks like coding. Coding and ML research also requires a lot of subjective taste, like writing easily understandable code with good abstractions or selecting approaches to a research problem. We also see companies like Harvey (legal AI) making over $50M in ARR, while I'm not aware of basically any useful manufacturing AI tools.

Raphael Roche7d10

AI is very useful in legal matters and is clearly a promising sector for business. It is possible that some legal jobs (especially documentation and basic, non-personalized legal information jobs) are already being challenged by AI and are on the verge of being eliminated, with others to follow sooner or later. My comment was simply reacting to the idea that many white-collar jobs will be on the front line of this destruction. The job of a lawyer is often cited, and I think it's a rather poor example for the reasons I mentioned. Many white-collar jobs combine technical and social skills that can be quite challenging for AI.

See in context

146 Frontier AI Models Still Fail at Basic Physical Tasks: A Manufacturing Case Study

by Adam Karvonen

14th Apr 2025

Linkpost from adamkarvonen.github.io

8 min read

146

Dario Amodei, CEO of Anthropic, recently worried about a world where only 30% of jobs become automated, leading to class tensions between the automated and non-automated. Instead, he predicts that nearly all jobs will be automated simultaneously, putting everyone "in the same boat." However, based on my experience spanning AI research (including first author papers at COLM / NeurIPS and attending MATS under Neel Nanda), robotics, and hands-on manufacturing (including machining prototype rocket engine parts for Blue Origin and Ursa Major), I see a different near-term future.

Since the GPT-4 release, I've evaluated frontier models on a basic manufacturing task, which tests both visual perception and physical reasoning. While Gemini 2.5 Pro recently showed progress on the visual front, all models tested continue to fail significantly on physical reasoning. They still perform terribly overall. Because of this, I think that there will be an interim period where a significant portion of white collar work is automated by AI, with many physical world jobs being largely unaffected.

(Estimated reading time: 7 minutes, 12 minutes with appendix)

The Evaluation

My evaluation is simple - I ask for a detailed plan to machine this part using a 3-axis CNC mill and a 2-axis CNC lathe. Although not completely trivial, most machinists in a typical prototype or job shop setting would view executing this as a routine task, involving standard turning and milling techniques across multiple setups. This was certainly much simpler than the average component at both shops I worked at. For context, compare the brass part's simplicity to the complexity of aerospace hardware like these Blue Origin parts.

Although this part is simple, even frontier models like O1-Pro or Gemini 2.5 Pro consistently make major mistakes. These mistakes can be split into two categories - visual abilities and physical reasoning skills.

Visual Errors

Most Models Have Truly Horrible Visual Abilities: For two years, I've observed essentially zero improvement in visual capabilities among models from Anthropic and OpenAI. They always miss obvious features like the flats cut into the round surface, holes, or even hallucinate nonexistent features such as holes drilled along the part’s length. I have never seen Claude 3.5, Claude 3.7 (thinking and non-thinking), GPT-4.5, GPT-4o, or O1-Pro produce a reasonable description of the part. Without vision abilities, creating a manufacturing plan is completely hopeless.

Interestingly, many of these models also score at or above the level of some human experts on visual reasoning benchmarks like MMMU. That which is easy to measure often doesn't correlate with real world usefulness.

Gemini 2.5 Pro Makes Significant Vision Progress: I was surprised when I saw Gemini 2.5 make major progress in vision capabilities. On roughly one out of four attempts, it identifies most of the major features without extra hallucinations, though it still always misses subtle details like the corner radius on the milled flats. Some details it captures are genuinely impressive. For example, I have to look closely to identify the two flats positioned exactly 180 degrees apart. However, this improved vision mostly serves to reveal deeper, unresolved issues.

Physical Reasoning Errors

Previously, it was hard to separate visual misunderstandings from deeper physical reasoning problems. Now, even when working from an accurate visual interpretation, Gemini 2.5 still produces machining plans filled with practical mistakes, such as:

Ignoring Rigidity and Chatter: The part is long and slender relative to its diameter. Attempting to machine it with standard techniques, as Gemini often suggests, would likely cause the part to deflect (bend slightly under tool pressure) or vibrate rapidly against the cutting tool (a phenomenon called 'chatter'). Both issues ruin surface finish and dimensional accuracy. Beginner machinists should instantly realize rigidity is critical for a long, slender part like this. When specifically asked about chatter, Gemini poorly applies textbook solutions like a tailstock in ways that worsen issues like bowing in this long, thin brass part.

Physically Impossible Workholding: Gemini usually proposes ways to clamp the part (workholding) and sequences of operations that are physically impossible. Its most common suggestion is to clamp the part in a fixture (specifically, a collet block), machine some features, then rotate the fixture to machine other features. However, this is physically impossible, as the fixture is blocking these new features. This is obvious if you're mentally walking through the process or looking at the setup.

Additional details, including a more detailed description of the mistakes AI models make and a reference gold standard plan, are in the Appendix.

My high level impression when reading the response is “someone who can parrot textbook knowledge but doesn’t know what they’re talking about”. The models are very eager to give textbook knowledge about e.g. recommended cutting speeds, but are completely incorrect on important practical details. This matches my conversations with friends and former colleagues in manufacturing and construction: current LLMs are seen as almost entirely useless for the core, hands-on aspects of their jobs.

This Evaluation Only Scratches the Surface

This task of generating a text plan represents one of the easiest parts of the job. Real machining demands managing many details behind every high-level step. Just selecting a cutting tool involves considering tip radius, toolholder collision clearance, cutting tool rigidity, coating, speeds/feeds, and more — often with direct trade-offs, like clearance versus rigidity. Many factors like ensuring tool clearances against the part and fixture are inherently spatial, which is impossible to evaluate fully via text. If models fail this badly on the describable aspects, their grasp of the underlying physical realities is likely far worse.

In fact, getting to the point of being actually useful requires overcoming a hierarchy of challenges, each significantly harder than the last:

1. Accurate Visual Perception: The foundational step is to correctly identify all geometric features and relationships from the input image. This requires almost no spatial reasoning ability, yet most models still perform terribly. Even Gemini 2.5 Pro is very bad at this, and falls apart if pushed at all.

2. Basic Physical Plausibility: Beyond just seeing the part, the model must propose operations and setups that are physically possible. This involves basic spatial reasoning to ensure that e.g. tool access isn't blocked by fixtures. This may require moving beyond a text chain of thought - when I mentally walk through a setup, I don't use language at all, and instead just visualize every step of the operation.

3. Incorporating Physical Knowledge: Successfully machining requires understanding real-world physics and tacit knowledge. This is often learned through hands-on experience and is poorly captured in existing datasets.

4. Process Optimization: Handling the detailed considerations within steps 1-3 is necessary just to yield one part correctly. As Elon Musk likes to say, manufacturing efficiently is 10-100x times more challenging than making a prototype. This involves proposing and evaluating multiple feasible approaches, designing effective workholding, selecting appropriate machines, and balancing complex trade-offs between cost, time, simplicity, and quality. This is the part of the job that's actually challenging.

Steps 2-4 could be challenging to address via synthetic data generated in simulations. Almost all machinists I've talked to have (completely valid) complaints about engineers that understand textbook formulas and CAD but don't understand real world manufacturing constraints. Simulation environments seem likely to create AIs with the same shortcomings.

Why do LLM’s struggle with physical tasks?

The obvious reason why LLMs struggle here is a lack of data. Physical tasks like machining rely heavily on tacit knowledge and countless subtle details learned through experience. These nuances aren't typically documented anywhere.

This isn't because experts are deliberately withholding secrets - instead, documenting this granular, real-world knowledge is impractical and inefficient. Just as software engineers rarely document their entire reasoning for each line of code, machinists don't document every consideration for setting up a single part. It's far quicker and more effective to teach someone adaptable skills through hands-on experience with a mentor instead of learning from a textbook or memorizing procedures.

This highlights a major difference from fields like software engineering or law. Although software engineers or lawyers may not explicitly record every reasoning step, they produce artifacts like code, version control history, and contracts which have very rich and detailed information. In physical tasks, equivalent detailed information certainly exists, but it's embedded in the 3D world - such as interactions between tools, materials, and physical forces - in formats which are very difficult to digitize effectively.

As a result, LLMs are great at regurgitating some of the textbook knowledge available, but this is very insufficient.

Improving on physical tasks may be difficult

Empirically, frontier models are currently bad at these tasks. Is this just a temporary hurdle that will soon be overcome? I’m not sure, and I have speculative arguments for both why future progress might be difficult and why it might be easier than expected.

One obvious explanation is that LLMs aren’t good at physical tasks because no one has put in much effort yet. However, improving physical world understanding could be challenging. The recipe for improving coding ability has relied on massive amounts of training data and clear reward signals enabling RL and synthetic data. However, this breaks down for physical tasks.

Lack of Verifiable Rewards: Defining a reward signal for complex physical tasks is hard. Defects in a part might show up as slightly increased failure rates years in the future or as rot that appears years after incorrectly applied waterproofing. Feedback loops can be long and outcomes are difficult to measure automatically.

Slow, Expensive, and Dangerous Trial-and-Error: Learning through RL or generating synthetic data could be difficult. I've personally caused thousands of dollars in damage due to mistakes in my early years of manufacturing, which is not uncommon. Single mistakes can easily cost hundreds of thousands of dollars in damage. Unlike running buggy code, mistakes with heavy machinery or construction can have severe consequences. I also caused tens of thousands of dollars in lost revenue due to my lower productivity when learning - gaining experience usually requires the use of expensive and limited resources, not just a few GPU hours.

However, there are also reasons why this could be easier than expected:

The Automated AI Researcher: AI is making major progress in coding and AI research and we may reach an automated AI researcher in the near future. Maybe the automated AI researcher will easily be able to solve these challenges by creating much more sample efficient algorithms or large amounts of simulated data.

Synthetic Data: There are obvious approaches that haven’t been well explored. For example, we can create a lot of data using simulations, although there will be a gap between simulation and reality. For this specific manufacturing process (CNC machining), CAM software can accurately simulate most operations. However, there are a ton of diverse manufacturing processes, many of which don’t have good simulation solutions.

Potential Implications of Uneven Automation

If this trend holds, we might face a period where remote work sees significant automation while skilled physical jobs remain largely untouched by AI. This “automation gap window” could last for an unknown duration and carries potential implications:

Class Conflicts: There could very easily be major class conflict between the automated and non-automated professions, especially because there are other underlying differences between these groups. White collar workers are more likely to see displacement, and they typically earn more and have more liberal political beliefs. These differences could exacerbate tensions and lead to major economic pain for automated groups.

Popular Opposition to AI: This could result in popular opposition against further AI research. Groups such as blue collar workers now have evidence that automation can happen really quickly, and they may not want to be automated. This could stall further AI progress and lengthen this window.

Geopolitical Bottlenecks: If most knowledge work is automated, physical capabilities like manufacturing could become the bottleneck in technological progress or defense (e.g., during an AI arms race). Nations such as China, with a much stronger industrial base, could gain a significant strategic advantage.

There’s a lot of uncertainty and tension here. For example, if manufacturing becomes a strategic bottleneck, the government may be able to fend off popular opposition to AI.

Conclusion

While it’s unclear how long this uneven automation gap might persist, its existence seems likely. Surprisingly few in AI research discuss this - perhaps because many in the field aren’t very familiar with manufacturing or other physical-world domains. Anyone working in policy, planning their career, or concerned about social stability should start considering the implications of a partial automation scenario seriously.

Acknowledgements: I am grateful to Neel Nanda, Kevin Liu, and Tim Butler for valuable feedback on this post.

Appendix

Here are links to plans generated by Claude 3.7 Thinking, GPT-4.5, O1-Pro, GPT-4o, and Gemini 2.5 Pro. My plan for machining the part is here. Below I have detailed descriptions of various errors made by the models.

Visual Errors

General Poor Performance (Non-Gemini Models): For nearly two years, models from Anthropic and OpenAI showed essentially zero improvement in visual perception for this task. Reviewing the transcripts, the errors are consistently egregious – missing obvious features like the milled flats or cross-holes, sometimes even hallucinating features like non-existent axial holes. No model tested, apart from Gemini 2.5 Pro, produced anything resembling an accurate description of the part geometry, making any subsequent machining plan fundamentally flawed from the start.

Gemini 2.5 Pro - Vision Progress but Persistent Flaws: Gemini 2.5 Pro represents a significant step forward visually. On roughly one out of four attempts, it identifies most major features without major hallucinations. However, it still makes consistent, if more subtle, visual mistakes:

Missed Details: It consistently fails to identify the corner radius on the milled flats. Recognizing these radii is critical for selecting the correct end mill and becomes second nature to machinists, as it's a factor in nearly every tool selection.

Occasional Hallucinations/Misinterpretations: It sometimes hallucinates features like a slot on one end of the part (which isn't present) or logically inconsistent features, such as suggesting two sets of threaded cross-holes – impossible given the part's small diameter due to insufficient material for thread engagement.

Inconsistent Feature Identification: It occasionally misses one of the two flats, despite correctly identifying their 180-degree opposition in other attempts (a detail which, to be fair, requires close inspection). It will also sometimes miss some of the existing holes.

Physical Reasoning Errors

Gemini’s plan is the only one worth reviewing, as it doesn’t have the major confounder of poor vision abilities. Other models have many egregious errors in every generated plan, but it’s difficult to distinguish if this is due to a misunderstanding of the part or poor physical reasoning abilities. Even when working from a relatively accurate visual interpretation, Gemini 2.5 Pro's plans are filled with significant flaws in practical physical reasoning and tacit knowledge.

Poorly Applied Textbook Solutions: Ignoring Rigidity and Chatter

Failure to Identify Risk: Gemini 2.5 consistently fails to recognize that the part, with a length-to-diameter ratio clearly exceeding 10:1 (based on provided dimensions in the prompt), is highly susceptible to chatter and deflection. The common heuristic (L:D > 3:1 requires special attention) seems completely missed. This should be an obvious red flag, yet surprisingly it's ignored, undermining many of its proposed operations.

When explicitly prompted about chatter concerns for a "long slender part," Gemini applies textbook solutions poorly. It suggests using a tailstock, common for long parts. However, for this specific small-diameter brass part, a tailstock is often not the best approach. There's minimal area for the tailstock center, and the required axial pressure can easily cause the slender part to bow, like a stiff spaghetti noodle buckling under axial pressure, especially in soft brass. Chatter and deflection are also still a concern. It will probably involve trial and error to get the process to work.

Gemini's lack of physical grounding is further exposed by its plan to Turn Small Diameter 1 (Near Collet). This step implies cutting a 90-degree internal shoulder directly adjacent to the collet face. In practice, this is physically impossible with standard turning tools, which can’t reach into that corner due to tool geometry. Machinists typically address this with a grooving tool, which is specifically designed to cut square shoulders, or a back turning tool, which approaches the feature from behind to reach the back-facing surface. Both approaches introduce real-world trade-offs. Grooving tools can generate higher tool pressure, which risks deflection and chatter. Back turning tools, on the other hand, often require additional part stick-out to clear the spindle and may introduce blending issues.

Missing the Practical Solution: The standard, effective solution for a small, slender part like this often runs counter to basic textbook advice. Instead of multiple light passes, you might start with significantly oversized stock (e.g., 3/4" diameter for this 3/16" part) and machine the final diameter in a single, deep pass. This keeps the part supported by bulk material during the cut, maintaining a low effective L:D ratio and preventing chatter/bowing. It's a common technique taught by mentors that I have used several times. However, it directly contradicts typical guidelines, and Gemini fails to consider it.

Physically Impossible Workholding: The part has features needing milling at different locations and orientations: the flats/threaded holes near one end, and an orthogonal cross-hole near the opposite end. Gemini often suggests using a collet block in a mill vise. It may have chatter problems due to the length of the unsupported part, but there’s worse problems as well. Its typical sequence involves clamping one end, machining the flats/holes, then rotating the entire collet block 90 degrees (Place the collet block back in the vise, but rotated 90 degrees...) to machine the cross-hole. This simply doesn't work.

The collet block is still clamping the very end of the part where the cross-hole needs to be drilled. Rotating the fixture doesn't grant access; the fixture itself physically obstructs the tool path. This kind of spatial reasoning failure should be immediately obvious when mentally walking through the setup.

Underestimating Drilling Difficulty: When planning the orthogonal cross-hole, Gemini sometimes lists spot drilling as optional (Spot Drill (Optional)). For drilling such a small hole on a small-diameter, curved surface, a spot drill is arguably essential to prevent the drill from "walking" off-center and potentially breaking. Calling it optional is not a good idea.

Ignoring Collision Risks: The cross-hole is located very close to a larger diameter shoulder. Gemini often suggests standard spot drill sizes (e.g., ¼ or ½ inch). Given the tight clearance, a standard spot drill would likely collide with the shoulder before its tip properly centers the hole. This is a common "gotcha" – I've run into this exact issue myself. It requires careful tool selection and clearance checking.

While a collision here might just scrap the part, because a small brass part is not very robust, this type of error (failing to check tool/holder clearance) can cause catastrophic damage with larger parts or harder materials. I’ve observed several bad crashes (and caused a couple myself) that have required major repairs due to this exact error.

Incorrect Work Reference (Z0 on Raw Stock): In plans where the part isn't fully cut off in Op 1, Gemini sometimes suggests setting the Z0 reference for Op 2 on the raw end of the bar stock still attached to the part. Referencing raw stock introduces unacceptable inaccuracy. Precise machining requires setting Z0 against a known, previously machined surface to ensure features are located correctly relative to each other.

Ignoring Delicate Threads: Conversely, when recommending a part-off in Op 1, Gemini fails to account for the delicate brass threads. Its plans imply letting the finished part drop freely after cutoff. This risks damaging the fine threads upon impact. The standard, safer method for such parts involves leaving a small nub and breaking the part off manually, or using a parts catcher, to protect fragile features. Gemini also will recommend clamping on the threads, using shim stock to protect them, but these brass threads are very fragile and easily damaged, so this is not a good idea.

AI EvaluationsAI

Frontpage

146

New Comment

41 comments, sorted by

top scoring

Click to highlight new comments since: Today at 5:17 AM

[-]Tao Lin7d130

I agree frontier models severely lack spatial reasoning on images, which I attribute to a lack of in-depth spatial discussion of images on the internet. My model of frontier models' vision capabilities is that they have very deep knowledge of aspects of images that relate to text that happens to be immediately before or after it in web text, and only a very small fraction of images on the internet have accompanying in-depth spatial discussion. The models are very good at for instance guessing the location of where photos were taken, vastly better than most humans, because locations are more often mentioned around photos. I expect that if labs want to, they can construct enough semi-synthetic data to fix this.

[-]Adam Karvonen7d90

I do agree that it looks like there has been a lack of data to address this ability. That being said, I'm pretty surprised at how terrible models are, and there's a hierarchy of problems to be addressed here before models are actually useful in the physical world. Each step feels much more difficult than the step before, and all models are completely terrible at steps 2-4.

First, simply look at a part and identify features / if a part is symmetric / etc. This requires basically no spatial reasoning ability, yet almost all models are completely terrible. Even Gemini is very bad. I'm pretty surprised that this ability didn't just fall out of scaling on data, but it does seem like this could be easily addressed with synthetic data.
Have some basic spatial reasoning ability where you can propose operations that are practical and aren't physically impossible. This is much more challenging. First, it could be difficult to automatically generate practical solutions. Secondly, it may require moving beyond text chain of thought - when I walk through a setup, I don't use language at all and just visualize everything.
Have an understanding of much of the tacit knowledge in machining, or rederive everything from first principles. Getting data could be especially challenging here.
Once you can create a single part correctly, now propose multiple different ways to manufacture the part. Evaluate all of the different plans and choose the best combination of cost, simplicity, and speed. This is the part of the job that's actually challenging.

[-]Rana Dexsin7d70

You mean something like using libraries of 3D models, maybe some kind of generative grammar of how to arrange them, and renderer feedback for what's actually visible to produce (geometry description, realistic rendering) pairs to train visuospatial understanding on?

[-]SorenJ8d8-1

A simple idea here would be to just pay machinists and other blue collar workers to wear something like a Go-Pro (and possibly even other sensors along the body that track movement) all throughout the day. You then use the same architectures that currently predict video to predict what the blue collar worker does during their day.

[-]Adam Karvonen8d80

This is an obvious step, but I'm a bit skeptical for a few reasons.

Current models are just so bad at vision tasks. Even Gemini 2.5 is pretty bad and falls apart if pushed to harder images. It really seems like identifying a feature on a part or if a part is symmetric is something that could be addressed by just scaling data, and these vision tasks are much easier than manufacturing details.
A lot of the work in manufacturing / construction would be in tactile details, which could be hard to capture with sensors. For example, a human finger can easily feel a step of 0.001 inches, which would be invisible on video, and I would often use this fine grained tactile detail when diagnosing problems.
The current reasoning paradigm requires scaling up RL. Where is the reward signal here? The most obvious thing I can think of is creating a bunch of simulated environments. But, almost all machinists I've talked to have (completely valid) complaints about engineers that understand textbook formulas and CAD but don't understand real world manufacturing constraints. Simulation environments seem likely to create AIs with the same shortcomings.

[-]SorenJ8d20

They're pretty bad, but they seem about GPT-2 level bad? So plausibly in a couple of years they will be GPT-4 level good, if things go the same way?
This does seem pretty difficult. The only idea I have is having humans wear special gloves with sensors on them, and maybe explain their thoughts aloud as they work, and then collecting all of this data.
Before you go to RL you need to train on prediction with a large amount of data first. We don't have this yet for blue collar work. Then once you have the prediction model, robots, and rudimentary agents, you try to get the robots to do simple tasks in isolated environments. If they succeed they get rewarded. This feels quite a bit more than 3 years away...

In general, I think the ideas is that you first get a superhuman coder, then you get a superhuman AI researcher, then you get a any-task superhuman researcher, and then you use this superhuman researcher to solve all of the problems we have been discussing in lightning fast time.

[-]Adam Karvonen8d41

Yeah, I agree. I currently feel like our current ML approach is going to make very little real world manufacturing progress, and that any progress will have to come from the automated AI researcher either brute forcing tons of synthetic data or coming up with new architectures and training procedures.

But, this is a low confidence take, and I wouldn't be shocked if a couple dumb tricks make a lot of progress.

[-]Yair Halberstadt8d60

Additionally for modern tools you might be able to continuously track machine settings and include that in the training data.

[-]RedMan7d*75

I think tacit knowledge is severely underrated in discussions of AGI and ASI.

In HPMOR, there is a scene near the end of the book where our hero wins the day by using some magic that would be equivalent to flying a drone around an extremely complicated path involving lots of loops in places not directly observable for our hero.

Our hero has never once in the book practiced doing this.

In theory, if I possess a drone and have a flight path the drone is capable of flying, I can pick up the controller for the first time and make it happen.

In practice, I will fail spectacularly. A lot of writing in this space assumes that with sufficient 'thinking power', success on the first attempt is assured.

[-]Mo Putera7d20

re: your last remark, FWIW I think a lot of those writings you've seen were probably intuition-pumped by this parable of Eliezer's, to which I consider titotal's pushback the most persuasive.

[-]Dusto8d61

Awesome work mate. Really keen to see more of this type of work digging into core components of larger scale tasks. I worry there has been too much of a jump to large scale "capability" before we have really established the building blocks. It makes it harder to get experts on board to take safety seriously when people are sold on the idea of AI automating everything, yet failing miserably on the foundations that are meant to enable this automation.

[-]p.b.8d50

I occasionally test LLMs by giving them a chess diagram and let them answer questions about the position ranging from very simple to requiring some calculation or insight.

Gemini 2.5 Pro also impressed me as the first LLM that could at least perceive the position correctly even if it quickly went off the rails as soon as some reasoning was required.

Contrary to manufacturing I expect this to get a lot better as soon as any of the labs makes an effort.

[-]Tachikoma8d10

Why do you think labs aren't making a focused effort on this problem? Is vision understanding not valuable for an automated software engineer or AI scientist?

[-]p.b.7d30

I meant chess specific reasoning.

[-]Raphael Roche8d50

Because of this, I think that there will be an interim period where a significant portion of white collar work is automated by AI, with many physical world jobs being largely unaffected.

[-]Adam Karvonen8d60

Coding and ML research also requires a lot of subjective taste, like writing easily understandable code with good abstractions or selecting approaches to a research problem. We also see companies like Harvey (legal AI) making over $50M in ARR, while I'm not aware of basically any useful manufacturing AI tools.

[-]Raphael Roche7d10

[-]Jonathan Claybrough6d40

o3 released today and I was impressed by the demo for working with images they gave so I wanted to try this out (using the prompt op linked to in appendices), but I don't have the machining experience to evaluate the answer quickly, so I thought I'd prompt it and share its result if anyone else wants to evaluate it : https://chatgpt.com/share/67fff70d-7d90-800e-a913-663b82ae7f33

[-]Adam Karvonen6d70

I also tested O3, and it looks better than Gemini 2.5 on Vision. Although it missed the second flat, it correctly identified that the ends had different diameters and picked up on some genuinely impressive details, like the grooved thread relief behind the larger thread.

However, it's still terrible at spatial reasoning. I now feel more confident in the argument in my post. It proposes many egregious, physically impossible operations. For example, it recommends to enclose 2.2 inches of the part in the collet, and then face the part down to the finished length of 2.000 inches. This is obviously impossible, as the part is buried 0.2 inches within the collet. It also makes bizarre decisions, like clamping on the threads for the second lathe op, when the main diameter is obviously a much better location for rigidity / simplicity. It does correctly identify the chatter issue, FWIW.

It feels a bit worse than Gemini's plan overall, but this is hard to evaluate. It's basically "here are two plans with multiple egregious errors, which one is worse?". I've also noticed that basically any time I ask an LLM for more specific details on a high level part of the plan that looks reasonable, it begins to make many egregious errors. So, a large part of how bad the plan is just revolves around how much detail the LLM goes into.

[-]ReaderM6d20

How about o4-mini-high ? Supposedly, it's actually better than o3 at visual reasoning. I'm not expecting much better. Just Curious

[-]Jonathan Claybrough5d10

Thanks for the followup!

[-]Petropolitan6d30

As Elon Musk likes to say, manufacturing efficiently is 10-100x times more challenging than making a prototype. This involves proposing and evaluating multiple feasible approaches, designing effective workholding, selecting appropriate machines, and balancing complex trade-offs between cost, time, simplicity, and quality. This is the part of the job that's actually challenging.

And setting up quality control!

Swedish inventor and vlogger Simone Giertz recently published the following video elaborating on this topic in a funny and enjoyable way:

Since this seems to be obscure knowledge in modern post-industrial societies^[1], many forecasters have assumed that you could easily "multiply" robots designed by AGI (presumably which overcomes the first three challenges in your list) with the same robots. I don't believe that's accurate!

^{^}
Personal anecdote: I won a wager with a school friend who got a job in an EV start-up after a decent career in IT and disagreed with me

[-]ashesfall5d20

Great work here, but I do feel that the only important observations in practice are those about reasoning. To the extent that obtaining visual information is the problem, I think the design of language models currently is just not representative of how this task would be implemented in real robotics applications for at least two reasons:

The model is not using anywhere near all of the information about an image that it could be, as language models which accept image data are just accepting an embedding that is far smaller (in an information theoretic sense) than either the original image (probably) or the activation spaces of a large independent vision model (definitely).
Even if we accept the constraint of using less data than we have by converting to an embedding and dropping it in line with other text tokens as these models mostly do, a real system would have the opportunity to consume many images if it was attached to physical machinery that was used to capture them and could eventually get far more data than the embedding provides.

——

I believe a hypothetically “real” system of this kind would almost always be implemented as a powerful vision model scaffolded on to a language model that directed it as a tool to perform an analysis progressively, operating on this data in a manner that is very different from the one-shot approach here. While these experiments are very helpful, I personally am not updating much toward it being very hard to pull off - except where the evidence is about the reasoning after the fact of understanding the image data correctly (which obviously will always be important even if vision is provided as an iterative tool of much higher bandwidth).

[-]Adam Karvonen5d30

I don't think image understanding is the bottleneck. O3 and O4-mini-high seem like they are a meaningful improvement in vision, where it's almost good enough for this part, but they still fail miserably at the physical reasoning aspects.

This person got O4-mini-high to generate a reasonably close image depiction of the part.

https://x.com/tombielecki/status/1912913806541693253

[-]janshi5d20

I just asked Gemini 2.5 Pro to explain how to tie shoelaces to someone who has never done that before, a task probably works in its favor because it is so common, plenty of descriptions exist and most people can perform it with little cognitive effort within few seconds every day. It took about 1.5 letter-sized pages of text and still missed a little bit of detail but I think a humanoid robot could follow it and get to the right result. I imagine many tasks of machinists and craftsmen are more complex but simply don’t exist in writing, so I agree that lack of data is the obvious problem. (Cooking would be another field where AI models today may perform above their baseline.)

Also, professions like law and obviously math and software have invented their own languages that only vaguely resemble everyday-language. If we had the goal of capturing the tacit knowledge of craftsmen, like this last surviving stucco worker, we would probably also first invent a more precise language to do that.

[-]Mo Putera5d20

Out of curiosity, can you share a link to Gemini 2.5 Pro's response?

[-]Nate Sharpe3d10

I can see this being automated given the visual capabilities in the latest models along with a healthy dose of input from existing practitioners. Do detailed teardowns of different products across many different industries, with images of each component and subassembly along with detailed descriptions of those images (what the parts are, what they're made of, how they were made, what purpose they serve as a whole and what purpose various features are serving). This could then start to create the textual training data to then allow the models to generate such information themselves in the opposite direction. And in fact this closely resembles how mechanical engineers often build up experience (along with making things, building them, and seeing why they don't work like they thought they would).

[-]Petropolitan6d20

Almost all machinists I've talked to have (completely valid) complaints about engineers that understand textbook formulas and CAD but don't understand real world manufacturing constraints.

Telling a recent graduate to "forget what you have been taught in college" might happen in many industries but seems especially common in the manufacturing sector AFAIK

[-]Clark Benham6d*20

I think you're asking the VLLM to do too much in a single call.

I was trying to get the VLLM to extract GD&T data from the NIST standardized renders of ASME Y14 tolerancing standards (eg. screenshotting 1 page at a time from https://www.nist.gov/system/files/documents/noindex/2022/04/06/nist_ftc_06_asme1_rd.pdf ).

Asking "List all GD&T and their matching element IDs" would totally fail, but would start to be reasonably when I only asked for 5-8 specific element ids at a time, and asking for the data of a single element id at a time with few shot examples would be ~90% accurate.

While right now you have to build the scaffold for them to perform well, eventually an agent could realize it needs to build it's own scaffold or the model would be able to effectively use all parts of the context window.

I'd be curious how well the models do if you wrote out a 7 step list, and then asked about each step in isolation, and then asked a model to summarize the results of 7 separate calls.

[-]janshi7d20

However, there are a ton of diverse manufacturing processes, many of which don’t have good simulation solutions.

I’m interested to know which processes these are, what general categories they fall into and why we don’t have simulations for them? Is the bottleneck physics, computation, economics…?

[-]Brian Smith6d60

The bottlenecks would be physics in this case!

Engineering is approximations of physics, and many physical systems break down into intractable math quickly. This is most true in places that care about dynamic (time-sensitive) systems, such as Computational Fluid Dynamics (CFD) or Kinematics. Modeling is done by doing discrete time steps and using previous time steps as approximations of derivatives for the differential equations that determine the system, which always loses some detail as you can never discretely calculate an infinitely small time step.

A simple example would be a double pendulum, where the fundamental equations are straightforward, but behaves chaotically. Most physical systems have this chaotic behavior at some level, just due to the complexity of the world.

[-]Nate Sharpe3d10

There are obvious approaches that haven’t been well explored. For example, we can create a lot of data using simulations, although there will be a gap between simulation and reality.

For the specific field(s) in question here (mechanical parts/products), I think there's really three mostly separate domains - design, prototyping, and high volume manufacturing. The latter two seem easier to me (how to make a thing or a million things given an input picture/drawing/CAD), but the former (given a product spec, design a thing that satisfies the product requirements) is a much larger space of potential solutions and seems harder to have tight feedback loops (especially since you need to solve the "make it" domains first if you want testing to be fully automated).

[-]Nate Sharpe3d10

instead, documenting this granular, real-world knowledge is impractical and inefficient.

I would add that unlike software and law, that work directly with text that is then easily shared, searchable, and intrinsically composed of a finite set of discrete tokens, people working with the physical world cannot easily share or search amongst the infinite set of continuously variable physical features. As a mechanical engineer, I've often bemoaned the lack of a "stack overflow for product design" or similar, and I think this is a big part of why such a thing doesn't exist. While in principal you could describe in text everything that you're doing, that doesn't provide any value in and of itself and thus when combined with other hurdles like confidentiality and time constraints, very little domain knowledge ends up in a place where everyone can access it.

[-]Mitchell_Porter8d*1-11

[Later edit: I acknowledge this is largely wrong! :-) ]

Have you researched or thought about how the models are dealing with visual information?

When ChatGPT or Gemini generates an image at a user's request, they are evidently generating a prompt based on accumulated instructions and then passing it to a specialized visual AI like DALLE-3 or Imagen-3. When they process an uploaded image (e.g. provide a description of it), something similar must be occurring.

On the other hand, when they answer your request, "how can I make the object in this picture", the reply comes from the more verbal intelligence, the LLM proper, and it will be responding on the basis of a verbal description of the picture supplied by its visual coprocessor. The quality of the response is therefore limited by the quality of the verbal description of the image - which easily leaves out details that may turn out to be important.

I would be surprised if the LLM even has the capacity to tell the visual AI, something like "pay special attention to detail". My impression of the visual AIs in use is that they generate their description of an image, take it or leave it. It would be possible to train a visual AI whose processing of an image is dependent on context, like an instruction to pay attention to detail or to look for extra details, but I haven't noticed any evidence of this yet.

The one model that I might expect to have a more sophisticated interaction between verbal and visual components, is 4o interacting as in the original demo, watching and listening in real time. I haven't had the opportunity to interact with 4o in that fashion, but there must be some special architecture and training to give it the ability of real-time interaction, even if there's also some core that's still the same as in 4o when accessed via text chat. (I wonder to what extent Sora the video AI has features in common with the video processing that 4o does.)

[-]Caleb Biddulph8d92

I believe image processing used to be done by a separate AI that would generate a text description and pass it to the LLM. Nowadays, most frontier models are "natively multimodal," meaning the same model is pretrained to understand both text and images. Models like GPT-4o can even do image generation natively now: https://openai.com/index/introducing-4o-image-generation. Even though making 4o "watch in real time" is not currently an option as far as I'm aware, uploading a single image to ChatGPT should do basically the same thing.

It's true that frontier models are still much worse at understanding images than text, though.

[-]Neel Nanda8d83

My understanding is that there was a separate image model in historical vlms like flamingo but that it passed on a vector representation of the image not text

[-]Davidmanheim8d41

I understood, very much secondhand, that current LLMs are still using a separately trained part of the model's input space for images. I'm very unsure how the model weights are integrating the different types of thinking, but am by default skeptical that it integrates cleanly into other parts of reasoning.

That said, I'm also skeptical that this is fundamentally a had part of the problem, as simulation and generated data seems like a very tractable route to improving this, if/once model developers see it as a critical bottleneck for tens of billions of dollars in revenue.

[+]ACCount8d-9-10

Moderation Log