All of Adam Karvonen's Comments + Replies

Model Vision of Pokémon Red is Bad. Really Bad.

Interesting that you found this to be the case! I recently had a post about evaluating LLMs on a basic manufacturing task, and I also found this to be the case. It's always a bit jarring for me to go from the text / code domain, where the LLMs feel so competent, to the vision domain, where I start to feel like Gary Marcus because the LLMs are so bad.

Relevant quote from my post:

"Most Models Have Truly Horrible Visual Abilities: For two years, I've observed essentially zero improvement in visual capabilities amo... (read more)

I don't think image understanding is the bottleneck. O3 and O4-mini-high seem like they are a meaningful improvement in vision, where it's almost good enough for this part, but they still fail miserably at the physical reasoning aspects.

This person got O4-mini-high to generate a reasonably close image depiction of the part.

https://x.com/tombielecki/status/1912913806541693253

I also tested O3, and it looks better than Gemini 2.5 on Vision. Although it missed the second flat, it correctly identified that the ends had different diameters and picked up on some genuinely impressive details, like the grooved thread relief behind the larger thread.

However, it's still terrible at spatial reasoning. I now feel more confident in the argument in my post. It proposes many egregious, physically impossible operations. For example, it recommends to enclose 2.2 inches of the part in the collet, and then face the part down to the finished leng... (read more)

1Jonathan Claybrough
Thanks for the followup!
2ReaderM
How about o4-mini-high ? Supposedly, it's actually better than o3 at visual reasoning. I'm not expecting much better. Just Curious

I do agree that it looks like there has been a lack of data to address this ability. That being said, I'm pretty surprised at how terrible models are, and there's a hierarchy of problems to be addressed here before models are actually useful in the physical world. Each step feels much more difficult than the step before, and all models are completely terrible at steps 2-4.

  1. First, simply look at a part and identify features / if a part is symmetric / etc. This requires basically no spatial reasoning ability, yet almost all models are completely terrible.

... (read more)

Hmm, I don't know. With the caveat that I'm not a legal expert, I do think there's a big difference between basically any job that can be done remotely most of the time and skilled physical labor jobs. I use LLMs for coding every day, and they still have tons of problems, but I do see significant progress happening. There is legitimate uncertainty over how long it will take for AIs to become reliable at tasks like coding.

Coding and ML research also requires a lot of subjective taste, like writing easily understandable code with good abstractions or selecti... (read more)

1Raphael Roche
AI is very useful in legal matters and is clearly a promising sector for business. It is possible that some legal jobs (especially documentation and basic, non-personalized legal information jobs) are already being challenged by AI and are on the verge of being eliminated, with others to follow sooner or later. My comment was simply reacting to the idea that many white-collar jobs will be on the front line of this destruction. The job of a lawyer is often cited, and I think it's a rather poor example for the reasons I mentioned. Many white-collar jobs combine technical and social skills that can be quite challenging for AI.

Yeah, I agree. I currently feel like our current ML approach is going to make very little real world manufacturing progress, and that any progress will have to come from the automated AI researcher either brute forcing tons of synthetic data or coming up with new architectures and training procedures.

But, this is a low confidence take, and I wouldn't be shocked if a couple dumb tricks make a lot of progress.

This is an obvious step, but I'm a bit skeptical for a few reasons.

  • Current models are just so bad at vision tasks. Even Gemini 2.5 is pretty bad and falls apart if pushed to harder images. It really seems like identifying a feature on a part or if a part is symmetric is something that could be addressed by just scaling data, and these vision tasks are much easier than manufacturing details.

  • A lot of the work in manufacturing / construction would be in tactile details, which could be hard to capture with sensors. For example, a human finger can easily

... (read more)
2SorenJ
1. They're pretty bad, but they seem about GPT-2 level bad? So plausibly in a couple of years they will be GPT-4 level good, if things go the same way? 2. This does seem pretty difficult. The only idea I have is having humans wear special gloves with sensors on them, and maybe explain their thoughts aloud as they work, and then collecting all of this data. 3. Before you go to RL you need to train on prediction with a large amount of data first. We don't have this yet for blue collar work. Then once you have the prediction model, robots, and rudimentary agents, you try to get the robots to do simple tasks in isolated environments. If they succeed they get rewarded. This feels quite a bit more than 3 years away...   In general, I think the ideas is that you first get a superhuman coder, then you get a superhuman AI researcher, then you get a any-task superhuman researcher, and then you use this superhuman researcher to solve all of the problems we have been discussing in lightning fast time. 

A $1 training run would be training 6 SAEs across 6 sparsities at 16K width on Gemma-2-2B for 200M tokens. This includes generating the activations, and it would be cheaper if the activations are precomputed. In practice this seems like large enough scale to validate ideas such as the Matryoshka SAE or the BatchTopK SAE.

5Neel Nanda
Yeah, if you're doing this, you should definitely pre compute and save activations

SAEs are early enough that there's tons of low hanging fruit and ideas to try. They also require relatively little compute (often around $1 for a training run), so AI agents could afford to test many ideas. I wouldn't be surprised if SAE improvements were a good early target for automated AI research, especially if the feedback loop is just "Come up with idea, modify existing loss function, train, evaluate, get a quantitative result".

3Bogdan Ionut Cirstea
Ok, this seems surprisingly cheap. Can you say more about what such a 1$ training run typically looks like (what the hyperparameters are)? I'd also be very interested in any analysis about how SAE (computational) training costs scale vs. base LLM pretraining costs. This sounds spiritually quite similar to what's already been done in Discovering Preference Optimization Algorithms with and for Large Language Models and I'd expect something roughly like that to probably produce something interestin, especially if a training run only cost $1.

If you're looking for a hackable SAE training repo for experiments, I'd recommend our dictionary_learning repo. It's been around for a few months, but we've recently spent some time cleaning it up and adding additional trainer types.

It's designed to be simple and hackable - you can add a new SAE type in a single file (~350 lines). We have 8 tested implementations, including JumpReLU, TopK, BatchTopK, Matryoshka, Gated, and others, with BatchTopK recommended as a good default. Training is quick and cheap - training 6 16K width SAEs on Gemma-2-2B for 200M to... (read more)

The forward hook for our best performing approach is here. As Sam mentioned, this hasn’t been deployed to production. We left it as a case study because Benchify is currently prioritizing other parts of their stack unrelated to ML.

For this demonstration, we added a forward hook to a HuggingFace Transformers model for simplicity, rather than incorporating it into a production inference stack.

Rejection sampling is a strong baseline that we hadn’t considered, and it’s definitely worth trying out—I suspect it will perform well here. Currently, our focus is on identifying additional in-the-wild tasks, particularly from other companies, as many of Benchify’s challenges involve sensitive details about their internal tooling that they prefer to keep private. We’re especially interested in tasks where it’s not possible to automatically measure success or failure via string matching, as this is where techniques like model steering are most likely to be... (read more)

I agree. In particular, there's a huge advantage to the defender due to the scale of the model weights. All the defense has to do is raise the bar high enough that an adversary can't exfiltrate model weights during the lifetime of the model.

If the adversary gains access to the weak model, it still reduces the possible encoded information density, as I discuss here. I haven't done the experiments, but I would guess this reduction improves significantly if the adversary doesn't have access to the weak model.

Various ways of changing the weak model weights or retraining the weak model with new data could also increase the size of the defender's advantage.

I think it does stack if the perplexity filter is applied before the compression step, which relies on an underlying assumption that it's easier to secure the internet connection choke point rather than the entire data center.

I will try to continue the discussion here for simplicity.

The perplexity filter works best if the attacker doesn't have complete control of the entire data center, but instead limited access to some machines. An underlying assumption is that it's easier to secure the internet connection choke point, rather than everything going on inside the data center.

This can be stacked with compression by applying the perplexity filter before the compression step.

Assuming that the attacker has access to the perplexity filter model, it would still reduce the possible information density if they have to perform some sort of ari... (read more)

Another potential option (that could stack with text compression) is to filter outgoing data. In particular, if we only allow outgoing data to be language tokens, and we filter the outgoing data to have low perplexity as measured by a small, secret language model, then it could become very difficult to encode terabytes of model weights in a way that has low perplexity.

I discuss this idea more here.

2ryan_greenblatt
I don't think this stacks with compression - if you compress data then it is no longer low perplexity. Data which is compressed as well as you can should look random to you (rather than predictable). I think filtering like this is strictly worse than compression for reasons I discuss in my response here. (But I appreciate the idea!)

Thanks for this comment, by the way! I added a paragraph to the beginning to make this post more clear.

The purpose of this proposal is to limit anyone from transferring model weights out of a data center. If someone wants to steal the weights and give them to China or another adversary, the model weights have to leave physically (hard drive out of the front door) or through the internet connection. If the facility has good physical security, then the weights have to leave through the internet connection.

If we also take steps to secure the internet connection, such as treating all outgoing data as language tokens and using a perplexity filter, then the model... (read more)

2Adam Karvonen
Thanks for this comment, by the way! I added a paragraph to the beginning to make this post more clear.

I would guess that it would learn an exact algorithm rather than heuristics. The challenging part for OthelloGPT is that the naive algorithm to calculate board state from input tokens requires up to 60 sequential steps, and it only has 8 layers to calculate the board state and convert this to a probability distribution over legal moves.

I think it's pretty plausible that this is true, and that OthelloGPT is already doing something that's somewhat close to optimal within the constraints of its architecture. I have also spent time thinking about the optimal algorithm for next move prediction within the constraints of the OthelloGPT architecture, and "a bag of heuristics that promote / suppress information with attention to aggregate information across moves" seems like a very reasonable approach.

3gwern
Seems like the natural next step would be to try to investigate grokking, as this appears analogous: you have a model which has memorized or learned a grabbag of heuristics & regularities, but as far as you can tell, the algorithmic core is eluding the model despite what seems like ample parameterization & data, perhaps because it is a wide shallow model. So one could try to train a skinny net, and maybe aggressively subsample the training data down into a maximally diverse subset. If it groks, then one should be able to read off much more interpretable algorithmic sub-models.

In Othello, pieces must be played next to existing pieces, and the game is initialized with 4 pieces in the center. Thus, it's impossible for the top left corner to be played within the first 5 moves, and extremely unlikely in the early portion of a randomly generated game.

I had the following results:

Stockfish level 2 vs Stockfish level 0, 0.01 seconds per move, 5k games:

0 random moves: win rate 81.2%

20 random moves: win rate 81.2%

40 random moves: 77.9%

95% confidence interval is about +- 1%

Stockfish level 15 vs level 9, 0.01 seconds per move, 5k games:

0 random moves: 65.5%

20 random moves: 72.8%

40 random moves: 67.5%

Once again, 95% confidence interval is about +- 1%

At 120 seconds per move, both of these level differences correspond to ~300 Elo: https://github.com/official-stockfish/Stockfish/commit/a08b8d4

This is 0.01 seconds... (read more)

1kromem
Interesting results - definitely didn't expect the bump at random 20 for the higher skill case. But I think really useful to know that the performance decrease in Chess-GPT for initial random noise isn't a generalized phenomenon. Appreciate the follow-up!!

Both are great points, especially #1. I'll run some experiments and report back.

2Adam Karvonen
I had the following results: Stockfish level 2 vs Stockfish level 0, 0.01 seconds per move, 5k games: 0 random moves: win rate 81.2% 20 random moves: win rate 81.2% 40 random moves: 77.9% 95% confidence interval is about +- 1% Stockfish level 15 vs level 9, 0.01 seconds per move, 5k games: 0 random moves: 65.5% 20 random moves: 72.8% 40 random moves: 67.5% Once again, 95% confidence interval is about +- 1% At 120 seconds per move, both of these level differences correspond to ~300 Elo: https://github.com/official-stockfish/Stockfish/commit/a08b8d4 This is 0.01 seconds per move. It appears that less search time lowers the Elo difference for level 15 vs level 9. A 65% win rate corresponds to a ~100 Elo difference, while a 81% win rate corresponds to a 250-300 Elo difference. Honestly not too sure what to make of the results. One possible variable is that in every case, the higher level player is White. Starting in a game with a random position may favor the first to move. Level 2 vs level 0 seems most applicable to the Chess-GPT setting.

That's an interesting idea, I may test that out at some point. I'm assuming the softmax would be for kings / queens, where there is typically only one on the board, rather than for e.g. blank squares or pawns?

The all stockfish data engine played at a level that was 100-200 Elo higher in my tests, with a couple caveats. First, I benchmarked the LLMs against stockfish, so an all stockfish dataset seems helpful for this benchmark. Secondly, the stockfish LLM would probably have an advantage for robustness because I included a small percentage of stockfish vs random move generator games in the stockfish dataset in the hopes that it would improve its ability.

I haven't done an in depth qualitative assessment of their abilities to give a more in depth answer unfortunately.

Yes, in this recent OpenAI superalignment paper they said that GPT-4's training dataset included a dataset of chess games filtered for players with greater than 1800 Elo. Given gpt-3.5-turbo-instruct's ability, I'm guessing that its dataset included a similar collection.