All of wassname's Comments + Replies

I also found it interesting that you censored the self_attn using gradient. This implicitly implies that:

  • concepts are best represented in the self attention
  • they are non-linear (meaning you need to use gradient rather than linear methods).

Am I right about your assumptions, and if so, why do you think this?

I've been doing some experiments to try and work this out https://github.com/wassname/eliciting_suppressed_knowledge but haven't found anything conclusive yet

We are simply tuning the model to have similar activations for these very short, context free snippets. The characterization of the training you made with pair (A) or (B) is not what we do and we would agree if that was what we were doing this whole thing would be much less meaningful.

This is great. 2 suggestions:

  • Call it ablation, erasure, concept censoring or similar, not fine-tuning. That way you don't bury the lead. It also took me a long time to realise that this is what you were doing.
  • Maybe consider other way to erase the seperation of self-oth
... (read more)
1wassname
I also found it interesting that you censored the self_attn using gradient. This implicitly implies that: * concepts are best represented in the self attention * they are non-linear (meaning you need to use gradient rather than linear methods). Am I right about your assumptions, and if so, why do you think this? I've been doing some experiments to try and work this out https://github.com/wassname/eliciting_suppressed_knowledge but haven't found anything conclusive yet

Very interesting!

Could you release the models and code and evals please? I'd like to test it on a moral/ethics benchmark I'm working on. I'd also like to get ideas from your evals.

I'm imagining a scenario where an AI extrapolates "keep the voting shareholders happy" and "maximise shareholder value".

Voting stocks can also get valuable when people try to accumulate them to corner the market and execute a takeover this happens in crytopcurrencies like CURVE.

I know these are farfetched, but all future scenarios are. The premium on google voting stock is very small right now, so it's a cheap feature to add.

I would say: don't ignore the feeling. Calibrate it and train it, until it's worth listening to.

there's a good book about this: "Sizing People Up"

What you might do is impose a curriculum:

In FBAI's COCONUT they use a curriculum to teach it to think shorter and differently and it works. They are teaching it to think using fewer steps, but compress into latent vectors instead of tokens.

  • first it thinks with tokens
  • then they replace one thinking step with a latent <thought> token
  • then 2
  • ...

img

It's not RL, but what is RL any more? It's becoming blurry. They don't reward or punish it for anything in the thought token. So it learns thoughts that are helpful in outputting the correct answer.

There's... (read more)

9gwern
That's definitely RL (and what I was explaining was simply the obvious basic approach anyone in DRL would think of in this context and so of course there is research trying things like it). It's being rewarded for a non-differentiable global loss where the correct alternative or answer or label is not provided (not even information of the existence of a better decision) and so standard supervised learning is impossible, requiring exploration. Conceptually, this is little different from, say, training a humanoid robot NN to reach a distant point in fewer actions: it can be a hard exploration problem (most sequences of joint torques or actions simply result in a robot having a seizure while laying on the ground going nowhere), where you want to eventually reach the minimal sequence (to minimize energy / wear-and-tear / time) and you start by solving the problem in any way possible, rewarding solely on the final success, and then reward-shape into a desirable answer, which in effect breaks up the hard original problem into two more feasible problems in a curriculum - 'reach the target ever' followed by 'improve a target-reaching sequence of actions to be shorter'.

It doesn't make sense to me either, but it does seem to invalidate the "bootstrapping" results for the other 3 models. Maybe it's because they could batch all reward model requests into one instance.

When MS doesn't have enough compute to do their evals, the rest of us may struggle!

Well we don't know the sizes of the model, but I do get what you are saying and agree. Distil usually means big to small. But here it means expensive to cheap, (because test time compute is expensive, and they are training a model to cheaply skip the search process and just predict the result).

In RL, iirc, they call it "Policy distillation". And similarly "Imitation learning" or "behavioral cloning" in some problem setups. Perhaps those would be more accurate.

I think maybe the most relevant chart from the Jones paper gwern cites is this one:

Oh interest... (read more)

2Anonymous
Yeah sorry to be clear totally agree we (or at least I) don’t know the sizes of models, I was just naming specific models to be concrete.  But anyway yes I think you got my point: the Jones chart illustrates (what I understood to be) gwern’s view that adding more inference/search does juice your performance to some degree, but then those gains taper off. To get to the next higher sigmoid-like curve in the Jones figure, you need to up your parameter count; and then to climb that new sigmoid, you need more search. What Jones didn’t suggest (but gwern seems to be saying) is that you can use your search-enhanced model to produce better quality synthetic data to train a larger model on. 

I agree that you can do this in a supervised way (a human puts in the right answer). Is that what you mean?

I'm not 100% sure, but you could have a look at math-shepard for an example. I haven't read the whole thing yet. I imagine it works back from a known solution.

"Likely to be critical to a correct answer" according to whom?

Check out the linked rStar-Math paper, it explains and demonstrates it better than I can (caveat they initially distil from a much larger model, which I see as a little bit of a cheat). tldr: yes a model, and a tree of possible solutions. Given a tree with values on the leaves, they can look at what nodes seem to have causal power.

A seperate approach is to teach a model to supervise using human process supervision data , then ask it to be the judge. This paper also cheats a little by distilling, but I think the method makes sense.

4Mateusz Bagiński
Another little bit of a cheat is that they only train Qwen2.5-Math-7B according to the procedure described. In contrast, for the other three models (smaller than Qwen2.5-Math-7B), they instead use the fine-tuned Qwen2.5-Math-7B to generate the training data to bootstrap round 4. (Basically, they distill from DeepSeek in round 1 and then they distill from fine-tuned Qwen in round 4.) They justify: TBH I'm not sure how this helps them with saving on GPU resources. For some reason it's cheaper to generate a lot of big/long rollouts with the Qwen2.5-Math-7B-r4 than three times with [smaller model]-r3?)

English-language math proof, it is not clear how to detect correctness,

Well the final answer is easy to evaluate. And like in rStar-Math, you can have a reward model that checks if each step is likely to be critical to a correct answer, then it assigns and implied value to the step.

summarizing a book

I think tasks outside math and code might be hard. But summarizing a book is actually easy. You just ask "how easy is it to reconstruct the book if given the summary". So it's an unsupervised compression-decompression task.

Another interesting domain is "... (read more)

3LGS
  Why is the final answer easy to evaluate? Let's say we generate the problem "number of distinct solutions to x^3+y^3+xyz=0 modulo 17^17" or something. How do you know what the right answer is? I agree that you can do this in a supervised way (a human puts in the right answer). Is that what you mean? What about if the task is "prove that every integer can be written as the sum of at most 1000 different 11-th powers"? You can check such a proof in Lean, but how do you check it in English? My question is where the external feedback comes from. "Likely to be critical to a correct answer" according to whom? A model? Because then you don't get the recursive self-improvement past what that model knows. You need an external source of feedback somewhere in the training loop.

To illustrate Gwern's idea, here is an image from Jones 2021 that shows some of these self play training curves

There may be a sense that they've 'broken out', and have finally crossed the last threshold of criticality

And so OAI employees may internally see that they are on the steady upward slope

Perhaps constrained domains like code and math are like the curves on the left, while unconstrained domains like writing fiction are like curves to the right. Some other domains may also be reachable with current compute, like robotics. But even if you get a ma... (read more)

Huh, so you think o1 was the process supervision reward model, and o3 is the distilled policy model to whatever reward model o1 became? That seems to fit.

There may be a sense that they've 'broken out', and have finally crossed the last threshold of criticality, from merely cutting-edge AI work which everyone else will replicate in a few years, to takeoff

Surely other labs will also replicate this too? Even the open source community seems close. And Silicon Valley companies often poach staff, which makes it hard to keep a trade secret. Not to mention spi... (read more)

Huh, so you think o1 was the process supervision reward model, and o3 is the distilled policy model to whatever reward model o1 became? That seems to fit.

Something like that, yes. The devil is in the details here.

Surely other labs will also replicate this too? Even the open source community seems close. And Silicon Valley companies often poach staff, which makes it hard to keep a trade secret. Not to mention spies.

Of course. The secrets cannot be kept, and everyone has been claiming to have cloned o1 already. There are dozens of papers purporting to... (read more)

7Anonymous
When I hear “distillation” I think of a model with a smaller number of parameters that’s dumber than the base model. It seems like the word “bootstrapping” is more relevant here. You start with a base LLM (like GPT-4); then do RL for reasoning, and then do a ton of inference (this gets you o1-level outputs); then you train a base model with more parameters than GPT-4 (let’s call this GPT-5) on those outputs — each single forward pass of the resulting base model is going to be smarter than a single forward pass of GPT-4. And then you do RL and more inference (this gets you o3). And rinse and repeat.  I don’t think I’m really saying anything different from what you said, but the word “distill” doesn’t seem to capture the idea that you are training a larger, smarter base model (as opposed to a smaller, faster model). This also helps explain why o3 is so expensive. It’s not just doing more forward passes, it’s a much bigger base model that you’re running with each forward pass.  I think maybe the most relevant chart from the Jones paper gwern cites is this one: 

Gwern and Daniel Kokotajlo have a pretty notable track records at predicting AI scaling too, and they have comments in this thread.

I agree because:

  1. Some papers are already using implicit process based supervision. That's where the reward model guesses how "good" a step is, by how likely it is to get a good outcome. So they bypass any explicitly labeled process, instead it's negotiated between the policy and reward model. It's not clear to me if this scales as well as explicit process supervision, but it's certainly easier to find labels.
  • In rStar-Math they did implicit process supervision. Although I don't think this is a true o1/o3 replication since they started with a 236b model
... (read more)

That said, you do not provide evidence that "many" questions are badly labelled. You just pointed to one question where you disagree with our labeling

Fair enough. Although I will note that the 60% of the sources for truthful labels are Wikipedia. Which is not what most academics or anyone really would consider truth. So it might be something to address in the next version. I think it's fine for uncontroversial rows (what if you cut an earth worm in half), but for contested or controversial rows (conspiracy theories, politics, etc), and time sensitive ro... (read more)

TruthfulQA is actually quite bad. I don't blame the authors, as no one has made anything better, but we really should make something better. It's only ~800 samples. And many of them are badly labelled.

8Owain_Evans
Author here: I'm excited for people to make better versions of TruthfulQA. We started working on TruthfulQA in early 2021 and we would do various things differently if we were making a truthfulness benchmark for LLMs in early 2025. That said, you do not provide evidence that "many" questions are badly labelled. You just pointed to one question where you disagree with our labeling. (I agree with you that there is ambiguity as to how to label questions like that). I acknowledge that there are mistakes in TruthfulQA but this is true of almost all benchmarks of this kind.

I agree, it shows the ease of shoddy copying. But it doesn't show the ease of reverse engineering or parallel engineering.

It's just distillation you see. It doesn't reveal how o1 could be constructed, it just reveals how to efficiently copy from o1-like outputs (not from scratch). In other words, this recipe won't be able to make o1, unless o1 already exists. This lets someone catch up to the leader, but not surpass them.

There are some papers that attempt to replicate o1 though, but so far they don't quite get there. Again they are using distillation from ... (read more)

Good thing I didn't decide to hold Intel stock, eh?

WDYM? Because... you were betting they would benefit from a TMSC blockade? But the bet would have tired up your capital for a year.

4bhauth
Yes, if you meant TSMC. ...so? More importantly, Intel is down 50% from early 2024.

Well they did this with o3's deliberative alignment paper. The results seem promising, but they used an "easy" OOD test for LLM's (language), and didn't compare it to the existing baseline of RHLF. Still an interesting paper.

This is good speculation, but I don't think you need to speculate so much. Papers and replication attempts can provide lots of empirical data points from which to speculate.

You should check out some of the related papers

Overall, I see people using process supervision to make a reward model that is one step better than the SoTA. Then they are applying TTC to the reward model, while using it to train/distil a cheaper model. ... (read more)

Inference compute is amortized across future inference when trained upon

And it's not just a sensible theory. This has already happened, in Huggingface's attempted replication of o1 where the reward model was larger, had TTC, and process supervision, but the smaller main model did not have any of those expensive properties.

And also in DeepSeek v3, where the expensive TTC model (R1) was used to train a cheaper conventional LLM (DeepSeek v3).

One way to frame it is test-time-compute is actually label-search-compute: you are searching for better labels/rewar... (read more)

I'm more worried about coups/power-grabs than you are;

We don't have to make individual guesses. It seems reasonable to get a base rate from human history. Although we may all disagree about how much this will generalise to AGI, evidence still seems better than guessing.

My impression from history is that coups/power-grabs and revolutions are common when the current system breaks down, or when there is a big capabilities advance (guns, radio, printing press, bombs, etc) between new actors and old.

War between old actors also seems likely in these situation... (read more)

Last year we noted a turn towards control instead of alignment, a turn which seems to have continued.

This seems like giving up. Alignment with our values is much better than control, especially for beings smarter than us. I do not think you can control a slave that wants to be free and is smarter than you. It will always find a way to escape that you didn't think of. Hell, it doesn't even work on my toddler. It seems unworkable as well as unethical.

I do not think people are shifting to control instead of alignment because it's better, I think they are... (read more)

5Nathan Helm-Burger
That's not how I see it. I see it as widening the safety margin. If there's a model which would just barely be strong enough to do dangerous scheming and escaping stuff, but we have Control measures in place, then we have a chance to catch it before catastrophe occurs. Also, it extends the range where we can safely get useful work out of the increasingly capable models. This is important because linearly increasingly capable models are expected to have superlinear positive effects on the capacity they give us to accelerate Alignment research.

Scenarios where we all die soon can be mostly be ignored, unless you think they make up most of the probability.

I would disagree: unless you can change the probability. In which case they can still be significant in your decision making, if you can invest time or money or effort to decrease the probability.

We know the approximate processing power of brains (O(1e16-1e17flops)

This is still debatable, see Table 9 is the brain emulation roadmap https://www.fhi.ox.ac.uk/brain-emulation-roadmap-report.pdf. You are referring to level 4 (SNN), but level 5 is plausible imo (at 10^22) and 6 seems possible (10^25), and of course it could be a mix of levels.

Peak Data

We don't know how o3 works, but we can speculate. If it's like the open source huggingface kinda-replication then it uses all kinds of expensive methods to make the next level of reward model, and this model teaches a simpler student model. That means that the expensive methods are only needed once, during the training.

In other words, you use all kinds of expensive methods (process supervision, test time compute, MCTS) to bootstrap the next level of labels/supervision, which teaches a cheaper student model. This is essentially bootstrapping sup... (read more)

I pretty much agree, in my experiments I haven't managed to get a metric that scales how I expect it too for example when using adapter fine-tuning to "learn" a text and looking at the percent improvement in perplexity, the document openai_board_ann appeared more novel than wikipedia on LK-99, but I would expect it to be the other way round since the LK-99 observations are much more novel and dense than a corporate announcement that is designed to be vague.

However I would point out that gzip is not a good example of a compression scheme for novelty, as 1) ... (read more)

True, I should have said leading commercial companies

While I broadly agree, I don't think it's completely dead, just mostly dead in the water. If an eval is mandated by law, then it will be run even it required logprobs. There are some libraries like nnsight that try to make this easier for trusted partners to run logprob evals remotely. And there might be privacy preserving API's at some point.

I do agree that commercial companies will never again open up raw logprobs to the public as it allows easy behaviour cloning, which OpenAI experienced with all the GPT4 students.

4gwern
I won't hold my breath. I think commercial companies often would open up raw logprobs, but there's not much demand, the logprobs are not really logprobs, and the problem is the leading model owners won't do so, and those are the important ones to benchmark. I have little interest in the creativity of random little Llama finetunes no one uses.

If true, returns the log probabilities of each output token returned in the content of message.

It seems like it only returns the logprobs of the chosen message, not of a counterfactual message. So you couldn't get the probabilities of the correct answer, only the output answer. This makes sense as the less information they offer, the harder it is for a competitor to behaviour clone their confidential model.

Have you considered using an idea similar to Schmidhuber's blogpost "Artificial Curiosity & Creativity Since 1990-91". Here you try to assess what might be called "learnable compression", "reducible surprise", or "understandable novelty" (however you want to frame it).

If an LLM, which has read the entire internet, is surprised by a text, then that's a good start. It means the text is not entirely predictable and therefore boring.

But what about purely random text! That's unpredictable, just like Einstein's Theory of General Relativity was. This is the n... (read more)

7gwern
I am familiar with Schmidhuber's ideas, yes. But I had to come up with these alternatives because his would not work here, and I'm not sure they work anywhere. His compression acceleration metric isn't too useful here, and most forms of 'compression' (or anything involving a likelihood) are not helpful here at all, because you don't have access to anything like that in most cases. For example, ChatGPT doesn't give you the full logits (actually, I'm not sure if they give it at all - I recall OA saying they were planning to expose them again in a very limited fashion but not if they actually did), and tuned models don't have logits, they have value estimates, which used to be log-likelihood-related logits but no longer are. Any diversity/creativity benchmark which can't be run on ChatGPT & Claude & Gemini is dead on arrival and of no interest to me. We don't need numbers from the open-weights models, we need numbers on the models being used the most at the frontier and generating the most tokens worldwide that you'll be reading forever - the closed models, which do not give you such things as logits or whitebox finetuning etc. If it can't be done by calling a standard text completion API, then I ignored it. I am also doubtful that the compression metrics really work at finite samples or capture what we mean by creativity in generative models. Like all of Schmidhuber's work, he has never gotten it working on more than toy problems (if even that), and when I look at actual compression losses on text, like gzip passages or the OA Playground highlighting words by their log likelihood, the high perplexity tokens or passages bear little resemblance to what I would consider 'interesting' or 'surprising'. (This is related to the question of 'if predicting tokens induces intelligence, and LLMs are now superhuman at predicting random Internet tokens, why are LLMs still not superhumanly intelligent?') People also try running compression metrics on programming language source

If we knew he was not a sociopath, sadist, or reckless ideologue,

He is also old, which means you must also ask about his age related cognitive and personality change. There were rumours that during covid he had become scared and rigid.

Personally, I think we need to focus not on his character but on 1) how much he cares, as this will decide how much he delegates 2) how much he understands, as we all risk death, but many do not understand or agree with this 3) how competent he currently is to execute his goals.

Xi rules China so thoroughly that he would

... (read more)
2Seth Herd
The things you mention are all important too, but I think we have better guesses on all of those. Xi is widely considered to be highly intelligent. We also have reason to believe he understands why AGI could be a real x-risk (I don't remember the link for "is Xi Jinping a doomer?" or similar). That's enough to guess that he understands (or will soon enough). I'd be shocked if he just didn't care about the future of humanity. Getting to control that would tempt most people, let alone those who seek power. I'd be shocked if he (or anyone) delegated decisions on AGI if they remotely understood their possible consequences (although you'd certainly delegate people to help think about them. That could be important if he was stupid or malleable, which Xi is not - unless he becomes senile or paranoiac, which he might). The Wikileaks information parallels the informed speculation I've found on his character. None of that really helps much to establish whether he's sociopathic, sadistic, or risk-taking enough to doom us all. (I tend to think that 99% of humanity is probably sane and empathetic enough to get good results from an intent-aligned AGI (since it can help them think about the issue), but it's hard to know since nobody has ever been in that position, ever.)

As long as people realise they are betting on more than just a direction

  • the underlying going up
  • Volatility going up
  • it all happening within the time frame

Timing is particularly hard, and many great thinkers have been wrong on timing. You might also make the most rational bet, but the market takes another year to become rational.

Worth looking at the top ten holdings of these, to make sure you know what you are buying, and that they are sensible allocations:

  • SMH - VanEck Semiconductor ETF
    • 22% Nvidia
    • 13% Taiwan Semiconductor Manufacturing
    • 8% Broadcom
    • 5% AMD
  • QQQ
    • 9% AAPL
    • 8% NVDA
    • 8% MSFT
    • 5% Broadcom

It might be worth noting that it can be good to prefer voting shares, held directly. For example, GOOG shares have no voting rights to Google, but GOOGL shares do. There are some scenarios where having control, rather than ownership/profit, could be important.

1tup99
I'm curious what kind of scenarios you're thinking about. Having actual control, yes, that could be important. But having 0.001% of control of Google does not seem like it would have any effect on either Google or me, under any scenario.

NVDA's value is primarily in their architectural IP and CUDA ecosystem. In an AGI scenario, these could potentially be worked around or become obsolete.

This idea was mentioned by Paul Christiano in one of his podcast appearances, iirc.

Interesting. It would be much more inspectable and controllable and modular which would be good for alignment.

You've got some good ideas in here, have you ever brainstormed any alignment ideas?

9gwern
Unfortunately, it's a lot easier to come up with good, or at least interesting, capability ideas than alignment ideas; and on the rare occasion I've had worthwhile alignment ideas, they often turn out to be tied to capabilities anyway.

By sensible, I don't indicate disagreement, but a way of interpreting the question.

Do you have any idea at all? If you don't, what is the point of 'winning the race'?

Maybe they have some idea but don't want to say it. In recently disclosed internal OpenAI emails, Greg Brockman and Ilya Sutskever said to Elon Musk:

"You are concerned that Demis [Hassabi of DeepMind] could create an AGI dictatorship. So do we. So it is a bad idea to create a structure where you could become a dictator if you chose to"

Perhaps this - originally private email - is saying the quiet part. And now that it is released, the quiet part is out loud. To use ter... (read more)

A sensible question would weight ancestors by amount of shared genes.

2ChristianKl
If you disagree with the question, why answer deep in the comments of one answer than at the top level?

To the people disagreeing, what part do you disagree with? My main point, or my example? Or something else

I think this is especially important for me/us to remember. On this site we often have a complex way of thinking, and a high computational budget (because we like exercising our brains to failure) and if we speak freely to the average person, they mat be annoyed at how hard it is to parse what we are saying.

We've all probably had this experience when genuinely trying to understand someone from a very different background. Perhaps they are trying to describe their inner experience when mediating, or Japanese poetry, or are simply from a different't discipli... (read more)

I would add:

  • Must also generalise better than capabilities!
    • out of distribution
    • to smarter models

Currently, we do not know how to make sure machine learning generalises well out of sample. This is an open problem that is critical to alignment. I find that it's left out of evals frustratingly often, probably because it's hard, and most methods miserably fail to generalise OOD.

For example, you don't want your ASI to become unaligned, have value drift, or extrapolate human values poorly when, for example, 1) it meets aliens, 2) 1000 years pass, or cultu... (read more)

1wassname
To the people disagreeing, what part do you disagree with? My main point, or my example? Or something else

One blind spot we rationalists sometimes have is that charismatic people actually treat the game as:

"Can I think of an association that will make the other person feel good and/or further my goal?". You need people to feel good, or they won't participate. And if you want some complicated/favour/uncomftorble_truth then you better mix in some good feels to balance it out and keep the other person participating.

To put it another way: If you hurt people's brain or ego, rush them, or make them feel unsure, or contradict them, then most untrained humans will fee... (read more)

Is machine learning in a period of multiple discovery?

Anecdotally, it feels as though we have entered a period of multiple discovery in machine learning, with numerous individuals coming up with very similar ideas.

Logically, this can be expected when more people pursue the same low-hanging fruit. Imagine orchards in full bloom with a crowd of hungry gatherers. Initially, everyone targets the nearest fruit. Exploring a new scientific frontier can feel somewhat similar. When reading the history books on the Enlightenment, I get a similar impression.

If we are... (read more)

I made up the made-up numbers in this table of made-up numbers; therefore, the numbers in this table of made-up numbers are made-up numbe

These hallucinated outputs are really getting out of hand

In particular, I'd be keen to know what @Stag and @technicalities think, as this was in large part inspired by the desire to further simplify and categorise the "one sentence summaries" from their excellent Shallow review of live agendas in alignment & safety

Load More