All of Thane Ruthenis's Comments + Replies

Sooo, apparently OpenAI's mysterious breakthrough technique for generalizing RL to hard-to-verify domains that scored them IMO gold is just... "use the LLM as a judge"? Sources: the main one is paywalled, but this seems to capture the main data, and you can also search for various crumbs here and here.

The technical details of how exactly the universal verifier works aren’t yet clear. Essentially, it involves tasking an LLM with the job of checking and grading another model’s answers by using various sources to research them.

My understanding is that they ap... (read more)

3kaiwilliams
One point of information against the "journalists are completely misinterpreting the thing they're reporting on" view is that the one of the co-authors is Rocket Drew, who previously worked as a Research Manager at MATS.  But I'll definitely be interested to follow this space more.
7Vladimir_Nesov
The full text is on archive.today.

We also observe inscrutable reasoning traces whose uninterpretable sections seem (on a surface reading) less likely to contain any particular semantic meaning.

So, um, not to anthropomorphize too much, but is it, like, dissociating under stress in that screencap or what?

Truly, AGI achieved.

1Jon Garcia
Honestly, I empathize a lot with all those dots. It can be hard to focus and stay on task sometimes. If RL focused on just the final outputs, all sorts of stuff could happen during the covert reasoning phase and still lead to results that received positive reinforcement. That includes all the pathological behavior that happened to coincide with good responses. Sometimes it's useful internal reasoning "codewords"; sometimes it's actual junk. You can see with the "Stop. Focus."-type self-admonishment that it didn't learn how to avoid pathological distractions entirely, only how to detect them and tools to get back on task. I wonder whether it would be helpful to apply reinforcement on the contents of the reasoning phase (maybe scored by some measure of human-readable task-relevance), or whether this would cripple its ability to reason effectively.

That would be my interpretation if I were to steelman him. My actual expectation is that he's lumping Eliezer-style positions with Yampolskiy-style positions, barely differentiating between them. Eliezer has certainly said things along the general lines of "AGI can never be made aligned using the tools of the current paradigm", backing it up by what could be called "logical arguments" from evolution or first principles.

Like, Dario clearly disagrees with Eliezer's position as well, given who he is and what he is doing, so there must be some way he is dismis... (read more)

7yams
I agree that Dario disagrees with Eliezer somewhere. I don't know for sure that you've isolated the part that Dario disagrees with, and it seems plausible to me that Dario thinks we need some more MIRI-esque, principled thing, or an alternative architecture altogether, or for the LLMs to have solved the problem for us, once we cross some capabilities threshold. If he's said something public about this either way, I'd love to know.  I also think that some interpretations of Dario's statement are compatible with some interpretations of the section of the IABIED book excerpt above, so we ought to just... all be extra careful not to be too generous to one side or the other, or too critical of one side or the other. I agree that my interpretation errs on the side of giving Dario too much credit here. I'm pretty confused about Dario and don't trust him, but I want to gesture toward some care in the intended targets of some of his stronger statements about 'doomers'. I think he's a pretty careful communicator, and still lean toward my interpretation over yours (although I also expect him to be wrong in his characterization of Eliezer's beliefs, I don't expect him to be quite as wrong as the above). I find the story you're telling here totally plausible, and just genuinely do not know. There's also a meta concern where if you decide that you're the target of some inaccurate statement that's certainly targeted at someone but might not be tarted at you, you've perhaps done more damage to yourself by adopting that mischaracterization of yourself in order to amend it, than by saying something like "Well, you must not be talking about me, because that's just not what I believe."

Recall how Putin has been "putting nuclear forces on high alert" over and over and over again since the start of the war, including during the initial events in February 2022. It never meant anything.

I expect this is the exact same thing. Trump is just joining in on the posturing fun, because he's Putin-like in this regard. I feel fairly confident that neither Putin nor Trump will ever actually nuke over this conflict in its current shape, and you should feel free to ignore all of their nonsense.

Some context here: I'm Russian and I pay some attention to Ru... (read more)

I second @Seth Herd's suggestion, I'm interested in your vision regarding how success would look like. Not just "here's a list of some initiatives and research programs that should be helpful" or "here's a possible optimistic scenario in which things go well, but which we don't actually believe in", but the sketch of an actual end-to-end plan around which you'd want people to coordinate. (Under the understanding that plans are worthless but planning is everything, of course.)

I think I have a thing very similar to John's here, and for me at least, it's mostly orthogonal to "how much you care about this person's well-being". Or, like, as relevant for that as whether that person has a likeable character trait.

The main impact is on the ability to coordinate with/trust/relax around that person. If they're well-modeled as an agent, you can, to wit, model them as a game-theoretic agent: as someone who is going to pay attention to the relevant parts of any given situation and continually make choices within it that are consistent with... (read more)

To expand on that...

In my mental ontology, there's a set of specific concepts and mental motions associated with accountability: viewing people as being responsible for their actions, being disappointed in or impressed by their choices, modeling the assignment of blame/credit as meaningful operations. Implicitly, this requires modeling other people as agents: types of systems which are usefully modeled as having control over their actions. To me, this is a prerequisite for being able to truly connect with someone.

When you apply the not-that-coherent-an-age... (read more)

"Not optimized to be convincing to AI researchers" ≠ "looks like fraud". "Optimized to be convincing to policymakers" might involve research that clearly demonstrates some property of AIs/ML models which is basic knowledge for capability researchers (and for which they already came up with rationalizations why it's totally fine) but isn't well-known outside specialist circles.

E. g., the basic example is the fact that ML models are black boxes trained by an autonomous process which we don't understand, instead of manually coded symbolic programs. This isn't... (read more)

3Guive
What kind of "research" would demonstrate that ML models are not the same as manually coded programs? Why not just link to the Wikipedia article for "machine learning"? 

That seems like a pretty good idea!

(There are projects that stress-test the assumptions behind AGI labs' plans, of course, but I don't think anyone is (1) deliberately picking at the plans AGI labs claim, in a basically adversarial manner, (2) optimizing experimental setups and results for legibility to policymakers, rather than for convincingness to other AI researchers. Explicitly setting those priorities might be useful.)

3Kabir Kumar
AI Plans does this
6Buck
People who do research like this are definitely optimizing for legibility to policymakers (always at least a bit, and usually a lot). One problem is that if AI researchers think your work is misleading/scientifically suspect, they get annoyed at you and tell people that your research sucks and you're a dishonest ideologue. This is IMO often a healthy immune response, though it's a bummer when you think that the researchers are wrong and your work is fine. So I think it's pretty costly to give up on convincingness to AI researchers.

@Caleb Biddulph's reply seems right to me. Another tack:

It's like the old "dragon in the garage" parable: the woman is too good at systematically denying the things which would actually help to not have a working model somewhere in there

I think you're still imagining too coherent an agent. Yes, perhaps there is a slice through her mind that contains a working model which, if that model were dropped into the mind of a more coherent agent, could be used to easily comprehend and fix the situation. But this slice doesn't necessarily have executive conscious co... (read more)

7Thane Ruthenis
To expand on that... In my mental ontology, there's a set of specific concepts and mental motions associated with accountability: viewing people as being responsible for their actions, being disappointed in or impressed by their choices, modeling the assignment of blame/credit as meaningful operations. Implicitly, this requires modeling other people as agents: types of systems which are usefully modeled as having control over their actions. To me, this is a prerequisite for being able to truly connect with someone. When you apply the not-that-coherent-an-agent lens, you do lose that. Because, like, which parts of that person's cognition should you interpret as the agent making choices, and which as parts of the malfunctioning exoskeleton the agent has no control over? You can make some decision about that, but this is usually pretty arbitrary. If someone is best modeled like this, they're not well-modeled as an agent, and holding them accountable is a category error. They're a type of system that does what it does. You can still invoke the social rituals of "blame" and "responsibility" if you expect that to change their behavior, but the mental experience of doing so is very different. It's more like calculating the nudges you need to make to prompt the desired mechanistic behavior, rather than as interfacing with a fellow person. In the latter case, you can sort of relax, communicate in a way focused on transferring information, instead of focusing on the form of communication, and trust them to make correct inferences. In the former case, you need to keep precise track of tone/wording/aesthetics/etc., and it's less "communication" and more "optimization". I really dislike thinking of people in this way, and I try to adopt the viewing-them-as-a-person frame whenever it's at all possible. But the other frame does unfortunately seem to be useful in many cases. Trying to do otherwise often feels like reaching out for someone's hand and finding nothing there. If t

When I try to empathize with that woman, what I feel toward her is disgust. If I were in her shoes, I would immediately jump to getting rid of the damn nail, it wouldn’t even occur to me to not fix it. 

You may not be doing enough of putting yourself into her shoes. Specifically, you seem to be putting yourself into her material circumstances, as if you switched minds (and got her memories et cetera), instead of, like... imagining yourself also having her world-model and set of crystallized-intelligence heuristics and cognitive-bandwidth limitatio... (read more)

From the perspective of someone with a nail stuck in their head, the world does not look like there's a nail stuck in their head which they could easily remove in order to improve their life in ways in which they want it to be improved. [...] They're best modeled not as an agents who are being willfully obstinate, but as people helplessly trapped in the cognitive equivalents of malfunctioning motorized exoskeletons.

I think this is false. It's like the old "dragon in the garage" parable: the woman is too good at systematically denying the things which would... (read more)

Well, an aligned Singularity would probably be relatively pleasant, since the entities fueling it would consider causing this sort of vast distress a negative and try to avoid it. Indeed, if you trust them not to drown you, there would be no need for this sort of frantic grasping-at-straws.

An unaligned Singularity would probably also be more pleasant, since the entities fueling it would likely try to make it look aligned, with the span of time between the treacherous turn and everyone dying likely being short.

This scenario covers a sort of "neutral-alignme... (read more)

Yes, it's competently executed

Is it?

It certainly signals that the authors have a competent grasp of the AI industry and its mainstream models of what's happening. But is it actually competent AI-policy work, even under the e/acc agenda?

My impression is that no, it's not. It seems to live in an e/acc fanfic about a competent US racing to AGI, not in reality. It vaguely recommends doing a thousand things that would be nontrivial to execute if the Eye of Sauron were looking directly at them, and the Eye is very much not doing that. On the contrary, the wider ... (read more)

7Zvi
I believe that this is a competently executed plan from the perspective of those executing the plan, which is different from the entire policy of the White House being generally competent in ways that those in charge of the plan lacked the power to do anything about (e.g. immigration, attacks on solar power, trade and alliances in general...)
7habryka
Sorry, by executed I meant "competently written". It does seem to me that this piece of policy is more grounded in important bottlenecks and detailed knowledge for AI progress than previous similar things. I find it plausible that it might fail on implementation because it's modeling the Trump administration and America as too much of a thing that could actually execute this plan. I agree with you that it is not unlikely likely it will fall flat on those grounds, and that does give me some solace.

Yeah, I figured.

If the judge sees that you are a $61 billion market cap company hiring the greatest lawyers in the world, but you're not putting forth your best legal foot when you have lawyers from other companies writing briefs outlining their own defense arguments, the consequences for you and your lawyers will be severe

What would be the actual wrongdoing here, legally speaking?

Federal lawsuits must satisfy the case or controversy requirement of Article 3 of the Constitution. 

A failure to do so (if there is no genuine adversity between the parties in practice because they collude on the result) renders the lawsuit dead on the spot (because the federal court cannot constitutionally exercise jurisdiction over the parties, so there can be no decision on the merits) and exposes the lawyers and parties to punishment and repercussions in case they tried to conceal this from/directly lie to the judge, both because lying to a judici... (read more)

Clearly the heroic thing to do would be to go to trial and then deliberately mess it up very badly in a calculated fashion that sets an awful precedent for the other AGI companies. You might say, "but China!", but if the US cripples itself, then suddenly the USG would be much more interested in reaching some sort of international-AGI-ban deal with China, so it all works out.

(Only half-serious.)

Responding to the serious half only, sandbagging doesn't work in general in the legal system, and in particular it wouldn't work here. That's because you have so much outside attention on the case and (presumably) so many amici briefs describing all the most powerful arguments in the AI companies' favor. If the judge sees that you are a $61 billion market cap company hiring the greatest lawyers in the world, but you're not putting forth your best legal foot when you have lawyers from other companies writing briefs outlining their own defense arguments, the consequences for you and your lawyers will be severe and any notion of "precedent" will be poisoned for all of time.

Yeah, I guess the use-case I had in mind is generally people who don't want LLMs trained on (particular pieces of) their writing, rather than datasets specifically.

Hmm. This approach relies partly on the AGI labs being cooperative and wary of violating the law, and partly on creating minor inconveniences for accessing the data which inconvenient human users as well. In addition, any data shared this way would have to be shared via the download portal, impoverishing the web experience.

I wonder if it's possible to design some method of data protection that (1) would be deployable on arbitrary web pages, (2) would not burden human users, (3) would make AGI labs actively not want to scrape pages protected this way.

Here's... (read more)

2ProgramCrafter
You have to make that poison inactive in accessibility cases, or a person using screen reader would hear all that. However, if a correctly configured screen reader skips the invisible data, then labs will just use it (assuming they can be bothered with cleaning dataset at all). Also, training-time jailbreaks are likely quite different from inference-time jailbreaks. The latter will tend to hit Operator-style stuff harder.
1anaguma
This seems like a great idea! However, I think it might degrade the usefulness of the dataset, especially if it’s meant to later be used to evaluate LLMs since any jailbreaks etc. would apply in that setting as well. If you provide utilities to clean up the text before evaluation, these could be used for scraping as well.

Of course, the degree of transmission does depend on the distillation distribution.

Yes, that's what makes it not particularly enlightening here, I think? The theorem says that the student moves in a direction that is at-worst-orthogonal towards the teacher – meaning "orthogonal direction" is the lower bound, right? And it's a pretty weak lower bound. (Or, a statement which I think is approximately equivalent, the student's post-distillation loss on the teacher's loss function is at-worst-equal to its pre-distillation loss.)

Another perspective: consider loo... (read more)

8cloud
I agree the theorem is fairly limited (particularly because it assumes the teacher and student are derived by single steps of GD), but I argue that it is, in fact, enlightening. Three reasons: 1. A priori, I don't think it would be crazy to think that training M to match a similarly parametrized M' on input distribution D could cause M to diverge from M' on some other distribution D'. This probably can happen if M' is behaviorally similar but parametrized differently. So, a justifiable intuition for the true fact would have to incorporate the dependence on the parametrization of M'. Even if this dependence feels obvious upon reflection ("well yeah, the models have to have similarly entangled representations for this to happen"), you'd first have to consider that this dependence existed in the first place. Why did this entanglement have to be path dependent? Could it not have been universal across models? To test the a priori plausibility of the claim, I tried asking o3 and Opus 4. You can see the responses below. (It's unclear to me how much evidence this is.) 2. In a complex system, being able to eliminate half of the outcome space suggests interesting structure. For example, if a theory of physics showed that a butterfly flapping its wings never decreases the probability of a hurricane, that would be a surprising insight into a fundamental property of chaotic systems-- even though it only "lower-bounds" change in hurricane probability at 0. 3. The proof of the theorem actually does quantify transmission. It is given by equation (2) in terms of inner products of teacher and student gradients on the distillation distribution. So, if you are willing to compute or make assumptions about these terms, there are more insights to be had. That said, I'm with you when I say, armed only with the theorem, I would not have predicted our results! Prompt Consider the following machine learning experiment: start with a neural network M. Create a new network, M',

OpenAI has declared ChatGPT Agent as High in Biological and Chemical capabilities under their Preparedness Framework

Huh. They certainly say all the right things here, so this might be a minor positive update on OpenAI for me.

Of course, the way it sounds and the way it is are entirely different things, and it's not clear yet whether the development of all these serious-sounding safeguards was approached with making things actually secure in mind, as opposed to safety-washing. E. g., are they actually going to stop anyone moderately determined?

Hm, it's been ... (read more)

Fascinating. This is the sort of result that makes me curious about how LLMs work irrespective of their importance to any existential risks.

In the paper, we prove a theorem showing that a single, sufficiently small step of gradient descent on any teacher-generated output necessarily moves the student toward the teacher

Hmm, that theorem didn't seem like a very satisfying explanation to me. Unless I'm missing something, it doesn't actually imply anything about the student's features that are seemingly unrelated to the training distribution being moved t... (read more)

The theorem says that the student will become more like the teacher, as measured by whatever loss was used to create the teacher. So if we create the teacher by supervised learning on the text "My favorite animal is the owl," the theorem says the student should have lower loss on this text[1]. This result does not depend on the distillation distribution. (Of course, the degree of transmission does depend on the distillation distribution. If you train the student to imitate the teacher on the input "My favorite animal is", you will get more transmission tha... (read more)

Also, it's funny that we laugh at xAI when they say stuff like "we anticipate Grok will uncover new physics and technology within 1-2 years", but when an OpenAI employee goes "I wouldn’t be surprised if by next year models will be deriving new theorems and contributing to original math research", that's somehow more credible. Insert the "know the work rules" meme here.

(FWIW, I consider both claims pretty unlikely but not laughably incredible.)

The ‘barely speak English’ part makes the solution worse in some ways but actually makes me give their claims to be doing something different more credence rather than less

I think people are overupdating on that. My impression is that gibberish like this is the default way RL makes models speak, and that they need to be separately fine-tuned to produce readable outputs. E. g., the DeepSeek-R1 paper repeatedly complained about "poor readability" with regards to DeepSeek-R1-Zero (their cold-start no-SFT training run).

Actually This Seems Like A Big Deal

If we

... (read more)
4Thane Ruthenis
Also, it's funny that we laugh at xAI when they say stuff like "we anticipate Grok will uncover new physics and technology within 1-2 years", but when an OpenAI employee goes "I wouldn’t be surprised if by next year models will be deriving new theorems and contributing to original math research", that's somehow more credible. Insert the "know the work rules" meme here. (FWIW, I consider both claims pretty unlikely but not laughably incredible.)

I think this is overall reasonable if you interpret "hard-to-verify" as "substantially harder to verify" and I think this probably how many people would read this by default

Not sure about this. The kind of "hard-to-verify" I care about is e. g. agenty behavior in real-world conditions. I assume many other people are also watching out for that specifically, and that capability researchers are deliberately aiming for it.

And I don't think the proofs are any evidence for that. The issue is that there exists, in principle, a way to easily verify math proofs: by... (read more)

Singular Learning Theory and Simplex's work (e. g. this), maybe? Cartesian Frames and Finite Factored Sets might also work, but I'm less sure about those.

It's actually pretty hard to come up with agendas in the intersection of "seems like an alignment-relevant topic it'd be useful to popularize" and "has complicated math which would be insightful and useful to visualize/simulate".

  • Natural abstractions, ARC's ELK, Shard Theory, and general embedded-agency theories are currently better understood by starting from the concepts, not the math.
  • Infrabayesianism, O
... (read more)

what would it even mean to have 10^30 times more shrimp than atoms?

Oh, easy, it just implies you're engaging in acausal trade with a godlike entity residing in some universe dramatically bigger than this one. This interpretation introduces no additional questions or complications whatsoever.

I just really don't buy the whole "let's add up qualia" as any basis of moral calculation

Same, honestly. To me, many of these thought experiments seem decoupled from anything practically relevant. But it still seems to me that people often do argue from those abstracted-out frames I'd outlined, and these arguments are probably sometimes useful for establishing at least some agreement on ethics. (I'm not sure how a full-complexity godshatter-on-godshatter argument would even look like (a fistfight, maybe?), and am very skeptical it'd yield any useful results.)

Anyway, it sounds like we mostly figured out what the initial drastic disconnect between our views here was caused by?

2habryka
Yeah, I think so, though not sure. But I feel good stopping here.

I agree that this is a thing people often like to invoke, but it feels to me a lot like people talking about billionaires and not noticing the classical crazy arithmetic errors like

Isn't it the opposite? It's a defence against providing too-low numbers, it's specifically to ensure that even infinitesimally small preferences are elicited with certainty.

Bundling up all "this seems like a lot" numbers into the same mental bucket, and then failing to recognize when a real number is not actually as high as in your hypothetical, is certainly an error one could m... (read more)

4habryka
I agree probably I implied a bit too much contextualization. Like, I agree the post has a utilitarian bend, but man, I just really don't buy the whole "let's add up qualia" as any basis of moral calculation, that I find attempts at trying to create a "pure qualia shrimp" about as confused and meaningless as trying to argue that 7 bees are more important than a human. "qualia" isn't a thing that exists. The only thing that exists are your values in all of their complexity and godshatteredness. You can't make a "pure qualia shrimp", it doesn't many any philosophical sense, pure qualia isn't real. And I agree that maybe the post was imagining some pure qualia juice, and I don't know, maybe in that case it makes sense to dismiss it by doing a reductio ad absurdum on qualia juice, but I don't currently buy it. I think that both wouldn't be engaging with the good parts of the author, and also be kind of a bad step in the discourse (like, the previous step was understanding why it doesn't make sense for 7 bees to be more important than a human, for a lot of different reasons and very robustly and within that discourse, it's actually quite important to understand why 10^100 shrimp might actually be more important than a human, under at least a lot of reasonable set of assumptions).

I think there is also a real conversation going on here about whether maybe, even if you isolated each individual shrimp into a tiny pocket universe, and you had no way of ever seeing them or visiting the great shrimp rift (a natural wonder clearly greater than any natural wonder on earth), and all you knew for sure was that it existed somewhere outside of your sphere of causal influence, and the shrimp never did anything more interesting than current alive shrimp, whether it would still be worth it to kill a human

Yeah, that's more what I had in mind. Illu... (read more)

2habryka
I agree that this is a thing people often like to invoke, but it feels to me a lot like people talking about billionaires and not noticing the classical crazy arithmetic errors like:  Like, in those discussions people are almost always trying to invoke numbers like "$1 trillion" as "a number so big that the force of the conclusion must be inevitable", but like most of the time they just fail because the number isn't big enough.  If someone was like "man, are you really that confident that a shrimp does not have morally relevant experience that you wouldn't trade a human for a million shrimp?", my response is "nope, sorry, 1 million isn't big enough, that's just really not that big of a number". But if you give me a number a trillion trillion trillion trillion trillion trillion trillion trillion times bigger, IDK, yeah, that is a much bigger number. And correspondingly, for every thought experiment of this kind, I do think there is often a number that will just rip through your assumptions and your tradeoffs. There are just really very very very big numbers.  Like, sure, we all agree our abstraction break here, and I am not confident you can't find any hardening of abstraction that make the tradeoff come out in the direction of the size of the number really absolutely not mattering at all, but I think that would be a violation of the whole point of the exercise. Like, clearly we can agree that we assign a non-zero value to a marginal shrimp. We value that marginal shrimp for a lot of different reasons, but like, you probably value it for reasons that does include things like the richness of its internal experience, and the degree to which it differs from other shrimp, and the degree to which it contributes to an ecosystem, and the degree to which it's an interesting object of trade, and all kinds of reasons. Now, if we want to extrapolate that value to 10^100, those things still are there, we can't just start ignoring them.  Like, I would feel more sympathetic t

One can argue it's meaningless to talk about numbers this big, and while I would dispute that, it's definitely a much more sensible position than trying to take a confident stance to destroy or substantially alter a set of things so large that it vastly eclipses in complexity and volume and mass and energy all that has ever or will ever exist by a trillion-fold.

Okay, while I'm hastily backpedaling from the general claims I made, I am interested in your take on the first half of this post. I think there's a difference between talking about an actual situati... (read more)

4habryka
I agree there is something to this, but when actually thinking about tradeoffs that do actually have orders of magnitude of variance in them, which is ultimately where this kind of reasoning is most useful (not 100 orders of magnitude, but you know 30-50 are not unheard of), this kind of abstraction would mostly lead you astray, and so I don't think it's a good norm for how to take thought experiments like this. Like, I agree there are versions of the hypothetical that are too removed, but ultimately, I think a central lesson of scope sensitivity is that having a lot more of something often means drastic qualitative changes that come with that drastic change in quantity. Having 10 flop/s of computation is qualitatively different to having 10^10 flop/s. I can easily imagine someone before the onset of modern computing saying "look, how many numbers do you really need to add in everyday life? What is even the plausible purpose of having 10^10 flop/s available? For what purpose would you need to possibly perform 10 billion operations per second? This just seems completely absurd. Clearly the value of a marginal flop goes to zero long before that. That is more operations than all computers[1] in the world have ever ever done, in all of history, combined. What could possibly be the point of this?"  And of course, such a person would be sorely mistaken. And framing the thought experiment as "well, no, I think if you want to take this thought experiment seriously you should think about how much you would be willing to pay for the 10 billionth operation of the kind that you are currently doing, which is clearly zero. I don't want you to hypothesize some kind of new art forms or applications or computing infrastructure or human culture, which feel like they are not the point of this exercise, I want you to think about the marginal item in isolation" would be pointless. It would be emptying the exercise and tradeoff of any of its meaning. If we ever face a choice like this

No, being extremely overwhelmingly confident about morality such that even if you are given a choice to drastically alter 99.999999999999999999999% of the matter in the universe, you call the side of not destroying it "insane" for not wanting to give up a single human life, a thing we do routinely for much weaker considerations, is insane.

Hm. Okay, so my reasoning there went as follows:

  • Substitute shrimp for rocks.  rocks would also be an amount of matter bigger than exists in the observable universe, and we presumably should assign a nonzero
... (read more)
8habryka
You should be able to strike out the text manually and get the same-ish effect, or leave a retraction notice. The text being hard to read is intentional so that it really cannot be the case that someone screenshots it or skims it without noticing that it is retracted.

Edit: Nevermind, evidently I've not thought this through properly. I'm retracting the below.


The naïve formulations of utilitarianism assume that all possible experiences can be mapped to scalar utilities lying on the same, continuous spectrum, and that experiences' utility is additive. I think that's an error.

This is how we get the frankly insane conclusions like "you should save  shrimps instead of one human" or everyone's perennial favorite, "if you're choosing between one person getting tortured for 50 years or some amount of people ... (read more)

3quetzal_rainbow
We can easily ban speed above 15km/h for any vehicles except ambulances. Nobody starves to death in this scenario, it's just very inconvenient. We value convenience lost in this scenario more than lives lost in our reality, so we don't ban high-speed vehicles.  Ordinal preferences are bad and insane and they are to be avoided. What's really wrong with utilitarianism is that you can't, actually, sum utilities: it's a type error, because utilities are invariant up to affine transform, what would their sum mean? The problem, I think, that humans naturally conflate two types of altruism. First type is caring about other entities mental state. Second type is "game-theoretic" or "alignment-theoretic" altruism: generalized notion of what does that mean to care about someone else's values. Roughly, I think that good type of the second type of altruism requires you to do fair bargaining in interests of entity you are being altruistic towards.  Let's take "World Z" thought experiment. The problem from the second type altruism perspective is that total utilitarian gets very large utility from this world, while all inhabitants of this world, by premise, get very small utility per person, which is unfair division of gains.  One may object: why not create entities who think that very small share of gains is fair? My answer is that if entity can be satisfied with infinitesimal share of gains, it also can be satisfied with infinitesimal share of anthropic measure, i.e., non-existence, and it's more altruistic to look for more demanding entities to fill universe with. My general problem with animal welfare from bargaining perspective is that most of animals probably don't have sufficient agency to have any sort of representative in bargaining. We can imagine CEV of shrimp which is negative utilitarian and wants to kill all shrimp, or positive utilitarian which thinks that even very painful existence is worth it, or CEV that prefers shrimp swimming in heroin, or something human
4Garrett Baker
Ok, but if you don't drive to the store one day to get your chocolate, then that is not a major pain for you, yes? Why not just decide that next time you want chocolate at the store, you're not going to go out and get it because you may run over a pedestrian? Your decision there doesn't need to impact your other decisions. Then you ought to keep on making that choice until you are right on the edge of those choices adding up to a first-tier experience, but certainly below. This logic generalizes. You will always be pushing the lower tiers of experience as low as they can go before they enter the upper-tiers of experience. I think the fact that your paragraph above is clearly motivated reasoning here (instead of "how can I actually get the most bang for my buck within this moral theory" style reasoning) shows that you agree with me (and many others) that this is flawed.
6Nick_Tarleton
Besides uncertainty, there's the problem of needing to pick cutoffs between tiers in a ~continuous space of 'how much effect does this have on a person's life?', with things slightly on one side or the other of a cutoff being treated very differently. I agree with the intuition that this is important, but I think that points toward just rejecting utilitarianism (as in utility-as-a-function-purely-of-local-experiences, not consequentialism).
-2habryka
Huh, I expected better from you. No, it is absolutely not insane to save 10100 shrimp instead of one human! I think the case for insanity for the opposite is much stronger! Please, actually think about how big 10100 is. We are talking about more shrimp than atoms in the universe. Trillions upon trillions of shrimp more than atoms in the universe.  This is a completely different kind of statement than "you should trade of seven bees against a human".  No, being extremely overwhelmingly confident about morality such that even if you are given a choice to drastically alter 99.999999999999999999999% of the matter in the universe, you call the side of not destroying it "insane" for not wanting to give up a single human life, a thing we do routinely for much weaker considerations, is insane. The whole "tier" thing obviously fails. You always end up dominated by spurious effects on the highest tier. In a universe with any appreciable uncertainty you basically just ignore any lower tiers, because you can always tell some causal story of how your actions might infinitesimally affect something, and so you completely ignore it. You might as well just throw away all morality except the highest tier, it will never change any of your actions.
6ryan_greenblatt
It's worth noting that everything funges: some large number of experiences of eating a chocolate bar can be exchanged for avoiding extreme human suffering or death. So, if you lexicographically put higher weight on extreme human suffering or death, then you're willing to make extreme tradeoffs (e.g. 1030 chocolate bar experiences) in terms of mundane utility for saving a single life. I think this easily leads to extremely unintuitive conclusions, e.g. you shouldn't ever be willing to drive to a nice place. See also Trading off Lives. I find your response to this sort of argument under "Relevance: There's reasoning that goes" in the footnote very uncompelling as it doesn't apply to marginal impacts.

Incidentally, your Intelligence as Privilege Escalation is pretty relevant to that picture. I had it in mind when writing that.

Not necessarily. If humans don't die or end up depowered in the first few weeks of it, it might instead be a continuous high-intensity stress state, because you'll need to be paying attention 24/7 to constant world-upturning developments, frantically figuring out what process/trend/entity you should be hitching your wagon to in order to not be drowned by the ever-rising tide, with the correct choice dynamically changing at an ever-increasing pace.

"Not being depowered" would actually make the Singularity experience massively worse in the short term, precise... (read more)

5S. Alex Bradt
This comment has been tumbling around in my head for a few days now. It seems to be both true and bad. Is there any hope at all that the Singularity could be a pleasant event to live through?

It does sound like it may be a new and in a narrow sense unexpected technical development

I buy that, sure. I even buy that they're as excited about it as they present, that they believe/hope it unlocks generalization to hard-ot-verify domains. And yes, they may or may not be right. But I'm skeptical on priors/based on my model of ML, and their excitement isn't very credible evidence, so I've not moved far from said priors.

3Amalthea
Got it! I'm more inclined to generally expect that various half-decent ideas may unlock surprising advances (for no good reason in particular), so I'm less skeptical that this may be true.  Also, while math is of course easy to verify, assuming they haven't significantly used verification in the training process, it makes their claims more reasonable.    

Oh, yeah, he's not superintelligence-pilled or anything. I was implicitly comparing with a relatively low baseline, yes.

Honestly, that thread did initially sound kind of copium-y to me too, which I was surprised by, since his AI takes are usually pretty good[1] and level-headed. But it makes much more sense under the interpretation that this isn't him being in denial about AI performance, but him undermining OpenAI in response to them defecting against IMO. That's why he's pushing the "this isn't a fair human-AI comparison" line.

  1. ^

    Edit: For someone who doesn't "feel the ASI", I mean.

5Amalthea
I would not characterize Tao's usual takes on AI as particularly good (unless you compare with a relatively low baseline). He's been overall pretty conservative and mostly stuck to reasonable claims about current AI. So there's not much to criticize in particular, but it has come at the cost of him not appreciating the possible/likely trajectories of where things are going, which I think misses the forest for the trees. 

The claim I'm squinting at real hard is this one:

We developed new techniques that make LLMs a lot better at hard-to-verify tasks. 

Like, there's some murkiness with them apparently awarding gold to themselves instead of IMO organizers doing it, and with that other competitive-programming contest at which presumably-the-same model did well being OpenAI-funded. But whatever, I'm willing to buy that they have a model that legitimately achieved roughly this performance (even if a fairer set of IMO judges would've docked points to slightly below the unimpor... (read more)

6ryan_greenblatt
When Noam Brown says "hard-to-verify", I think he means that natural language IMO proofs are "substantially harder to verify": he says "proofs are pages long and take experts hours to grade". (Yes, there are also things which are much harder to verify like things that experts strongly disagree about after years of discussion. Also, for IMO programs, "hours to grade" is probably overstated?) Also, I interpreted this as mostly in contrast to cases where outputs are trivial to programmatically verify (or reliably verify with a dumb LLM) in the context of large scale RL. E.g., you can trivially verify the answers to purely numerical math problems (or competitive programming or other programming situations where you have test cases). Indeed, OpenAI LLMs have historically been much better at numerical math problems than proofs, though possibly this gap has now been closed (at least partially). I think this is overall reasonable if you interpret "hard-to-verify" as "substantially harder to verify" and I think this probably how many people would read this by default. I don't have a strong view about whether this method will actually generalize to other cases where experts can verify things with high agreement in a few hours. (Noam Brown doesn't say anything about competitive programming, so I'm not sure why you mentioned that. Competitive programming is trivial to verify.)
2Amalthea
Sure, math is not an example of a hard-to-verify task, but I think you're getting unnecessarily hung up on these things. It does sound like it may be a new and in a narrow sense unexpected technical development, and it's unclear how significant it is. I wouldn't try to read into their communications much more.

Silently sponsoring FrontierMath and receiving access to the question sets, and, if I remember correctly, o3 and o3-mini performing worse on a later evaluation done on a newer private question set of some sort

IIRC, that worse performance was due to using a worse/less adapted agency scaffold, rather than OpenAI making the numbers up or engaging in any other egregious tampering. Regarding ARC-AGI, the December-2024 o3 and the public o3 are indeed entirely different models, but I don't think it implies the December one was tailored for ARC-AGI.

I'm not saying ... (read more)

3lwreader132
Are you sure? I'm pretty sure that was cited as *one* of the possible reasons, but not confirmed anywhere. I don't know if some minor scaffolding differences could have that much of an effect on the results (-15%?) in a math benchmark, but if they did, that should have been accounted for in the first place. I don't think other models were tested with scaffolds specifically engineered for them getting a higher score. As per Arc Prize and what they said OpenAI told them, the December version ("o3-preview", as Arc Prize named it) had a compute tier above that of any publicly released model. Not only that, they say that the public version of o3 didn't undergo any RL for ARC-AGI, "not even on the train set". That seems suspicious to me, because once you train a model on something, you can't easily untrain it; as per OpenAI, the ARC-AGI train set was "just a tiny fraction of the o3 train set" and, once again, the model used for evaluations is "fully general". This means that either o3-preview was trained on the ARC-AGI train set somewhere close to the end of the training run and OpenAI was easily able to load an earlier checkpoint to undo that, then not train it on that again for unknown reasons, OR that the public version of o3 was retrained from scratch/a very early checkpoint, then again, not trained on the ARC-AGI data again for unknown reasons, OR that o3-preview was somehow specifically tailored towards ARC-AGI. The latter option seems the most likely to me, especially considering the custom compute tier used in the December evaluation.

I'd guess it has something to do with whatever they're using to automatically evaluate the performance in "hard-to-verify domains". My understanding is that, during training, those entire proofs would have been the final outputs which the reward function (or whatever) would have taken in and mapped to training signals. So their shape is precisely what the training loop optimized – and if so, this shape is downstream of some peculiarities on that end, the training loop preferring/enforcing this output format.

Pretty much everybody is looking into test-time compute and RLVR right now. How come (seemingly) nobody else has found out about this "new general-purpose method" before OpenAI?

Well, someone has to be the first, and they got to RLVR itself first last September.

OpenAI has been shown to not be particularly trustworthy when it comes to test and benchmark results

They have? How so?

1lwreader132
Silently sponsoring FrontierMath and receiving access to the question sets, and, if I remember correctly, o3 and o3-mini performing worse on a later evaluation done on a newer private question set of some sort. Also whatever happened with their irreproducible ARC-AGI results and them later explicitly confirming that the model that Arc Prize got access to in December was different from the released versions, with different training and a special compute tier, despite OpenAI employees claiming that the version of o3 used in the evaluations was fully general and not tailored towards specific tasks. Sure, but I'm just quite skeptical that it's specifically the lab known for endless hype that does. Besides, a lot less people were looking into RLVR at the time o1-preview was released, so the situations aren't exactly comparable.

Eh. Scaffolds that involve agents privately iterating on ideas and then outputting a single result are a known approach, see e. g. this, or Deep Research, or possibly o1 pro/o3 pro. I expect it's something along the same lines, except with some trick that makes it work better than ever before... Oh, come to think of it, Noam Brown did have that interview I was meaning to watch, about "scaling test-time compute to multi-agent civilizations". That sounds relevant.

I mean, it can be scary, for sure; no way to be certain until we see the details.

Misunderstood the resolution terms. ARC-AGI-2 submissions that are eligible for prizes are constrained as follows:

Unlike the public leaderboard on arcprize.org, Kaggle rules restrict you from using internet APIs, and you only get ~$50 worth of compute per submission. In order to be eligible for prizes, contestants must open source and share their solution and work into the public domain at the end of the competition.

Grok 4 doesn't count, and whatever frontier model beats it won't count either. The relevant resolution criterion for frontier model performanc... (read more)

Well, that's mildly unpleasant.

gemini-2.5-pro (31.55%)

But not that unpleasant, I guess. I really wonder what people think when they see a benchmark on which LLMs get 30%, and then confidently say that 80% is "years away". Obviously if LLMs already get 30%, it proves they're fundamentally capable of solving that task[1], so the benchmark will be saturated once AI researchers do more of the same. Hell, Gemini 2.5 Pro apparently got 5/7 (71%) on one of the problems, so clearly outputting 5/7-tier answers to IMO problems was a solved problem, so an LLM model g... (read more)

4Rafael Harth
Agreed, I don't really get how this could be all that much of an update. I think the cynical explanation here is probably correct, which is that most pessimism is just vibes based (as well as most optimism).
3Aaron Staley
Note that the likely known SOTA was even higher than 30%.  Google never released Gemini 2.5 Pro Deep think, which they claimed scored 49% on the USAMO (vs. 34.5% for Gemini 2.5-05-06).  Little hard to convert this to an implied IMO score (especially because matharena.ai has Gemini 2.5 oddly having a significantly lower USAMO score for the June model (24%),  though similar IMO score (~31.5%) but my guess is Deep Think would get somewhere between 37% and 45% on the IMO.    81% remains a huge jump of course. Perhaps, perhaps not.  Substantial weight was on the "no one bothers" case - no one was reporting such high scores on the USAMO (pretty similar difficulty to IMO) and the market started dropping rapidly after the USAMO date.  Note that we were still at 50% odds of IMO gold a week ago -- but the lack of news of anyone trying drove it down to ~26%. Interestingly, I can find write-ups roughly predicting order of AI difficulty.  Looking at gemini-2.5 pro's result so far, using alphageometry would have guaranteed problem 2, so assuming Pro Deep Think only boosted performance on the non-geometric problems, we'd be at a 58% using deep think + alphageometry, giving Bronze and close to Silver.  I think it was reasonable to assume an extra 4+ months (2 months timeline, 2 months labs being ahead of release) + more compute would have given the 2 more points to get silver. What is surprising is that generalist LLM got better at combinatorics (problem 1) and learned to solve geometry problems well.  I'm neither an AI nor math competition expert, so can't opine whether this is a qualitative gain or just an example of a company targeting these specific problems (lots of training on math + lots of inference).
3Gurkenglas
You sold, what changed your mind?

the bottom 60% or so grift and play status games, but probably weren’t going to contribute much anyway

I disagree with this reasoning. A well-designed system with correct incentives would co-opt these people's desire to grift and play status games for the purposes of extracting useful work from them. Indeed, setting up game-theoretic environments in which agents with random or harmful goals all end up pointed towards some desired optimization target is largely the purpose of having "systems" at all. (See: how capitalism, at its best, harnesses people's self... (read more)

4Cole Wyeth
It’s not a perfectly designed system, but it’s still possible to benefit from it if you want a few years to do research. 
8leogao
well, in academia, if you do quality work anyways and ignore incentives, you'll get a lot less funding to do that quality work, and possibly perish. unfortunately, academia is not a sufficiently well designed system to extract useful work out of grifters.

Tertiarily relevant annoyed rant on terminology:

I will persist in using "AGI" to describe the merely-quite-general AI of today, and use "ASI" for the really dangerous thing that can do almost anything better than humans can, unless you'd prefer to coordinate on some other terminology.

I don't really like referring to The Thing as "ASI" (although I do it too), because I foresee us needing to rename it from that to "AGSI" eventually, same way we had to move from AI to AGI.

Specifically: I expect that AGI labs might start training their models to be superhuman ... (read more)

My takes on those are:

The very first example shows that absolutely arbitrary things (e.g. arbitrary green lines) can be "natural latents". Does it mean that "natural latents" don't capture the intuitive idea of "natural abstractions"?

I think what's arbitrary here isn't the latent, but the objects we're abstracting over. They're unrelated to anything else, useless to reason about.

Imagine, instead, if Alice's green lines were copied not just by Bob, but by a whole lot of artists, founding an art movement whose members drew paintings containing this specific ... (read more)

Ask Omega to introduce you to one the next time it abducts you to a decision-theory experiment.

The most impressed person in early days was Pliny? [...] I don’t know what that means.

Best explanation for this I've seen

 

We’ve created the obsessive toxic AI companion from the famous series of news stories ‘increasing amounts of damage caused by obsessive toxic AI companies.’

This is real life

I know this isn't the class of update any of us ever expected to make, and I know it's uncomfortable to admit this, but I think we should stare at the truth unflinching: the genre of reality is not "hard science fiction", but "futuristic black comedy".

Ask any So... (read more)

3sanxiyn
Do you have any Solomonoff inductor you know? I don't, and I would like an introduction.

Counterarguments:

  • "I’ll assume 10,000 people believe chatbots are God based on the first article I shared" basically assumes the conclusion that it's unimportant. Perhaps instead all 2.25 million delusion-prone LLM users are having their delusions validated and exacerbated by LLMs? After all, their delusions are presumably pretty important to their lives, so there's a high chance they talked to an LLM about them at some point, and perhaps after that they all keep talking.
    • I mean, I also expect it's actually very few people (at least so far), but we don't rea
... (read more)
4RussellThor
Yes its too early to tell what the net effect will be. I am following the digital health/therapist product space and there is a lot of chatbots focused on CBT style interventions.  Preliminary indications say they are well received. I think a fair perspective on the current situation is to compare GenAI to previous AI.  The Facebook styled algorithms have done pretty massive mental harm. GenAI LLM at present are not close to that impact. In the future it depends a lot on how companies react - if mass LLM delusion is a thing then I expect LLM can be trained to detect and stop that, if the will is there. Especially a different flavor of LLM perhaps. Its clear to me that the majority of social media harm could have been prevented in a different competitive environment. In the future, I am more worried about LLM being deliberately used to oppress people, NK could be internally invincible if everyone wore ankle bracelet LLM listeners etc. We also have yet to see what AI companions will do - that has the potential to cause massive disruption too and you can't put in a simple check to claim its failed. I am not so sure that calling LLM not at all aligned because of this issue is fair. If they are not capable enough then they won't be able to prevent such harm and appear misaligned. If they are capable to detect such harm and stop it, but companies don't bother to put in automatic checks, then yes they are misaligned.
Load More