James Payor — LessWrong

I think about AI alignment; send help.

I'm also on twitter. More links on my homepage payor.io.

I think about AI alignment; send help.

I'm also on twitter. More links on my homepage payor.io.

I really should have something short to say, that turns the whole argument on its head, given how clear-cut it seems to me. I don't have that yet, but I do have some rambly things to say.

I basically don't think overhangs are a good way to think about things, because the bridge that connects an "overhang" to an outcome like "bad AI" seems flimsy to me. I would like to see a fuller explication some time from OpenAI (or a suitable steelman!) that can be critiqued. But here are some of my thoughts.

The usual argument that leads from "overhang" to "we all die" has some imaginary other actor who is scaling up their methods with abandon at the end, killing us all because it's not hard to scale and they aren't cautious. This is then used to justify scaling up your own method with abandon, hoping that we're not about to collectively fall off a cliff.

For one thing, the hype and work being done now is making this problem a lot worse at all future timesteps. There was (and still is) a lot people need to figure out regarding effectively using lots of compute. (For instance, architectures that can be scaled up, training methods and hyperparameters, efficient compute kernels, putting together datacenters and interconnect, data, etc etc.) Every chipmaker these days has started working on things with a lot of memory right next to a lot compute with a tonne of bandwidth, tailored to these large models. These are barriers-to-entry that it would have been better to leave in place, if one was concerned with rapid capability gains. And just publishing fewer things and giving out fewer hints would have helped.

Another thing: I would take the whole argument as being more in good-faith if I saw attempts being made to scale up anything other than capabilities at high speed, or signs that made it seem at all likely that "alignment" might be on track. Examples:

A single alignment result that was supported by a lot of OpenAI staff. (Compare and contrast the support that the alignment team's projects get to what a main training run gets.)
Any focus on trying to claw cognition back out of the giant inscrutable floating-point numbers, into a domain easier to understand, rather than pouring more power into the systems that get much harder to inspect as you scale them. (Failure to do this suggests OpenAI and others are mostly just doing what they know how to do, rather than grappling with navigating us toward better AI foundations.)
Any success in understanding how shallow vs deep the thinking of the LLMs is, in the sense of "how long a chain of thoughts/inferences can it make as it composes dialogue", and how this changes with scale. (Since the whole "LLMs are safer" thing relies on their thinking being coupled to the text they output; otherwise you're back in giant inscrutable RL agent territory)
The delta between "intelligence embedded somewhere in the system" and "intelligence we can make use of" looking smaller than it does. (Since if our AI gets to use of more of its intelligence than us, and this gets worse as we scale, this looks pretty bad for the "use our AI to tame the AI before it's too late" plan.)

Also I can't make this point precisely, but I think there's something like capabilities progress just leaves more digital fissile material lying around the place, especially when published and hyped. And if you don't want "fast takeoff", you want less fissile material lying around, lest it get assembled into something dangerous.

Finally, to more directly talk about LLMs, my crux for whether they're "safer" than some hypothetical alternative is about how much of the LLM "thinking" is closely bound to the text being read/written. My current read is that they're more like doing free-form thinking inside, that tries to concentrate mass on right prediction. As we scale that up, I worry that any "strange competence" we see emerging is due to the LLM having something like a mind inside, and less due to it having accrued more patterns.

Isn't there a third way out? Name the circumstances under which your models break down.

e.g. "I'm 90% confident that if OpenAI built AGI that could coordinate AI research with 1/10th the efficiency of humans, we would then all die. My assessment is contingent on a number of points, like the organization displaying similar behaviour wrt scaling and risks, cheap inference costs allowing research to be scaled in parallel, and my model of how far artificial intelligence can bootstrap. You can ask me questions about how I think it would look if I were wrong about those."

I think it's good practice to name ways your models can breakdown that you think are plausible, and also ways that your conversational partners may think are plausible.

e.g. even if I didn't think it would be hard for AGI to bootstrap, if I'm talking to someone for whom that's a crux, it's worth laying out that I'm treating that as a reliable step. It's better yet if I clarify whether it's a crux for my model that bootstrapping is easy. (I can in fact imagine ways that everything takes off even if bootstrapping is hard for the kind of AGI we make, but these will rely more on the human operators continuing to make dangerous choices.)

Also, here's a proof that a bot is never exploited. It only cooperates when its partner $B$ provably cooperates.

First, note that $A \to □ A$ , i.e. if $A$ cooperates it provably cooperates. (Proof sketch: $A \leftrightarrow □ X \to □ □ X \leftrightarrow □ A$ .)

Now we show that $A \to □ B$ (i.e. if $A$ chooses to cooperate, its partner is provably cooperating):

We get $A \to (□ □ A \to □ B)$ by distributing.
We get $A \to □ □ A$ by applying internal necessitation to $A \to □ A$ .
By (1) and (2), $A \to □ B$ .

(PS: we can strengthen this to $A \leftrightarrow □ B$ , by noticing that $□ B \to □ (□ A \to B) \leftrightarrow A$ .)

What interesting things do y'all think are up with AI lab politics these days? Also why is everyone (or just many people in these circle) going to Anthropic now?

Any changes in how things seem for control plans based on vibes and awareness present in more recent models? (GPT-5 series may not count here; I'm mostly interested in visiblity on the next generation that are coming, of which I think Opus 4.5 is a preview but I'm fairly unsure.)

Anything generally striking about how things look in the landscape and models versus a year ago?

I would contest the frame here. In particular I think it won't hold up because things won't stay as capital-bound as they are now, and that seriously messes with the continuity required for today's investments to maintain their relative portion of the pie. What do you think of this part?

(EDIT: Okay I think you are referencing this sort of thing with "most people can't invest in assets that grow at the average rate", but I still take issue with some picture in which everything is apportioned to assets that grow or something like that.)

To expand: I think today's capital earns its gains mostly because it is a required input, which thereby gives it a lot of negotiating power. And I think this falls off pretty sharply in time, in the limit of technological development.

(I should say, so long as it remains the case that we need lots of coordinated work to build compute engines to run intelligence, "capital" seems to remain meaningful in the old ways. But a lot of things seem possible with a nanofactory and the right information about how to use it, and at a point like that eld-capital isn't a relevant bottleneck.)

And while we can imagine some way that neo-capital continues to project its force in a profit-grabbing way in the future, I think the mechanism is pretty different than today, and probably has to involve more literal force, and is unlikely to have solid continuity with today-capital.

I would like to thank you Anna for this write-up! I love the thoughtfulness and ideas on institutional design, relationship to donors and EA, and others. Much of this post was quite viscerally relaxing for me to read, in a "finally it feels possible to jointly understand X and Y and Z" kind of way, which I'm quite glad for.

On the overall note of the post:

I claim that most women have a “deep” preference for nonconsent in dating/mating. It’s not just a kink; from the first approach to a date to sex, women typically want to not have to consent to what’s happening.

...I would like to say that, in my capacity as just-some-guy, this has not been my lived experience interacting with women, and I disagree with a lot of the framing in this post. (I don't mean to contest what you've experienced; I would interpret these events pretty differently though.)

I'm writing this mostly because I'm finding it horrifying that it appears the LessWrong consensus reality is treating "nonconsent" stuff as a cornerstone of how it views women. This somehow seems to increasingly to be the case, seems wrong to me, and contributes to the erasure of some things I consider important.

Meta note: Thanks for your comment! I failed to reply to this for a number of days, since I was confused about how to do that in the context of this post. Still though I think it's relevant about probabilistic reasoning, and I've now offered my thoughts in the other replies.

Anyhow, regarding probability distributions, there's some philosophical difficulty in my opinion about "grounding". Specifically, what reason should I have to trust that the probability distribution is doing something sensible around my safety questions of interest? How did we construct things such that it was?

The best approach I'm aware of to building a computable (but not practical) distribution with some "grounding" results is logical induction / Garrabrant induction. They come with have a self-trust result of the form that logical inductors will, across time, converge to predicting their future selves' probabilities agree with their current probabilities. If I understand correctly, this includes limiting toward predicting a conditional probability for an event $X$ if we are given that the future inductor assigns probability $p$ .

...however, as I understand, there's still scope for any probability distributions we try to base on logical inductors to be "ungrounded", in that we only have a guarantee that ungrounded/adversarial perturbations must be "finite" across the limit to infinity.

Here is something more technical on the matter that I alas haven't made the personal effort to read through: https://www.lesswrong.com/posts/5bd75cc58225bf067037556d/logical-inductor-tiling-and-why-it-s-hard

In a more realistic and complicated setting, we may definitely want to be obtaining a high probability under some distribution we trust to be well-grounded, as our condition for a chain of trust. In terms of the technical difficulty I'm interested in working through, I think it should be possible to get satisfying results about proving that another proof system is correct, and whatnot, without needing to invoke probability distributions. To the extent that you can make things work with probabilistic reasoning, I think they can also be made to work in a logic setting, but we're currently missing some pieces.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments