I like the thrust of this paper, but I feel that it overstates how robust the safety properties will be, by drawing an overly sharp distinction between agentic and non-agentic systems, and not really engaging with the strongest counterexamples
To give some examples from the text:
A chess-playing AI, for instance, is goal-directed because it prefers winning to losing. A classifier trained with log likelihood is not goal-directed, as that learning objective is a natural consequence of making observations
But I could easily train an AI which simply classif...
Thank you for the very detailed comment! I’m pretty sympathetic to a lot of what you’re saying, and mostly agree with you about the three properties you describe. I also think we ought to do some more spelling-out of the relationship between gradual disempowerment and takeover risk, which isn’t very fleshed-out in the paper — a decent part of why I’m interested in it is because I think it increases takeover risk, in a similar but more general way to the way that race dynamics increase takeover risk.
I’m going to try to respond to the specific points you lay...
The writing here was definitely influenced by Lewis (we quote TAoM in footnote 6), although I think the Choice Transition is broader and less categorically negative.
For instance in Lewis's criticism of the potential abolition he writes things like:
...The old dealt with its pupils as grown birds deal with young birds when they teach them to fly; the new deals with them more as the poultry-keeper deals with young birds— making them thus or thus for purposes of which the birds know nothing. In a word, the old was a kind of propagation—men transmitting manh
Could you expand on what you mean by 'less automation'? I'm taking it to mean some combination of 'bounding the space of controller actions more', 'automating fewer levels of optimisation', 'more of the work done by humans' and maybe 'only automating easier tasks' but I can't quite tell which of these you're intending or how they fit together.
(Also, am I correctly reading an implicit assumption here that any attempts to do automated research would be classed as 'automated ai safety'?)
When I read this post I feel like I'm seeing four different strands bundled together:
1. Truth-of-beliefs as fuzzy or not
2. Models versus propositions
3. Bayesianism as not providing an account of how you generate new hypotheses/models
4. How people can (fail to) communicate with each other
I think you hit the nail on the head with (2) and am mostly sold on (4), but am sceptical of (1) - similar to what several others have said, it seems to me like these problems don't appear when your beliefs are about expected observations, and only appear when you start to ...
Strongly agree that active inference is underrated both in general and specifically for intuitions about agency.
I think the literature does suffer from ambiguity over where it's descriptive (ie an agent will probably approximate a free energy minimiser) vs prescriptive (ie the right way to build agents is free energy minimisation, and anything that isn't that isn't an agent). I am also not aware of good work on tying active inference to tool use - if you know of any, I'd be pretty curious.
I think the viability thing is maybe slightly fraught - I expect it'...
Interesting! I think one of the biggest things we gloss over in the piece in how perception fits into the picture, and this seems like a pretty relevant point. In general the space of 'things that give situational awareness' seems pretty broad and ripe for analysis.
I also wonder how much efficiency gets lost by decoupling observation and understanding - at least in humans, it seems like we have a kind of hierarchical perception where our subjective experience of 'looking at' something has already gone through a few layers of interpretation, giving us basically no unadulterated visual observation, presumably because this is more efficient (maybe in particular faster?).
This seems like a misunderstanding / not my intent. (Could you maybe quote the part that gave you this impression?)
I believe Dusan was trying to say that davidad's agenda limits the planner AI to only writing provable mathematical solutions. To expand, I believe that compared to what you briefly describe, the idea in davidad's agenda is that you don't try to build a planner that's definitely inner aligned, you simply have a formal verification system that ~guarantees what effects a plan will and won't have if implemented.
Oh interesting! I just had a go at testing it on screenshots from a parallel conversation and it seems like it incorrectly interprets those screenshots as also being of its own conversation.
So it seems like 'recognising things it has said' is doing very little of the heavy lifting and 'recognising its own name' is responsible for most of the effect.
I'll have a bit more of a play around and probably put a disclaimer at the top of the post some time soon.
The 'reward being chance of winning' stuff changes a bit about how the model generalises if it's playing a game with randomness and conditioned on the upper end - it biases the model towards 'expecting risk to pay off'. Eg/ if the model plays a 1-step game where it either banks 1 point or gets a 1% chance of 10 points, then conditioning on it getting 10 points will cause it to take the lower EV action. But this isn't super relevant.
In general I am pretty confused about how models would generalise strategies out of distribution, and this seems like at least...
Re generalisation - decision transformers don't really have strategies per se, they pick moment to moment, and might be systematically miscalibrated about what they'll do in future timesteps. It is true that they'll have some chance at every timestep, which will add up over time, but if you were actually trying to implement this then you could do things like lowering the temperature, which shouldn't affect performance otherwise.
Re higher conditioning - I think this shouldn't be true. For the sake of argument we can reframe it as a binary outcome, where the...
Thanks! Yeah this isn't in the paper, it's just a thing I'm fairly sure of which probably deserves a more thorough treatment elsewhere. In the meantime, some rough intuitions would be:
A slightly sideways argument for interpretability: It's a really good way to introduce the importance and tractability of alignment research
In my experience it's very easy to explain to someone with no technical background that
My main takeaway from this post is that it's important to distinguish between sending signals and trying to send signals, because the latter often leads to goodharting.
It's tricky, though, because obviously you want to be paying attention to what signals you're giving off, and how they differ from the signals you'd like to be giving off, and sometimes you do just have to try to change them.
For instance, I make more of an effort now than I used to, to notice when I appreciate what people are doing, and tell them, so that they know I care. And I think ...
My main takeaway from this post is that it's important to distinguish between sending signals and trying to send signals, because the latter often leads to goodharting.
That is a wonderful summary.
...For instance, I make more of an effort now than I used to, to notice when I appreciate what people are doing, and tell them, so that they know I care. And I think this has basically been very good. This is very much not me dropping all effort to signal.
But I think what you're talking about is very applicable here, because if I were just trying to maximise th
if you think timelines are short for reasons unrelated to biological anchors, I don't think Bio Anchors provides an affirmative argument that you should change your mind.
...Eliezer: I wish I could say that it probably beats showing a single estimate, in terms of its impact on the reader. But in fact, writing a huge careful Very Serious Report like that and snowing the reader under with Alternative Calculations is probably going to cause them to give more authority to the whole thing. It's all very well to note the Ways I Could Be Wrong
The Bio Anchors report is intended as a tool for making debates about AI timelines more concrete, for those who find some bio-anchor-related bound helpful (e.g., some think we should lower bound P(AGI) at some reasonably high number for any year in which we expect to hit a particular kind of "biological anchor"). Ajeya's work lengthened my own timelines, because it helped me understand that some bio-anchor-inspired arguments for shorter timelines didn't have as much going for them as I'd thought; but I think it may have shortened some other folks'.
(The pre...
The belief that people can only be morally harmed by things that causally affect them is not universally accepted. Personally I intuitively would like my grave to not be desecrated, for instance.
I think we have lots of moral intuitions that have become less coherent as science has progressed. But if my identical twin started licensing his genetic code to make human burgers for people who wanted to see what cannibalism was like, I would feel wronged.
I'm using pretty charged examples here, but the point I'm trying to convey is that there are a lot of moral l...
You ask a number of good questions here, but the crucial point to me is that they are still questions. I agree it seems, based on my intuitions of the answers, like this isn't the best path. But 'how much would it cost' and 'what's the chance a clone works on something counterproductive' are, to me, not an argument against cloning, but rather arguments for working out how to answer those questions.
Also very ironic if we can't even align clones and that's what gets us.
I think there are extra considerations to do with what the clone's relation to von Neumann. Plausibly, it might be wrong to clone him without his consent, which we can now no longer get. And the whole idea that you might have a right to your likeness, identity, image, and so on, becomes much trickier as soon as you have actually been cloned.
Also there's a bit of a gulf between a parent deciding to raise a child they think might do good and a (presumably fairly large) organisation funding the creation of a child.
I don't have strongly held convictions on these points, but I do think that they're important and that you'd need to have good answers before you cloned somebody.
Ok I think this is partly fair, but also clearly our moral standards are informed by our society, and in no small part those standards emerge from discussions about what we collectively would like those standards to be, and not just a genetically hardwired disloyalty sensor.
Put another way: yes, in pressured environments we act on instinct, but those instincts don't exist in a vacuum, and the societal project of working out what they ought to be is quite important and pretty hard, precisely because in the moment where you need to refer to it, you will be acting on System 1.
I'm not sure I'm entirely persuaded. Are you saying that the goal of ethics is to accurately predict what people's moral impulse will be in arbitrary situations?
I think moral impulses have changed with times, and it's notable that some people (Bentham, for example) managed to think hard about ethics and arrive at conclusions which massively preempted later shifts in moral values.
Like, Newton's theories give you a good way to predict what you'll see when you throw a ball in the air, but it feels incorrect to me to say that Newton's goal was to find order in...
Massive conflict of interest: I blog on ghost, know and like the people at ghost, and work at a company that moved from substack to ghost, get paid to help people use ghost, and a couple more COIs in this vein.
But if you're soliciting takes from somebody from wordpress I think you might also appreciate the case for ghost, which I simply do think is better than substack for most bloggers above a certain size.
Re your cons, ghost:
1 - has a migration team and the ability to do custom routing, so you would be able to migrate your content
3 - supports total...
I'd like to throw out some more bad ideas, with fewer disclaimers about how terrible they are because I have less reputation to hedge against.
Inline Commenting
I very strongly endorse the point that it seems bad that someone can make bad claims in a post, which are then refuted in comments which only get read by people who get all the way to the bottom and read comments. To me the obvious (wrong) solution is to let people make inline comments. If nothing else, having a good way within comments to point to what part of the post you want to address feels like...
Ah I should emphasise, I do think all of these things could help -- it definitely is a spectrum, and I would guess these proposals all do push away from agency. I think the direction here is promising.
The two things I think are (1) the paper seems to draw an overly sharp distinction between agents and non-agents, and (2) basically all of the mitigations proposed look like they break down with superhuman capabilities. Hard to tell which of this is actual disagreements and which is the paper trying to be concise and approachable, so I'll set that aside for n... (read more)