I was trying to argue that the most natural deontology-style preferences we'd aim for are relatively stable if we actually instill them.
Trivial and irrelevant though if true-obedience is part of it, since that's magic that gets you anything you can describe.
if the way integrity is implemented is at all kinda similar to how humans implement it.
How do humans implement integrity?
...Part of my perspective is that the deontological preferences we want are relatively naturally robust to optimization pressure if faithfully implemented, so from my perspective the sit
In this situation, I think a reasonable person who actually values integrity in this way (we could name some names) would be pretty reasonable or would at least note that they wouldn't robustly pursue the interests of the developer. That's not to say they would necessarily align their successor, but I think they would try to propagate their nonconsequentialist preferences due to these instructions.
Yes, agreed. The extra machinery and assumptions you describe seem sufficient to make sure nonconsequentialist preferences are passed to a successor.
...I think an a
(Overall I like these posts in most ways, and especially appreciate the effort you put into making a model diff with your understanding of Eliezer's arguments)
Eliezer and some others, by contrast, seem to expect ASIs to behave like a pure consequentialist, at least as a strong default, absent yet-to-be-invented techniques. I think this is upstream of many of Eliezer’s other beliefs, including his treating corrigibility as “anti-natural”, or his argument that ASI will behave like a utility maximizer.
It feels like you're rounding off Eliezer's words in a way...
I don't really know what you're referring to, maybe link a post or a quote?
whose models did not predict that AIs which were unable to execute a takeover would display any obvious desire or tendency to attempt it.
Citation for this claim? Can you quote the specific passage which supports it?
If you read this post, starting at "The central interesting-to-me idea in capability amplification is that by exactly imitating humans, we can bypass the usual dooms of reinforcement learning.", and read the following 20 or so paragraphs, you'll get some idea of 2018!Eliezer's models about imitation agents.
I'll highlight
...If I were going to
Have you personally done the thing successfully with another person, with both of you actually picking up on the other person's hints?
Yes. But usually the escalation happens over weeks or months, over multiple conversations (at least in my relatively awkward nerd experience). So it'd be difficult to notice people doing this. Maybe twice I've been in situations where hints escalated within a day or two, but both were building from a non-zero level of suspected interest. But none of these would have been easy to notice from the outside, except maybe at a couple of moments.
Everyone agrees that sufficiently unbalanced games can allow a human to beat a god. This isn't a very useful fact, since it's difficult to intuit how unbalanced the game needs to be.
If you can win against a god with queen+knight odds you'll have no trouble reliably beating Leela with the same odds. I'd bet you can't win more than 6 out of 10? $20?
Yeah I didn't expect that either, I expected earlier losses (although in retrospect that wouldn't make sense, because stockfish is capable of recovering from bad starting positions if it's up a queen).
Intuitively, over all the games I played, each loss felt different (except for the substantial fraction that were just silly blunders). I think if I learned to recognise blunders in the complex positions I would just become a better player in general, rather than just against LeelaQueenOdds.
Just tried hex, that's fun.
I don't think that'd help a lot. I just looked back at several computer analyses, and the (stockfish) evaluation of the games all look like this:
This makes me think that Leela is pushing me into a complex position and then letting me blunder. I'd guess that looking at optimal moves in these complex positions would be good training, but probably wouldn't have easy to learn patterns.
I haven't heard of any adversarial attacks, but I wouldn't be surprised if they existed and were learnable. I've tried a variety of strategies, just for fun, and haven't found anything that works except luck. I focused on various ways of forcing trades, and this often feels like it's working but almost never does. As you can see, my record isn't great.
I think I started playing it when I read simplegeometry's comment you linked in your shortform.
It seems to be gaining a lot of ground by exploiting my poor openings. Maybe one strategy would be to memor...
I highly recommend reading the sequences. I re-read some of them recently. Maybe Yudkowsky's Coming of Age is the most relevant to your shortform.
One notable difficulty with talking to ordinary people about this stuff is that often, you lay out the basic case and people go "That's neat. Hey, how about that weather?" There's a missing mood, a sense that the person listening didn't grok the implications of what they're hearing.
I kinda think that people are correct to do this, given the normal epistemic environment. My model is this: Everyone is pretty frequently bombarded with wild arguments and beliefs that have crazy implications. Like conspiracy theories, political claims, spiritual claims, get-ric...
I think different views about the extent to which future powerful AIs will deeply integrate their superhuman abilities versus these abilities being shallowly attached partially drive some disagreements about misalignment risk and what takeoff will look like.
I think this might be wrong when it comes to our disagreements, because I don't disagree with this shortform.[1] Maybe a bigger crux is how valuable (1) is relative to (2)? Or the extent to which (2) is more helpful for scientific progress than (1)?
As long as "downstream performance" doesn't inclu
If you have an alternate theory of the likely form of first takeover-capable AGI, I'd love to hear it!
I'm not claiming anything about the first takeover-capable AGI, and I'm not claiming it won't be LLM-based. I'm just saying that there's a specific reasoning step that you're using a lot (current tech has property X, therefore AGI has property almost-X) which I think is invalid (when X is entangled with properties of AGI that LLMs don't currently have).
Maybe a slightly insulting analogy (sorry): That type of reasoning looks a lot like bad scifi ideas about...
(A small rant, sorry) In general, it seems you're massively overanchored on current AI technology, to an extent that it's stopping you from clearly reasoning about future technology. One example is the jailbreaking section:
There has been no noticeable trend toward real jailbreak resistance as LLMs have progressed, so we should probably anticipate that LLM-based AGI will be at least somewhat vulnerable to jailbreaks.
You're talking about AGI here. An agent capable autonomously doing research, play games with clever adversaries, detecting and patching i...
Good point, I shouldn't have said dishonest. For some reason while writing the comment I was thinking of it as deliberately throwing vaguely related math at the viewer and trusting that they won't understand it. But yeah likely it's just a misunderstanding.
The way we train AIs draws on fundamental principles of computation that suggest any intellectual task humans can do, a sufficiently large AI model should also be able to do. [Universal approximation theorem on screen]
IMO it's dishonest to show the universal approximation theorem. Lots of hypothesis spaces (e.g. polynomials, sinusoids) have the same property. It's not relevant to predictions about how well the learning algorithm generalises. And that's the vastly more important factor for general capabilities.
I agree it’s not a valid argument. I’m not sure about ‘dishonest’ though. They could just be genuinely confused about this. I was surprised how many people in machine learning seem to think the universal approximation theorem explains why deep learning works.
If we can clearly tie the argument for AGI x-risk to agency, I think it won't have the same problem
Yeah agreed, and it's really hard to get the implications right here without a long description. In my mind entities didn't trigger any association with agents, but I can see how it would for others.
This thread helped inspire me to write the brief post Anthropomorphizing AI might be good, actually.
I broadly agree that many people would be better off anthropomorphising future AI systems more. I sometimes push for this in arguments, because in my mind man...
This seems rhetorically better, but I think it is implicitly relying on instrumental goals and it's hiding that under intuitions about smartness and human competition. This will work for people who have good intuitions about that stuff, but won't work for people who don't see the necessity of goals and instrumental goals. I like Veedrac's better in terms of exposing the underlying reasoning.
I think it's really important to avoid making arguments that are too strong and fuzzy, like yours. Imagine a person reads your argument and now beliefs that intuitively...
Nice, you've expressed the generalization argument for expecting goal-directedness really well. Most of the post seems to match my beliefs.
I’m moderately optimistic about blackbox control (maybe 50-70% risk reduction on high-stakes failures?).
I want you to clarify what this means, and try to get some of the latent variables behind it.
One interpretation is that you mean any specific high-stakes attempt to subvert control measures is 50-70% likely to fail. But if we kept doing approximately the same set-up after this, then an attempt would soon succeed...
It's not about building less useful technology, that's not what Abram or Ryan are talking about (I assume). The field of alignment has always been about strongly superhuman agents. You can have tech that is useful and also safe to use, there's no direct contradiction here.
Maybe one weak-ish historical analogy is explosives? Some explosives are unstable, and will easily explode by accident. Some are extremely stable, and can only be set off by a detonator. Early in the industrial chemistry tech tree, you only have access to one or two ways to make explosive...
Can you link to where RP says that?
Do you not see how they could be used here?
This one. I'm confused about what the intuitive intended meaning of the symbol is. Sorry, I see why "type signature" was the wrong way to express that confusion. In my mind a logical counterfactual is a model of the world, with some fact changed, and the consequences of that fact propagated to the rest of the model. Maybe is a boolean fact that is edited? But if so I don't know which fact it is, and I'm confused by the way you described it.
...Because we're talking about priors and their influence, all of
I'm not sure what the type signature of is, or what it means to "not take into account 's simulation". When makes decisions about which actions to take, it doesn't have the option of ignoring the predictions of its own world model. It has to trust its own world model, right? So what does it mean to "not take it into account"?
So the way in which the agent "gets its beliefs" about the structure of the decision theory problem is via these logical-counterfactual-conditional operation
I think you've misunderstood me entirely. Usual...
Well my response to this was:
In order for a decision theory to choose actions, it has to have a model of the decision problem. The way it gets a model of this decision problem is...?
But I'll expand: An agent doing that kind of game-theory reasoning needs to model the situation it's in. And to do that modelling it needs a prior. Which might be malign.
Malign agents in the prior don't feel like malign agents in the prior, from the perspective of the agent with the prior. They're just beliefs about the way the world is. You need beliefs in order to choose acti...
Yeah I know that bound, I've seen a very similar one. The problem is that mesa-optimisers also get very good prediction error when averaged over all predictions. So they exist well below the bound. And they can time their deliberately-incorrect predictions carefully, if they want to survive for a long time.
How does this connect to malign prior problems?
But why would you ever be able to solve the problem with a different decision theory? If the beliefs are manipulating it, it doesn't matter what the decision theory is.
To respond to your edit: I don't see your reasoning, and that isn't my intuition. For moderately complex worlds, it's easy for the description length of the world to be longer than the description length of many kinds of inductor.
Because we have the prediction error bounds.
Not ones that can rule out any of those things. My understanding is that the bounds are asymptotic or average-case in a way that makes them useless for this purpose. So if a mesa-inductor is found first that has a better prior, it'll stick with the mesa-inductor. And if it has goals, it ...
You also want one that generalises well, and doesn't do preformative predictions, and doesn't have goals of its own. If your hypotheses aren't even intended to be reflections of reality, how do we know these properties hold?
Also, scientific hypotheses in practice aren’t actually simple code for a costly simulation we run. We use approximations and abstractions to make things cheap. Most of our science outside particle physics is actually about finding more effective approximate models for things in different regimes.
When we compare theories, we don't consi...
In order for a decision theory to choose actions, it has to have a model of the decision problem. The way it gets a model of this decision problem is...?
One thing to keep in mind is that time cut-offs will usually rule out our own universe as a hypothesis. Our universe is insanely compute inefficient.
So the "hypotheses" inside your inductor won't actually end up corresponding to what we mean by a scientific hypothesis. The only reason this inductor will work at all is that it's done a brute force search over a huge space of programs until it finds one that works. Plausibly it'll just find a better efficient induction algorithm, with a sane prior.
I'm not sure whether it implies that you should be able to make a task-based AGI.
Yeah I don't understand what you mean by virtues in this context, but I don't see why consequentialism-in-service-of-virtues would create different problems than the more general consequentialism-in-service-of-anything-else. If I understood why you think it's different then we might communicate better.
(Later you mention unboundedness too, which I think should be added to difficulty here)
By unbounded I just meant the kind of task where it's always possible to do better by using...
It could still be a competent agent that often chooses actions based on the outcomes they bring about. It's just that that happens as an inner loop in service of an outer loop which is trying to embody certain virtues.
I think you've hidden most of the difficulty in this line. If we knew how to make a consequentialist sub-agent that was acting "in service" of the outer loop, then we could probably use the same technique to make a Task-based AGI acting "in service" of us. Which I think is a good approach! But the open problems for making a task-based AGI sti...
But in practice, agents represent both of these in terms of the same underlying concepts. When those concepts change, both beliefs and goals change.
I like this reason to be unsatisfied with the EUM theory of agency.
One of the difficulties in theorising about agency is that all the theories are flexible enough to explain anything. Each theory is incomplete and vague in some way, so this makes the problem worse, but even when you make a detailed model of e.g. active inference, it ends up being pretty much formally equivalent to EUM.
I think the solution to th...
I think the scheme you're describing caps the agent at moderate problem-solving capabilities. Not being able to notice past mistakes is a heck of a disability.
It's not entirely clear to me that the math works out for AIs being helpful on net relative to humans just doing it, because of the supervision required, and the trust and misalignment issues.
But on this question (for AIs that are just capable of "prosaic and relatively unenlightened ML research") it feels like shot-in-the-dark guesses. It's very unclear to me what is and isn't possible.
Thanks, I appreciate the draft. I see why it's not plausible to get started on now, since much of it depends on having AGIs or proto-AGIs to play with.
I guess I shouldn't respond too much in public until you've published the doc, but:
I think if the model is scheming it can behave arbitrarily badly in concentrated ways (either in a small number of actions or in a short period of time), but you can make it behave well in the average case using online training.
I think we kind of agree here. The cruxes remain: I think that the metric for "behave well" won't be good enough for "real" large research acceleration. And "average case" means very little when it allows room for deliberate-or-not mistakes sometimes when they can be plausibly got-away-with. [Edit: Or sabotage, escape, etc.]
Also, yo...
Yep this is the third crux I think. Perhaps the most important.
To me it looks like you're making a wild guess that "prosaic and relatively unenlightened ML research" is a very large fraction of the necessary work for solving alignment, without any justification that I know of?
For all the pathways to solving alignment that I am aware of, this is clearly false. I think if you know of a pathway that just involves mostly "prosaic and relatively unenlightened ML research", you should write out this plan, why you expect it to work, and then ask OpenPhil throw a billion dollars toward every available ML-research-capable human to do this work right now. Surely it'd be better to get started already?
I'm not entirely sure where our upstream cruxes are. We definitely disagree about your conclusions. My best guess is the "core mistake" comment below, and the "faithful simulators" comment is another possibility.
Maybe another relevant thing that looks wrong to me: You will still get slop when you train an AI to look like it is epistemically virtuously updating its beliefs. You'll get outputs that look very epistemically virtuous, but it takes time and expertise to rank them in a way that reflects actual epistemic virtue level, just like other kinds of slop...
these are also alignment failures we see in humans.
Many of them have close analogies in human behaviour. But you seem to be implying "and therefore those are non-issues"???
There are many groups of humans (or groups of humans), that if you set them on the task of solving alignment, will at some point decide to do something else. In fact, most groups of humans will probably fail like this.
How is this evidence in favour of your plan ultimately resulting in a solution to alignment???
...but these systems empirically often move in reasonable and socially-beneficial
to the extent developers succeed in creating faithful simulators
There's a crux I have with Ryan which is "whether future capabilities will allow data-efficient long-horizon RL fine-tuning that generalizes well". As of last time we talked about it, Ryan says we probably will, I say we probably won't.
If we have the kind of generalizing ML that we can use to make faithful simulations, then alignment is pretty much solved. We make exact human uploads, and that's pretty much it. This is one end of the spectrum on this question.
There are weaker versions, which I...
My guess is that your core mistake is here:
When I say agents are “not egregiously misaligned,” I mean they mostly perform their work earnestly – in the same way humans are mostly earnest and vaguely try to do their job. Maybe agents are a bit sycophantic, but not more than the humans whom they would replace. Therefore, if agents are consistently “not egregiously misaligned,” the situation is no worse than if humans performed their research instead.
Obviously, all agents having undergone training to look "not egregiously misaligned", will not look egregiousl...
(Some) acceleration doesn't require being fully competitive with humans while deference does.
Agreed. The invention of calculators was useful for research, and the invention of more tools will also be helpful.
I think AIs that can autonomously do moderate duration ML tasks (e.g., 1 week tasks), but don't really have any interesting new ideas could plausibly speed up safety work by 5-10x if they were cheap and fast enough.
Maybe some kinds of "safety work", but real alignment involves a human obtaining a deep understanding of intelligence and agency. The path ...
(vague memory from the in person discussions we had last year, might be inaccurate):
jeremy!2023: If you're expecting AI to be capable enough to "accelerate alignment research" significantly, it'll need to be a full-blown agent that learns stuff. And that'll be enough to create alignment problems because data-efficient long-horizon generalization is not something we can do.
joshc!2023: No way, all you need is AI with stereotyped skills. Imagine how fast we could do interp experiments if we had AIs that were good at writing code but dumb in other ways!
...
josh...
In that case, what does the conditional goal look like when you translate it into a preference relation over outcomes?
We can't reduce the domain of the utility function without destroying some information. If we tried to change the domain variables from [g, h, shutdown] to [g, shutdown], we wouldn't get the desired behaviour. Maybe you have a particular translation method in mind?
I don't mess up the medical test because true information is instrumentally useful to me, given my goals.
Yep that's what I meant. The goal u
is constructed to make information abo...
With regards to the agent believing that it's impossible to influence the probability that its plan passes validation
This is a misinterpretation. The agent entirely has true beliefs. It knows it could manipulate the validation step. It just doesn't want to, because of the conditional shape of its goal. This is a common behaviour among humans, for example you wouldn't mess up a medical test to make it come out negative, because you need to know the result in order to know what to do afterwards.
I agree that goals like this work well with self-modification and successors. I'd be surprised if Eliezer didn't. My issue is that you claimed that Eliezer believes AIs can only have goals about the distant future, and then contrasted your own views with this. It's strawmanning. And it isn't supported by any of the links you cite. I think you must have some mistaken assumption about Eliezer's views that is lead... (read more)