Jeremy Gillen

I'm interested in doing in-depth dialogues to find cruxes. Message me if you are interested in doing this.

I do alignment research, mostly stuff that is vaguely agent foundations. Currently doing independent alignment research on ontology identification. Formerly on Vivek's team at MIRI. Most of my writing before mid 2023 is not representative of my current views about alignment difficulty.

Wikitag Contributions

Comments

Sorted by

It's not about building less useful technology, that's not what Abram or Ryan are talking about (I assume). The field of alignment has always been about strongly superhuman agents. You can have tech that is useful and also safe to use, there's no direct contradiction here.

Maybe one weak-ish historical analogy is explosives? Some explosives are unstable, and will easily explode by accident. Some are extremely stable, and can only be set off by a detonator. Early in the industrial chemistry tech tree, you only have access to one or two ways to make explosives. If you're desperate, you use these whether or not they are stable, because the risk-usefulness tradeoff is worth it. A bunch of your soldiers will die, and your weapons caches will be easier to destroy, but that's a cost you might be willing to pay. As your industrial chemistry tech advances, you invent many different types of explosive, and among these choices you find ones that are both stable explosives and effective, because obviously this is better in every way.

Maybe another is medications? As medications advanced, as we gained choice and specificity in medications, we could choose medications that had both low side-effects and were effective. Before that, there was often a choice, and the correct choice was often to not use the medicine unless you were literally dying.

In both these examples, sometimes the safety-usefulness tradeoff was worth it, sometimes not. Presumably people in both cases people often made the choice not to use unsafe explosives or unsafe medicine, because the risk wasn't worth it.

As it is with these technologies, so it is with AGI. There are a bunch future paradigms of AGI building. The first one we stumble into isn't looking like one where we can precisely specify what it wants. But if we were able to keep experimenting and understanding and iterating after the first AGI, and we gradually developed dozens of ways of building AGI, then I'm confident we could find one that is just as intelligent and also could have its goals precisely specified.

My two examples above don't quite answer your question, because "humanity" didn't steer away from using them, just individual people at particular times. For examples where all or large sections of humanity steered away from using an extremely useful tech whose risks purportedly outweighed benefits:  Project Plowshare, nuclear power in some countries, GMO food in some countries, viral bioweapons (as far as I know), eugenics, stem cell research, cloning. Also {CFCs, asbestos, leaded petrol, CO2 to some extent, radium, cocaine, heroin} after the negative externalities were well known.

I guess my point is that safety-usefulness tradeoffs are everywhere, and tech development choices that take into account risks are made all the time. To me, this makes your question utterly confused. Building technology that actually does what you want (which is be safe and useful) is just standard practice. This is what everyone does, all the time, because obviously safety is one of the design requirements of whatever you're building.

The main difference with between above technologies and AGI is that it's a trapdoor. The cost of messing up AGI is that you lose any chance to try again. AGI shares with some of the above technologies an epistemic problem. For many of them it isn't clear in advance, to most people, how much risk there actually is, and therefore whether the tradeoff is worth it.


After writing this, it occurred to me that maybe by "competitive" you meant "earlier in the tech tree"? I interpreted it in my comment as a synonym of "useful" in a sense that excluded safe-to-use.

Can you link to where RP says that?

Do you not see how they could be used here?

This one. I'm confused about what the intuitive intended meaning of the symbol is. Sorry, I see why "type signature" was the wrong way to express that confusion. In my mind a logical counterfactual is a model of the world, with some fact changed, and the consequences of that fact propagated to the rest of the model. Maybe  is a boolean fact that is edited? But if so I don't know which fact it is, and I'm confused by the way you described it.

Because we're talking about priors and their influence, all of this is happening inside the agent's brain. The agent is going about daily life, and thinks "hm, maybe there is an evil demon simulating me who will give me -101010^10 utility if I don't do what they want for my next action". I don't see why this is obviously ill-defined without further specification of the training setup.

Can we replace this with: "The agent is going about daily life, and its (black box) world model suddenly starts predicting that most available actions actions lead to -1010 utility."? This is what it's like to be an agent with malign hypotheses in the world model. I think we can remove the additional complication of believing its in a simulation.

I'm not sure what the type signature of  is, or what it means to "not take into account 's simulation". When  makes decisions about which actions to take, it doesn't have the option of ignoring the predictions of its own world model. It has to trust its own world model, right? So what does it mean to "not take it into account"?

So the way in which the agent "gets its beliefs" about the structure of the decision theory problem is via these logical-counterfactual-conditional operation

I think you've misunderstood me entirely. Usually in a decision problem, we assume the agent has a perfectly true world model, and we assume that it's in a particular situation (e.g. with omega and knowing how omega will react to different actions). But in reality, an agent has to learn which kind of world its in using an inductor. That's all I meant by "get its beliefs".

Well my response to this was:

In order for a decision theory to choose actions, it has to have a model of the decision problem. The way it gets a model of this decision problem is...?

But I'll expand: An agent doing that kind of game-theory reasoning needs to model the situation it's in. And to do that modelling it needs a prior. Which might be malign.

Malign agents in the prior don't feel like malign agents in the prior, from the perspective of the agent with the prior. They're just beliefs about the way the world is. You need beliefs in order to choose actions. You can't just decide to act in a way that is independent of your beliefs, because you've decided your beliefs are out to get you. 

On top of this, how would you even decide that your beliefs are out to get you? Isn't this also a belief?

Yeah I know that bound, I've seen a very similar one. The problem is that mesa-optimisers also get very good prediction error when averaged over all predictions. So they exist well below the bound. And they can time their deliberately-incorrect predictions carefully, if they want to survive for a long time.

How does this connect to malign prior problems?

But why would you ever be able to solve the problem with a different decision theory? If the beliefs are manipulating it, it doesn't matter what the decision theory is.

To respond to your edit: I don't see your reasoning, and that isn't my intuition. For moderately complex worlds, it's easy for the description length of the world to be longer than the description length of many kinds of inductor.

Because we have the prediction error bounds.

Not ones that can rule out any of those things. My understanding is that the bounds are asymptotic or average-case in a way that makes them useless for this purpose. So if a mesa-inductor is found first that has a better prior, it'll stick with the mesa-inductor. And if it has goals, it can wait as long as it wants to make a false prediction that helps achieve its goals. (Or just make false predictions about counterfactuals that are unlikely to be chosen).

If I'm wrong then I'd be extremely interested in seeing your reasoning. I'd maybe pay $400 for a post explaining the reasoning behind why prediction error bounds rule out mesa-optimisers in the prior.

You also want one that generalises well, and doesn't do preformative predictions, and doesn't have goals of its own. If your hypotheses aren't even intended to be reflections of reality, how do we know these properties hold?

Also, scientific hypotheses in practice aren’t actually simple code for a costly simulation we run. We use approximations and abstractions to make things cheap. Most of our science outside particle physics is actually about finding more effective approximate models for things in different regimes.

When we compare theories, we don't consider the complexity of all the associated approximations and abstractions. We just consider the complexity of the theory itself. 

E.g. the theory of evolution isn't quite code for a costly simulation. But it can be viewed as set of statements about such a simulation. And the way we compare the theory of evolution to alternatives doesn't involve comparing the complexity of the set of approximations we used to work out the consequences of each theory.

Edit to respond to your edit: I don't see your reasoning, and that isn't my intuition. For moderately complex worlds, it's easy for the description length of the world to be longer than the description length of many kinds of inductor.

Load More