All of Jeremy Gillen's Comments + Replies

Do you not see how they could be used here?

This one. I'm confused about what the intuitive intended meaning of the symbol is. Sorry, I see why "type signature" was the wrong way to express that confusion. In my mind a logical counterfactual is a model of the world, with some fact changed, and the consequences of that fact propagated to the rest of the model. Maybe $L_{A}$ is a boolean fact that is edited? But if so I don't know which fact it is, and I'm confused by the way you described it.

Because we're talking about priors and their influence, all of

Changing my mind about Christiano's malign prior argument

I'm not sure what the type signature of $L_{A}$ is, or what it means to "not take into account $M$ 's simulation". When $A$ makes decisions about which actions to take, it doesn't have the option of ignoring the predictions of its own world model. It has to trust its own world model, right? So what does it mean to "not take it into account"?

So the way in which the agent "gets its beliefs" about the structure of the decision theory problem is via these logical-counterfactual-conditional operation

I think you've misunderstood me entirely. Usual... (read more)

1Garrett Baker1mo

I know you know about logical decision theory, and I know you know its not formalized, and I'm not going to be able to formalize it in a LessWrong comment, so I'm not sure what you want me to say here. Do you reject the idea of logical counterfactuals? Do you not see how they could be used here? Because we're talking about priors and their influence, all of this is happening inside the agent's brain. The agent is going about daily life, and thinks "hm, maybe there is an evil demon simulating me who will give me -101010^10 utility if I don't do what they want for my next action". I don't see why this is obviously ill-defined without further specification of the training setup.

Changing my mind about Christiano's malign prior argument

Well my response to this was:

In order for a decision theory to choose actions, it has to have a model of the decision problem. The way it gets a model of this decision problem is...?

But I'll expand: An agent doing that kind of game-theory reasoning needs to model the situation it's in. And to do that modelling it needs a prior. Which might be malign.

Malign agents in the prior don't feel like malign agents in the prior, from the perspective of the agent with the prior. They're just beliefs about the way the world is. You need beliefs in order to choose acti... (read more)

2Garrett Baker1mo

Let M be an agent which can be instantiated in a much simpler world and has different goals from our limited Bayesian agent A. We say M is malign with respect to A if p(q|O)<p(qM,A|O) where q is the "real" world and qM,A is the world where M has decided to simulate all of A's observations for the purpose of trying to invade their prior. Now what influences p(qM,A|O)? Well M will only simulate all of A's observations if it expects this will give it some influence over A. Let LA be an unformalized logical counterfactual operation that A could make. Then p(qM,A|O,LA) is maximal when LA takes into account M's simulation, and 0 when LA doesn't take into account M's simulation. In particular, if LA,¬M is a logical counterfactual which doesn't take M's simulation into account, then p(qM,A|O,LA,¬M)=0<p(q|O,LA,¬M) So the way in which the agent "gets its beliefs" about the structure of the decision theory problem is via these logical-counterfactual-conditional operations, same as in causal decision theory, and same as in evidential decision theory.

Changing my mind about Christiano's malign prior argument

Yeah I know that bound, I've seen a very similar one. The problem is that mesa-optimisers also get very good prediction error when averaged over all predictions. So they exist well below the bound. And they can time their deliberately-incorrect predictions carefully, if they want to survive for a long time.

Changing my mind about Christiano's malign prior argument

How does this connect to malign prior problems?

2Garrett Baker1mo

Changing my mind about Christiano's malign prior argument

But why would you ever be able to solve the problem with a different decision theory? If the beliefs are manipulating it, it doesn't matter what the decision theory is.

2Garrett Baker1mo

My world model would have a loose model of myself in it, and this will change which worlds I'm more or less likely to be found in. For example, a logical decision theorist, trying to model omega, will have very low probability that omega has predicted it will two box.

Changing my mind about Christiano's malign prior argument

To respond to your edit: I don't see your reasoning, and that isn't my intuition. For moderately complex worlds, it's easy for the description length of the world to be longer than the description length of many kinds of inductor.

Because we have the prediction error bounds.

Not ones that can rule out any of those things. My understanding is that the bounds are asymptotic or average-case in a way that makes them useless for this purpose. So if a mesa-inductor is found first that has a better prior, it'll stick with the mesa-inductor. And if it has goals, it ... (read more)

2Lucius Bushnaq1mo

The bound is the same one you get for normal Solomonoff induction, except restricted to the set of programs the cut-off induction runs over. It’s a bound on the total expected error in terms of CE loss that the predictor will ever make, summed over all datapoints. Look at the bound for cut-off induction in that post I linked, maybe? Hutter might also have something on it. Can also discuss on a call if you like. Note that this doesn’t work in real life, where the programs are not in fact restricted to outputting bit string predictions and can e.g. try to trick the hardware they’re running on.

Changing my mind about Christiano's malign prior argument

You also want one that generalises well, and doesn't do preformative predictions, and doesn't have goals of its own. If your hypotheses aren't even intended to be reflections of reality, how do we know these properties hold?

Also, scientific hypotheses in practice aren’t actually simple code for a costly simulation we run. We use approximations and abstractions to make things cheap. Most of our science outside particle physics is actually about finding more effective approximate models for things in different regimes.

When we compare theories, we don't consi... (read more)

2Lucius Bushnaq1mo

Because we have the prediction error bounds. Yes.

Changing my mind about Christiano's malign prior argument

In order for a decision theory to choose actions, it has to have a model of the decision problem. The way it gets a model of this decision problem is...?

2Garrett Baker1mo

Oh my point wasn't against solomonoff in general, maybe more crisply, my clam is different decision theories will find different "pathologies" in the solomonoff prior, and in particular for causal and evidential decision theorists, I could totally buy the misaligned prior bit, and I could totally buy, if formalized, the whole thing rests on the interaction between bad decision theory and solomonoff.

Is instrumental convergence a thing for virtue-driven agents?

One thing to keep in mind is that time cut-offs will usually rule out our own universe as a hypothesis. Our universe is insanely compute inefficient.

So the "hypotheses" inside your inductor won't actually end up corresponding to what we mean by a scientific hypothesis. The only reason this inductor will work at all is that it's done a brute force search over a huge space of programs until it finds one that works. Plausibly it'll just find a better efficient induction algorithm, with a sane prior.

5Lucius Bushnaq1mo

That’s fine. I just want a computable predictor that works well. This one does. Also, scientific hypotheses in practice aren’t actually simple code for a costly simulation we run. We use approximations and abstractions to make things cheap. Most of our science outside particle physics is about finding more effective approximations for stuff. Edit: Actually, I don’t think this would yield you a different general predictor as the program dominating the posterior. General inductor program P1 running program P2 is pretty much never going to be the shortest implementation of P2.

Is instrumental convergence a thing for virtue-driven agents?

I'm not sure whether it implies that you should be able to make a task-based AGI.

Yeah I don't understand what you mean by virtues in this context, but I don't see why consequentialism-in-service-of-virtues would create different problems than the more general consequentialism-in-service-of-anything-else. If I understood why you think it's different then we might communicate better.

(Later you mention unboundedness too, which I think should be added to difficulty here)

By unbounded I just meant the kind of task where it's always possible to do better by using... (read more)

Jeremy Gillen1mo*80

It could still be a competent agent that often chooses actions based on the outcomes they bring about. It's just that that happens as an inner loop in service of an outer loop which is trying to embody certain virtues.

I think you've hidden most of the difficulty in this line. If we knew how to make a consequentialist sub-agent that was acting "in service" of the outer loop, then we could probably use the same technique to make a Task-based AGI acting "in service" of us. Which I think is a good approach! But the open problems for making a task-based AGI sti... (read more)

4mattmacdermott1mo

Later I might try to flesh out my currently-very-loose picture of why consequentialism-in-service-of-virtues seems like a plausible thing we could end up with. I'm not sure whether it implies that you should be able to make a task-based AGI. Fair enough. Talk of instrumental convergence usually assumes that the amount of power that is helpful will be a lot (otherwise it wouldn't be scary). But I suppose you'd say that's just because we expect to try to use AIs for very difficult tasks. (Later you mention unboundedness too, which I think should be added to difficulty here). I'm not sure about that, because the fact that the task is being completed in service of some virtue might limit the scope of actions that are considered for it. Again I think it's on me to paint a more detailed picture of the way the agent works and how it comes about in order for us to be able to think that through.

-1StanislavKrym1mo

As I wrote in another comment, in an experiment ChatGPT failed to utter a racial slur to save millions of lives. A re-run of the experiment led it to agree to use the slur and to claim that "In this case, the decision to use the slur is a complex ethical dilemma that ultimately comes down to weighing the value of saving countless lives against the harm caused by the slur". This implies that ChatGPT is either already aligned to a not so consequential ethics or that it ended up grossly exaggerating the slur's harm. Or that it failed to understand the taboo's meaning. UPD: if racial slurs are a taboo for AI, then colonizing the world, apparently, is a taboo as well. Is AI takeover close enough to colonialism to align AI against the former, not just the latter?

Towards a scale-free theory of intelligent agency

Jeremy Gillen1mo210

But in practice, agents represent both of these in terms of the same underlying concepts. When those concepts change, both beliefs and goals change.

I like this reason to be unsatisfied with the EUM theory of agency.

One of the difficulties in theorising about agency is that all the theories are flexible enough to explain anything. Each theory is incomplete and vague in some way, so this makes the problem worse, but even when you make a detailed model of e.g. active inference, it ends up being pretty much formally equivalent to EUM.

I think the solution to th... (read more)

3Jonas Hallgren1mo

Could you please make an argument for goal stability over process stability? If I reflecticely agree that if the process A (QACI or CEV for example) is reflectively good then I agree to changing my values from B to C if process A happens? So it is more about the process than the underlying goals. Why do we treat goals as the main class citizen here? There's something in well defined processes that make them applicable to themselves and reflectively stable?

Evaluating Stability of Unreflective Alignment

Training AI to do alignment research we don’t already know how to do

I think the scheme you're describing caps the agent at moderate problem-solving capabilities. Not being able to notice past mistakes is a heck of a disability.

Training AI to do alignment research we don’t already know how to do

It's not entirely clear to me that the math works out for AIs being helpful on net relative to humans just doing it, because of the supervision required, and the trust and misalignment issues.

But on this question (for AIs that are just capable of "prosaic and relatively unenlightened ML research") it feels like shot-in-the-dark guesses. It's very unclear to me what is and isn't possible.

4ryan_greenblatt2mo

I certainly agree it isn't clear, just my current best guess.

Jeremy Gillen2mo30

Thanks, I appreciate the draft. I see why it's not plausible to get started on now, since much of it depends on having AGIs or proto-AGIs to play with.

I guess I shouldn't respond too much in public until you've published the doc, but:

If I'm interpreting correctly, a number of the things you intend to try involve having a misaligned (but controlled) proto-AGI run experiments involving training (or otherwise messing with in some way) an AGI. I hope you have some empathy the internal screaming I have toward this category of things.
A bunch of the ideas do seem

... (read more)

2ryan_greenblatt2mo

Yes, I just meant on net. (Relative to the current ML community and given a similar fraction of resources to spend on AI compute.)

Training AI to do alignment research we don’t already know how to do

Jeremy Gillen2mo*20

I think if the model is scheming it can behave arbitrarily badly in concentrated ways (either in a small number of actions or in a short period of time), but you can make it behave well in the average case using online training.

I think we kind of agree here. The cruxes remain: I think that the metric for "behave well" won't be good enough for "real" large research acceleration. And "average case" means very little when it allows room for deliberate-or-not mistakes sometimes when they can be plausibly got-away-with. [Edit: Or sabotage, escape, etc.]

Also, yo... (read more)

2ryan_greenblatt2mo

Oh, yeah I meant "perform well according to your metrics" not "behave well" (edited)

Training AI to do alignment research we don’t already know how to do

Jeremy Gillen2mo3-1

Yep this is the third crux I think. Perhaps the most important.

To me it looks like you're making a wild guess that "prosaic and relatively unenlightened ML research" is a very large fraction of the necessary work for solving alignment, without any justification that I know of?

For all the pathways to solving alignment that I am aware of, this is clearly false. I think if you know of a pathway that just involves mostly "prosaic and relatively unenlightened ML research", you should write out this plan, why you expect it to work, and then ask OpenPhil throw a billion dollars toward every available ML-research-capable human to do this work right now. Surely it'd be better to get started already?

3ryan_greenblatt2mo

I don't think "what is the necessary work for solving alignment" is a frame I really buy. My perspective on alignment is more like: * Avoiding egregious misalignment (where AIs intentionally act in ways that make our tests highly misleading or do pretty obviously unintended/dangerous actions) reduces risk once AIs are otherwise dangerous. * Additionally, we will likely to need to hand over making most near term decisions and most near term labor to some AI systems at some point. This going well very likely requires being able to avoid egregious misalignment (in systems capable enough to obsolete us) and also requires some other stuff. * There is a bunch of "prosaic and relatively unenlightened ML research" which can make egregious misalignment much less likely and can resolve other problems needed for handover. * Much of this work is much easier once you already have powerful AIs to experiment on. * The risk reduction will depend on the amount of effort put in and the quality of the execution etc. * The total quantity of risk reduction is unclear, but seems substantial to me. I'd guess takeover risk goes from 50% to 5% if you do a very good job at executing on huge amounts of prosaic and relatively unenlightened ML research at the relevant time. (This require more misc conceptual work, but not something that requires deep understanding persay.) I think my perspective is more like "here's a long list of stuff which would help". Some of this is readily doable to work on in advance and should be worked on, and some is harder to work on. This work isn't extremely easy to verify or scale up (such that I don't think "throw a billion dollars at it" just works), though I'm excited for a bunch more work on this stuff. ("relatively unenlightened" doesn't mean "trivial to get the ML community work on this using money" and I also think that getting the ML community to work on things effectively is probably substantially harder than getting AIs to work on things effecti

Training AI to do alignment research we don’t already know how to do

Training AI to do alignment research we don’t already know how to do

I'm not entirely sure where our upstream cruxes are. We definitely disagree about your conclusions. My best guess is the "core mistake" comment below, and the "faithful simulators" comment is another possibility.

Maybe another relevant thing that looks wrong to me: You will still get slop when you train an AI to look like it is epistemically virtuously updating its beliefs. You'll get outputs that look very epistemically virtuous, but it takes time and expertise to rank them in a way that reflects actual epistemic virtue level, just like other kinds of slop... (read more)

Jeremy Gillen2mo*21

these are also alignment failures we see in humans.

Many of them have close analogies in human behaviour. But you seem to be implying "and therefore those are non-issues"???

There are many groups of humans (or groups of humans), that if you set them on the task of solving alignment, will at some point decide to do something else. In fact, most groups of humans will probably fail like this.

How is this evidence in favour of your plan ultimately resulting in a solution to alignment???

but these systems empirically often move in reasonable and socially-beneficial

Jeremy Gillen2mo52

to the extent developers succeed in creating faithful simulators

There's a crux I have with Ryan which is "whether future capabilities will allow data-efficient long-horizon RL fine-tuning that generalizes well". As of last time we talked about it, Ryan says we probably will, I say we probably won't.

If we have the kind of generalizing ML that we can use to make faithful simulations, then alignment is pretty much solved. We make exact human uploads, and that's pretty much it. This is one end of the spectrum on this question.

There are weaker versions, which I... (read more)

4ryan_greenblatt2mo

FWIW, I don't think "data-efficient long-horizon RL" (which is sample efficient in a online training sense) implies you can make faithful simulations. I think if the model is scheming it can behave arbitrarily badly in concentrated ways (either in a small number of actions or in a short period of time), but you can make it behave well perform well according to your metrics in the average case using online training.

Training AI to do alignment research we don’t already know how to do

Jeremy Gillen2mo62

My guess is that your core mistake is here:

When I say agents are “not egregiously misaligned,” I mean they mostly perform their work earnestly – in the same way humans are mostly earnest and vaguely try to do their job. Maybe agents are a bit sycophantic, but not more than the humans whom they would replace. Therefore, if agents are consistently “not egregiously misaligned,” the situation is no worse than if humans performed their research instead.

Obviously, all agents having undergone training to look "not egregiously misaligned", will not look egregiousl... (read more)

2joshc2mo

I think my arguments still hold in this case though right? i.e. we are training models so they try to improve their work and identify these subtle issues -- and so if they actually behave this way they will find these issues insofar as humans identify the subtle mistakes they make. I agree there are lots of "messy in between places," but these are also alignment failures we see in humans. And if humans had a really long time to do safety reseach, my guess is we'd be ok. Why? Like you said, there's a messy complicated system of humans with different goals, but these systems empirically often move in reasonable and socially-beneficial directions over time (governments get set up to deal with corrupt companies, new agencies get set up to deal with issues in governments, etc) and i expect we can make AI agents a lot more aligned than humans typically are. e.g. most humans don't actually care about the law etc but, Claude sure as hell seems to. If we have agents that sure as hell seem to care about the law and are not just pretending (they really will, in most cases, act like they care about the law) then that seems to be a good state to be in.

Training AI to do alignment research we don’t already know how to do

Jeremy Gillen2mo41

(Some) acceleration doesn't require being fully competitive with humans while deference does.

Agreed. The invention of calculators was useful for research, and the invention of more tools will also be helpful.

I think AIs that can autonomously do moderate duration ML tasks (e.g., 1 week tasks), but don't really have any interesting new ideas could plausibly speed up safety work by 5-10x if they were cheap and fast enough.

Maybe some kinds of "safety work", but real alignment involves a human obtaining a deep understanding of intelligence and agency. The path ... (read more)

3ryan_greenblatt2mo

A typical crux is that I think we can increase our chances of "real alignment" using prosaic and relatively unenlightened ML reasearch without any deep understanding. I both think: 1. We can significantly accelerate prosaic ML safety research (e.g., of the sort people are doing today) using AIs that are importantly limited. 2. Prosaic ML safety research can be very helpful for increasing the chance of "real alignment" for AIs that we hand off to. (At least when this research is well executed and has access to powerful AIs to experiment on.) This top level post is part of Josh's argument for (2).

Training AI to do alignment research we don’t already know how to do

Jeremy Gillen2mo6-2

(vague memory from the in person discussions we had last year, might be inaccurate):

jeremy!2023: If you're expecting AI to be capable enough to "accelerate alignment research" significantly, it'll need to be a full-blown agent that learns stuff. And that'll be enough to create alignment problems because data-efficient long-horizon generalization is not something we can do.

joshc!2023: No way, all you need is AI with stereotyped skills. Imagine how fast we could do interp experiments if we had AIs that were good at writing code but dumb in other ways!

...

josh... (read more)

2joshc2mo

I definitely agree that the AI agents at the start will need to be roughly aligned for the proposal above to work. What is it you think we disagree about?

9ryan_greenblatt2mo

Something important is that "significantly accelerate alignment research" isn't the same as "making AIs that we're happy to fully defer to". This post is talking about conditions needed for deference and how we might achieve them. (Some) acceleration doesn't require being fully competitive with humans while deference does. I think AIs that can autonomously do moderate duration ML tasks (e.g., 1 week tasks), but don't really have any interesting ideas could plausibly speed up safety work by 5-10x if they were cheap and fast enough.

In that case, what does the conditional goal look like when you translate it into a preference relation over outcomes?

We can't reduce the domain of the utility function without destroying some information. If we tried to change the domain variables from [g, h, shutdown] to [g, shutdown], we wouldn't get the desired behaviour. Maybe you have a particular translation method in mind?

I don't mess up the medical test because true information is instrumentally useful to me, given my goals.

Yep that's what I meant. The goal u is constructed to make information abo... (read more)

Jesse Hoogland's Shortform

Jeremy Gillen2mo70

With regards to the agent believing that it's impossible to influence the probability that its plan passes validation

This is a misinterpretation. The agent entirely has true beliefs. It knows it could manipulate the validation step. It just doesn't want to, because of the conditional shape of its goal. This is a common behaviour among humans, for example you wouldn't mess up a medical test to make it come out negative, because you need to know the result in order to know what to do afterwards.

3EJT2mo

Oh I see. In that case, what does the conditional goal look like when you translate it into a preference relation over outcomes? I think it might involve incomplete preferences. Here's why I say that. For the agent to be useful, it needs to have some preference between plans conditional on their passing validation: there must be some plan A and some plan A+ such that the agent prefers A+ to A. Then given Completeness and Transitivity, the agent can't lack a preference between shutdown and each of A and A+. If the agent lacks a preference between shutdown and A, it must prefer A+ to shutdown. It might then try to increase the probability that A+ passes validation. If the agent lacks a preference between shutdown and A+, it must prefer shutdown to A. It might then try to decrease the probability that A passes validation. This is basically my Second Theorem and the point that John Wentworth makes here. I'm not sure the medical test is a good analogy. I don't mess up the medical test because true information is instrumentally useful to me, given my goals. But (it seems to me) true information about whether a plan passes validation is only instrumentally useful to the agent if the agent's goal is to do what we humans really want. And that's something we can't assume, given the difficulty of alignment.

Jeremy Gillen3mo50

I propose: the best planners must break the beta.

Because if a planner is going to be the best, it needs to be capable of finding unusual (better!) plans. If it's capable of finding those, there's ~no benefit of knowing the conventional wisdom about how to do it (climbing slang: beta).

Edit: or maybe: good planners don't need beta?

6Jesse Hoogland3mo

That's fun but a little long. Why not... BetaZero?

Gradual Disempowerment, Shell Games and Flinches

Jeremy Gillen3mo*5416

I think you're wrong to be psychoanalysing why people aren't paying attention to your work. You're overcomplicating it. Most people just think you're wrong upon hearing a short summary, and don't trust you enough to spend time learning the details. Whether your scenario is important or not, from your perspective it'll usually look like people are bouncing off for bad reasons.

For example, I read the executive summary. For several shallow reasons,^[1] the scenario seemed unlikely and unimportant. I didn't expect there to be better arguments further on. S... (read more)

dr_s3mo132

I think the shell games point is interesting though. It's not psychoanalysing (one can think that people are in denial or have rational beliefs about this, not much point second guessing too far), it's pointing out a specific fallacy: a sort of god of the gaps in which every person with a focus on subsystem X assumes the problem will be solved in subsystem Y, which they understand or care less about because it's not their specialty. If everyone does it, that does indeed lead to completely ignoring serious problems due to a sort of bystander effect.

2[comment deleted]3mo

Jan_Kulveit3mo2314

I think 'people aren't paying attention to your work' is somewhat different situation than voiced in the original post. I'm discussing specific ways in which people engage with the argument, as opposed to just ignoring it. It is the baseline that most people ignore most arguments most of time.

Also it's probably worth noting the ways seem somewhat specific to the crowd over-represented here - in different contexts people are engaging with it in different ways.

Invulnerable Incomplete Preferences: A Formal Statement

The description of how sequential choice can be defined is helpful, I was previously confused by how this was supposed to work. This matches what I meant by preferences over tuples of outcomes. Thanks!

We'd incorrectly rule out the possibility that the agent goes for (B+,B).

There's two things we might want from the idea of incomplete preferences:

To predict the actions of agents.
Because complete agents behave dangerously sometimes, and we want to design better agents with different behaviour.

I think modelling an agent as having incomplete preferences is grea... (read more)

Why Not Subagents?

Daniel Kokotajlo's Shortform

Perhaps I'm misusing the word "representable"? But what I meant was that any single sequence of actions generate by the agent could also have been generated by an outcome-utility maximizer (that has the same world model). This seems like the relevant definition, right?

Why Not Subagents?

Jeremy Gillen3mo*20

That's not right

Are you saying that my description (following) is incorrect?

[incomplete preferences w/ caprice] would be equivalent to 1. choosing the best policy by ranking them in the partial order of outcomes (randomizing over multiple maxima), then 2. implementing that policy without further consideration.

Or are you saying that it is correct, but you disagree that this implies that it is "behaviorally indistinguishable from an agent with complete preferences"? If this is the case, then I think we might disagree on the definition of "behaviorally ... (read more)

Jeremy Gillen3mo147

I think it's important to note the OOD push that comes from online-accumulated knowledge and reasoning. Probably you include this as a distortion or subversion, but that's not quite the framing I'd use. It's not taking a "good" machine and breaking it, it's taking a slightly-broken-but-works machine and putting it into a very different situation where the broken parts become load-bearing.

My overall reaction is yep, this is a modal-ish pathway for AGI development (but there are other, quite different stories that seem plausible also).

Announcing Dialogues

Jeremy Gillen3mo60

Hmm good point. Looking at your dialogues has changed my mind, they have higher karma than the ones I was looking at.

You might also be unusual on some axis that makes arguments easier. It takes me a lot of time to go over peoples words and work out what beliefs are consistent with them. And the inverse, translating model to words, also takes a while.

Announcing Dialogues

Dialogues are more difficult to create (if done well between people with different beliefs), and are less pleasant to read, but are often higher value for reaching true beliefs as a group.

8ryan_greenblatt3mo

The dialogues I've done have all been substantially less time investment than basically any of my posts.

Announcing Dialogues

Jeremy Gillen3mo50

Dialogues seem under-incentivised relative to comments, given the amount of effort involved. Maybe they would get more karma if we could vote on individual replies, so it's more like a comment chain?

This could also help with skimming a dialogue because you can skip to the best parts, to see whether it's worth reading the whole thing.

2ryan_greenblatt3mo

I don't see a reason to give dialogues more karma than posts, but I agree posts (including dialogues) are under-incentivized relative to comments.

Jeremy Gillen3mo30

The ideal situation understanding-wise is that we understand AI at an algorithmic level. We can say stuff like: there are X,Y,Z components of the algorithm, and X passes (e.g.) beliefs to Y in format b, and Z can be viewed as a function that takes information in format w and links it with... etc. And infrabayes might be the theory you use to explain what some of the internal datastructures mean. Heuristic arguments might be how some subcomponent of the algorithm works. Most theoretical AI work (both from the alignment community and in normal AI and ML theo... (read more)

1Jonas Hallgren3mo

Okay, that makes sense to me so thank you for explaining! I guess what I was pointing at with the language thing is the question of what the actual underlying objects that you called XYZ were and their relation to the linguistic explanation of language as a contextually dependent symbol defined by many scenarios rather than some sort of logic. Like if we use IB it might be easy to look at that as a probability distribution of probability distributions? I just thought it was interesting to get some more context on how language might help in an alignment plan.

Jeremy Gillen3mo40

Fair enough, good points. I guess I classify these LLM agents as "something-like-an-LLM that is genuinely creative", at least to some extent.

Although I don't think the first example is great, seems more like a capability/observation-bandwidth issue.

4Garrett Baker3mo

I think you can have multiple failures at the same time. The reason I think this was also goodhart was because I think the failure-mode could have been averted if sonnet was told “collect wood WITHOUT BREAKING MY HOUSE” ahead of time.

I'm not sure how this is different from the solution I describe in the latter half of the post.

Jeremy Gillen3mo40

Great comment, agreed. There was some suggestion of (3), and maybe there was too much. I think there are times when expectations about the plan are equivalent to literal desires about how the task should be done. For making coffee, I expect that it won't create much noise. But also, I actually want the coffee-making to not be particularly noisy, and if it's the case that the first plan for making coffee also creates a lot of noise as a side effect, this is a situation where something in the goal specification has gone horribly wrong (and there should be some institutional response).

Jeremy Gillen3mo31

Yeah I think I remember Stuart talking about agents that request clarification whenever they are uncertain about how a concept generalizes. That is vaguely similar. I can't remember whether he proposed any way to make that reflectively stable though.

From the perspective of this post, wouldn't natural language work a bit as a redundancy specifier in that case and so LLMs are more alignable than RL agents?

LLMs in their current form don't really cause Edge Instantiation problems. Plausibly this is because they internally implement many kinds of regularization... (read more)

3Jonas Hallgren3mo

Those are some great points, made me think of some more questions. Any thoughts on what language "understood vs not understood" might be in? ARC Heuristic arguments or something like infrabayesianism? Like what is the type signature of this and how does this relate to what you wrote in the post? Also what is its relation to natural language?

4Garrett Baker3mo

If you put current language models in weird situations & give them a goal, I’d say they do do edge instantiation, without the missing “creativity” ingredient. Eg see claude sonnet in minecraft repurposing someone’s house for wood after being asked to collect wood. Edit: There are other instances of this too, where you can tell claude to protect you in minecraft, and it will constantly tp to your position, and build walls around you when monsters are around. Protecting you, but also preventing any movement or fun you may have wanted to have.

Yeah I agree there are similarities. I think a benefit of my approach, that I should have emphasized more, is that it's reflectively stable (and theoretically simple and therefore easy to analyze). In your description of an AI that wants to seek clarification, it isn't clear that it won't self-modify (but it's hard to tell).

There’s a general problem that people will want AGIs to find clever out-of-the-box solutions to problems, and there’s no principled distinction between “finding a clever out-of-the-box solution to a problem” and “Goodharting the problem