All of Vivek Hebbar's Comments + Replies

Do you want to try playing this game together sometime?

2Daniel Kokotajlo
Yes! Which side do you want to be on? Want to do it in person, or in this comment thread?

We're then going to use a small amount of RL (like, 10 training episodes) to try to point it in this direction. We're going to try to use the RL to train: "Act exactly like [a given alignment researcher] would act."

Why are we doing RL if we just want imitation?  Why not SFT on expert demonstrations?
Also, if 10 episodes suffices, why is so much post-training currently done on base models?

If the agent follows EDT, it seems like you are giving it epistemically unsound credences. In particular, the premise is that it's very confident it will go left, and the consequence is that it in fact goes right. This was the world model's fault, not EDT's fault. (It is notable though that EDT introduces this loopiness into the world model's job.)

2ryan_greenblatt
Thanks, I improved the wording.

Superadditivity seems rare in practice.  For instance, workers should have subadditive contributions after some point.  This is certainly true in the unemployment example in the post.

1lalaithion
Perhaps there is a different scheme for dividing gains from coöperation which satisfies some of the things we want but not superadditivity, but I’m unfamiliar with one. Please let me know if you find anything in that vein, I’d love to read about some alternatives to Shapley Value.

The idea of dividing failure stories into "failures involving rogue deployments" and "other failures" seems most useful if the following argument goes through:
1. Catastrophes require a very large (superhuman?) quantity and/or quality of intellectual labor
2. Either this labor is done by AIs in approved scaffolds, or it is done in "rogue deployments"
3. Hence the only easy-by-default disaster route is through a rogue deployment
4. Hence if we rule out rogue deployments and very impressive/difficult malicious labor in our scaffolds, we are safe

This seems true f... (read more)

5ryan_greenblatt
Hmm, I agree this division would be more useful if this argument went through, but I think it is quite useful even without this and this worth noting. (And indeed the post doesn't make this argument and discusses subtle manipulation.) I think subtle manipulation is a reasonably plausible threat model.

This google search seems to turn up some interesting articles (like maybe this one, though I've just started reading it).

Paul [Christiano] called this “problems of the interior” somewhere

Since it's slightly hard to find: Paul references it here (ctrl+f for "interior") and links to this source (once again ctrl+f for "interior").  Paul also refers to it in this post.  The term is actually "position of the interior" and apparently comes from military strategist Carl von Clausewitz.

4DanielFilan
Thanks for finding this! Will link it in the transcript.
4ryan_greenblatt
Also some discussion in this thread.
3Vivek Hebbar
This google search seems to turn up some interesting articles (like maybe this one, though I've just started reading it).

Can you clarify what figure 1 and figure 2 are showing?  

I took the text description before figure 1 to mean {score on column after finetuning on 200 from row then 10 from column} - {score on column after finetuning on 10 from column}.  But then the text right after says "Babbage fine-tuned on addition gets 27% accuracy on the multiplication dataset" which seems like a different thing.

1agg
Position i, j in figure 1 represents how well a model fine-tuned on 200 examples of dataset i performs on dataset j; Position i, j in figure 2 represents how well a model fine-tuned on 200 examples of dataset i, and then fine-tuned on 10 examples of dataset j, performs on dataset j.

Note: The survey took me 20 mins (but also note selection effects on leaving this comment)

1Cameron Berg
Definitely good to know that it might take a bit longer than we had estimated from earlier respondents (with the well-taken selection effect caveat).  Note that if it takes between 10-20 minutes to fill out, this still works out to donating $120-240/researcher-hour to high-impact alignment orgs (plus whatever the value is of the comparison of one's individual results to that of community), which hopefully is worth the time investment :)

Here's a fun thing I noticed:

There are 16 boolean functions of two variables.  Now consider an embedding that maps each of the four pairs {(A=true, B=true), (A=true, B=false), ...} to a point in 2d space.  For any such embedding, at most 14 of the 16 functions will be representable with a linear decision boundary.

For the "default" embedding (x=A, y=B), xor and its complement are the two excluded functions.  If we rearrange the points such that xor is linearly represented, we always lose some other function (and its complement).  In fact... (read more)

Oops, I misunderstood what you meant by unimodality earlier. Your comment seems broadly correct now (except for the variance thing). I would still guess that unimodality isn't precisely the right well-behavedness desideratum, but I retract the "directionally wrong".

The variance of the multivariate uniform distribution  is largest along the direction , which is exactly the direction which we would want to represent a AND b.

The variance is actually the same in all directions.  One can sanity-check by integration that the variance is 1/12 both along the axis and along the diagonal.

In fact, there's nothing special about the uniform distribution here: The variance should be independent of direction for any N-dimensional joint distribution where the N constituent distributions are ind... (read more)

3Sam Marks
Thanks, you're totally right about the equal variance thing -- I had stupidly thought that the projection of U([0,1]2) onto y = x would be uniform on [−1√2,1√2] (obviously false!). The case of a fully discrete distribution (supported in this case on four points) seems like a very special case of a something more general, where a "more typical" special case would be something like: * if a, b are both false, then sample from N(0,Σ) * if a is true and b is false, then sample from N(μa,Σ) * if a is false and b is true then sample from N(μb,Σ) * if a and b are true, then sample from N(μa+μb,Σ) for some μa,μb∈Rn and covariance matrix Σ. In general, I don't really expect the class-conditional distributions to be Gaussian, nor for the class-conditional covariances to be independent of the class. But I do expect something broadly like this, where the distributions are concentrated around their class-conditional means with probability falling off as you move further from the class-conditional mean (hence unimodality), and that the class-conditional variances are not too big relative to the distance between the clusters. Given that longer explanation, does the unimodality thing still seem directionally wrong?

Maybe models track which features are basic and enforce that these features be more salient

Couldn't it just write derivative features more weakly, and therefore not need any tracking mechanism other than the magnitude itself?

2Sam Marks
Some features which are computed from other features should probably themselves be treated as basic and thus represented with large salience.

It's sad that agentfoundations.org links no longer work, leading to broken links in many decision theory posts (e.g. here and here)

2habryka
Oh, hmm, this seems like a bug on our side. I definitely set up a redirect a while ago that should make those links work. My guess is something broke in the last few months.
2Vladimir_Nesov
Thanks for the heads up. Example broken link (https://agentfoundations.org/item?id=32), currently redirects to broken https://www.alignmentforum.org/item?id=32, should redirect further to https://www.alignmentforum.org/posts/5bd75cc58225bf0670374e7d/exploiting-edt (Exploiting EDT[1]), archive.today snapshot. Edit 14 Oct: It works now, even for links to comments, thanks LW team! ---------------------------------------- 1. LW confusingly replaces the link to www.alignmentforum.org given in Markdown comment source text with a link to www.lesswrong.com when displaying the comment on LW. ↩︎

This will initially boost  relative to  because it will suddenly be joined to a network with is correctly transmitting  but which does not understand  at all.

However, as these networks are trained to equilibrium the advantage will disappear as a steganographic protocol is agreed between the two models. Also, this can only be used once before the networks are in equilibrium.

Why would it be desirable to do this end-to-end training at all, rather than simply sticking the two networks together and doing no furthe... (read more)

I've been asked to clarify a point of fact, so I'll do so here:

My recollection is that he probed a little and was like "I'm not too worried about that" and didn't probe further.

This does ring a bell, and my brain is weakly telling me it did happen on a walk with Nate, but it's so fuzzy that I can't tell if it's a real memory or not.  A confounder here is that I've probably also had the conversational route "MIRI burnout is a thing, yikes" -> "I'm not too worried, I'm a robust and upbeat person" multiple times with people other than Nate.

In private ... (read more)

7TurnTrout
This is a slight positive update for me. I maintain my overall worry and critique: chats which are forgettable do not constitute sufficient warning.  Insofar as non-Nate MIRI personnel thoroughly warned Vivek, that is another slight positive update, since this warning should reliably be encountered by potential hires. If Vivek was independently warned via random social connections not possessed by everyone,[1] then that's a slight negative update.  1. ^ For example, Thomas Kwa learned about Nate's comm doc by randomly talking with a close friend of Nate's, and mentioning comm difficulties.
3leogao
I meant it as an analogy to https://en.m.wikipedia.org/wiki/Denormalization

In database design, sometimes you have a column in one table whose entries are pointers into another table - e.g. maybe I have a Users table, and each User has a primaryAddress field which is a pointer into an Address table. That keeps things relatively compact and often naturally represents things - e.g. if several Users in a family share a primary address, then they can all point to the same Address. The Address only needs to be represented once (so it's relatively compact), and it can also be changed once for everyone if that's a thing someone wants to ... (read more)

When you describe the "emailing protein sequences -> nanotech" route, are you imagining an AGI with computers on which it can run code (like simulations)?  Or do you claim that the AGI could design the protein sequences without writing simulations, by simply thinking about it "in its head"?

4Eliezer Yudkowsky
At the superintelligent level there's not a binary difference between those two clusters.  You just compute each thing you need to know efficiently.

Cool! It wrote and executed code to solve the problem, and it got it right.

Are you using chat-GPT-4?  I thought it can't run code?

1Jonathan Marcus
Interesting! Yes, I am using ChatGPT with GPT-4. It printed out the code, then *told me that it ran it*,  then printed out a correct answer. I didn't think to fact-check it; instead I assumed the OpenAI has been adding some impressive/scary new features.

Interesting, I find what you are saying here broadly plausible, and it is updating me (at least toward greater uncertainity/confusion).  I notice that I don't expect the 10x effect, or the Von Neumann effect, to be anywhere close to purely genetic.  Maybe some path-dependency in learning?  But my intuition (of unknown quality) is that there should be some software tweaks which make the high end of this more reliably achievable.

Anyway, to check that I understand your position, would this be a fair dialogue?:

Person: "The jump from chimps to hu

... (read more)
5jacob_cannell
Your model of my model sounds about right, but I also include neotany extension of perhaps 2x which is part of the scale up (spending longer on training the cortex, especially in higher brain regions). For Von Neumann in particular my understanding is he was some combination of 'regular' genius and a mentant (a person who can perform certain computer like calculations quickly), which was very useful for many science tasks in an era lacking fast computers and software like mathematica, but would provide less of an effective edge today. It also inflated people's perception of his actual abilities.

In your view, who would contribute more to science -- 1000 Einsteins, or 10,000 average scientists?[1]

"IQ variation is due to continuous introduction of bad mutations" is an interesting hypothesis, and definitely helps save your theory.  But there are many other candidates, like "slow fixation of positive mutations" and "fitness tradeoffs[2]".

Do you have specific evidence for either:

  1. Deleterious mutations being the primary source of IQ variation
  2. Human intelligence "plateauing" around the level of top humans[3]

Or do you believe these things just because ... (read more)

6Alexander Gietelink Oldenziel
IIRC according to gwern the theory that IQ variation is mostly due to mutational load has been debunked by modern genomic studies [though mutational load definitely has a sizable effect on IQ]. IQ variation seems to be mostly similar to height in being the result of the additive effect of many individual common allele variations.

In your view, who would contribute more to science -- 1000 Einsteins, or 10,000 average scientists?

I vaguely agree with your 90%/60% split for physics vs chemistry. In my field of programming we have the 10x myth/meme, which I think is reasonably correct but it really depends on the task.

For the 10x programmers it's some combination of greater IQ/etc but also starting programming earlier with more focused attention for longer periods of time, which eventually compounds into the 10x difference.

But it really depends on the task distribution - there are s... (read more)

It would still be interesting to know whether you were surprised by GPT-4's capabilities (if you have played with it enough to have a good take)

7Steven Byrnes
When I started blogging about AI alignment in my free time, it happened that GPT-2 had just come out, and everyone on LW was talking about it. So I wrote a couple blog posts (e.g. 1,2) trying (not very successfully, in hindsight, but I was really just starting out, don’t judge) to think through what would happen if GPT-N could reach TAI / x-risk levels. I don’t recall feeling strongly that it would or wouldn’t reach those levels, it just seemed like worth thinking about from a safety perspective and not many other people were doing so at the time. But in the meantime I was also gradually getting into thinking about brain algorithms, which involve RL much more centrally, and I came to believe that that RL was necessary to reach dangerous capability levels (recent discussion here; I think the first time I wrote it down was here). And I still believe that, and I think the jury’s out as to whether it’s true. (RLHF doesn’t count, it’s just a fine-tuning step, whereas in the brain it’s much more central.) My updates since then have felt less like “Wow look at what GPT can do” and more like “Wow some of my LW friends think that GPT is rapidly approaching the singularity, and these are pretty reasonable people who have spent a lot more time with LLMs than I have”. I haven’t personally gotten much useful work out of GPT-4. Especially not for my neuroscience work. I am currently using GPT-4 only for copyediting. (“[The following is a blog post draft. Please create a bullet point list with any typos or grammar errors.] …” “Was there any unexplained jargon in that essay?” Etc.) But maybe I’m bad at prompting, or trying the wrong things. I certainly haven’t tried very much, and find it more useful to see what other people online are saying about GPT-4 and doing with GPT-4, rather than my own very limited experience. Anyway, I have various theory-driven beliefs about deficiencies of LLMs compared to other possible AI algorithms (the RL thing I mentioned above is just one of ma
3Alexander Gietelink Oldenziel
fwiw, I think I'm fairly close to Steven Byrnes' model. I was not surprised by gpt-4 (but like most people who weren't following LLMs closely was surprised by gpt-2 capabilities)

Human intelligence in terms of brain arch priors also plateaus

Why do you think this?

POV: I'm in an ancestral environment, and I (somehow) only care about the rewarding feeling of eating bread. I only care about the nice feeling which comes from having sex, or watching the birth of my son, or being gaining power in the tribe. I don't care about the real-world status of my actual son, although I might have strictly instrumental heuristics about e.g. how to keep him safe and well-fed in certain situations, as cognitive shortcuts for getting reward (but not as terminal values). 

Would such a person sacrifice themselves for their children (in situations where doing so would be a fitness advantage)?

6TurnTrout
I think this highlights a good counterpoint. I think this alternate theory predicts "probably not", although I can contrive hypotheses for why people would sacrifice themselves (because they have learned that high-status -> reward; and it's high-status to sacrifice yourself for your kid). Or because keeping your kid safe -> high reward as another learned drive. Overall this feels like contortion but I think it's possible. Maybe overall this is a... 1-bit update against the "not selection for caring about reality" point?

Isn't going from an average human to Einstein a huge increase in science-productivity, without any flop increase? Then why can't there be software-driven foom, by going farther in whatever direction Einstein's brain is from the average human?

6jacob_cannell
Science/engineering is often a winner-take all race. To him who has is given more - so for every Einstein there are many others less well known (Lorentz, Minkowski), and so on. Actual ability is filtered through something like a softmax to produce fame, so fame severely underestimates ability. Evolution proceeds by random exploration of parameter space, the more intelligent humans only reproduce a little more than average in aggregation, and there is drag due to mutations. So the subset of the most intelligent humans represents the upper potential of the brain, but it clearly asymptotes. Finally, intelligence results from the interaction of genetics and memetics, just like in ANNs. Digital minds can be copied easily (well at least current ones - future analog neuromorphic minds may be more difficult to copy), so it seems likely that they will not have the equivalent of the mutation load issue as much. On the other hand the great expense of training digital minds and the great cost of GPU RAM means they have much less diversity - many instances of a few minds. None of this by itself leaves much hope for foom.

Of course, my argument doesn't pin down the nature or rate of software-driven takeoff, or whether there is some ceiling.  Just that the "efficiency" arguments don't seem to rule it out, and that there's no reason to believe that science-per-flop has a ceiling near the level of top humans.

You could use all of world energy output to have a few billion human speed AGI, or a millions that think 1000x faster, etc.

Isn't it insanely transformative to have millions of human-level AIs which think 1000x faster??  The difference between top scientists and average humans seems to be something like "software" (Einstein isn't using 2x the watts or neurons).  So then it should be totally possible for each of the "millions of human-level AIs" to be equivalent to Einstein.  Couldn't a million Einstein-level scientists running at 1000x speed ... (read more)

4jacob_cannell
Yes it will be transformative. GPT models already think 1000x to 10000x faster - but only for the learning stage (absorbing knowledge), not for inference (thinking new thoughts).
3Vivek Hebbar
Of course, my argument doesn't pin down the nature or rate of software-driven takeoff, or whether there is some ceiling.  Just that the "efficiency" arguments don't seem to rule it out, and that there's no reason to believe that science-per-flop has a ceiling near the level of top humans.

In your view, is it possible to make something which is superhuman (i.e. scaled beyond human level), if you are willing to spend a lot on energy, compute, engineering cost, etc?

1Nicholas / Heather Kross
Oops. Fixed!
2Teerth Aloke
QA sessions.

Any idea why "cheese Euclidean distance to top-right corner" is so important?  It's surprising to me because the convolutional layers should apply the same filter everywhere.

2TurnTrout
I'm also lightly surprised by the strength of the relationship, but not because of the convolutional layers. It seems like if "convolutional layers apply the same filter everywhere" makes me surprised by the cheese-distance influence, it should also make me be surprised by "the mouse behaves differently in a dead-end versus a long corridor" or "the mouse tends to go to the top-right."  (I have some sense of "maybe I'm not grappling with Vivek's reasons for being surprised", so feel free to tell me if so!)
3Vaniver
My naive guess is that the other relationships are nonlinear, and this is the best way to approximate those relationships out of just linear relationships of the variables the regressor had access to.

See Godel's incompleteness theorems.  For example, consider the statement "For all A, (ZFC proves A) implies A", encoded into a form judgeable by ZFC itself.  If you believe ZFC to be sound, then you believe that this statement is true, but due to Godel stuff you must also believe that ZFC cannot prove it.  The reasons for believing ZFC to be sound are reasons from "outside the system" like "it looks logically sound based on common sense", "it's never failed in practice", and "no-one's found a valid issue".  Godel's theorems let us conv... (read more)

1Victor Novikov
I think we understand each other! Thank you for clarifying. The way I translate this: some logical statements are true (to you) but not provable (to you), because you are not living in a world of mathematical logic, you are living in a messy, probabilistic world. It is nevertheless true, by the rule of necessitation in provability logic, that if a logical statement is true within the system, then it is also provable within the system. P -> □P. Because the fact that the system is making the statement P is the proof. Within a logical system, there is an underlying assumption that the system only makes true statements. (ok, this is potentially misleading and not strictly correct) This is fascinating! So my takeaway is something like: our reasoning about logical statements and systems is not necessarily "logical" itself, but is often probabilistic and messy. Which is how it has to be, given... our bounded computational power, perhaps? This very much seems to be a logical uncertainty thing.

??? For math this is exactly backward, there can be true-but-unprovable statements

1Victor Novikov
Then how do you know they are true? If you do know then they are true, it is because you have proven it, no? But I think what you are saying is correct, and I'm curious to zoom in on this disagreement.

Agreed.  To give a concrete toy example:  Suppose that Luigi always outputs "A", and Waluigi is {50% A, 50% B}.  If the prior is {50% luigi, 50% waluigi}, each "A" outputted is a 2:1 update towards Luigi.  The probability of "B" keeps dropping, and the probability of ever seeing a "B" asymptotes to 50% (as it must).

This is the case for perfect predictors, but there could be some argument about particular kinds of imperfect predictors which supports the claim in the post.

8abramdemski
LLMs are high order Markov models, meaning they can't really balance two different hypotheses in the way you describe; because evidence drops out of memory eventually, the probability of Waluigi drops very small instead of dropping to zero. This makes an eventual waluigi transition inevitable as claimed in the post.

Context windows could make the claim from the post correct. Since the simulator can only consider a bounded amount of evidence at once, its P[Waluigi] has a lower bound. Meanwhile, it takes much less evidence than fits in the context window to bring its P[Luigi] down to effectively 0.

Imagine that, in your example, once Waluigi outputs B it will always continue outputting B (if he's already revealed to be Waluigi, there's no point in acting like Luigi). If there's a context window of 10, then the simulator's probability of Waluigi never goes below 1/1025, w... (read more)

1Eschaton
The transform isn't symmetric though right? A character portraying "good" behaviour is, narratively speaking, more likely to have been deceitful the whole time or transform into a villain than for the antagonist to turn "good".
9Cleo Nardo
Yep I think you might be right about the maths actually. I'm thinking that waluigis with 50% A and 50% B have been eliminated by llm pretraining and definitely by rlhf. The only waluigis that remain are deceptive-at-initialisation. So what we have left is a superposition of a bunch of luigis and a bunch of waluigis, where the waluigis are deceptive, and for each waluigi there is a different phrase that would trigger them. I'm not claiming basin of attraction is the entire space of interpolation between waluigis and luigis. Actually, maybe "attractor" is the wrong technical word to use here. What I want to convey is that the amplitude of the luigis can only grow very slowly and can be reversed, but the amplitude of the waluigi can suddenly jump to 100% in a single token and would remain there permanently. What's the right dynamical-systemy term for that?

In section 3.7 of the paper, it seems like the descriptions ("6 in 5", etc) are inconsistent across the image, the caption, and the paragraph before them.  What are the correct labels?  (And maybe fix the paper if these are typos?)

1Kshitij Sachan
This has been fixed now. Thanks for pointing it out! I'm sorry it took me so long to get to this.

Does the easiest way to make you more intelligent also keep your values intact?

What exactly do you mean by "multi objective optimization"?

1DragonGod
Optimising multiple objective functions in a way that cannot be collapsed into a single utility function to e.g. the reals. I guess multi objective optimisation can be represented by a single utility function that maps to a vector space, but as far as I'm aware, utility functions usually have a field as their codomain.

It would help if you specified which subset of "the community" you're arguing against.  I had a similar reaction to your comment as Daniel did, since in my circles (AI safety researchers in Berkeley), governance tends to be well-respected, and I'd be shocked to encounter the sentiment that working for OpenAI is a "betrayal of allegiance to 'the community'".

To be clear, I do think most people who have historically worked on "alignment" at OpenAI have probably caused great harm! And I do think I am broadly in favor of stronger community norms against working at AI capability companies, even in so called "safety positions". So I do think there is something to the sentiment that Critch is describing.

In ML terms, nearly-all the informational work of learning what “apple” means must be performed by unsupervised learning, not supervised learning. Otherwise the number of examples required would be far too large to match toddlers’ actual performance.

I'd guess the vast majority of the work (relative to the max-entropy baseline) is done by the inductive bias.

8Rohin Shah
You don't need to guess; it's clearly true. Even a 1 trillion parameter network where each parameter is represented with 64 bits can still only represent at most 264,000,000,000,000 different functions, which is a tiny tiny fraction of the full space of 228,000,000 possible functions. You're already getting at least 28,000,000−64,000,000,000,000 of the bits just by choosing the network architecture. (This does assume things like "the neural network can learn the correct function rather than a nearly-correct function" but similarly the argument in the OP assumes "the toddler does learn the correct function rather than a nearly-correct function".)

Beware, though; string theory may be what underlies QFT and GR, and it describes a world of stringy objects that actually do move through space

I think this contrast is wrong.[1]  IIRC, strings have the same status in string theory that particles do in QFT.  In QM, a wavefunction assigns a complex number to each point in configuration space, where state space has an axis for each property of each particle.[2]  So, for instance, a system with 4 particles with only position and momentum will have a 12-dimensional configuration space.[3]  I... (read more)

2Adam Scherlis
QFT doesn't actually work like that -- the "classical degrees of freedom" underlying its configuration space are classical fields over space, not properties of particles. Note that Quantum Field Theory is not the same as the theory taught in "Quantum Mechanics" courses, which is as you describe. "Quantum Mechanics" (in common parlance): quantum theory of (a fixed number of) particles, as you describe. "Quantum Field Theory": quantum theory of fields, which are ontologically similar to cellular automata. "String Theory": quantum theory of strings, and maybe branes, as you describe.* "Quantum Mechanics" (strictly speaking): any of the above; quantum theory of anything. You can do a change of basis in QFT and get something that looks like properties of particles (Fock space), and people do this very often, but the actual laws of physics in a QFT (the Lagrangian) can't be expressed nicely in the particle ontology because of nonperturbative effects. This doesn't come up often in practice -- I spent most of grad school thinking QFT was agnostic about whether fields or particles are fundamental -- but it's an important thing to recognize in a discussion about whether modern physics privileges one ontology over the other. (Note that even in the imperfect particle ontology / Fock space picture, you don't have a finite-dimensional classical configuration space. 12 dimensions for 4 particles works great until you end up with a superposition of states with different particle numbers!) String theory is as you describe, AFAIK, which is why I contrasted it to QFT. But maybe a real string theorist would tell me that nobody believes those strings are the fundamental degrees of freedom, just like particles aren't the fundamental degrees of freedom in QFT. *Note: People sometimes use "string theory" to refer to weirder things like M-theory, where nobody knows which degrees of freedom to use...

As I understand Vivek's framework, human value shards explain away the need to posit alignment to an idealized utility function. A person is not a bunch of crude-sounding subshards (e.g. "If food nearby and hunger>15, then be more likely to go to food") and then also a sophisticated utility function (e.g. something like CEV). It's shards all the way down, and all the way up.[10] 

This read to me like you were saying "In Vivek's framework, value shards explain away .." and I was confused.  I now think you mean "My take on Vivek's is that value s... (read more)

2TurnTrout
Reworded, thanks.

Makes perfect sense, thanks!

"Well, what if I take the variables that I'm given in a Pearlian problem and I just forget that structure? I can just take the product of all of these variables that I'm given, and consider the space of all partitions on that product of variables that I'm given; and each one of those partitions will be its own variable.

How can a partition be a variable?  Should it be "part" instead?

3Ramana Kumar
Partitions (of some underlying set) can be thought of as variables like this: * The number of values the variable can take on is the number of parts in the partition. * Every element of the underlying set has some value for the variable, namely, the part that that element is in. Another way of looking at it: say we're thinking of a variable v:S→D as a function from the underlying set S to v's domain D. Then we can equivalently think of v as the partition {{s∈S∣v(s)=d}∣d∈D}∖∅ of S with (up to) |D| parts. In what you quoted, we construct the underlying set by taking all possible combinations of values for the "original" variables. Then we take all partitions of that to produce all "possible" variables on that set, which will include the original ones and many more.

ETA: Koen recommends reading Counterfactual Planning in AGI Systems before (or instead of) Corrigibility with Utility Preservation

Update: I started reading your paper "Corrigibility with Utility Preservation".[1]  My guess is that readers strapped for time should read {abstract, section 2, section 4} then skip to section 6.  AFAICT, section 5 is just setting up the standard utility-maximization framework and defining "superintelligent" as "optimal utility maximizer".

Quick thoughts after reading less than half:

AFAICT,[2] this is a mathematica... (read more)

3Koen.Holtman
Corrigibility with Utility Preservation is not the paper I would recommend you read first, see my comments included in the list I just posted. To comment on your quick thoughts: * My later papers spell out the ML analog of the solution in `Corrigibility with' more clearly. * On your question of Do you have an account of why MIRI's supposed impossibility results (I think these exist?) are false?: Given how re-tellings in the blogosphere work to distort information into more extreme viewpoints, I am not surprised you believe these impossibility results of MIRI exist, but MIRI does not have any actual mathematically proven impossibility results about corrigibility. The corrigibility paper proves that one approach did not work, but does not prove anything for other approaches. What they have is that 2022 Yudkowsky is on record expressing strongly held beliefs that corrigibility is very very hard, and (if I recall correctly) even saying that nobody has made any progress on it in the last ten years. Not everybody on this site shares these beliefs. If you formalise corrigibility in a certain way, by formalising it as producing a full 100% safety, no 99.999% allowed, it is trivial to prove that a corrigible AI formalised that way can never provably exist, because the humans who will have to build, train, and prove it are fallible. Roman Yampolskiy has done some writing about this, but I do not believe that this kind or reasoning is at the core of Yudkowsky's arguments for pessimism. * On being misleadingly optimistic in my statement that the technical problems are mostly solved: as long as we do not have an actual AGI in real life, we can only ever speculate about how difficult it will be to make it corrigible in real life. This speculation can then lead to optimistic or pessimistic conclusions. Late-stage Yudkowsky is of course well-known for speculating that everybody who shows some optimism about alignment is wrong and even dangerous, but I stand by my optimism. P

To be more specific about the technical problem being mostly solved: there are a bunch of papers outlining corrigibility methods that are backed up by actual mathematical correctness proofs

Can you link these papers here?  No need to write anything, just links.

OK, Below I will provide links to few mathematically precise papers about AGI corrigibility solutions, with some comments. I do not have enough time to write short comments, so I wrote longer ones.

This list or links below is not a complete literature overview. I did a comprehensive literature search on corrigibility back in 2019 trying to find all mathematical papers of interest, but have not done so since.

I wrote some of the papers below, and have read all the rest of them. I am not linking to any papers I heard about but did not read (yet).

Math-based w... (read more)

7Vivek Hebbar
ETA: Koen recommends reading Counterfactual Planning in AGI Systems before (or instead of) Corrigibility with Utility Preservation Update: I started reading your paper "Corrigibility with Utility Preservation".[1]  My guess is that readers strapped for time should read {abstract, section 2, section 4} then skip to section 6.  AFAICT, section 5 is just setting up the standard utility-maximization framework and defining "superintelligent" as "optimal utility maximizer". Quick thoughts after reading less than half: AFAICT,[2] this is a mathematical solution to corrigibility in a toy problem, and not a solution to corrigibility in real systems.  Nonetheless, it's a big deal if you have in fact solved the utility-function-land version which MIRI failed to solve.[3]  Looking to applicability, it may be helpful for you to spell out the ML analog to your solution (or point us to the relevant section in the paper if it exists).  In my view, the hard part of the alignment problem is deeply tied up with the complexities of the {training procedure --> model} map, and a nice theoretical utility function is neither sufficient nor strictly necessary for alignment (though it could still be useful). So looking at your claim that "the technical problem [is] mostly solved", this may or may not be true for the narrow sense (like "corrigibility as a theoretical outer-objective problem in formally-specified environments"), but seems false and misleading for the broader practical sense ("knowing how to make an AGI corrigible in real life").[4] Less important, but I wonder if the authors of Soares et al agree with your remark in this excerpt[5]: "In particular, [Soares et al] uses a Platonic agent model [where the physics of the universe cannot modify the agent's decision procedure]  to study a design for a corrigible agent, and concludes that the design considered does not meet the desiderata, because the agent shows no incentive to preserve its shutdown behavior. Part of this concl
  1. Try to improve my evaluation process so that I can afford to do wider searches without taking excessive risk.

Improve it with respect to what?  

My attempt at a framework where "improving one's own evaluator" and "believing in adversarial examples to one's own evaluator" make sense:

  • The agent's allegiance is to some idealized utility function  (like CEV).  The agent's internal evaluator  is "trying" to approximate  by reasoning heuristically.  So now we ask Eval to evaluate the plan "do argmax w.r.t
... (read more)
3TurnTrout
Vivek -- I replied to your comment in appendix C of today's follow-up post, Alignment allows imperfect decision-influences and doesn't require robust grading. 
4adamShimi
The way you write this (especially the last sentence) makes me think that you see this attempt as being close to the only one that makes sense to you atm. Which makes me curious: * Do you think that you are internally trying to approximate your own Uideal? * Do you think that you have ever made the decision (either implicitly or explicitly) to not eval all or most plans because you don't trust your ability to do so for adversarial examples (as opposed to tractability issues for example)? * Can you think of concrete instances where you improved your own Eval? * Can you think of concrete instances where you thought you improved you own Eval but then regretted it later? * Do you think that your own changes to your eval have been moving in the direction of your Uideal?
5cfoster0
Yeah I think you're on the right track. A simple framework (that probably isn't strictly distinct from the one you mentioned) would be that the agent has a foresight evaluation method that estimates "How good do I think this plan is?" and a hindsight evaluation method that calculates "How good was it, really?". There can be plans that trick the foresight evaluation method relative to the hindsight one. For example, I can get tricked into thinking some outcome is more likely than it actually is ("The chances of losing my client's money with this investment strategy were way higher than I thought they were.") or thinking that some new state will be hindsight-evaluated better than it actually will be ("He convinced me that if I tried coffee, I would like it, but I just drank it and it tastes disgusting."), etc.
8Wei Dai
This is tempting, but the problem is that I don't know what my idealized utility function is (e.g., I don't have a specification for CEV that I think would be safe or ideal to optimize for), so what does it mean to try to approximate it? Or consider that I only read about CEV one day in a blog, so what was I doing prior to that? Or if I was supposedly trying to approximate CEV, I can change my mind about it if I realized that it's a bad idea, but how does that fit into the framework? My own framework is something like this: * The evaluation process is some combination of gut, intuition, explicit reasoning (e.g. cost-benefit analysis), doing philosophy, and cached answers. * I think there are "adversarial inputs" because I've previously done things that I later regretted, due to evaluating them highly in ways that I no longer endorse. I can also see other people sometimes doing obviously crazy things (which they may or may not later regret). I can see people (including myself) being persuaded by propaganda / crazy memes, so there must be a risk of persuading myself with my own bad ideas. * I can try to improve my evaluation process by doing things like 1. look for patterns in my and other people's mistakes 2. think about ethical dilemmas / try to resolve conflicts between my evaluative subprocesses 3. do more philosophy (think/learn about ethical theories, metaethics, decision theory, philosophy of mind, etc.) 4. talk (selectively) to other people 5. try to improve how I do explicit reasoning or philosophy

Yeah, the right column should obviously be all 20s.  There must be a bug in my code[1] :/

I like to think of the argmax function as something that takes in a distribution on probability distributions on  with different sigma algebras, and outputs a partial probability distribution that is defined on the set of all events that are in the sigma algebra of (and given positive probability by) one of the components.

Take the following hypothesis :

If I add this into  with weight , then the middle column is still near... (read more)

2Slider
This maps the credence but I would imagine that the confidence would not be evenly spread around the boxes. With confidence literally 0 it does not make sense to express any credence to stand any taller than another as 1 and 0 would make equal sense. With a miniscule confidence the foggy hunch does point in some direction. Without h3 it is consistent to have middle square confidence 0. With positive plausibily of h3 middle square is not "glossed over" we have some confidence it might matter. But because h3 is totally useless for credences those come from the structures of h1 and h2. Thus effectively h1 and h2 are voting for zero despite not caring about it. Contrast what would happen with an even more trivial hypothesis of one square covering all with 100% or 9x9 equiprobable hypothesis. You could also have a "micro detail hypothesis", (actually a 3x3) a 9x9 grid where each 3x3 is zeroes everywhere else than the bottom right corner and all the "small square locations" are in the same case among the other "big square" correspondents. The "big scale" hypotheses do not really mind the "small scale" dragging of the credence around. Thus the small bottom-right square is quite sensitive to the corresponding big square value and the other small squares are relatively insensitive. Mixing two 3x3 resolutions that are orthogonal results in a 9x9 resolution which is sparse (because it is separable). John Vervaeke meme of "sterescopic vision" seems to apply. The two 2x2 perspectives are not entirely orthogonal so the "sparcity" is not easy to catch.
2Scott Garrabrant
The point I was trying to make with the partial functions was something like "Yeah, there are 0s, yeah it is bad, but at least we can never assign low probability to any event that any of the hypotheses actually cares about." I guess I could have make that argument more clearly if instead, I just pointed out that any event in the sigma algebra of any of the hypotheses will have probability at least equal to the probability of that hypothesis times the probability of that event in that hypothesis. Thus the 0s (and the 10−9s) are really coming from the fact that (almost) nobody cares about those events.
2Scott Garrabrant
I agree with all your intuition here. The thing about the partial functions is unsatisfactory, because it is discontinuous. It is trying to be #1, but a little more ambitious. I want the distribution on distributions to be a new type of epistemic state, and the geometric maximization to be the mechanism for converting the new epistemic state to a traditional probability distribution. I think that any decent notion of an embedded epistemic state needs to be closed under both mixing and coarsening, and this is trying to satisfy that as naturally as possible. I think that the 0s are pretty bad, but I think they are the edge case of the only reasonable thing to do here. I think the reason it feels like the only reasonable thing to do for me is something like credit assignment/hypothesis autonomy. If a world gets probability mass, that should be because some hypothesis or collection of hypotheses insisted on putting probability mass there. You gave an edge case example where this didn't happen. Maybe everything is edge cases. I am not sure. It might be that the 0s are not as bad as they seem. 0s seem bad because we have cached that "0 means you cant update" but maybe you aren't supposed to be updating in the output distribution anyway, you are supposed to do you updating in the more general epistemic state input object.  I actually prefer a different proposal for the type of "epistemic state that is closed under coarsening and mixture" that is more general than the thing I gesture at in the post: A generalized epistemic state is a (quasi-?)convex function ΔW→R. A standard probability distribution is converted to an epistemic state through P↦(Q↦DKL(P||Q)). A generalized epistemic state is converted to a (convex set of) probability distribution(s) by taking an argmin. Mixture is mixture as functions, and coarsening is the obvious thing (given a function W→V, we can convert a generalized epistemic state over V to a generalized epistemic state over W by precomposing wit

Now, let's consider the following modification: Each hypothesis is no longer a distribution on , but instead a distribution on some coarser partition of . Now  is still well defined

Playing around with this a bit, I notice a curious effect (ETA: the numbers here were previously wrong, fixed now):

The reason the middle column goes to zero is that hypothesis A puts 60% on the rightmost column, and hypothesis B puts 40% on the leftmost, and neither cares about the middle column specifically.

But philosophically, what d... (read more)

2Scott Garrabrant
I think your numbers are wrong, and the right column on the output should say 20% 20% 20%. The output actually agrees with each of the components on every event in that component's sigma algebra. The input distributions don't actually have any conflicting beliefs, and so of course the output chooses a distribution that doesn't disagree with either. I agree that the 0s are a bit unfortunate. I think the best way to think of the type of the object you get out is not a probability distribution on W, but what I am calling a partial probability distribution on W. A partial probability distribution is a partial function from 2W→[0,1] that can be completed to a full probability distribution on W (with some sigma algebra that is a superset of the domain of the partial probability distribution. I like to think of the argmax function as something that takes in a distribution on probability distributions on W with different sigma algebras, and outputs a partial probability distribution that is defined on the set of all events that are in the sigma algebra of (and given positive probability by) one of the components. One nice thing about this definition is that it makes it so the argmax always takes on a unique value. (proof omitted.) This doesn't really make it that much better, but the point here is that this framework admits that it doesn't really make much sense to ask about the probability of the middle column. You can ask about any of the events in the original pair of sigma algebras, and indeed, the two inputs don't disagree with the output at all on any of these sets.

most egregores/epistemic networks, which I'm completely reliant upon, are much smarter than me, so that can't be right

*Egregore smiles*

Another way of looking at this question:  Arithmetic rationality is shift invariant, so you don't have to know your total balance to calculate expected values of bets.  Whereas for geometric rationality, you need to know where the zero point is, since it's not shift invariant.

Load More