All of Vivek Hebbar's Comments + Replies

I made a few edits to this post today, mostly in response to feedback from Ryan and Richard:

  • Added 2 sentences emphasizing the point that schemers probably won't be aware of their terminal goal in most contexts.  I thought this was clear from the post already, but apparently it wasn't.
  • Modified "What factors affect the likelihood of training-gaming?" to emphasize that "sum of proxies" and "reward-seeker" are points on a spectrum.  We might get an in-between model where context-dependent drives conflict with higher goals and sometimes "win" even out
... (read more)

In the fictional dialogue, Claude Shannon's first answer is more correct -- info theory is useful far outside the original domain of application, and its elegance is the best way to predict that.

2Sam Marks
I disagree. Consider the following two sources of evidence that information theory will be broadly useful: 1. Information theory is elegant. 2. There is some domain of application in which information theory is useful. I think that (2) is stronger evidence than (1). If some framework is elegant but has not been applied downstream in any domain after a reasonable amount of time, then I don't think its elegance is strong reason to nevertheless believe that the framework will later find a domain of application. I think there's some threshold number of downstream applications X such that once a framework has Xdownstream applications, discovering the (X+1)st application is weaker evidence of broad usefulness than elegance. But very likely, X≥1. Consider e.g. that there are many very elegant mathematical structures that aren't useful for anything.

Slightly more spelled-out thoughts about bounded minds:

  1. We can't actually run the hypotheses of Solomonoff induction.  We can only make arguments about what they will output.
  2. In fact, almost all of the relevant uncertainty is logical uncertainty.  The "hypotheses" (programs) of Solomonoff induction are not the same as the "hypotheses" entertained by bounded Bayesian minds.  I don't know of any published formal account of what these bounded hypotheses even are and how they relate to Solomonoff induction.  But informally, all I'm talking ab
... (read more)

I suspect there is some merit to the Scientist's intuition (and the idea that constant returns are more "empirical") which nobody has managed to explain well.  I'll try to explain it here.[1]

The Epistemologist's notion of simplicity is about short programs with unbounded runtime which perfectly explain all evidence.  The [non-straw] empiricist notion of simplicity is about short programs with heavily-bounded runtime which approximately explain a subset of the evidence.  The Epistemologist is right that there is nothing of value in the empiri... (read more)

4Vivek Hebbar
Slightly more spelled-out thoughts about bounded minds: 1. We can't actually run the hypotheses of Solomonoff induction.  We can only make arguments about what they will output. 2. In fact, almost all of the relevant uncertainty is logical uncertainty.  The "hypotheses" (programs) of Solomonoff induction are not the same as the "hypotheses" entertained by bounded Bayesian minds.  I don't know of any published formal account of what these bounded hypotheses even are and how they relate to Solomonoff induction.  But informally, all I'm talking about are ordinary hypotheses like "the Ponzi guy only gets money from new investors". 3. In addition to "bounded hypotheses" (of unknown type), we also have "arguments".  An argument is a thing whose existence provides fallible evidence for a claim. 4. Arguments are made of pieces which can be combined "conjuctively" or "disjunctively".  The conjunction of two subarguments is weaker evidence for its claim than each subargument was for its subclaim.  This is the sense in which "big arguments" are worse.
Vivek HebbarΩ120

I'd be capable of helping aliens optimize their world, sure. I wouldn't be motivated to, but I'd be capable.

@So8res How many bits of complexity is the simplest modification to your brain that would make you in fact help them?  (asking for an order-of-magnitude wild guess)
(This could be by actually changing your values-upon-reflection, or by locally confusing you about what's in your interest, or by any other means.)

Sigmoid is usually what "straight line" should mean for a quantity bounded at 0 and 1.  It's a straight line in logit-space, the most natural space which complies with that range restriction.
(Just as exponentials are often the correct form of "straight line" for things that are required to be positive but have no ceiling in sight.)

Vivek HebbarΩ330

Do you want to try playing this game together sometime?

2Daniel Kokotajlo
Yes! Which side do you want to be on? Want to do it in person, or in this comment thread?
Vivek HebbarΩ342

We're then going to use a small amount of RL (like, 10 training episodes) to try to point it in this direction. We're going to try to use the RL to train: "Act exactly like [a given alignment researcher] would act."

Why are we doing RL if we just want imitation?  Why not SFT on expert demonstrations?
Also, if 10 episodes suffices, why is so much post-training currently done on base models?

Vivek HebbarΩ110

If the agent follows EDT, it seems like you are giving it epistemically unsound credences. In particular, the premise is that it's very confident it will go left, and the consequence is that it in fact goes right. This was the world model's fault, not EDT's fault. (It is notable though that EDT introduces this loopiness into the world model's job.)

2ryan_greenblatt
Thanks, I improved the wording.

Superadditivity seems rare in practice.  For instance, workers should have subadditive contributions after some point.  This is certainly true in the unemployment example in the post.

1lalaithion
Perhaps there is a different scheme for dividing gains from coöperation which satisfies some of the things we want but not superadditivity, but I’m unfamiliar with one. Please let me know if you find anything in that vein, I’d love to read about some alternatives to Shapley Value.
Vivek HebbarΩ5118

The idea of dividing failure stories into "failures involving rogue deployments" and "other failures" seems most useful if the following argument goes through:
1. Catastrophes require a very large (superhuman?) quantity and/or quality of intellectual labor
2. Either this labor is done by AIs in approved scaffolds, or it is done in "rogue deployments"
3. Hence the only easy-by-default disaster route is through a rogue deployment
4. Hence if we rule out rogue deployments and very impressive/difficult malicious labor in our scaffolds, we are safe

This seems true f... (read more)

5ryan_greenblatt
Hmm, I agree this division would be more useful if this argument went through, but I think it is quite useful even without this and this worth noting. (And indeed the post doesn't make this argument and discusses subtle manipulation.) I think subtle manipulation is a reasonably plausible threat model.

This google search seems to turn up some interesting articles (like maybe this one, though I've just started reading it).

Paul [Christiano] called this “problems of the interior” somewhere

Since it's slightly hard to find: Paul references it here (ctrl+f for "interior") and links to this source (once again ctrl+f for "interior").  Paul also refers to it in this post.  The term is actually "position of the interior" and apparently comes from military strategist Carl von Clausewitz.

4DanielFilan
Thanks for finding this! Will link it in the transcript.
4ryan_greenblatt
Also some discussion in this thread.
3Vivek Hebbar
This google search seems to turn up some interesting articles (like maybe this one, though I've just started reading it).

Can you clarify what figure 1 and figure 2 are showing?  

I took the text description before figure 1 to mean {score on column after finetuning on 200 from row then 10 from column} - {score on column after finetuning on 10 from column}.  But then the text right after says "Babbage fine-tuned on addition gets 27% accuracy on the multiplication dataset" which seems like a different thing.

1agg
Position i, j in figure 1 represents how well a model fine-tuned on 200 examples of dataset i performs on dataset j; Position i, j in figure 2 represents how well a model fine-tuned on 200 examples of dataset i, and then fine-tuned on 10 examples of dataset j, performs on dataset j.

Note: The survey took me 20 mins (but also note selection effects on leaving this comment)

1Cameron Berg
Definitely good to know that it might take a bit longer than we had estimated from earlier respondents (with the well-taken selection effect caveat).  Note that if it takes between 10-20 minutes to fill out, this still works out to donating $120-240/researcher-hour to high-impact alignment orgs (plus whatever the value is of the comparison of one's individual results to that of community), which hopefully is worth the time investment :)

Here's a fun thing I noticed:

There are 16 boolean functions of two variables.  Now consider an embedding that maps each of the four pairs {(A=true, B=true), (A=true, B=false), ...} to a point in 2d space.  For any such embedding, at most 14 of the 16 functions will be representable with a linear decision boundary.

For the "default" embedding (x=A, y=B), xor and its complement are the two excluded functions.  If we rearrange the points such that xor is linearly represented, we always lose some other function (and its complement).  In fact... (read more)

Oops, I misunderstood what you meant by unimodality earlier. Your comment seems broadly correct now (except for the variance thing). I would still guess that unimodality isn't precisely the right well-behavedness desideratum, but I retract the "directionally wrong".

The variance of the multivariate uniform distribution  is largest along the direction , which is exactly the direction which we would want to represent a AND b.

The variance is actually the same in all directions.  One can sanity-check by integration that the variance is 1/12 both along the axis and along the diagonal.

In fact, there's nothing special about the uniform distribution here: The variance should be independent of direction for any N-dimensional joint distribution where the N constituent distributions are ind... (read more)

3Sam Marks
Thanks, you're totally right about the equal variance thing -- I had stupidly thought that the projection of U([0,1]2) onto y = x would be uniform on [−1√2,1√2] (obviously false!). The case of a fully discrete distribution (supported in this case on four points) seems like a very special case of a something more general, where a "more typical" special case would be something like: * if a, b are both false, then sample from N(0,Σ) * if a is true and b is false, then sample from N(μa,Σ) * if a is false and b is true then sample from N(μb,Σ) * if a and b are true, then sample from N(μa+μb,Σ) for some μa,μb∈Rn and covariance matrix Σ. In general, I don't really expect the class-conditional distributions to be Gaussian, nor for the class-conditional covariances to be independent of the class. But I do expect something broadly like this, where the distributions are concentrated around their class-conditional means with probability falling off as you move further from the class-conditional mean (hence unimodality), and that the class-conditional variances are not too big relative to the distance between the clusters. Given that longer explanation, does the unimodality thing still seem directionally wrong?

Maybe models track which features are basic and enforce that these features be more salient

Couldn't it just write derivative features more weakly, and therefore not need any tracking mechanism other than the magnitude itself?

2Sam Marks
Some features which are computed from other features should probably themselves be treated as basic and thus represented with large salience.

It's sad that agentfoundations.org links no longer work, leading to broken links in many decision theory posts (e.g. here and here)

2habryka
Oh, hmm, this seems like a bug on our side. I definitely set up a redirect a while ago that should make those links work. My guess is something broke in the last few months.
2Vladimir_Nesov
Thanks for the heads up. Example broken link (https://agentfoundations.org/item?id=32), currently redirects to broken https://www.alignmentforum.org/item?id=32, should redirect further to https://www.alignmentforum.org/posts/5bd75cc58225bf0670374e7d/exploiting-edt (Exploiting EDT[1]), archive.today snapshot. Edit 14 Oct: It works now, even for links to comments, thanks LW team! ---------------------------------------- 1. LW confusingly replaces the link to www.alignmentforum.org given in Markdown comment source text with a link to www.lesswrong.com when displaying the comment on LW. ↩︎

This will initially boost  relative to  because it will suddenly be joined to a network with is correctly transmitting  but which does not understand  at all.

However, as these networks are trained to equilibrium the advantage will disappear as a steganographic protocol is agreed between the two models. Also, this can only be used once before the networks are in equilibrium.

Why would it be desirable to do this end-to-end training at all, rather than simply sticking the two networks together and doing no furthe... (read more)

I've been asked to clarify a point of fact, so I'll do so here:

My recollection is that he probed a little and was like "I'm not too worried about that" and didn't probe further.

This does ring a bell, and my brain is weakly telling me it did happen on a walk with Nate, but it's so fuzzy that I can't tell if it's a real memory or not.  A confounder here is that I've probably also had the conversational route "MIRI burnout is a thing, yikes" -> "I'm not too worried, I'm a robust and upbeat person" multiple times with people other than Nate.

In private ... (read more)

7TurnTrout
This is a slight positive update for me. I maintain my overall worry and critique: chats which are forgettable do not constitute sufficient warning.  Insofar as non-Nate MIRI personnel thoroughly warned Vivek, that is another slight positive update, since this warning should reliably be encountered by potential hires. If Vivek was independently warned via random social connections not possessed by everyone,[1] then that's a slight negative update.  1. ^ For example, Thomas Kwa learned about Nate's comm doc by randomly talking with a close friend of Nate's, and mentioning comm difficulties.
3leogao
I meant it as an analogy to https://en.m.wikipedia.org/wiki/Denormalization

In database design, sometimes you have a column in one table whose entries are pointers into another table - e.g. maybe I have a Users table, and each User has a primaryAddress field which is a pointer into an Address table. That keeps things relatively compact and often naturally represents things - e.g. if several Users in a family share a primary address, then they can all point to the same Address. The Address only needs to be represented once (so it's relatively compact), and it can also be changed once for everyone if that's a thing someone wants to ... (read more)

When you describe the "emailing protein sequences -> nanotech" route, are you imagining an AGI with computers on which it can run code (like simulations)?  Or do you claim that the AGI could design the protein sequences without writing simulations, by simply thinking about it "in its head"?

4Eliezer Yudkowsky
At the superintelligent level there's not a binary difference between those two clusters.  You just compute each thing you need to know efficiently.

Cool! It wrote and executed code to solve the problem, and it got it right.

Are you using chat-GPT-4?  I thought it can't run code?

1Jonathan Marcus
Interesting! Yes, I am using ChatGPT with GPT-4. It printed out the code, then *told me that it ran it*,  then printed out a correct answer. I didn't think to fact-check it; instead I assumed the OpenAI has been adding some impressive/scary new features.

Interesting, I find what you are saying here broadly plausible, and it is updating me (at least toward greater uncertainity/confusion).  I notice that I don't expect the 10x effect, or the Von Neumann effect, to be anywhere close to purely genetic.  Maybe some path-dependency in learning?  But my intuition (of unknown quality) is that there should be some software tweaks which make the high end of this more reliably achievable.

Anyway, to check that I understand your position, would this be a fair dialogue?:

Person: "The jump from chimps to hu

... (read more)
5jacob_cannell
Your model of my model sounds about right, but I also include neotany extension of perhaps 2x which is part of the scale up (spending longer on training the cortex, especially in higher brain regions). For Von Neumann in particular my understanding is he was some combination of 'regular' genius and a mentant (a person who can perform certain computer like calculations quickly), which was very useful for many science tasks in an era lacking fast computers and software like mathematica, but would provide less of an effective edge today. It also inflated people's perception of his actual abilities.

In your view, who would contribute more to science -- 1000 Einsteins, or 10,000 average scientists?[1]

"IQ variation is due to continuous introduction of bad mutations" is an interesting hypothesis, and definitely helps save your theory.  But there are many other candidates, like "slow fixation of positive mutations" and "fitness tradeoffs[2]".

Do you have specific evidence for either:

  1. Deleterious mutations being the primary source of IQ variation
  2. Human intelligence "plateauing" around the level of top humans[3]

Or do you believe these things just because ... (read more)

6Alexander Gietelink Oldenziel
IIRC according to gwern the theory that IQ variation is mostly due to mutational load has been debunked by modern genomic studies [though mutational load definitely has a sizable effect on IQ]. IQ variation seems to be mostly similar to height in being the result of the additive effect of many individual common allele variations.

In your view, who would contribute more to science -- 1000 Einsteins, or 10,000 average scientists?

I vaguely agree with your 90%/60% split for physics vs chemistry. In my field of programming we have the 10x myth/meme, which I think is reasonably correct but it really depends on the task.

For the 10x programmers it's some combination of greater IQ/etc but also starting programming earlier with more focused attention for longer periods of time, which eventually compounds into the 10x difference.

But it really depends on the task distribution - there are s... (read more)

It would still be interesting to know whether you were surprised by GPT-4's capabilities (if you have played with it enough to have a good take)

7Steven Byrnes
When I started blogging about AI alignment in my free time, it happened that GPT-2 had just come out, and everyone on LW was talking about it. So I wrote a couple blog posts (e.g. 1,2) trying (not very successfully, in hindsight, but I was really just starting out, don’t judge) to think through what would happen if GPT-N could reach TAI / x-risk levels. I don’t recall feeling strongly that it would or wouldn’t reach those levels, it just seemed like worth thinking about from a safety perspective and not many other people were doing so at the time. But in the meantime I was also gradually getting into thinking about brain algorithms, which involve RL much more centrally, and I came to believe that that RL was necessary to reach dangerous capability levels (recent discussion here; I think the first time I wrote it down was here). And I still believe that, and I think the jury’s out as to whether it’s true. (RLHF doesn’t count, it’s just a fine-tuning step, whereas in the brain it’s much more central.) My updates since then have felt less like “Wow look at what GPT can do” and more like “Wow some of my LW friends think that GPT is rapidly approaching the singularity, and these are pretty reasonable people who have spent a lot more time with LLMs than I have”. I haven’t personally gotten much useful work out of GPT-4. Especially not for my neuroscience work. I am currently using GPT-4 only for copyediting. (“[The following is a blog post draft. Please create a bullet point list with any typos or grammar errors.] …” “Was there any unexplained jargon in that essay?” Etc.) But maybe I’m bad at prompting, or trying the wrong things. I certainly haven’t tried very much, and find it more useful to see what other people online are saying about GPT-4 and doing with GPT-4, rather than my own very limited experience. Anyway, I have various theory-driven beliefs about deficiencies of LLMs compared to other possible AI algorithms (the RL thing I mentioned above is just one of ma
3Alexander Gietelink Oldenziel
fwiw, I think I'm fairly close to Steven Byrnes' model. I was not surprised by gpt-4 (but like most people who weren't following LLMs closely was surprised by gpt-2 capabilities)

Human intelligence in terms of brain arch priors also plateaus

Why do you think this?

POV: I'm in an ancestral environment, and I (somehow) only care about the rewarding feeling of eating bread. I only care about the nice feeling which comes from having sex, or watching the birth of my son, or being gaining power in the tribe. I don't care about the real-world status of my actual son, although I might have strictly instrumental heuristics about e.g. how to keep him safe and well-fed in certain situations, as cognitive shortcuts for getting reward (but not as terminal values). 

Would such a person sacrifice themselves for their children (in situations where doing so would be a fitness advantage)?

6TurnTrout
I think this highlights a good counterpoint. I think this alternate theory predicts "probably not", although I can contrive hypotheses for why people would sacrifice themselves (because they have learned that high-status -> reward; and it's high-status to sacrifice yourself for your kid). Or because keeping your kid safe -> high reward as another learned drive. Overall this feels like contortion but I think it's possible. Maybe overall this is a... 1-bit update against the "not selection for caring about reality" point?

Isn't going from an average human to Einstein a huge increase in science-productivity, without any flop increase? Then why can't there be software-driven foom, by going farther in whatever direction Einstein's brain is from the average human?

6jacob_cannell
Science/engineering is often a winner-take all race. To him who has is given more - so for every Einstein there are many others less well known (Lorentz, Minkowski), and so on. Actual ability is filtered through something like a softmax to produce fame, so fame severely underestimates ability. Evolution proceeds by random exploration of parameter space, the more intelligent humans only reproduce a little more than average in aggregation, and there is drag due to mutations. So the subset of the most intelligent humans represents the upper potential of the brain, but it clearly asymptotes. Finally, intelligence results from the interaction of genetics and memetics, just like in ANNs. Digital minds can be copied easily (well at least current ones - future analog neuromorphic minds may be more difficult to copy), so it seems likely that they will not have the equivalent of the mutation load issue as much. On the other hand the great expense of training digital minds and the great cost of GPU RAM means they have much less diversity - many instances of a few minds. None of this by itself leaves much hope for foom.

Of course, my argument doesn't pin down the nature or rate of software-driven takeoff, or whether there is some ceiling.  Just that the "efficiency" arguments don't seem to rule it out, and that there's no reason to believe that science-per-flop has a ceiling near the level of top humans.

You could use all of world energy output to have a few billion human speed AGI, or a millions that think 1000x faster, etc.

Isn't it insanely transformative to have millions of human-level AIs which think 1000x faster??  The difference between top scientists and average humans seems to be something like "software" (Einstein isn't using 2x the watts or neurons).  So then it should be totally possible for each of the "millions of human-level AIs" to be equivalent to Einstein.  Couldn't a million Einstein-level scientists running at 1000x speed ... (read more)

4jacob_cannell
Yes it will be transformative. GPT models already think 1000x to 10000x faster - but only for the learning stage (absorbing knowledge), not for inference (thinking new thoughts).
3Vivek Hebbar
Of course, my argument doesn't pin down the nature or rate of software-driven takeoff, or whether there is some ceiling.  Just that the "efficiency" arguments don't seem to rule it out, and that there's no reason to believe that science-per-flop has a ceiling near the level of top humans.

In your view, is it possible to make something which is superhuman (i.e. scaled beyond human level), if you are willing to spend a lot on energy, compute, engineering cost, etc?

It would be "QA", not "QE"

1Nicholas / Heather Kross
Oops. Fixed!
2Teerth Aloke
QA sessions.
Vivek HebbarΩ24-2

Any idea why "cheese Euclidean distance to top-right corner" is so important?  It's surprising to me because the convolutional layers should apply the same filter everywhere.

2TurnTrout
I'm also lightly surprised by the strength of the relationship, but not because of the convolutional layers. It seems like if "convolutional layers apply the same filter everywhere" makes me surprised by the cheese-distance influence, it should also make me be surprised by "the mouse behaves differently in a dead-end versus a long corridor" or "the mouse tends to go to the top-right."  (I have some sense of "maybe I'm not grappling with Vivek's reasons for being surprised", so feel free to tell me if so!)
3Vaniver
My naive guess is that the other relationships are nonlinear, and this is the best way to approximate those relationships out of just linear relationships of the variables the regressor had access to.

See Godel's incompleteness theorems.  For example, consider the statement "For all A, (ZFC proves A) implies A", encoded into a form judgeable by ZFC itself.  If you believe ZFC to be sound, then you believe that this statement is true, but due to Godel stuff you must also believe that ZFC cannot prove it.  The reasons for believing ZFC to be sound are reasons from "outside the system" like "it looks logically sound based on common sense", "it's never failed in practice", and "no-one's found a valid issue".  Godel's theorems let us conv... (read more)

1Victor Novikov
I think we understand each other! Thank you for clarifying. The way I translate this: some logical statements are true (to you) but not provable (to you), because you are not living in a world of mathematical logic, you are living in a messy, probabilistic world. It is nevertheless true, by the rule of necessitation in provability logic, that if a logical statement is true within the system, then it is also provable within the system. P -> □P. Because the fact that the system is making the statement P is the proof. Within a logical system, there is an underlying assumption that the system only makes true statements. (ok, this is potentially misleading and not strictly correct) This is fascinating! So my takeaway is something like: our reasoning about logical statements and systems is not necessarily "logical" itself, but is often probabilistic and messy. Which is how it has to be, given... our bounded computational power, perhaps? This very much seems to be a logical uncertainty thing.

??? For math this is exactly backward, there can be true-but-unprovable statements

1Victor Novikov
Then how do you know they are true? If you do know then they are true, it is because you have proven it, no? But I think what you are saying is correct, and I'm curious to zoom in on this disagreement.
Vivek HebbarΩ102617

Agreed.  To give a concrete toy example:  Suppose that Luigi always outputs "A", and Waluigi is {50% A, 50% B}.  If the prior is {50% luigi, 50% waluigi}, each "A" outputted is a 2:1 update towards Luigi.  The probability of "B" keeps dropping, and the probability of ever seeing a "B" asymptotes to 50% (as it must).

This is the case for perfect predictors, but there could be some argument about particular kinds of imperfect predictors which supports the claim in the post.

8abramdemski
LLMs are high order Markov models, meaning they can't really balance two different hypotheses in the way you describe; because evidence drops out of memory eventually, the probability of Waluigi drops very small instead of dropping to zero. This makes an eventual waluigi transition inevitable as claimed in the post.

Context windows could make the claim from the post correct. Since the simulator can only consider a bounded amount of evidence at once, its P[Waluigi] has a lower bound. Meanwhile, it takes much less evidence than fits in the context window to bring its P[Luigi] down to effectively 0.

Imagine that, in your example, once Waluigi outputs B it will always continue outputting B (if he's already revealed to be Waluigi, there's no point in acting like Luigi). If there's a context window of 10, then the simulator's probability of Waluigi never goes below 1/1025, w... (read more)

1Eschaton
The transform isn't symmetric though right? A character portraying "good" behaviour is, narratively speaking, more likely to have been deceitful the whole time or transform into a villain than for the antagonist to turn "good".
9Cleo Nardo
Yep I think you might be right about the maths actually. I'm thinking that waluigis with 50% A and 50% B have been eliminated by llm pretraining and definitely by rlhf. The only waluigis that remain are deceptive-at-initialisation. So what we have left is a superposition of a bunch of luigis and a bunch of waluigis, where the waluigis are deceptive, and for each waluigi there is a different phrase that would trigger them. I'm not claiming basin of attraction is the entire space of interpolation between waluigis and luigis. Actually, maybe "attractor" is the wrong technical word to use here. What I want to convey is that the amplitude of the luigis can only grow very slowly and can be reversed, but the amplitude of the waluigi can suddenly jump to 100% in a single token and would remain there permanently. What's the right dynamical-systemy term for that?

In section 3.7 of the paper, it seems like the descriptions ("6 in 5", etc) are inconsistent across the image, the caption, and the paragraph before them.  What are the correct labels?  (And maybe fix the paper if these are typos?)

1Kshitij Sachan
This has been fixed now. Thanks for pointing it out! I'm sorry it took me so long to get to this.

Does the easiest way to make you more intelligent also keep your values intact?

What exactly do you mean by "multi objective optimization"?

1DragonGod
Optimising multiple objective functions in a way that cannot be collapsed into a single utility function to e.g. the reals. I guess multi objective optimisation can be represented by a single utility function that maps to a vector space, but as far as I'm aware, utility functions usually have a field as their codomain.

It would help if you specified which subset of "the community" you're arguing against.  I had a similar reaction to your comment as Daniel did, since in my circles (AI safety researchers in Berkeley), governance tends to be well-respected, and I'd be shocked to encounter the sentiment that working for OpenAI is a "betrayal of allegiance to 'the community'".

habryka1621

To be clear, I do think most people who have historically worked on "alignment" at OpenAI have probably caused great harm! And I do think I am broadly in favor of stronger community norms against working at AI capability companies, even in so called "safety positions". So I do think there is something to the sentiment that Critch is describing.

In ML terms, nearly-all the informational work of learning what “apple” means must be performed by unsupervised learning, not supervised learning. Otherwise the number of examples required would be far too large to match toddlers’ actual performance.

I'd guess the vast majority of the work (relative to the max-entropy baseline) is done by the inductive bias.

8Rohin Shah
You don't need to guess; it's clearly true. Even a 1 trillion parameter network where each parameter is represented with 64 bits can still only represent at most 264,000,000,000,000 different functions, which is a tiny tiny fraction of the full space of 228,000,000 possible functions. You're already getting at least 28,000,000−64,000,000,000,000 of the bits just by choosing the network architecture. (This does assume things like "the neural network can learn the correct function rather than a nearly-correct function" but similarly the argument in the OP assumes "the toddler does learn the correct function rather than a nearly-correct function".)

Beware, though; string theory may be what underlies QFT and GR, and it describes a world of stringy objects that actually do move through space

I think this contrast is wrong.[1]  IIRC, strings have the same status in string theory that particles do in QFT.  In QM, a wavefunction assigns a complex number to each point in configuration space, where state space has an axis for each property of each particle.[2]  So, for instance, a system with 4 particles with only position and momentum will have a 12-dimensional configuration space.[3]  I... (read more)

2Adam Scherlis
QFT doesn't actually work like that -- the "classical degrees of freedom" underlying its configuration space are classical fields over space, not properties of particles. Note that Quantum Field Theory is not the same as the theory taught in "Quantum Mechanics" courses, which is as you describe. "Quantum Mechanics" (in common parlance): quantum theory of (a fixed number of) particles, as you describe. "Quantum Field Theory": quantum theory of fields, which are ontologically similar to cellular automata. "String Theory": quantum theory of strings, and maybe branes, as you describe.* "Quantum Mechanics" (strictly speaking): any of the above; quantum theory of anything. You can do a change of basis in QFT and get something that looks like properties of particles (Fock space), and people do this very often, but the actual laws of physics in a QFT (the Lagrangian) can't be expressed nicely in the particle ontology because of nonperturbative effects. This doesn't come up often in practice -- I spent most of grad school thinking QFT was agnostic about whether fields or particles are fundamental -- but it's an important thing to recognize in a discussion about whether modern physics privileges one ontology over the other. (Note that even in the imperfect particle ontology / Fock space picture, you don't have a finite-dimensional classical configuration space. 12 dimensions for 4 particles works great until you end up with a superposition of states with different particle numbers!) String theory is as you describe, AFAIK, which is why I contrasted it to QFT. But maybe a real string theorist would tell me that nobody believes those strings are the fundamental degrees of freedom, just like particles aren't the fundamental degrees of freedom in QFT. *Note: People sometimes use "string theory" to refer to weirder things like M-theory, where nobody knows which degrees of freedom to use...

As I understand Vivek's framework, human value shards explain away the need to posit alignment to an idealized utility function. A person is not a bunch of crude-sounding subshards (e.g. "If food nearby and hunger>15, then be more likely to go to food") and then also a sophisticated utility function (e.g. something like CEV). It's shards all the way down, and all the way up.[10] 

This read to me like you were saying "In Vivek's framework, value shards explain away .." and I was confused.  I now think you mean "My take on Vivek's is that value s... (read more)

2TurnTrout
Reworded, thanks.

Makes perfect sense, thanks!

"Well, what if I take the variables that I'm given in a Pearlian problem and I just forget that structure? I can just take the product of all of these variables that I'm given, and consider the space of all partitions on that product of variables that I'm given; and each one of those partitions will be its own variable.

How can a partition be a variable?  Should it be "part" instead?

3Ramana Kumar
Partitions (of some underlying set) can be thought of as variables like this: * The number of values the variable can take on is the number of parts in the partition. * Every element of the underlying set has some value for the variable, namely, the part that that element is in. Another way of looking at it: say we're thinking of a variable v:S→D as a function from the underlying set S to v's domain D. Then we can equivalently think of v as the partition {{s∈S∣v(s)=d}∣d∈D}∖∅ of S with (up to) |D| parts. In what you quoted, we construct the underlying set by taking all possible combinations of values for the "original" variables. Then we take all partitions of that to produce all "possible" variables on that set, which will include the original ones and many more.
Vivek Hebbar*Ω270

ETA: Koen recommends reading Counterfactual Planning in AGI Systems before (or instead of) Corrigibility with Utility Preservation

Update: I started reading your paper "Corrigibility with Utility Preservation".[1]  My guess is that readers strapped for time should read {abstract, section 2, section 4} then skip to section 6.  AFAICT, section 5 is just setting up the standard utility-maximization framework and defining "superintelligent" as "optimal utility maximizer".

Quick thoughts after reading less than half:

AFAICT,[2] this is a mathematica... (read more)

3Koen.Holtman
Corrigibility with Utility Preservation is not the paper I would recommend you read first, see my comments included in the list I just posted. To comment on your quick thoughts: * My later papers spell out the ML analog of the solution in `Corrigibility with' more clearly. * On your question of Do you have an account of why MIRI's supposed impossibility results (I think these exist?) are false?: Given how re-tellings in the blogosphere work to distort information into more extreme viewpoints, I am not surprised you believe these impossibility results of MIRI exist, but MIRI does not have any actual mathematically proven impossibility results about corrigibility. The corrigibility paper proves that one approach did not work, but does not prove anything for other approaches. What they have is that 2022 Yudkowsky is on record expressing strongly held beliefs that corrigibility is very very hard, and (if I recall correctly) even saying that nobody has made any progress on it in the last ten years. Not everybody on this site shares these beliefs. If you formalise corrigibility in a certain way, by formalising it as producing a full 100% safety, no 99.999% allowed, it is trivial to prove that a corrigible AI formalised that way can never provably exist, because the humans who will have to build, train, and prove it are fallible. Roman Yampolskiy has done some writing about this, but I do not believe that this kind or reasoning is at the core of Yudkowsky's arguments for pessimism. * On being misleadingly optimistic in my statement that the technical problems are mostly solved: as long as we do not have an actual AGI in real life, we can only ever speculate about how difficult it will be to make it corrigible in real life. This speculation can then lead to optimistic or pessimistic conclusions. Late-stage Yudkowsky is of course well-known for speculating that everybody who shows some optimism about alignment is wrong and even dangerous, but I stand by my optimism. P
Load More