LESSWRONG
LW

All of Vivek Hebbar's Comments + Replies

New Endorsements for “If Anyone Builds It, Everyone Dies”

Wouldn't it be better to have a real nighttime view of north america? I also found it jarring...

5habryka14d

You would not be able to find one in appropriate resolution or color scheme.

I sympathize somewhat with this complexity point but I'm worried that training will be extremely non-Bayesian in a way that makes complexity arguments not really work. So I feel like the point about optimization power at best cuts the worry about hyperstition by about a factor of 2. Perhaps there should be research on how "sticky" the biases from early in training can be in the face of later optimization pressure.

4Daniel Kokotajlo21d

Mia & co at CLR are currently doing some somewhat related research iiuc

Toward A Mathematical Framework for Computation in Superposition

Vivek Hebbar1mo20

In general, computing a boolean expression with $k$ terms without the signal being drowned out by the noise will require $ϵ < 1 / k$ if the noise is correlated, and $ϵ < 1 / k^{2}$ if the noise is uncorrelated.

Shouldn't the second one be $\frac{1}{\sqrt{k}}$ ?

the past token can use information that is not accessible to the token generating the key (as it is in its “future” – this is captured e.g. by the attention mask)

Is this meant to say "last token" instead of "past token"?

How training-gamers might function (and win)

Vivek Hebbar2moΩ340

I made a few edits to this post today, mostly in response to feedback from Ryan and Richard:

Added 2 sentences emphasizing the point that schemers probably won't be aware of their terminal goal in most contexts. I thought this was clear from the post already, but apparently it wasn't.
Modified "What factors affect the likelihood of training-gaming?" to emphasize that "sum of proxies" and "reward-seeker" are points on a spectrum. We might get an in-between model where context-dependent drives conflict with higher goals and sometimes "win" even out

Vivek Hebbar3moΩ340

In the fictional dialogue, Claude Shannon's first answer is more correct -- info theory is useful far outside the original domain of application, and its elegance is the best way to predict that.

2Sam Marks2mo

I disagree. Consider the following two sources of evidence that information theory will be broadly useful: 1. Information theory is elegant. 2. There is some domain of application in which information theory is useful. I think that (2) is stronger evidence than (1). If some framework is elegant but has not been applied downstream in any domain after a reasonable amount of time, then I don't think its elegance is strong reason to nevertheless believe that the framework will later find a domain of application. I think there's some threshold number of downstream applications X such that once a framework has Xdownstream applications, discovering the (X+1)st application is weaker evidence of broad usefulness than elegance. But very likely, X≥1. Consider e.g. that there are many very elegant mathematical structures that aren't useful for anything.

'Empiricism!' as Anti-Epistemology

Vivek Hebbar3mo42

Slightly more spelled-out thoughts about bounded minds:

We can't actually run the hypotheses of Solomonoff induction. We can only make arguments about what they will output.
In fact, almost all of the relevant uncertainty is logical uncertainty. The "hypotheses" (programs) of Solomonoff induction are not the same as the "hypotheses" entertained by bounded Bayesian minds. I don't know of any published formal account of what these bounded hypotheses even are and how they relate to Solomonoff induction. But informally, all I'm talking ab

... (read more)

'Empiricism!' as Anti-Epistemology

Vivek Hebbar3mo42

I suspect there is some merit to the Scientist's intuition (and the idea that constant returns are more "empirical") which nobody has managed to explain well. I'll try to explain it here.^[1]

The Epistemologist's notion of simplicity is about short programs with unbounded runtime which perfectly explain all evidence. The [non-straw] empiricist notion of simplicity is about short programs with heavily-bounded runtime which approximately explain a subset of the evidence. The Epistemologist is right that there is nothing of value in the empiri... (read more)

4Vivek Hebbar3mo

Slightly more spelled-out thoughts about bounded minds: 1. We can't actually run the hypotheses of Solomonoff induction. We can only make arguments about what they will output. 2. In fact, almost all of the relevant uncertainty is logical uncertainty. The "hypotheses" (programs) of Solomonoff induction are not the same as the "hypotheses" entertained by bounded Bayesian minds. I don't know of any published formal account of what these bounded hypotheses even are and how they relate to Solomonoff induction. But informally, all I'm talking about are ordinary hypotheses like "the Ponzi guy only gets money from new investors". 3. In addition to "bounded hypotheses" (of unknown type), we also have "arguments". An argument is a thing whose existence provides fallible evidence for a claim. 4. Arguments are made of pieces which can be combined "conjuctively" or "disjunctively". The conjunction of two subarguments is weaker evidence for its claim than each subargument was for its subclaim. This is the sense in which "big arguments" are worse.

Shah and Yudkowsky on alignment failures

Vivek Hebbar3moΩ240

I'd be capable of helping aliens optimize their world, sure. I wouldn't be motivated to, but I'd be capable.

@So8res How many bits of complexity is the simplest modification to your brain that would make you in fact help them? (asking for an order-of-magnitude wild guess)
(This could be by actually changing your values-upon-reflection, or by locally confusing you about what's in your interest, or by any other means.)

Arguments about fast takeoff

Vivek Hebbar3mo30

Sigmoid is usually what "straight line" should mean for a quantity bounded at 0 and 1. It's a straight line in logit-space, the most natural space which complies with that range restriction.
(Just as exponentials are often the correct form of "straight line" for things that are required to be positive but have no ceiling in sight.)

Discussion with Nate Soares on a key alignment difficulty

Vivek Hebbar9moΩ330

Do you want to try playing this game together sometime?

2Daniel Kokotajlo9mo

Yes! Which side do you want to be on? Want to do it in person, or in this comment thread?

Discussion with Nate Soares on a key alignment difficulty

Vivek Hebbar9moΩ344

We're then going to use a small amount of RL (like, 10 training episodes) to try to point it in this direction. We're going to try to use the RL to train: "Act exactly like [a given alignment researcher] would act."

Why are we doing RL if we just want imitation? Why not SFT on expert demonstrations?
Also, if 10 episodes suffices, why is so much post-training currently done on base models?

Dutch-Booking CDT

Vivek Hebbar11moΩ110

If the agent follows EDT, it seems like you are giving it epistemically unsound credences. In particular, the premise is that it's very confident it will go left, and the consequence is that it in fact goes right. This was the world model's fault, not EDT's fault. (It is notable though that EDT introduces this loopiness into the world model's job.)

An issue with training schemers with supervised fine-tuning

Vivek Hebbar1y*10

[resolved]

2ryan_greenblatt1y

Thanks, I improved the wording.

Worked Examples of Shapley Values

Vivek Hebbar1y10

Superadditivity seems rare in practice. For instance, workers should have subadditive contributions after some point. This is certainly true in the unemployment example in the post.

1lalaithion1y

Perhaps there is a different scheme for dividing gains from coöperation which satisfies some of the things we want but not superadditivity, but I’m unfamiliar with one. Please let me know if you find anything in that vein, I’d love to read about some alternatives to Shapley Value.

AI catastrophes and rogue deployments

Vivek Hebbar1yΩ6138

The idea of dividing failure stories into "failures involving rogue deployments" and "other failures" seems most useful if the following argument goes through:
1. Catastrophes require a very large (superhuman?) quantity and/or quality of intellectual labor
2. Either this labor is done by AIs in approved scaffolds, or it is done in "rogue deployments"
3. Hence the only easy-by-default disaster route is through a rogue deployment
4. Hence if we rule out rogue deployments and very impressive/difficult malicious labor in our scaffolds, we are safe

This seems true f... (read more)

5ryan_greenblatt1y

Hmm, I agree this division would be more useful if this argument went through, but I think it is quite useful even without this and this worth noting. (And indeed the post doesn't make this argument and discusses subtle manipulation.) I think subtle manipulation is a reasonably plausible threat model.

AXRP Episode 27 - AI Control with Buck Shlegeris and Ryan Greenblatt

Vivek Hebbar1yΩ130

This google search seems to turn up some interesting articles (like maybe this one, though I've just started reading it).

AXRP Episode 27 - AI Control with Buck Shlegeris and Ryan Greenblatt

Vivek Hebbar1yΩ350

Paul [Christiano] called this “problems of the interior” somewhere

Since it's slightly hard to find: Paul references it here (ctrl+f for "interior") and links to this source (once again ctrl+f for "interior"). Paul also refers to it in this post. The term is actually "position of the interior" and apparently comes from military strategist Carl von Clausewitz.

4DanielFilan1y

Thanks for finding this! Will link it in the transcript.

4ryan_greenblatt1y

Also some discussion in this thread.

3Vivek Hebbar1y

This google search seems to turn up some interesting articles (like maybe this one, though I've just started reading it).

Transfer learning and generalization-qua-capability in Babbage and Davinci (or, why division is better than Spanish)

Vivek Hebbar1y62

Can you clarify what figure 1 and figure 2 are showing?

I took the text description before figure 1 to mean {score on column after finetuning on 200 from row then 10 from column} - {score on column after finetuning on 10 from column}. But then the text right after says "Babbage fine-tuned on addition gets 27% accuracy on the multiplication dataset" which seems like a different thing.

1agg1y

Position i, j in figure 1 represents how well a model fine-tuned on 200 examples of dataset i performs on dataset j; Position i, j in figure 2 represents how well a model fine-tuned on 200 examples of dataset i, and then fine-tuned on 10 examples of dataset j, performs on dataset j.

Survey for alignment researchers!

Vivek Hebbar1yΩ112

Note: The survey took me 20 mins (but also note selection effects on leaving this comment)

1Cameron Berg1y

Definitely good to know that it might take a bit longer than we had estimated from earlier respondents (with the well-taken selection effect caveat). Note that if it takes between 10-20 minutes to fill out, this still works out to donating $120-240/researcher-hour to high-impact alignment orgs (plus whatever the value is of the comparison of one's individual results to that of community), which hopefully is worth the time investment :)

What’s up with LLMs representing XORs of arbitrary features?

Vivek Hebbar1yΩ110

Here's a fun thing I noticed:

There are 16 boolean functions of two variables. Now consider an embedding that maps each of the four pairs {(A=true, B=true), (A=true, B=false), ...} to a point in 2d space. For any such embedding, at most 14 of the 16 functions will be representable with a linear decision boundary.

For the "default" embedding (x=A, y=B), xor and its complement are the two excluded functions. If we rearrange the points such that xor is linearly represented, we always lose some other function (and its complement). In fact... (read more)

What’s up with LLMs representing XORs of arbitrary features?

Vivek Hebbar1yΩ110

Oops, I misunderstood what you meant by unimodality earlier. Your comment seems broadly correct now (except for the variance thing). I would still guess that unimodality isn't precisely the right well-behavedness desideratum, but I retract the "directionally wrong".

What’s up with LLMs representing XORs of arbitrary features?

Vivek Hebbar1yΩ230

The variance of the multivariate uniform distribution $U ([0, 1] \times [0, 1])$ is largest along the direction $x_{1} + x_{2}$ , which is exactly the direction which we would want to represent a AND b.

The variance is actually the same in all directions. One can sanity-check by integration that the variance is 1/12 both along the axis and along the diagonal.

In fact, there's nothing special about the uniform distribution here: The variance should be independent of direction for any N-dimensional joint distribution where the N constituent distributions are ind... (read more)

3Sam Marks1y

Thanks, you're totally right about the equal variance thing -- I had stupidly thought that the projection of U([0,1]2) onto y = x would be uniform on [−1√2,1√2] (obviously false!). The case of a fully discrete distribution (supported in this case on four points) seems like a very special case of a something more general, where a "more typical" special case would be something like: * if a, b are both false, then sample from N(0,Σ) * if a is true and b is false, then sample from N(μa,Σ) * if a is false and b is true then sample from N(μb,Σ) * if a and b are true, then sample from N(μa+μb,Σ) for some μa,μb∈Rn and covariance matrix Σ. In general, I don't really expect the class-conditional distributions to be Gaussian, nor for the class-conditional covariances to be independent of the class. But I do expect something broadly like this, where the distributions are concentrated around their class-conditional means with probability falling off as you move further from the class-conditional mean (hence unimodality), and that the class-conditional variances are not too big relative to the distance between the clusters. Given that longer explanation, does the unimodality thing still seem directionally wrong?

What’s up with LLMs representing XORs of arbitrary features?

Vivek Hebbar1yΩ110

Maybe models track which features are basic and enforce that these features be more salient

Couldn't it just write derivative features more weakly, and therefore not need any tracking mechanism other than the magnitude itself?

2Sam Marks1y

Some features which are computed from other features should probably themselves be treated as basic and thus represented with large salience.

Vivek Hebbar's Shortform

Vivek Hebbar2y81

It's sad that agentfoundations.org links no longer work, leading to broken links in many decision theory posts (e.g. here and here)

2habryka2y

Oh, hmm, this seems like a bug on our side. I definitely set up a redirect a while ago that should make those links work. My guess is something broke in the last few months.

2Vladimir_Nesov2y

Thanks for the heads up. Example broken link (https://agentfoundations.org/item?id=32), currently redirects to broken https://www.alignmentforum.org/item?id=32, should redirect further to https://www.alignmentforum.org/posts/5bd75cc58225bf0670374e7d/exploiting-edt (Exploiting EDT[1]), archive.today snapshot. Edit 14 Oct: It works now, even for links to comments, thanks LW team! ---------------------------------------- 1. LW confusingly replaces the link to www.alignmentforum.org given in Markdown comment source text with a link to www.lesswrong.com when displaying the comment on LW. ↩︎

Note-Taking without Hidden Messages

Vivek Hebbar2y30

This will initially boost $¯ H$ relative to $¯ S$ because it will suddenly be joined to a network with is correctly transmitting $¯ H$ but which does not understand $¯ S$ at all.
However, as these networks are trained to equilibrium the advantage will disappear as a steganographic protocol is agreed between the two models. Also, this can only be used once before the networks are in equilibrium.

Why would it be desirable to do this end-to-end training at all, rather than simply sticking the two networks together and doing no furthe... (read more)

Vivek Hebbar2y222

I've been asked to clarify a point of fact, so I'll do so here:

My recollection is that he probed a little and was like "I'm not too worried about that" and didn't probe further.

This does ring a bell, and my brain is weakly telling me it did happen on a walk with Nate, but it's so fuzzy that I can't tell if it's a real memory or not. A confounder here is that I've probably also had the conversational route "MIRI burnout is a thing, yikes" -> "I'm not too worried, I'm a robust and upbeat person" multiple times with people other than Nate.

In private ... (read more)

7TurnTrout2y

This is a slight positive update for me. I maintain my overall worry and critique: chats which are forgettable do not constitute sufficient warning. Insofar as non-Nate MIRI personnel thoroughly warned Vivek, that is another slight positive update, since this warning should reliably be encountered by potential hires. If Vivek was independently warned via random social connections not possessed by everyone,[1] then that's a slight negative update. 1. ^ For example, Thomas Kwa learned about Nate's comm doc by randomly talking with a close friend of Nate's, and mentioning comm difficulties.

Paper: LLMs trained on “A is B” fail to learn “B is A”

Vivek Hebbar2y60

What's "denormalization"?

3leogao2y

I meant it as an analogy to https://en.m.wikipedia.org/wiki/Denormalization

johnswentworth2y131

In database design, sometimes you have a column in one table whose entries are pointers into another table - e.g. maybe I have a Users table, and each User has a primaryAddress field which is a pointer into an Address table. That keeps things relatively compact and often naturally represents things - e.g. if several Users in a family share a primary address, then they can all point to the same Address. The Address only needs to be represented once (so it's relatively compact), and it can also be changed once for everyone if that's a thing someone wants to ... (read more)

Making Nanobots isn't a one-shot process, even for an artificial superintelligance

Vivek Hebbar2yΩ110

When you describe the "emailing protein sequences -> nanotech" route, are you imagining an AGI with computers on which it can run code (like simulations)? Or do you claim that the AGI could design the protein sequences without writing simulations, by simply thinking about it "in its head"?

4Eliezer Yudkowsky2y

At the superintelligent level there's not a binary difference between those two clusters. You just compute each thing you need to know efficiently.

LLMs and computation complexity

Vivek Hebbar2y85

Cool! It wrote and executed code to solve the problem, and it got it right.

Are you using chat-GPT-4? I thought it can't run code?

1Jonathan Marcus2y

Interesting! Yes, I am using ChatGPT with GPT-4. It printed out the code, then *told me that it ran it*, then printed out a correct answer. I didn't think to fact-check it; instead I assumed the OpenAI has been adding some impressive/scary new features.

Contra Yudkowsky on Doom from Foom #2

Vivek Hebbar2y60

Interesting, I find what you are saying here broadly plausible, and it is updating me (at least toward greater uncertainity/confusion). I notice that I don't expect the 10x effect, or the Von Neumann effect, to be anywhere close to purely genetic. Maybe some path-dependency in learning? But my intuition (of unknown quality) is that there should be some software tweaks which make the high end of this more reliably achievable.

Anyway, to check that I understand your position, would this be a fair dialogue?:

Person: "The jump from chimps to hu

... (read more)

5jacob_cannell2y

Your model of my model sounds about right, but I also include neotany extension of perhaps 2x which is part of the scale up (spending longer on training the cortex, especially in higher brain regions). For Von Neumann in particular my understanding is he was some combination of 'regular' genius and a mentant (a person who can perform certain computer like calculations quickly), which was very useful for many science tasks in an era lacking fast computers and software like mathematica, but would provide less of an effective edge today. It also inflated people's perception of his actual abilities.

Contra Yudkowsky on Doom from Foom #2

Vivek Hebbar2y134

In your view, who would contribute more to science -- 1000 Einsteins, or 10,000 average scientists?^[1]

"IQ variation is due to continuous introduction of bad mutations" is an interesting hypothesis, and definitely helps save your theory. But there are many other candidates, like "slow fixation of positive mutations" and "fitness tradeoffs^[2]".

Do you have specific evidence for either:

Deleterious mutations being the primary source of IQ variation
Human intelligence "plateauing" around the level of top humans^[3]

Or do you believe these things just because ... (read more)

6Alexander Gietelink Oldenziel2y

IIRC according to gwern the theory that IQ variation is mostly due to mutational load has been debunked by modern genomic studies [though mutational load definitely has a sizable effect on IQ]. IQ variation seems to be mostly similar to height in being the result of the additive effect of many individual common allele variations.

jacob_cannell2y103

In your view, who would contribute more to science -- 1000 Einsteins, or 10,000 average scientists?

I vaguely agree with your 90%/60% split for physics vs chemistry. In my field of programming we have the 10x myth/meme, which I think is reasonably correct but it really depends on the task.

For the 10x programmers it's some combination of greater IQ/etc but also starting programming earlier with more focused attention for longer periods of time, which eventually compounds into the 10x difference.

But it really depends on the task distribution - there are s... (read more)

AI doom from an LLM-plateau-ist perspective

Vivek Hebbar2yΩ242

It would still be interesting to know whether you were surprised by GPT-4's capabilities (if you have played with it enough to have a good take)

7Steven Byrnes2y

When I started blogging about AI alignment in my free time, it happened that GPT-2 had just come out, and everyone on LW was talking about it. So I wrote a couple blog posts (e.g. 1,2) trying (not very successfully, in hindsight, but I was really just starting out, don’t judge) to think through what would happen if GPT-N could reach TAI / x-risk levels. I don’t recall feeling strongly that it would or wouldn’t reach those levels, it just seemed like worth thinking about from a safety perspective and not many other people were doing so at the time. But in the meantime I was also gradually getting into thinking about brain algorithms, which involve RL much more centrally, and I came to believe that that RL was necessary to reach dangerous capability levels (recent discussion here; I think the first time I wrote it down was here). And I still believe that, and I think the jury’s out as to whether it’s true. (RLHF doesn’t count, it’s just a fine-tuning step, whereas in the brain it’s much more central.) My updates since then have felt less like “Wow look at what GPT can do” and more like “Wow some of my LW friends think that GPT is rapidly approaching the singularity, and these are pretty reasonable people who have spent a lot more time with LLMs than I have”. I haven’t personally gotten much useful work out of GPT-4. Especially not for my neuroscience work. I am currently using GPT-4 only for copyediting. (“[The following is a blog post draft. Please create a bullet point list with any typos or grammar errors.] …” “Was there any unexplained jargon in that essay?” Etc.) But maybe I’m bad at prompting, or trying the wrong things. I certainly haven’t tried very much, and find it more useful to see what other people online are saying about GPT-4 and doing with GPT-4, rather than my own very limited experience. Anyway, I have various theory-driven beliefs about deficiencies of LLMs compared to other possible AI algorithms (the RL thing I mentioned above is just one of ma

3Alexander Gietelink Oldenziel2y

fwiw, I think I'm fairly close to Steven Byrnes' model. I was not surprised by gpt-4 (but like most people who weren't following LLMs closely was surprised by gpt-2 capabilities)

$250 prize for checking Jake Cannell's Brain Efficiency

Vivek Hebbar2y42

Human intelligence in terms of brain arch priors also plateaus

Why do you think this?

Richard Ngo's Shortform

Vivek Hebbar2yΩ570

POV: I'm in an ancestral environment, and I (somehow) only care about the rewarding feeling of eating bread. I only care about the nice feeling which comes from having sex, or watching the birth of my son, or being gaining power in the tribe. I don't care about the real-world status of my actual son, although I might have strictly instrumental heuristics about e.g. how to keep him safe and well-fed in certain situations, as cognitive shortcuts for getting reward (but not as terminal values).

Would such a person sacrifice themselves for their children (in situations where doing so would be a fitness advantage)?

6TurnTrout2y

I think this highlights a good counterpoint. I think this alternate theory predicts "probably not", although I can contrive hypotheses for why people would sacrifice themselves (because they have learned that high-status -> reward; and it's high-status to sacrifice yourself for your kid). Or because keeping your kid safe -> high reward as another learned drive. Overall this feels like contortion but I think it's possible. Maybe overall this is a... 1-bit update against the "not selection for caring about reality" point?

Contra Yudkowsky on Doom from Foom #2

Vivek Hebbar2y4229

Isn't going from an average human to Einstein a huge increase in science-productivity, without any flop increase? Then why can't there be software-driven foom, by going farther in whatever direction Einstein's brain is from the average human?

6jacob_cannell2y

Science/engineering is often a winner-take all race. To him who has is given more - so for every Einstein there are many others less well known (Lorentz, Minkowski), and so on. Actual ability is filtered through something like a softmax to produce fame, so fame severely underestimates ability. Evolution proceeds by random exploration of parameter space, the more intelligent humans only reproduce a little more than average in aggregation, and there is drag due to mutations. So the subset of the most intelligent humans represents the upper potential of the brain, but it clearly asymptotes. Finally, intelligence results from the interaction of genetics and memetics, just like in ANNs. Digital minds can be copied easily (well at least current ones - future analog neuromorphic minds may be more difficult to copy), so it seems likely that they will not have the equivalent of the mutation load issue as much. On the other hand the great expense of training digital minds and the great cost of GPU RAM means they have much less diversity - many instances of a few minds. None of this by itself leaves much hope for foom.

$250 prize for checking Jake Cannell's Brain Efficiency

Vivek Hebbar2y32

Of course, my argument doesn't pin down the nature or rate of software-driven takeoff, or whether there is some ceiling. Just that the "efficiency" arguments don't seem to rule it out, and that there's no reason to believe that science-per-flop has a ceiling near the level of top humans.

$250 prize for checking Jake Cannell's Brain Efficiency

Vivek Hebbar2y136

You could use all of world energy output to have a few billion human speed AGI, or a millions that think 1000x faster, etc.

Isn't it insanely transformative to have millions of human-level AIs which think 1000x faster?? The difference between top scientists and average humans seems to be something like "software" (Einstein isn't using 2x the watts or neurons). So then it should be totally possible for each of the "millions of human-level AIs" to be equivalent to Einstein. Couldn't a million Einstein-level scientists running at 1000x speed ... (read more)

4jacob_cannell2y

Yes it will be transformative. GPT models already think 1000x to 10000x faster - but only for the learning stage (absorbing knowledge), not for inference (thinking new thoughts).

3Vivek Hebbar2y

$250 prize for checking Jake Cannell's Brain Efficiency

Vivek Hebbar2y32

In your view, is it possible to make something which is superhuman (i.e. scaled beyond human level), if you are willing to spend a lot on energy, compute, engineering cost, etc?

What could EA's new name be?

Vivek Hebbar2y21

It would be "QA", not "QE"

1Nicholas / Heather Kross2y

Oops. Fixed!

2Teerth Aloke2y

QA sessions.

Understanding and controlling a maze-solving policy network

Vivek Hebbar2yΩ24-2

Any idea why "cheese Euclidean distance to top-right corner" is so important? It's surprising to me because the convolutional layers should apply the same filter everywhere.

2TurnTrout2y

I'm also lightly surprised by the strength of the relationship, but not because of the convolutional layers. It seems like if "convolutional layers apply the same filter everywhere" makes me surprised by the cheese-distance influence, it should also make me be surprised by "the mouse behaves differently in a dead-end versus a long corridor" or "the mouse tends to go to the top-right." (I have some sense of "maybe I'm not grappling with Vivek's reasons for being surprised", so feel free to tell me if so!)

3Vaniver2y

My naive guess is that the other relationships are nonlinear, and this is the best way to approximate those relationships out of just linear relationships of the variables the regressor had access to.

ZT5's Shortform

Vivek Hebbar2y30

See Godel's incompleteness theorems. For example, consider the statement "For all A, (ZFC proves A) implies A", encoded into a form judgeable by ZFC itself. If you believe ZFC to be sound, then you believe that this statement is true, but due to Godel stuff you must also believe that ZFC cannot prove it. The reasons for believing ZFC to be sound are reasons from "outside the system" like "it looks logically sound based on common sense", "it's never failed in practice", and "no-one's found a valid issue". Godel's theorems let us conv... (read more)

1Victor Novikov2y

I think we understand each other! Thank you for clarifying. The way I translate this: some logical statements are true (to you) but not provable (to you), because you are not living in a world of mathematical logic, you are living in a messy, probabilistic world. It is nevertheless true, by the rule of necessitation in provability logic, that if a logical statement is true within the system, then it is also provable within the system. P -> □P. Because the fact that the system is making the statement P is the proof. Within a logical system, there is an underlying assumption that the system only makes true statements. (ok, this is potentially misleading and not strictly correct) This is fascinating! So my takeaway is something like: our reasoning about logical statements and systems is not necessarily "logical" itself, but is often probabilistic and messy. Which is how it has to be, given... our bounded computational power, perhaps? This very much seems to be a logical uncertainty thing.

ZT5's Shortform

Vivek Hebbar2y70

??? For math this is exactly backward, there can be true-but-unprovable statements

1Victor Novikov2y

Then how do you know they are true? If you do know then they are true, it is because you have proven it, no? But I think what you are saying is correct, and I'm curious to zoom in on this disagreement.

The Waluigi Effect (mega-post)

Vivek Hebbar2yΩ102717

Agreed. To give a concrete toy example: Suppose that Luigi always outputs "A", and Waluigi is {50% A, 50% B}. If the prior is {50% luigi, 50% waluigi}, each "A" outputted is a 2:1 update towards Luigi. The probability of "B" keeps dropping, and the probability of ever seeing a "B" asymptotes to 50% (as it must).

This is the case for perfect predictors, but there could be some argument about particular kinds of imperfect predictors which supports the claim in the post.

8abramdemski2y

LLMs are high order Markov models, meaning they can't really balance two different hypotheses in the way you describe; because evidence drops out of memory eventually, the probability of Waluigi drops very small instead of dropping to zero. This makes an eventual waluigi transition inevitable as claimed in the post.

Tom Shlomi2y135

Context windows could make the claim from the post correct. Since the simulator can only consider a bounded amount of evidence at once, its P[Waluigi] has a lower bound. Meanwhile, it takes much less evidence than fits in the context window to bring its P[Luigi] down to effectively 0.

Imagine that, in your example, once Waluigi outputs B it will always continue outputting B (if he's already revealed to be Waluigi, there's no point in acting like Luigi). If there's a context window of 10, then the simulator's probability of Waluigi never goes below 1/1025, w... (read more)

1Eschaton2y

The transform isn't symmetric though right? A character portraying "good" behaviour is, narratively speaking, more likely to have been deceitful the whole time or transform into a villain than for the antagonist to turn "good".

9Cleo Nardo2y

Yep I think you might be right about the maths actually. I'm thinking that waluigis with 50% A and 50% B have been eliminated by llm pretraining and definitely by rlhf. The only waluigis that remain are deceptive-at-initialisation. So what we have left is a superposition of a bunch of luigis and a bunch of waluigis, where the waluigis are deceptive, and for each waluigi there is a different phrase that would trigger them. I'm not claiming basin of attraction is the entire space of interpolation between waluigis and luigis. Actually, maybe "attractor" is the wrong technical word to use here. What I want to convey is that the amplitude of the luigis can only grow very slowly and can be reversed, but the amplitude of the waluigi can suddenly jump to 100% in a single token and would remain there permanently. What's the right dynamical-systemy term for that?

Polysemanticity and Capacity in Neural Networks

Vivek Hebbar2yΩ250

In section 3.7 of the paper, it seems like the descriptions ("6 in 5", etc) are inconsistent across the image, the caption, and the paragraph before them. What are the correct labels? (And maybe fix the paper if these are typos?)

1Kshitij Sachan2y

This has been fixed now. Thanks for pointing it out! I'm sorry it took me so long to get to this.

The Hidden Complexity of Wishes

Vivek Hebbar2y20

Does the easiest way to make you more intelligent also keep your values intact?

Why The Focus on Expected Utility Maximisers?

Vivek Hebbar3y10

What exactly do you mean by "multi objective optimization"?

1DragonGod3y

Optimising multiple objective functions in a way that cannot be collapsed into a single utility function to e.g. the reals. I guess multi objective optimisation can be represented by a single utility function that maps to a vector space, but as far as I'm aware, utility functions usually have a field as their codomain.

Let’s think about slowing down AI

Vivek Hebbar3y125

It would help if you specified which subset of "the community" you're arguing against. I had a similar reaction to your comment as Daniel did, since in my circles (AI safety researchers in Berkeley), governance tends to be well-respected, and I'd be shocked to encounter the sentiment that working for OpenAI is a "betrayal of allegiance to 'the community'".

habryka3y1621

To be clear, I do think most people who have historically worked on "alignment" at OpenAI have probably caused great harm! And I do think I am broadly in favor of stronger community norms against working at AI capability companies, even in so called "safety positions". So I do think there is something to the sentiment that Critch is describing.

The "Minimal Latents" Approach to Natural Abstractions

Vivek Hebbar3yΩ685

In ML terms, nearly-all the informational work of learning what “apple” means must be performed by unsupervised learning, not supervised learning. Otherwise the number of examples required would be far too large to match toddlers’ actual performance.

I'd guess the vast majority of the work (relative to the max-entropy baseline) is done by the inductive bias.

8Rohin Shah3y

You don't need to guess; it's clearly true. Even a 1 trillion parameter network where each parameter is represented with 64 bits can still only represent at most 264,000,000,000,000 different functions, which is a tiny tiny fraction of the full space of 228,000,000 possible functions. You're already getting at least 28,000,000−64,000,000,000,000 of the bits just by choosing the network architecture. (This does assume things like "the neural network can learn the correct function rather than a nearly-correct function" but similarly the argument in the OP assumes "the toddler does learn the correct function rather than a nearly-correct function".)

Consider using reversible automata for alignment research

Vivek Hebbar3y31

Beware, though; string theory may be what underlies QFT and GR, and it describes a world of stringy objects that actually do move through space

I think this contrast is wrong.^[1] IIRC, strings have the same status in string theory that particles do in QFT. In QM, a wavefunction assigns a complex number to each point in configuration space, where state space has an axis for each property of each particle.^[2] So, for instance, a system with 4 particles with only position and momentum will have a 12-dimensional configuration space.^[3] I... (read more)

2Adam Scherlis3y

QFT doesn't actually work like that -- the "classical degrees of freedom" underlying its configuration space are classical fields over space, not properties of particles. Note that Quantum Field Theory is not the same as the theory taught in "Quantum Mechanics" courses, which is as you describe. "Quantum Mechanics" (in common parlance): quantum theory of (a fixed number of) particles, as you describe. "Quantum Field Theory": quantum theory of fields, which are ontologically similar to cellular automata. "String Theory": quantum theory of strings, and maybe branes, as you describe.* "Quantum Mechanics" (strictly speaking): any of the above; quantum theory of anything. You can do a change of basis in QFT and get something that looks like properties of particles (Fock space), and people do this very often, but the actual laws of physics in a QFT (the Lagrangian) can't be expressed nicely in the particle ontology because of nonperturbative effects. This doesn't come up often in practice -- I spent most of grad school thinking QFT was agnostic about whether fields or particles are fundamental -- but it's an important thing to recognize in a discussion about whether modern physics privileges one ontology over the other. (Note that even in the imperfect particle ontology / Fock space picture, you don't have a finite-dimensional classical configuration space. 12 dimensions for 4 particles works great until you end up with a superposition of states with different particle numbers!) String theory is as you describe, AFAIK, which is why I contrasted it to QFT. But maybe a real string theorist would tell me that nobody believes those strings are the fundamental degrees of freedom, just like particles aren't the fundamental degrees of freedom in QFT. *Note: People sometimes use "string theory" to refer to weirder things like M-theory, where nobody knows which degrees of freedom to use...

Alignment allows "nonrobust" decision-influences and doesn't require robust grading

Vivek Hebbar3yΩ330

As I understand Vivek's framework, human value shards explain away the need to posit alignment to an idealized utility function. A person is not a bunch of crude-sounding subshards (e.g. "If food nearby and hunger>15, then be more likely to go to food") and then also a sophisticated utility function (e.g. something like CEV). It's shards all the way down, and all the way up.^[10]

This read to me like you were saying "In Vivek's framework, value shards explain away .." and I was confused. I now think you mean "My take on Vivek's is that value s... (read more)

2TurnTrout3y

Reworded, thanks.