In the fictional dialogue, Claude Shannon's first answer is more correct -- info theory is useful far outside the original domain of application, and its elegance is the best way to predict that.
Slightly more spelled-out thoughts about bounded minds:
I suspect there is some merit to the Scientist's intuition (and the idea that constant returns are more "empirical") which nobody has managed to explain well. I'll try to explain it here.[1]
The Epistemologist's notion of simplicity is about short programs with unbounded runtime which perfectly explain all evidence. The [non-straw] empiricist notion of simplicity is about short programs with heavily-bounded runtime which approximately explain a subset of the evidence. The Epistemologist is right that there is nothing of value in the empiri...
I'd be capable of helping aliens optimize their world, sure. I wouldn't be motivated to, but I'd be capable.
@So8res How many bits of complexity is the simplest modification to your brain that would make you in fact help them? (asking for an order-of-magnitude wild guess)
(This could be by actually changing your values-upon-reflection, or by locally confusing you about what's in your interest, or by any other means.)
Sigmoid is usually what "straight line" should mean for a quantity bounded at 0 and 1. It's a straight line in logit-space, the most natural space which complies with that range restriction.
(Just as exponentials are often the correct form of "straight line" for things that are required to be positive but have no ceiling in sight.)
Do you want to try playing this game together sometime?
We're then going to use a small amount of RL (like, 10 training episodes) to try to point it in this direction. We're going to try to use the RL to train: "Act exactly like [a given alignment researcher] would act."
Why are we doing RL if we just want imitation? Why not SFT on expert demonstrations?
Also, if 10 episodes suffices, why is so much post-training currently done on base models?
If the agent follows EDT, it seems like you are giving it epistemically unsound credences. In particular, the premise is that it's very confident it will go left, and the consequence is that it in fact goes right. This was the world model's fault, not EDT's fault. (It is notable though that EDT introduces this loopiness into the world model's job.)
Superadditivity seems rare in practice. For instance, workers should have subadditive contributions after some point. This is certainly true in the unemployment example in the post.
The idea of dividing failure stories into "failures involving rogue deployments" and "other failures" seems most useful if the following argument goes through:
1. Catastrophes require a very large (superhuman?) quantity and/or quality of intellectual labor
2. Either this labor is done by AIs in approved scaffolds, or it is done in "rogue deployments"
3. Hence the only easy-by-default disaster route is through a rogue deployment
4. Hence if we rule out rogue deployments and very impressive/difficult malicious labor in our scaffolds, we are safe
This seems true f...
This google search seems to turn up some interesting articles (like maybe this one, though I've just started reading it).
Paul [Christiano] called this “problems of the interior” somewhere
Since it's slightly hard to find: Paul references it here (ctrl+f for "interior") and links to this source (once again ctrl+f for "interior"). Paul also refers to it in this post. The term is actually "position of the interior" and apparently comes from military strategist Carl von Clausewitz.
Can you clarify what figure 1 and figure 2 are showing?
I took the text description before figure 1 to mean {score on column after finetuning on 200 from row then 10 from column} - {score on column after finetuning on 10 from column}. But then the text right after says "Babbage fine-tuned on addition gets 27% accuracy on the multiplication dataset" which seems like a different thing.
Note: The survey took me 20 mins (but also note selection effects on leaving this comment)
Here's a fun thing I noticed:
There are 16 boolean functions of two variables. Now consider an embedding that maps each of the four pairs {(A=true, B=true), (A=true, B=false), ...} to a point in 2d space. For any such embedding, at most 14 of the 16 functions will be representable with a linear decision boundary.
For the "default" embedding (x=A, y=B), xor and its complement are the two excluded functions. If we rearrange the points such that xor is linearly represented, we always lose some other function (and its complement). In fact...
Oops, I misunderstood what you meant by unimodality earlier. Your comment seems broadly correct now (except for the variance thing). I would still guess that unimodality isn't precisely the right well-behavedness desideratum, but I retract the "directionally wrong".
The variance of the multivariate uniform distribution is largest along the direction , which is exactly the direction which we would want to represent a AND b.
The variance is actually the same in all directions. One can sanity-check by integration that the variance is 1/12 both along the axis and along the diagonal.
In fact, there's nothing special about the uniform distribution here: The variance should be independent of direction for any N-dimensional joint distribution where the N constituent distributions are ind...
Maybe models track which features are basic and enforce that these features be more salient
Couldn't it just write derivative features more weakly, and therefore not need any tracking mechanism other than the magnitude itself?
This will initially boost relative to because it will suddenly be joined to a network with is correctly transmitting but which does not understand at all.
However, as these networks are trained to equilibrium the advantage will disappear as a steganographic protocol is agreed between the two models. Also, this can only be used once before the networks are in equilibrium.
Why would it be desirable to do this end-to-end training at all, rather than simply sticking the two networks together and doing no furthe...
I've been asked to clarify a point of fact, so I'll do so here:
My recollection is that he probed a little and was like "I'm not too worried about that" and didn't probe further.
This does ring a bell, and my brain is weakly telling me it did happen on a walk with Nate, but it's so fuzzy that I can't tell if it's a real memory or not. A confounder here is that I've probably also had the conversational route "MIRI burnout is a thing, yikes" -> "I'm not too worried, I'm a robust and upbeat person" multiple times with people other than Nate.
In private ...
What's "denormalization"?
In database design, sometimes you have a column in one table whose entries are pointers into another table - e.g. maybe I have a Users table, and each User has a primaryAddress field which is a pointer into an Address table. That keeps things relatively compact and often naturally represents things - e.g. if several Users in a family share a primary address, then they can all point to the same Address. The Address only needs to be represented once (so it's relatively compact), and it can also be changed once for everyone if that's a thing someone wants to ...
When you describe the "emailing protein sequences -> nanotech" route, are you imagining an AGI with computers on which it can run code (like simulations)? Or do you claim that the AGI could design the protein sequences without writing simulations, by simply thinking about it "in its head"?
Cool! It wrote and executed code to solve the problem, and it got it right.
Are you using chat-GPT-4? I thought it can't run code?
Interesting, I find what you are saying here broadly plausible, and it is updating me (at least toward greater uncertainity/confusion). I notice that I don't expect the 10x effect, or the Von Neumann effect, to be anywhere close to purely genetic. Maybe some path-dependency in learning? But my intuition (of unknown quality) is that there should be some software tweaks which make the high end of this more reliably achievable.
Anyway, to check that I understand your position, would this be a fair dialogue?:
...Person: "The jump from chimps to hu
In your view, who would contribute more to science -- 1000 Einsteins, or 10,000 average scientists?[1]
"IQ variation is due to continuous introduction of bad mutations" is an interesting hypothesis, and definitely helps save your theory. But there are many other candidates, like "slow fixation of positive mutations" and "fitness tradeoffs[2]".
Do you have specific evidence for either:
Or do you believe these things just because ...
In your view, who would contribute more to science -- 1000 Einsteins, or 10,000 average scientists?
I vaguely agree with your 90%/60% split for physics vs chemistry. In my field of programming we have the 10x myth/meme, which I think is reasonably correct but it really depends on the task.
For the 10x programmers it's some combination of greater IQ/etc but also starting programming earlier with more focused attention for longer periods of time, which eventually compounds into the 10x difference.
But it really depends on the task distribution - there are s...
It would still be interesting to know whether you were surprised by GPT-4's capabilities (if you have played with it enough to have a good take)
Human intelligence in terms of brain arch priors also plateaus
Why do you think this?
POV: I'm in an ancestral environment, and I (somehow) only care about the rewarding feeling of eating bread. I only care about the nice feeling which comes from having sex, or watching the birth of my son, or being gaining power in the tribe. I don't care about the real-world status of my actual son, although I might have strictly instrumental heuristics about e.g. how to keep him safe and well-fed in certain situations, as cognitive shortcuts for getting reward (but not as terminal values).
Would such a person sacrifice themselves for their children (in situations where doing so would be a fitness advantage)?
Isn't going from an average human to Einstein a huge increase in science-productivity, without any flop increase? Then why can't there be software-driven foom, by going farther in whatever direction Einstein's brain is from the average human?
Of course, my argument doesn't pin down the nature or rate of software-driven takeoff, or whether there is some ceiling. Just that the "efficiency" arguments don't seem to rule it out, and that there's no reason to believe that science-per-flop has a ceiling near the level of top humans.
You could use all of world energy output to have a few billion human speed AGI, or a millions that think 1000x faster, etc.
Isn't it insanely transformative to have millions of human-level AIs which think 1000x faster?? The difference between top scientists and average humans seems to be something like "software" (Einstein isn't using 2x the watts or neurons). So then it should be totally possible for each of the "millions of human-level AIs" to be equivalent to Einstein. Couldn't a million Einstein-level scientists running at 1000x speed ...
In your view, is it possible to make something which is superhuman (i.e. scaled beyond human level), if you are willing to spend a lot on energy, compute, engineering cost, etc?
It would be "QA", not "QE"
Any idea why "cheese Euclidean distance to top-right corner" is so important? It's surprising to me because the convolutional layers should apply the same filter everywhere.
See Godel's incompleteness theorems. For example, consider the statement "For all A, (ZFC proves A) implies A", encoded into a form judgeable by ZFC itself. If you believe ZFC to be sound, then you believe that this statement is true, but due to Godel stuff you must also believe that ZFC cannot prove it. The reasons for believing ZFC to be sound are reasons from "outside the system" like "it looks logically sound based on common sense", "it's never failed in practice", and "no-one's found a valid issue". Godel's theorems let us conv...
??? For math this is exactly backward, there can be true-but-unprovable statements
Agreed. To give a concrete toy example: Suppose that Luigi always outputs "A", and Waluigi is {50% A, 50% B}. If the prior is {50% luigi, 50% waluigi}, each "A" outputted is a 2:1 update towards Luigi. The probability of "B" keeps dropping, and the probability of ever seeing a "B" asymptotes to 50% (as it must).
This is the case for perfect predictors, but there could be some argument about particular kinds of imperfect predictors which supports the claim in the post.
Context windows could make the claim from the post correct. Since the simulator can only consider a bounded amount of evidence at once, its P[Waluigi] has a lower bound. Meanwhile, it takes much less evidence than fits in the context window to bring its P[Luigi] down to effectively 0.
Imagine that, in your example, once Waluigi outputs B it will always continue outputting B (if he's already revealed to be Waluigi, there's no point in acting like Luigi). If there's a context window of 10, then the simulator's probability of Waluigi never goes below 1/1025, w...
In section 3.7 of the paper, it seems like the descriptions ("6 in 5", etc) are inconsistent across the image, the caption, and the paragraph before them. What are the correct labels? (And maybe fix the paper if these are typos?)
Does the easiest way to make you more intelligent also keep your values intact?
What exactly do you mean by "multi objective optimization"?
It would help if you specified which subset of "the community" you're arguing against. I had a similar reaction to your comment as Daniel did, since in my circles (AI safety researchers in Berkeley), governance tends to be well-respected, and I'd be shocked to encounter the sentiment that working for OpenAI is a "betrayal of allegiance to 'the community'".
To be clear, I do think most people who have historically worked on "alignment" at OpenAI have probably caused great harm! And I do think I am broadly in favor of stronger community norms against working at AI capability companies, even in so called "safety positions". So I do think there is something to the sentiment that Critch is describing.
In ML terms, nearly-all the informational work of learning what “apple” means must be performed by unsupervised learning, not supervised learning. Otherwise the number of examples required would be far too large to match toddlers’ actual performance.
I'd guess the vast majority of the work (relative to the max-entropy baseline) is done by the inductive bias.
Beware, though; string theory may be what underlies QFT and GR, and it describes a world of stringy objects that actually do move through space
I think this contrast is wrong.[1] IIRC, strings have the same status in string theory that particles do in QFT. In QM, a wavefunction assigns a complex number to each point in configuration space, where state space has an axis for each property of each particle.[2] So, for instance, a system with 4 particles with only position and momentum will have a 12-dimensional configuration space.[3] I...
As I understand Vivek's framework, human value shards explain away the need to posit alignment to an idealized utility function. A person is not a bunch of crude-sounding subshards (e.g. "If
food nearby
andhunger>15
, then be more likely togo to food
") and then also a sophisticated utility function (e.g. something like CEV). It's shards all the way down, and all the way up.[10]
This read to me like you were saying "In Vivek's framework, value shards explain away .." and I was confused. I now think you mean "My take on Vivek's is that value s...
Makes perfect sense, thanks!
"Well, what if I take the variables that I'm given in a Pearlian problem and I just forget that structure? I can just take the product of all of these variables that I'm given, and consider the space of all partitions on that product of variables that I'm given; and each one of those partitions will be its own variable.
How can a partition be a variable? Should it be "part" instead?
ETA: Koen recommends reading Counterfactual Planning in AGI Systems before (or instead of) Corrigibility with Utility Preservation
Update: I started reading your paper "Corrigibility with Utility Preservation".[1] My guess is that readers strapped for time should read {abstract, section 2, section 4} then skip to section 6. AFAICT, section 5 is just setting up the standard utility-maximization framework and defining "superintelligent" as "optimal utility maximizer".
Quick thoughts after reading less than half:
AFAICT,[2] this is a mathematica...
I made a few edits to this post today, mostly in response to feedback from Ryan and Richard:
- Added 2 sentences emphasizing the point that schemers probably won't be aware of their terminal goal in most contexts. I thought this was clear from the post already, but apparently it wasn't.
- Modified "What factors affect the likelihood of training-gaming?" to emphasize that "sum of proxies" and "reward-seeker" are points on a spectrum. We might get an in-between model where context-dependent drives conflict with higher goals and sometimes "win" even out
... (read more)