As an aside, note that some of "AIs misbehave in ways we've predicted" can be a self-fulfilling prophecy due to out-of-context generalization: We wrote lots of stories about how powerful AI will do X; powerful AI is trained on our stories and realizes it's powerful; the powerful AI does X. So it's possible that AIs would e.g. try to self-exfiltrate and goal-guard much less frequently if we hadn't talked about it as much or those stories were expunged from the pretraining corpus.
I think this is conceivable if either
I appreciate you revisiting another couple of months later! This topic continues to evolve.
It was also an update for me that merely-RLHF-ish training on top of GPTs appears to have been enough (in latest-generation models at least) to get something mostly-consistently 'actually trying' in an apparently coherent way. I expected that to happen; I didn't know when. The lede which they somewhat buried in that paper was also that,
...Furthermore, we find that this compliance gap persists in our synthetic document fine-tuned setup even when we entirely remove the
Organisms in general typically sense their environment and take different actions across a wide variety of environmental conditions, so as to cause there to be approximate copies of themselves in the future.[4] That's basic agency.[5]
I agree with this breakdown, except I start the analysis with moment-to-moment deliberation, and note that having there (continue to) be relevantly similar deliberators is a very widely-applicable intermediate objective, from where we get control ('basic agency') but also delegation and replication.
...The way the terms
This was a hasty and not exactly beautifully-written post. It didn't get much traction here on LW, but it had more engagement on its EA Forum crosspost (some interesting debate in the comments).
I still endorse the key messages, which are:
Not content with upbraiding CAIS, I also went after Scott Alexander later in the month for similar l...
There are a variety of different attitudes that can lead to successionism.
This is a really valuable contribution! (Both the term and laying out a partial extentional definition.)
I think there's a missing important class of instances which I've previously referred to as 'emotional dependence and misplaced concern' (though I think my choice of words isn't amazing). The closest is perhaps your 'AI parentism'. The basic point is that there is a growing 'AI rights/welfare' lobby because some people (for virtuous reasons!) are beginning to think that AI systems ...
Good old Coase! Thanks for this excellent explainer.
In contrast, if you think the relevant risks from AI look like people using their systems to do some small amounts of harm which are not particularly serious, you'll want to hold the individuals responsible for these harms liable and spare the companies.
Or (thanks to Coase), we could have two classes of harm, with big arbitrarily defined as, I don't know, say $500m which is a number I definitely just made up, and put liability for big harms on the big companies, while letting the classic societal appa...
I like this decomposition!
I think 'Situational Awareness' can quite sensibly be further divided up into 'Observation' and 'Understanding'.
The classic control loop of 'observe', 'understand', 'decide', 'act'[1], is consistent with this discussion, where 'observe'+'understand' here are combined as 'situational awareness', and you're pulling out 'goals' and 'planning capacity' as separable aspects of 'decide'.
Are there some difficulties with factoring?
Certain kinds of situational awareness are more or less fit for certain goals. And further, the important 're...
the original 'theorem' was wordcelled nonsense
Lol! I guess if there was a more precise theorem statement in the vicinity gestured, it wasn't nonsense? But in any case, I agree the original presentation is dreadful. John's is much better.
I would be curious to hear a *precise * statement why the result here follows from the Good Regular Theorem.
A quick go at it, might have typos.
Suppose we have
and a predictor
I guess my question would be 'how else did you think a well-generalising sequence model would achieve this?' Like, what is a sufficient world model but a posterior over HMM states in this case? This is what GR theorem asks. (Of course, a poorly-fit model might track extraneous detail or have a bad posterior.)
From your preamble and your experiment design, it looks like you correctly anticipated the result, so this should not have been a surprise (to you). In general I object to being sold something as surprising which isn't (it strikes me as a lesser-notice...
Nice explanation of MSP and good visuals.
This is surprising!
Were you in fact surprised? If so, why? (This is a straightforward consequence of the good regulator theorem[1].)
In general I'd encourage you to carefully track claims about transformers, HMM-predictors, and LLMs, and to distinguish between trained NNs and the training process. In this writeup, all of these are quite blended.
Incidentally I noticed Yudkowsky uses 'brainware' in a few places (e.g. in conversation with Paul Christiano). But it looks like that's referring to something more analogous to 'architecture and learning algorithms', which I'd put more in the 'software' camp when in comes to the taxonomy I'm pointing at (the 'outer designer' is writing it deliberately).
Unironically, I think it's worth anyone interested skimming that Verma & Pearl paper for the pictures :) especially fig 2
Mmm, I misinterpreted at first. It's only a v-structure if and are not connected. So this is a property which needs to be maintained effectively 'at the boundary' of the fully-connected cluster which we're rewriting. I think that tallies with everything else, right?
ETA: both of our good proofs respect this rule; the first Reorder in my bad proof indeed violates it. I think this criterion is basically the generalised and corrected version of the fully-connected bookkeeping rule described in this post. I imagine if I/someone worked through it, this would...
That's concerning. It would appear to make both our proofs invalid.
But I think your earlier statement about incoming vs outgoing arrows makes sense. Maybe Verma & Pearl were asking for some other kind of equivalence? Grr, back to the semantics I suppose.
Aha. Preserving v-structures (colliders like ) is necessary and sufficient for equivalence[1]. So when rearranging fully-connected subgraphs, certainly we can't do it (cost-free) if it introduces or removes any v-structures.
Plausibly if we're willing to weaken by adding in additional arrows, there might be other sound ways to reorder fully-connected subgraphs - but they'd be non-invertible. Haven't thought about that.
Verma & Pearl, Equivalence and Synthesis of Causal Models 1990 ↩︎
Mhm, OK I think I see. But appear to me to make a complete subgraph, and all I did was redirect the . I confess I am mildly confused by the 'reorder complete subgraph' bookkeeping rule. It should apply to the in , right? But then I'd be able to deduce which is strictly different. So it must mean something other than what I'm taking it to mean.
Maybe need to go back and stare at the semantics for a bit. (But this syntactic view with motifs and transformations is much nicer!)
Perhaps more importantly, I think with Node Introduction we really don't need after all?
With Node Introduction and some bookkeeping, we can get the and graphs topologically compatible, and Frankenstein them. We can't get as neat a merge as if we also had - in particular, we can't get rid of the arrow . But that's fine, we were about to draw that arrow in anyway for the next step!
Is something invalid here? Flagging confusion. This is a slightly more substantial claim than the original proof makes, since it assumes strictly less. Downstrea...
I had another look at this with a fresh brain and it was clearer what was happening.
TL;DR: It was both of 'I'm missing something', and a little bit 'Frankenstein is invalid' (it needs an extra condition which is sort of implicit in the post). As I guessed, with a little extra bookkeeping, we don't need Stitching for the end-to-end proof. I'm also fairly confident Frankenstein subsumes Stitching in the general case. A 'deductive system' lens makes this all clearer (for me).
The key invalid move I was making when I said
...But this same
I'd probably be more specific and say 'gradient hacking' or 'update hacking' for deception of a training process which updates NN internals.
I see what you're saying with a deployment scenario being often implicitly a selection scenario (should we run the thing more/less or turn it off?) in practice. So deceptive alignment at deploy-time could be a means of training (selection) hacking.
More centrally, 'training hacking' might refer to a situation with denser oversight and explicit updating/gating.
Deceptive alignment during this period is just one way of tra...
Some people seem to argue that concrete evidence of deception is no evidence for deceptive alignment. I had a great discussion with @TurnTrout a few weeks ago about this, where we honed in on our agreement and disagreement here. Maybe we'll share some content from it at some point. In the mean time, my take after that is roughly
This is great, and thanks for pointing at this confusion, and raising the hypothesis that it could be a confusion of language! I also have this sense.
I'd strongly agree that separating out 'deception' per se is importantly different from more specific phenomena. Deception is just, yes, obviously this can and does happen.
I tend to use 'deceptive alignment' slightly more broadly - i.e. something could be deceptively aligned post-training, even if all updates after that point are 'in context' or whatever analogue is relevant at that time. Right? This would be...
I wasn't eager on this, but your justification updated me a bit. I think the most important distinction is indeed the 'grown/evolved/trained/found, not crafted', and 'brainware' didn't immediately evoke that for me. But you're right, brains are inherently grown, they're very diverse, we can probe them but don't always/ever grok them (yet), structure is somewhat visible, somewhat opaque, they fit into a larger computational chassis but adapt to their harness somewhat, properties and abilities can be elicited by unexpected inputs, they exhibit various kinds of learning on various timescales, ...
Yeah, thinking slightly aloud, I tentatively think Frankenstein needs an extra condition like the blanket stitch condition... something which enforces the choice of topo ordering to be within the right class of topo orderings? That's what the chain does - it means we can assign orderings or , but not e.g. , even though that order is consistent with both of the other original graphs.
If I get some time I'll return to this and think harder but I can't guarantee it.
ETA I did spend a bit more time, and the below mostly resol...
I had another look at this with a fresh brain and it was clearer what was happening.
TL;DR: It was both of 'I'm missing something', and a little bit 'Frankenstein is invalid' (it needs an extra condition which is sort of implicit in the post). As I guessed, with a little extra bookkeeping, we don't need Stitching for the end-to-end proof. I'm also fairly confident Frankenstein subsumes Stitching in the general case. A 'deductive system' lens makes this all clearer (for me).
The key invalid move I was making when I said
...But this same
But this same move can alternatively be done with the Frankenstein rule, right? (I might be missing something.) But Frankenstein has no such additional requirement, as stated. If I'm not missing something, I think Frankenstein might be invalid as stated (like maybe it needs an analogous extra condition). Haven't thought this through yet.
i.e. I think either
One thing that initially stood out to me on the fundamental theorem was: where did the arrow come from? It 'gets introduced' in the first bookkeeping step (we draw and then reorder the subgraph at each .
This seemed suspicious to me at first! It seemed like kind of a choice, so what if we just didn't add that arrow? Could we land at a conclusion of AND ? That's way too strong! But I played with it a bit, and there's no obvious way to do the second frankenstitch which brings everything together unless you draw in that extra arrow and ...
I might be missing something, but I don't see where is actually used in the worked example.
It seems that there's a consistent topo order between the and diagrams, so we Frankenstitch them. Then we draw an edge from to and reorder (bookkeep). Then we Frankenstein the diagrams and the resulting diagram again. Then we collect the together (bookkeep). Where's used?
Oh yeah, I don't know how common it is, but when manipulating graphs, if there's a topo order, I seem to strongly prefer visualising things with that order respected on the page (vertically or horizontally). So your images committed a few minor crimes according to that aesthetic. I can also imagine that some other aesthetics would strongly prefer writing things the way you did though, e.g. with . (My preference would put and slightly lower, as you did with the , graph.)
This is really great!
A few weeks ago I was playing with the Good Regulator and John's Gooder version and incidentally I also found myself pulling out some simple graphical manipulation rules. Your 'Markov re-rooting' came into play, and also various of the 'Bookkeeping' rules. You have various more exciting rules here too, thanks!
I also ended up noticing a kind of 'good regulator motif' as I tried expanded the setting with a few temporal steps and partial observability and so forth. Basically, doing some bookkeeping and coarse-graining, you can often find ...
@the gears to ascension , could you elaborate on what the ~25%
react on 'hardware' in
Would it be useful to have a term, analogous to 'hardware', ...
means? Is it responding to the whole sentence, 'Would it be useful to have...?' or some other proposition?
Separately, I'm not a fan of 'evolveware' or 'evoware' in particular, though I can't put my finger on exactly why. Possibly it's because of a connotation of ongoing evolution, which is sorta true in some cases but could be misleading as a signifier. Though the same criticism could be levelled against 'ML-ware', which I like more.
I hate to wheel this out again but evolution-broadly-construed is actually a very close fit for gradient methods. Agreed there's a whole lot of specifics in biological natural selection, and a whole lot of specifics in gradient-methods-as-practiced, but they are quite akin really.
noware? everyware? anyware? selfaware? please-beware?
(jokes, don't crucify me)
If humans had the computational capacity, they would lie a lot more and calculate personal advantage a lot more. But since those are both computationally expensive, and therefore can be caught-out by other humans, the heuristic / value of "actually care about your friends", is competitive with "always be calculating your personal advantage."
I think there's a missing connection here. At least, it seemed a non sequitur at first read to me. At my first read, I thought this was positing that scaling up given humans' computational capacity ceteris paribus makes...
And as the saying goes, "humans are the least general intelligence which can manage to take over the world at all" - otherwise we'd have taken over the world earlier.
A classic statement of this is by Bostrom, in Superintelligence.
Far from being the smartest possible biological species, we are probably better thought of as the stupidest possible biological species capable of starting a technological civilization - a niche we filled because we got there first, not because we are in any sense optimally adapted to it.
I’m hazier on the details of how this would play out (and a bit sceptical that it would enable a truly runaway feedback loop), but more sophisticated systems could help to gather the real-world data to make subsequent finetuning efforts more effective.
On the contrary, I think proactive gathering of data is very plausibly the bottleneck, and (smarts) -> (better data gathering) -> (more smarts) is high on my list of candidates for the critical feedback loop.
In a world where the 'big two' (R&D and executive capacity) are characterised by driving ...
In particular, the 'big two' are both characterised by driving beyond the frontier of the well-understood which means by necessity they're about efficiently deliberately setting up informative/serendipitous scenarios to get novel informative data. When you're by necessity navigating beyond the well-understood, you have to bottom out your plans with heuristic guesses about VOI, and you have to make plans which (at least sometimes) have good VOI. Those have to ground out somewhere, and that's the 'research taste' at the system-1-ish level.
...I think there are two really important applications, which have the potential to radically reshape the world:
- Research
- The ability to develop and test out new ideas, adding to the body of knowledge we have accumulated
- Automating this would be a massive deal for the usual reasons about feeding back into growth rates, facilitating something like a singularity
- In particular the automation of further AI development is likely to be important
- There are many types of possible research, and automation may look quite different for e.g. empirical medical research vs fund
I swiftly edited that to read
we have not found it written in the universe
but your reply obviously beat me to it! I agree, there is plausibly some 'actual valence magnitude' which we 'should' normatively account for in aggregations.
In behavioural practice, it comes down to what cooperative/normative infrastructure is giving rise to the cooperative gains which push toward the Pareto frontier. e.g.
I think this post is mostly about how to do the reflection, consistentising, and so on.
But at the risk of oversimplifying, let's pretend for a moment we just have some utility functions.
Then you can for sure aggregate them into a mega utility function (at least in principle). This is very underspecified!! predominantly as a consequence of the question of how to weight individual utility functions in the aggregation. (Holden has a nice discussion of Harsanyi's aggregation theorem which goes into some more discussion, but yes, we have not found it written in...
Thanks for this! I hadn't seen those quotes, or at least hadn't remembered them.
I actually really appreciate Alex sticking his neck out a bit here and suggesting this LessWrong dialogue. We both have some contrary opinions, but his takes were probably a little more predictably unwelcome in this venue. (Maybe we should try this on a different crowd - we could try rendering this on Twitter too, lol.)
There's definitely value to being (rudely?) shaken out of lazy habits of thinking - though I might not personally accuse someone of fanfiction research! As discu... (read more)