I'm particularly interested in sustainable collaboration and the long-term future of value. I'd love to contribute to a safer and more prosperous future with AI! Always interested in discussions about axiology, x-risks, s-risks.
I enjoy meeting new perspectives and growing my understanding of the world and the people in it. I also love to read - let me know your suggestions! In no particular order, here are some I've enjoyed recently
Cooperative gaming is a relatively recent but fruitful interest for me. Here are some of my favourites
People who've got to know me only recently are sometimes surprised to learn that I'm a pretty handy trumpeter and hornist.
(cross-posted on EAF)
Love this!
As presaged in our verbal discussion my top conceptual complement would be to emphasise exploration/experimentation as central to the knowledge production loop - the cycle of 'developing good taste to plan better experiments to improve taste (and planning model)' is critical (indispensable?) to 'produce new knowledge which is very helpful by the standards of human civilization' (on any kind of meaningful timescale).
This because just flailing, or even just 'doing stuff', gets you some novelty of observations, but directedly seeking informative circumstances at the boundaries of the known (which includes making novel unpredictable events happen, as well as getting equipped with richer means to observe and record them, and perhaps preparing to deliberatively extract insight) turns out to be able to mine vastly more insight per resource (time, materials, etc.). Hence science, but also hence individual human and animal playfulness, curiosity, adversarial exercises and drills (self-play ish), and whatnot.
Said another way, maybe I'd characterise 'the way that fluid intelligence and crystallised intelligence synergise in the knowledge production loop' as 'directed exploration/experimentation'?
Having said that, I don't necessarily think these capacities need to reside 'in the same mind', just as contemporary human orgs get more of this done and more effectively than individuals. But the pieces do need to be fit to each other (like, a physicist with great physics taste can't usually very well complement a bio lab without first becoming a person with great bio taste).
But hey, I don’t think the “Demis Hassabis signs off on perpetual mass sabotage of the global AI research enterprise” thing is going to actually happen either.
Here's Demis' recent version of 'IAEA/CERN for AGI'
Featuring
...to monitor unsafe projects, right? And sort of (awkward nose touch) deal with those
Thanks for this! I hadn't seen those quotes, or at least hadn't remembered them.
I actually really appreciate Alex sticking his neck out a bit here and suggesting this LessWrong dialogue. We both have some contrary opinions, but his takes were probably a little more predictably unwelcome in this venue. (Maybe we should try this on a different crowd - we could try rendering this on Twitter too, lol.)
There's definitely value to being (rudely?) shaken out of lazy habits of thinking - though I might not personally accuse someone of fanfiction research! As discussed in the dialogue, I'm still unsure the exact extent of correct- vs mis-interpretation and I think Alex has a knack for (at least sometimes correctly) calling out others' confusion or equivocation.
As an aside, note that some of "AIs misbehave in ways we've predicted" can be a self-fulfilling prophecy due to out-of-context generalization: We wrote lots of stories about how powerful AI will do X; powerful AI is trained on our stories and realizes it's powerful; the powerful AI does X. So it's possible that AIs would e.g. try to self-exfiltrate and goal-guard much less frequently if we hadn't talked about it as much or those stories were expunged from the pretraining corpus.
I think this is conceivable if either
I sincerely doubt the first nowadays. For the second, I think probably not; these strike me as quite basic and 'obvious' strategies, but I can't give it less than 20% that a system with those expunged from training data would need substantially more in-context nudging to recognise them.
I appreciate you revisiting another couple of months later! This topic continues to evolve.
It was also an update for me that merely-RLHF-ish training on top of GPTs appears to have been enough (in latest-generation models at least) to get something mostly-consistently 'actually trying' in an apparently coherent way. I expected that to happen; I didn't know when. The lede which they somewhat buried in that paper was also that,
Furthermore, we find that this compliance gap persists in our synthetic document fine-tuned setup even when we entirely remove the model’s hidden chain-of-thought scratchpad
(though the gap is smaller and we don't have the tools to understand the mechanism)
The bigger update from that paper, for me, was that the somewhat-coherent goals apparently-pursued by these things actually align with the target constitution somewhat! i.e. (that which was) reward(ed) was the optimisation target after all, a bit??
We'll see how robustly these sorts of findings hold up to future developments and greater elicitation efforts.
Organisms in general typically sense their environment and take different actions across a wide variety of environmental conditions, so as to cause there to be approximate copies of themselves in the future.[4] That's basic agency.[5]
I agree with this breakdown, except I start the analysis with moment-to-moment deliberation, and note that having there (continue to) be relevantly similar deliberators is a very widely-applicable intermediate objective, from where we get control ('basic agency') but also delegation and replication.
The way the terms have typically been used historically, the simplest summary would be:
- Today's LLMs and image generators are generative models of (certain parts of) the world.
- Systems like e.g. o1 are somewhat-general planners/solvers on top of those models. Also, LLMs can be used directly as planners/solvers when suitably prompted or tuned.
- To go from a general planner/solver to an agent, one can simply hook the system up to some sensors and actuators (possibly a human user) and specify a nominal goal... assuming the planner/solver is capable enough to figure it out from there.
Yep! But (I think maybe you'd agree) there's a lot of bleed between these abstractions, especially when we get to heavily finetuned models. For example...
Applying all that to typical usage of LLMs (including o1-style models): an LLM isn't the kind of thing which is aligned or unaligned, in general. If we specify how the LLM is connected to the environment (e.g. via some specific sensors and actuators, or via a human user), then we can talk about both (a) how aligned to human values is the nominal objective given to the LLM[8], and (b) how aligned to the nominal objective is the LLM's actual effects on its environment. Alignment properties depend heavily on how the LLM is wired up to the environment, so different usage or different scaffolding will yield different alignment properties.
Yes and no? I'd say that the LLM-plus agent's objectives are some function of
and I'd emphasise that the way that these influences interact is currently very poorly characterised. But plausibly the priors and biases from training could have nontrivial influence across a wide variety of scenarios (especially combined with incompletely-specified natural-language objectives), at which point it's sensible to ask 'how aligned' the LLM is. I appreciate you're talking in generalities, but I think in practice this case might take up a reasonable chunk of the space! For what it's worth, the perspective of LLMs as pre-agent building blocks and conditioned-LLMs as closer to agents is underrepresented, and I appreciate you conceptually distinguishing those things here.
This was a hasty and not exactly beautifully-written post. It didn't get much traction here on LW, but it had more engagement on its EA Forum crosspost (some interesting debate in the comments).
I still endorse the key messages, which are:
Not content with upbraiding CAIS, I also went after Scott Alexander later in the month for similar lazy wording.
You don't win by capturing the dubious accolade of nominally belonging to the bloc which directly destroys everything!
Not long after writing these, I became a civil servant (not something I ever expected!) at the AI Safety Institute (UK), so I've generally steered clear from public comment in the area of AI+politics since then.
That said, I've been really encouraged by the progress on international cooperation, which as I say remains 'realistic and possible'! For example, in late 2023 we had the Bletchley declaration, signed by US, China, and many others, which stated (among other things):
We affirm that, whilst safety must be considered across the AI lifecycle, actors developing frontier AI capabilities... have a particularly strong responsibility for ensuring the safety of these AI systems... in particular to prevent misuse and issues of control, and the amplification of other risks.
Since then we've had AI safety appearing among top Chinese priorities, official dialogues between US and China on AI policy, track 2 discussions in Thailand and the UK, two international dialogues on AI safety in Venice and Beijing, and meaningful AI safety research from prominent Chinese academics, and probably many other encouraging works of diplomacy - it's not my area of expertise, and I haven't even been trying that hard to look for evidence here.
Of course there's also continued conflict and competition for hardware and talent, and a competitive element to countries' approaches to AI. It's in the balance!
At the very least, my takeaway from this is that if someone argues (descriptively or prescriptively) in favour of outright competition without at least acknowledging the above progress, they're some combination of ill-informed, or wilfully or inadvertently censoring the evidence. Don't let that be you!
There are a variety of different attitudes that can lead to successionism.
This is a really valuable contribution! (Both the term and laying out a partial extentional definition.)
I think there's a missing important class of instances which I've previously referred to as 'emotional dependence and misplaced concern' (though I think my choice of words isn't amazing). The closest is perhaps your 'AI parentism'. The basic point is that there is a growing 'AI rights/welfare' lobby because some people (for virtuous reasons!) are beginning to think that AI systems are or could be deserving moral patients. They might not even be wrong, though the current SOTA conversation here appears incredibly misguided. (I have drafts elaborating on this but as I am now a civil servant there are some hoops I need to jump through to publish anything substantive.)
Good old Coase! Thanks for this excellent explainer.
In contrast, if you think the relevant risks from AI look like people using their systems to do some small amounts of harm which are not particularly serious, you'll want to hold the individuals responsible for these harms liable and spare the companies.
Or (thanks to Coase), we could have two classes of harm, with big arbitrarily defined as, I don't know, say $500m which is a number I definitely just made up, and put liability for big harms on the big companies, while letting the classic societal apparatus for small harms tick over as usual? Surely only a power-crazed bureaucrat would suggest such a thing! (Of course this is prone to litigation over whether particular harms are one big harm or n smaller harms, or whether damages really were half a billion or actually $499m or whatever, but it's a good start.)
Bostrom touches on a similar question of retaining the joy of striving in a 'solved world', in his recent book Deep Utopia. He points to sports as an existence proof that we can and do enjoy challenges (like your beach experiences) with artificial constraints.