I think writing one of the best selling books of your century is extraordinary evidence you’ve understood something deep about human nature, which is more than most random rationalist bloggers can claim. but yes doesn’t imply you have a coherent philosophy or benevolent political program
cuts off some nuance, I would call this the projection of the collective intelligence agenda onto the AI safety frame of "eliminate the risk of very bad things happening" which I think is an incomplete way of looking at how to impact the future
in particular I tend to spend more time thinking about future worlds that are more like the current one in that they are messy and confusing and have very terrible and very good things happening simultaneously and a lot of the impact of collective intelligence tech (for good or ill) will determine the parameters of that world
Thanks, this is a really helpful broad survey of the field. Would be useful to see a one-screen-size summary, perhaps a table with the orthodox alignment problems as one axis?
I'll add that the collective intelligence work I'm doing is not really "technical AI safety" but is directly targeted at orthodox problems 11. Someone else will deploy unsafe superintelligence first and 13. Fair, sane pivotal processes, and targeting all alignment difficulty worlds not just the optimistic one (in particular, I think human coordination becomes more not less important i...
I find that surprising, do you care to elaborate? I don't think his worldview is complete, but he cares deeply about a lot of things I value too, which modern society seems not to value. I would certainly be glad to have him in my moral parliament.
Feels connected to his distrust of "quick, bright, standardized, mental processes", and the obsession with language. It's like his mind is relentlessly orienting to the territory, refusing to accept anyone else's map. Which makes it harder to be a student but easier to discover something new. Reminds me of Geoff Hinton's advice to not read the literature before engaging with the problem yourself.
I like this a lot! A few scattered thoughts
Now I can make the question more precise - why do you think it's safe to have more access to your thoughts and feelings than your subconscious gave you? And how exactly do you plan to deal with all the hostile telepaths out there (possibly including parts of yourself?).
An answer I'd give is that for a lot of people, most of the hostile telepaths are ultimately not that dangerous if you're confident enough to be able to deal with them. As Valentine mentioned, often it's enough to notice that you are actually not anymore in the kind of a situation where the ...
I know this isn't the central point of your life reviews section but curious if your model has any lower bound on life review timing - if not minutes to hours, at least seconds? milliseconds? (1 ms being a rough lower bound on the time for a signal to travel between two adjacent neurons).
If it's at least milliseconds it opens the strange metaphysical possibility of certain deaths (e.g. from very intense explosions) being exempt from life reviews.
Really appreciated this exchange, Ben & Alex have rare conversational chemistry and ability to sense-make productively at the edge of their world models.
I mostly agree with Alex on the importance of interfacing with extant institutional religion, though less sure that one should side with pluralists over exclusivists. For example, exclusivist religious groups seem to be the only human groups currently able to reproduce themselves, probably because exclusivism confers protection against harmful memes and cultural practices.
I'm also pursuing the vision o...
definitely agree there's some power-seeking equivocation going on, but wanted to offer a less sinister explanation from my experiences in AI research contexts. Seems that a lot of equivocation and blurring of boundaries comes from people trying to work on concrete problems and obtain empirical information. a thought process like
by definition, in a warning shot, nothing bad happened that time. (If something had, it wouldn't be a 'warning shot', it'd just be a 'shot' or 'disaster'.
Yours is the more direct definition but from context I at least understood 'warning shot' to mean 'disaster', on the scale of a successful terrorist attack, where the harm is large and undeniable and politicians feel compelled to Do Something Now. The 'warning' is not of harm but of existential harm if the warning is not heeded.
I do still expect such a warning shot, though as you say it could very w...
Agreed that coalitional agency is somehow more natural than squiggly-optimizer agency. Besides people, another class of examples are historical empires (like the Persian and then Roman) which were famously lenient [1] and respectful of local religious and cultural traditions; i.e. optimized coalition builders that offered goal-stability guarantees to their subagent communities, often stronger guarantees than those communities could expect by staying independent.
This extends my argument in Cooperators are more powerful than agents - in a world of ...
Correct, I was not offered such paperwork nor any incentives to sign it. Edited my post to include this.
I left Anthropic in June 2023 and am not under any such agreement.
EDIT: nor was any such agreement or incentive offered to me.
I left [...] and am not under any such agreement.
Neither is Daniel Kokotajlo. Context and wording strongly suggest that what you mean is that you weren't ever offered paperwork with such an agreement and incentives to sign it, but there remains a slight ambiguity on this crucial detail.
You're right the conclusion is quite underspecified - how exactly do we build such a cooperation machine?
I don't know yet, but my bet is more on engineering, product design, and infrastructure than on social science. More like building a better Reddit or Uber (or supporting infrastructure layers like WWW and the Internet) than like writing papers.
would to love to see this idea worked out a little more!
I like the "guardian" framing a lot! Besides the direct impact on human flourishing, I think a substantial fraction of x-risk comes from the deployment of superhumanly persuasive AI systems. It seems increasingly urgent that we deploy some kind of guardian technology that at least monitors, and ideally protects, against such superhuman persuaders.
Symbiosis is ubiquitous in the natural world, and is a good example of cooperation across what we normally would consider entity boundaries.
When I say the world selects for "cooperation" I mean it selects for entities that try to engage in positive-sum interactions with other entities, in contrast to entities that try to win zero-sum conflicts (power-seeking).
Agreed with the complicity point - as evo-sim experiments like Axelrod's showed us, selecting for cooperation requires entities that can punish defectors, a condition the world of "hammers" fails to satisfy.
Depends on offense-defense balance, I guess. E.g. if well-intentioned and well-coordinated actors are controlling 90% of AI-relevant compute then it seems plausible that they could defend against 10% of the compute being controlled by misaligned AGI or other bad actors - by denying them resources, by hardening core infrastructure, via MAD, etc.
I would be interested in a detailed analysis of pivotal act vs gradual steering; my intuition is that many of the differences dissolve once you try to calculate the value of specific actions. Some unstructured thoughts below:
You might find AI Safety Endgame Stories helpful - I wrote it last week to try to answer this exact question, covering a broad array of (mostly non-pivotal-act) success stories from technical and non-technical interventions.
Nate's "how various plans miss the hard bits of the alignment challenge" might also be helpful as it communicates the "dynamics of doom" that success stories have to fight against.
One thing I would love is to have a categorization of safety stories by claims about the world. E.g what does successful intervention look like in worlds wher...
I don't mean to suggest "just supporting the companies" is a good strategy, but there are promising non-power-seeking strategies like "improve collaboration between the leading AI labs" that I think are worth biasing towards.
Maybe the crux is how strongly capitalist incentives bind AI lab behavior. I think none of the currently leading AI labs (OpenAI, DeepMind, Google Brain) are actually so tightly bound by capitalist incentives that their leaders couldn't delay AI system deployment by at least a few months, and probably more like several years, before capitalist incentives in the form of shareholder lawsuits or new entrants that poach their key technical staff have a chance to materialize.
Interesting, I haven't seen anyone write about hardware-enabled attractor states but they do seem very promising because of just how decisive hardware is in determining which algorithms are competitive. An extreme version of this would be specialized hardware letting CAIS outcompete monolithic AGI. But even weaker versions would lead to major interpretability and safety benefits.
Fabricated options are products of incoherent thinking; what is the incoherence you're pointing out with policies that aim to delay existential catastrophe or reduce transaction costs between existing power centers?
I've considered starting an org that was either aimed at generating better alignment data or would do so as a side effect and this is really helpful - this kind of negative information is nearly impossible to find.
Is there a market niche for providing more interactive forms of human feedback, where it's important to have humans tightly in the loop with an ML process, rather than "send a batch to raters and get labels back in a few hours"? One reason RLHF is so little used is the difficulty of setting up this kind of human-in-the-loop infrastructure. Safety...
I think a substantial fraction of ML researchers probably agree with Yann LeCun that AI safety will be solved "by default" in the course of making the AI systems useful. The crux is probably related to questions like how competent society's response will be, and maybe the likelihood of deceptive alignment.
Two points of disagreement though:
Thank you for putting numbers on it!
~60%: there will be an existential catastrophe due to deceptive alignment specifically.
Is this an unconditionally prediction of 60% chance of existential catastrophe due to deceptive alignment alone? In contrast to the commonly used 10% chance of existential catastrophe due to all AI sources this century. Or do you mean that, conditional on there being an existential catastrophe due to AI, 60% chance it will be caused by deceptive alignment, and 40% by other problems like misuse or outer alignment?
In contrast to the commonly used 10% chance of existential catastrophe due to all AI sources this century
Amongst the LW crowd I'm relatively optimistic, but I'm not that optimistic. I would give maybe 20% total risk of misalignment this century. (I'm generally expecting singularity this century with >75% chance such that most alignment risk ever will be this century.)
The number is lower if you consider "how much alignment risk before AI systems are in the driver's seat," which I think is very often the more relevant question, but I'd still put it ...
Unconditional. I'm rather more pessimistic than an overall 10% chance. I usually give ~80% chance of existential risk from AI.
Agreed with the sentiment, though I would make a weaker claim, that AGI timelines are not uniquely strategically relevant, and the marginal hour of forecasting work at this point is better used on other questions.
My guess is that the timelines question has been investigated and discussed so heavily because for many people it is a crux for whether or not to work on AI safety at all - and there are many more such people than there are alignment researchers deciding what approach to prioritize. Most people in the world are not convinced that AGI safety is a pressing problem, and building very robust and legible models showing that AGI could happen soon is, empirically, a good way to convince them.
Mostly orthogonal:
Other relevant differences are
Agreed with Rohin that a key consideration is whether you are trying to form truer beliefs, or to contribute novel ideas, and this in turn depends on what role you are playing in the collective enterprise that is AI safety.
If you're the person in charge of humanity's AI safety strategy, or a journalist tasked with informing the public, or a policy person talking to governments, it makes a ton of sense to build a "good gears-level model of what their top 5 alignment researchers believe and why". If you're a researcher, tasked with generating novel ideas tha...
Would love to see your math! If L2 norm and Kolmogorov provide roughly equivalent selection pressure that's definitely a crux for me.
Agreed that the existence of general-purpose heuristic-generators like relaxation is a strong argument for why we should expect to select for inner optimizers that look something like A*, contrary to my gradient descent doesn't select for inner search post.
Recursive structure creates an even stronger bias toward things like A* but only in recurrent neural architectures (so notably not currently-popular transformer architectures, though it's plausible that recurrent architectures will come back).
I maintain that the compression / compactness argument from "R...
Yeah it's probably definitions. With the caveat that I don't mean the narrow "literally iterates over solutions", but roughly "behaves (especially off the training distribution) as if it's iterating over solutions", like Abram Demski's term selection.
I disagree that performing search is central to human capabilities relative to other species. The cultural intelligence hypothesis seems much more plausible: humans are successful because our language and ability to mimic allow us to accumulate knowledge and coordinate at massive scale across both space and time. Not because individual humans are particularly good at thinking or optimizing or performing search. (Not sure what the implications of this are for AI).
You're right though, I didn't say much about alternative algorithms other than point vaguely in...
I agree that A* and gradient descent are central examples of search; for realistic problems these algorithms typically evaluate the objective on millions of candidates before returning an answer.
In contrast, human problem solvers typically do very little state evaluation - perhaps evaluating a few dozen possibilities directly, and relying (as you said) on abstractions and analogies instead. I would call this type of reasoning "not very search-like".
On the far end we have algorithms like Gauss-Jordan elimination, which just compute the optimal solution dire...
See my answer to tailcalled:
a program is more "search-like" if it is enumerating possible actions and evaluating their consequences
I'm curious if you mean something different by search when you say that we're likely to find policies that look like an "explicit search process + simple objective(s)"
Agreed that "search" is not a binary but more like a continuum, where we might call a program more "search-like" if it is enumerating possible actions and evaluating their consequences, and less "search-like" if it is directly mapping representations of inputs to actions. The argument in this post is that gradient descent (unlike evolution, and unlike human programmers) doesn't select much for "search-like" programs. If we take depth-first search as a central example of search, and a thermostat as the paradigmatic non-search program, gradient descent will ...
Yeah I think you need some additional assumptions on the models and behaviors, which you're gesturing at with the "matching behaviors" and "inexact descriptions". Otherwise it's easy to find counterexamples: imagine the model is just a single N x N matrix of parameters, then in general there is no shorter description length of the behavior than the model itself.
Yes there are non-invertible (you might say "simpler") behaviors which each occupy more parameter volume than any given invertible behavior, but random matrices are almost certainly invertible so the actual optimization pressure towards low description length is infinitesimal.
Ah I think that's the crux - I believe the overparametrized regime finds generalizing models because gradient descent finds functions that have low function norm, not low description length. I forget the paper that showed this for neural nets but here's a proof for logistic regression.
Agreed on "explicit search" being a misleading phrase, I'll replace it with just "search" when I'm referring to learned programs.
small descriptions give higher parameter space volume, and so the things we find are those with short descriptions
I don't think I understand this. GPT-3 is a thing we found, which has 175B parameters, what is the short description of it?
Thinking about this more, I think gradient descent (at least in the modern regime) probably doesn't select for inner search processes, because it's not actually biased towards low Kolmogorov complexity. More in my standalone post, and here's a John Maxwell comment making a similar point.
Agreed with John, with the caveat that I expect search processes + simple objectives to only emerge from massively multi-task training. If you're literally training an AI just on smiling, TurnTrout is right that "a spread of situationally-activated computations" is more likely since you're not getting any value from the generality of search.
The Deep Double Descent paper is a good reference for why gradient descent training in the overparametrized regime favors low complexity models, though I don't know of explicit evidence for the conjecture that "explicit...
I love the framing of outer alignment as a data quality problem!
As an illustrative data point, the way Google generates "alignment data" for its search evals is by employing thousands of professional raters and training them to follow a 200-page handbook (!) that operationalizes the concept of a "good search result".
Intuitively speaking, the underlying problem is that aligned goals need to generalize robustly enough to block AGIs from the power-seeking strategies recommended by instrumental reasoning, which will become much more difficult as their instrumental reasoning skills improve.
This is the clearest justification of capabilities generalize further than alignment I've seen, bravo!
My main disagreement with the post is that goal misgeneralization comes after situational awareness. Weak versions of goal misgeneralization are already happening all the time, fro...
Yes, definitely possible.
Saying the quiet part out loud: VC for both research and product startups runs on trust. To get funding you will mostly likely need someone trusted to vouch for you, and/or to have legible, hard-to-fake accomplishments in a related field, that obviates the need for trust. (Writing up a high quality AI alignment research agenda could be such an accomplishment!). If you DM me with more details about your situation, I might be able to help route you.
I don't think any factored cognition proponents would disagree with
Composing interpretable pieces does not necessarily yield an interpretable system.
They just believe that we could, contingently, choose to compose interpretable pieces into an interpretable system. Just like we do all the time with
...Figuring out how to turn interpretabi
Agreed on all points! One clarification is that large founder-led companies, including Facebook, are all moral mazes internally (i.e. from the perspective of the typical employee); but their founders often have so much legitimacy that their external actions are only weakly influenced by moral maze dynamics.
I guess that means that if AGI deployment is very incremental - a sequence of small changes to many different AI systems, that only in retrospect add up to AGI - moral maze dynamics will still be paramount, even in founder-led companies.
basically every company eventually becomes a moral maze
Agreed, but Silicon Valley wisdom says founder-led and -controlled companies are exceptionally dynamic, which matters here because the company that deploys AGI is reasonably likely to be one of those. For such companies, the personality and ideological commitments of the founder(s) are likely more predictive of external behavior than properties of moral mazes.
Facebook's pivot to the "metaverse", for instance, likely could not have been executed by a moral maze. If we believed that Facebook / Meta was o...
In support of this, I remember Geoff Hinton saying at his Turing award lecture that he strongly advised new grad students not to read the literature before trying, for months, to solve the problem themselves.
Two interesting consequences of the "unique combination of facts" model of invention:
Great, thought-provoking post. The AI research community certainly felt much more cooperative before it got an injection of startup/monopoly/winner-take-all thinking. Google Brain publishing the Transformer paper being a great example.
I wonder how much this truly is narrative, as opposed to AI being genuinely more winner-take-all than fusion in the economic sense. Certainly the hardware layer has proven quite winner-take-all so far with NVDA taking a huge fraction of the profit; same with adtech, the most profitable application of (last-generation) AI, whe... (read more)