All of Ivan Vendrov's Comments + Replies

Great, thought-provoking post. The AI research community certainly felt much more cooperative before it got an injection of startup/monopoly/winner-take-all thinking. Google Brain publishing the Transformer paper being a great example.

I wonder how much this truly is narrative, as opposed to AI being genuinely more winner-take-all than fusion in the economic sense. Certainly the hardware layer has proven quite winner-take-all so far with NVDA taking a huge fraction of the profit; same with adtech, the most profitable application of (last-generation) AI, whe... (read more)

I think writing one of the best selling books of your century is extraordinary evidence you’ve understood something deep about human nature, which is more than most random rationalist bloggers can claim. but yes doesn’t imply you have a coherent philosophy or benevolent political program

cuts off some nuance, I would call this the projection of the collective intelligence agenda onto the AI safety frame of "eliminate the risk of very bad things happening" which I think is an incomplete way of looking at how to impact the future

in particular I tend to spend more time thinking about future worlds that are more like the current one in that they are messy and confusing and have very terrible and very good things happening simultaneously and a lot of the impact of collective intelligence tech (for good or ill) will determine the parameters of that world

Thanks, this is a really helpful broad survey of the field. Would be useful to see a one-screen-size summary, perhaps a table with the orthodox alignment problems as one axis?

I'll add that the collective intelligence work I'm doing is not really "technical AI safety" but is directly targeted at orthodox problems 11. Someone else will deploy unsafe superintelligence first and 13. Fair, sane pivotal processes, and targeting all alignment difficulty worlds not just the optimistic one (in particular, I think human coordination becomes more not less important i... (read more)

4Stag
Would you agree that the entire agenda of collective intelligence is aimed at addressing 11. Someone else will deploy unsafe superintelligence first and 13. Fair, sane pivotal processes, or does that cut off nuance?

I find that surprising, do you care to elaborate? I don't think his worldview is complete, but he cares deeply about a lot of things I value too, which modern society seems not to value. I would certainly be glad to have him in my moral parliament.

6Mateusz Bagiński
https://www.lesswrong.com/posts/u8GMcpEN9Z6aQiCvp/rule-thinkers-in-not-out

Feels connected to his distrust of "quick, bright, standardized, mental processes", and the obsession with language. It's like his mind is relentlessly orienting to the territory, refusing to accept anyone else's map. Which makes it harder to be a student but easier to discover something new. Reminds me of Geoff Hinton's advice to not read the literature before engaging with the problem yourself.

I like this a lot! A few scattered thoughts

  • This theory predicts and explains "therapy-resistant dissociation", or the common finding that none of the "woo" exercises like focusing, meditation, etc, actually work. (c.f. Scott's experience as described in https://www.astralcodexten.com/p/are-woo-non-responders-defective). If there's an active strategy of self-deception, you'd expect people to react negatively (or learn to not react via yet deeper levels of self-deception) to straightforward attempts to understand and untangle one's psychology.
  • It matches and
... (read more)

Now I can make the question more precise - why do you think it's safe to have more access to your thoughts and feelings than your subconscious gave you? And how exactly do you plan to deal with all the hostile telepaths out there (possibly including parts of yourself?).

An answer I'd give is that for a lot of people, most of the hostile telepaths are ultimately not that dangerous if you're confident enough to be able to deal with them. As Valentine mentioned, often it's enough to notice that you are actually not anymore in the kind of a situation where the ... (read more)

I know this isn't the central point of your life reviews section but curious if your model has any lower bound on life review timing - if not minutes to hours, at least seconds? milliseconds? (1 ms being a rough lower bound on the time for a signal to travel between two adjacent neurons).

If it's at least milliseconds it opens the strange metaphysical possibility of certain deaths (e.g. from very intense explosions) being exempt from life reviews.

Really appreciated this exchange, Ben & Alex have rare conversational chemistry and ability to sense-make productively at the edge of their world models.

I mostly agree with Alex on the importance of interfacing with extant institutional religion, though less sure that one should side with pluralists over exclusivists. For example, exclusivist religious groups seem to be the only human groups currently able to reproduce themselves, probably because exclusivism confers protection against harmful memes and cultural practices.

I'm also pursuing the vision o... (read more)

definitely agree there's some power-seeking equivocation going on, but wanted to offer a less sinister explanation from my experiences in AI research contexts. Seems that a lot of equivocation and blurring of boundaries comes from people trying to work on concrete problems and obtain empirical information. a thought process like

  1. alignment seems maybe important?
  2. ok what experiment can I set up that lets me test some hypotheses
  3. can't really test the long-term harms directly, let me test an analogue in a toy environment or on a small model, publish results
  4. when t
... (read more)

by definition, in a warning shot, nothing bad happened that time. (If something had, it wouldn't be a 'warning shot', it'd just be a 'shot' or 'disaster'.

Yours is the more direct definition but from context I at least understood 'warning shot' to mean 'disaster', on the scale of a successful terrorist attack, where the harm is large and undeniable and politicians feel compelled to Do Something Now.  The 'warning' is not of harm but of existential harm if the warning is not heeded.

I do still expect such a warning shot, though as you say it could very w... (read more)

Agreed that coalitional agency is somehow more natural than squiggly-optimizer agency. Besides people, another class of examples are historical empires (like the Persian and then Roman) which were famously lenient [1] and respectful of local religious and cultural traditions; i.e. optimized coalition builders that offered goal-stability guarantees to their subagent communities, often stronger guarantees than those communities could expect by staying independent.

This extends my argument in Cooperators are more powerful than agents - in a world of ... (read more)

Correct, I was not offered such paperwork nor any incentives to sign it. Edited my post to include this.

Ivan Vendrov*14012

I left Anthropic in June 2023 and am not under any such agreement.

EDIT: nor was any such agreement or incentive offered to me.

I left [...] and am not under any such agreement.

Neither is Daniel Kokotajlo. Context and wording strongly suggest that what you mean is that you weren't ever offered paperwork with such an agreement and incentives to sign it, but there remains a slight ambiguity on this crucial detail.

  1. Agree trust and cooperation is dual use, and I'm not sure how to think about this yet; perhaps the most important form of coordination is the one that prevents (directly or via substitution) harmful forms of coordination from arising.
  2. One reason I wouldn't call lack of altruism the root is that it's not clear how to intervene on it, it's like calling the laws of physics the root of all evil. I prefer to think about "how to reduce transaction costs to self-interested collaboration". I'm also less sure that a society of people more altruistic motives will nec
... (read more)
2FlorianH
Neither entirely convinced nor entirely against the idea of defining 'root cause' essentially with respect to 'where is intervention plausible'. Either way, to me that way of defining it would not have to exclude "altruism" as a candidate: (i) there could be scope to re-engineer ourselves to become more altruistic, and (ii) without doing that, gosh how infinitely difficult does it feel to improve the world truly systematically (as you rightly point out). That is strongly related to Unfit for the Future - The Need for Moral Enhancement (whose core story is spot on imho, even though I find quite some of the details in the book substandard)

You're right the conclusion is quite underspecified - how exactly do we build such a cooperation machine?

I don't know yet, but my bet is more on engineering, product design, and infrastructure than on social science. More like building a better Reddit or Uber (or supporting infrastructure layers like WWW and the Internet) than like writing papers.

2Vaughn Papenhausen
Okay, I see better now where you're coming from and how you're thinking that social science could be hopeless and yet we can still build a cooperation machine. I still suspect you'll need some innovations in social science to implement such a machine. Even if we assume that we have a black box machine that does what you say, you still have to be sure that people will use the machine, so you'll need enough understanding of social science to either predict that they will, or somehow get them to. But even if you solve the problem of implementation, I suspect you'll need innovations in social science in order to even design such a machine. In order to understand what kind of technology or infrastructure would increase trust, asabiyah, etc, you need to understand people. And maybe you think the understanding we already have of people with our current social science is already enough to tell us what we'd need to build such a machine. But you sounded pretty pessimistic about our current social science. (I'm making no claim one way or the other about our current social science, just trying to draw out tensions between different parts of your piece.)

would to love to see this idea worked out a little more!

I like the "guardian" framing a lot! Besides the direct impact on human flourishing, I think a substantial fraction of x-risk comes from the deployment of superhumanly persuasive AI systems. It seems increasingly urgent that we deploy some kind of guardian technology that at least monitors, and ideally protects, against such superhuman persuaders.

Symbiosis is ubiquitous in the natural world, and is a good example of cooperation across what we normally would consider entity boundaries.

When I say the world selects for "cooperation" I mean it selects for entities that try to engage in positive-sum interactions with other entities, in contrast to entities that try to win zero-sum conflicts (power-seeking).

Agreed with the complicity point - as evo-sim experiments like Axelrod's showed us, selecting for cooperation requires entities that can punish defectors, a condition the world of "hammers" fails to satisfy.

1Roman Leventov
Power-seeking conflict might be zero- or negative-sum in terms of its immediate effect, yet the order which is established after the conflict is over (perhaps, temporarily) is not necessarily zero-sum. Dictatorship is not a zero-sum order, it could be even more productive in the short run than democracy.

Depends on offense-defense balance, I guess. E.g. if well-intentioned and well-coordinated actors are controlling 90% of AI-relevant compute then it seems plausible that they could defend against 10% of the compute being controlled by misaligned AGI or other bad actors - by denying them resources, by hardening core infrastructure, via MAD, etc.

1Krieger
It seems like the exact model which the AI will adopt is kinda confounding my picture when I'm trying to imagine how "existentially secure" a world looks like. I'm current thinking there are two possible existentially secure worlds: The obvious one is where all human dependence is removed from setting/modifying the AI's value system (like CEV, fully value-aligned)—this would look much more unipolar. The alternate is for the well-intentioned-and-coordianted group to use a corrigible AI that is aligned with its human instructor. To me, whether this scenario looks existentially secure probably depends on "whether small differences in capability can magnify to great power differences"—if false, it would be much easier for capable groups to defect and make their own corrigible AI push agendas that may not be in favor of humanity's interest (hence not so existentially secure). If true, then the world would again be more unipolar—and its existential secureness would depend on how value-aligned the humans that are operating the corrigible AI are (I'm guessing this is your offense-defense balance example?) So it seems to me that the ideal end game is for humanity to end up with a value-aligned AI, either by starting with it or somehow going through the "dangerous period" of multipolar corrigible AIs and transition to a value-aligned one. Possible pathways (non-exhaustive). I'm not sure whether this is a good framing at all (probably isn't), but simply counting the number of dependencies (without taking into consider how plausible each dependencies are) it just seems to me that humanity's chances would be better off with a unipolar takeover scenario—either using a value-aligned AI from the start or transitioning into one after a pivotal act.

I would be interested in a detailed analysis of pivotal act vs gradual steering; my intuition is that many of the differences dissolve once you try to calculate the value of specific actions. Some unstructured thoughts below:

  1. Both aim to eventually end up in a state of existential security, where nobody can ever build an unaligned AI that destroys the world. Both have to deal with the fact that power is currently broadly distributed in the world, so most plausible stories in which we end up with existential security will involve the actions of thousands if
... (read more)
1Krieger
Is it even possible for a non-pivotal act to ever achieve existential security? Even if we max-ed up AI lab communication and had awesome interpretability, that doesn't help in the long-run given that the amount of minimum resources required to build a misaligned AGI will probably be keep dropping.
Answer by Ivan Vendrov30

You might find AI Safety Endgame Stories helpful - I wrote it last week to try to answer this exact question, covering a broad array of (mostly non-pivotal-act) success stories from technical and non-technical interventions.

Nate's "how various plans miss the hard bits of the alignment challenge" might also be helpful as it communicates the "dynamics of doom" that success stories have to fight against.

One thing I would love is to have a categorization of safety stories by claims about the world. E.g what does successful intervention look like in worlds wher... (read more)

1Krieger
Thanks, I found your post very helpful and I think this community would benefit from posts similar as such. I agree that we would need a clear categorization. Ideally, they would provide us a way to explicitly quantify/make-legible the claims of various proposals e.g. "my proposal, under these assumptions about the world, may give us X years of time, changes the world in these ways, and interacts with proposal A, B, C in these ways. The lack of such is perhaps one of the reasons as to why I feel the pivotal act framing is still necessary. It seems to me that, while proposals closer to the "gradual steering" end of the spectrum (e.g. regulation, culture change, AI lab communication) usually are aimed at giving humanity a couple more months/years of extra time, they fail to make legible claims as above and yet (I might be wrong) proceed to implicitly claim "therefore, if we do a lot of these, we're safe—even without any pivotal acts!" (of course pivotal acts aren't guilt-free and many of their details are hand-wavy, but their claims of impact & world-assumptions seem pretty straightforward. Are there non pivotal act type proposals like that?)

I don't mean to suggest "just supporting the companies" is a good strategy, but there are promising non-power-seeking strategies like "improve collaboration between the leading AI labs" that I think are worth biasing towards.

Maybe the crux is how strongly capitalist incentives bind AI lab behavior. I think none of the currently leading AI labs (OpenAI, DeepMind, Google Brain) are actually so tightly bound by capitalist incentives that their leaders couldn't delay AI system deployment by at least a few months, and probably more like several years, before capitalist incentives in the form of shareholder lawsuits or new entrants that poach their key technical staff have a chance to materialize.

1Noosphere89
This is the crux, thank you for identifying it. Yeah, I'm fairly pessimistic for several years time, since I don't think they're that special of a company in resisting capitalist nudges and incentives. And yeah I'm laughing because unless the alignment/safety teams control what capabilities are added, then I do not expect the capabilities teams to stop, because they won't get paid for that.

Interesting, I haven't seen anyone write about hardware-enabled attractor states but they do seem very promising because of just how decisive hardware is in determining which algorithms are competitive.  An extreme version of this would be specialized hardware letting CAIS outcompete monolithic AGI. But even weaker versions would lead to major interpretability and safety benefits.

1Paul Tiplady
One other thought after considering this a bit more - we could test this now using software submodules. It’s unlikely to perform better (since no hardware speedup) but it could shed light on the tradeoffs with the general approach. And as these submodules got more complex, it may eventually be beneficial to use this approach even in a pure-software (no hardware) paradigm, if it lets you skip retraining a bunch of common functionality. I.e. if you train a sub-network for one task, then incorporate that in two distinct top-layer networks trained on different high-level goals, do you get savings by not having to train two “visual cortexes”? This is in a similar vein to Google’s foundation models, where they train one jumbo model that then gets specialized for each usecase. Can that foundation model be modularized? (Maybe for relatively narrow usecases like “text comprehension” it’s actually reasonable to think of a foundation model as a single submodule, but I think they are quite broad right now. ) The big difference is I think all the weights are mutable in the “refine the foundation model” step? Perhaps another concrete proposal for a technological attractor would be to build a SOTA foundation model and make that so good that the community uses it instead of training their own, and then that would also give a slower-moving architecture/target to interpret.

Fabricated options are products of incoherent thinking; what is the incoherence you're pointing out with policies that aim to delay existential catastrophe or reduce transaction costs between existing power centers?

1Noosphere89
I think the fabricated option here is just supporting the companies making AI, when my view is that by default, capitalist incentives kill us all due to boosting AI capabilities while doing approximately zero AI safety, in particular deceptive alignment would not be invested in despite this being the majority of the risk. One of the most important points for AGI safety is the leader in AGI needs a lot of breathing space and leadership ahead of their competitors, and I think this needs to be done semi-unilaterally by an organization not having capitalist incentives, because all the incentives point towards ever faster, not slowing down AGI capabilities. That's why I think your options are fabricated, because they assume unrealistically good incentives to do what you want.

I've considered starting an org that was either aimed at generating better alignment data or would do so as a side effect and this is really helpful - this kind of negative information is nearly impossible to find.

Is there a market niche for providing more interactive forms of human feedback, where it's important to have humans tightly in the loop with an ML process, rather than "send a batch to raters and get labels back in a few hours"? One reason RLHF is so little used is the difficulty of setting up this kind of human-in-the-loop infrastructure. Safety... (read more)

Answer by Ivan Vendrov61

I think a substantial fraction of ML researchers probably agree with Yann LeCun that AI safety will be solved "by default" in the course of making the AI systems useful. The crux is probably related to questions like how competent society's response will be, and maybe the likelihood of deceptive alignment.

Two points of disagreement though:

  • I don't think setting P(doom) = 10% indicates lack of engagement or imagination; Toby Ord in the Precipice also gives a 10% estimate for AI-derived x-risk this century, and I assume he's engaged pretty deeply with the ali
... (read more)
2Aorou
Hey! Thanks for sharing the debate with LeCun, I found it very interesting and I’ll do more research on his views.  Thanks for pointing out that even a 1% existential risk is worth worrying about, I imagine it’s true even in my moral system, if I just realize that ie 1% probability that humanity wipes = 70 million expected deaths (1% of 7 billions) plus all the expected humans that wouldn’t come to be.  That’s logically.  Emotionally, I find it WAY harder to care for a 1% X-risk. Scope insensitivity. I want to think about where else in my thinking this is causing output errors. 

Thank you for putting numbers on it!

~60%: there will be an existential catastrophe due to deceptive alignment specifically.

Is this an unconditionally prediction of 60% chance of existential catastrophe due to deceptive alignment alone? In contrast to the commonly used 10% chance of existential catastrophe due to all AI sources this century. Or do you mean that, conditional on there being an existential catastrophe due to AI, 60% chance it will be caused by deceptive alignment, and 40% by other problems like misuse or outer alignment?

In contrast to the commonly used 10% chance of existential catastrophe due to all AI sources this century

Amongst the LW crowd I'm relatively optimistic, but I'm not that optimistic.  I would give maybe 20% total risk of misalignment this century. (I'm generally expecting singularity this century with >75% chance such that most alignment risk ever will be this century.)

The number is lower if you consider "how much alignment risk before AI systems are in the driver's seat," which I think is very often the more relevant question, but I'd still put it ... (read more)

evhubΩ12264

Unconditional. I'm rather more pessimistic than an overall 10% chance. I usually give ~80% chance of existential risk from AI.

Agreed with the sentiment, though I would make a weaker claim, that AGI timelines are not uniquely strategically relevant, and the marginal hour of forecasting work at this point is better used on other questions.

My guess is that the timelines question has been investigated and discussed so heavily because for many people it is a crux for whether or not to work on AI safety at all - and there are many more such people than there are alignment researchers deciding what approach to prioritize. Most people in the world are not convinced that AGI safety is a pressing problem, and building very robust and legible models showing that AGI could happen soon is, empirically, a good way to convince them.

Ivan Vendrov*Ω670

Mostly orthogonal:

  • Evan's post argues that if search is computationally optimal (in the sense of being the minimal circuit) for a task, then we can construct a task where the minimal circuit that solves it is deceptive.
  • This post argues against (a version of) Evan's premise: search is not in fact computationally optimal in the context of modern tasks and architectures, so we shouldn't expect gradient descent to select for it.

Other relevant differences are

  1. gradient descent doesn't actually select for low time complexity / minimal circuits; it holds time &
... (read more)

Agreed with Rohin that a key consideration is whether you are trying to form truer beliefs, or to contribute novel ideas, and this in turn depends on what role you are playing in the collective enterprise that is AI safety.

If you're the person in charge of humanity's AI safety strategy, or a journalist tasked with informing the public, or a policy person talking to governments, it makes a ton of sense to build a "good gears-level model of what their top 5 alignment researchers believe and why". If you're a researcher, tasked with generating novel ideas tha... (read more)

Would love to see your math! If L2 norm and Kolmogorov provide roughly equivalent selection pressure that's definitely a crux for me.

7Lucius Bushnaq
There should be a post with some of it out soon-ish. Short summary: You can show that at least for overparametrised neural networks, the eigenvalues of the Hessian of the loss function at optima, which determine the basin size within some approximation radius, are basically given by something like the number of independent, orthogonal features the network has, and how "big" these features are.  The less independent, mutually orthogonal features the network has, and the smaller they are, the broader the optimum will be. Size and orthogonality are given by the Hilbert space scalar product for functions here.  That sure sounds an awful lot like a kind of complexity measure to me. Not sure it's Kolmogorov exactly, but it does seem like something related. And while I haven't formalised it yet, I think there's quite a lot to suggest that the less information you pass around in the network, the less independent features you'll tend to have. E.g., if you have 20 independent bits of input information, and you only pass on 10 of them to the deeper layers of the network, you'll be much more likely to get fewer unique features than if you'd passed them on. Because you're making the Hilbert space smaller. So if you introduce a penalty on exchanging too much information between parts of the network, like, say, with L2 regularisation, you'd expect the optimiser to find solutions with less independent features ("description length"), and broader basins.  Empirically, introducing "connection costs" does seem to lead to broader basins in our group's experiments, IIRC. Also, there's a bunch of bio papers on how connection costs lead to modularity, and our own experiments support the idea that modularity means broader basins. I'm not sure I've seen it implemented with L2 regularisation as the connection cost specifically, but my guess would be that it'd do the same thing. (Our hope is actually that these orthogonalised features might prove to be a better fundamental unit of DL t

Agreed that the existence of general-purpose heuristic-generators like relaxation is a strong argument for why we should expect to select for inner optimizers that look something like A*, contrary to my gradient descent doesn't select for inner search post.

Recursive structure creates an even stronger bias toward things like A* but only in recurrent neural architectures (so notably not currently-popular transformer architectures, though it's plausible that recurrent architectures will come back).

I maintain that the compression / compactness argument from "R... (read more)

3Lucius Bushnaq
I have some math that hints that those may be equivalent-ish statements. Why would we expect a 10x times distillation factor? Half the directions of the basin being completely flat seems like a pretty big optimum to me. Also, I'm not sure if you can always manage to align the free directions in parameter space with individual parameters, such that you can discard p parameters if you had p free directions.

Yeah it's probably definitions. With the caveat that I don't mean the narrow "literally iterates over solutions", but roughly "behaves (especially off the training distribution) as if it's iterating over solutions", like Abram Demski's term selection.

I disagree that performing search is central to human capabilities relative to other species. The cultural intelligence hypothesis seems much more plausible: humans are successful because our language and ability to mimic allow us to accumulate knowledge and coordinate at massive scale across both space and time. Not because individual humans are particularly good at thinking or optimizing or performing search. (Not sure what the implications of this are for AI).

You're right though, I didn't say much about alternative algorithms other than point vaguely in... (read more)

3Lauro Langosco
(Note that I'm not making a claim about how search is central to human capabilities relative to other species; I'm just saying search is useful in general. Plausibly also for other species, though it is more obvious for humans) From my POV, the "cultural intelligence hypothesis" is not a counterpoint to importance of search. It's obvious that culture is important for human capabilities, but it also seems obvious to me that search is important. Building printing presses or steam engines is not something that a bundle of heuristics can do, IMO, without gaining those heuristics via a long process of evolutionary trial-and-error. And it seems important that humans can build steam engines without generations of breeding better steam-engine-engineers. Re AlphaStar and AlphaZero: I've never played Starcraft, so I don't have good intuitions for what capabilities are needed. But on the definitions of search that I use, the AlphaZero policy network definitely performs search. In fact out of current systems it's probably the one that most clearly performs search! ...Now I'm wondering whether our disagreement just comes from having different definitions of search in mind. Skimming your other comments above, it seems like you take a more narrow view of search = literally iterating through solutions and picking a good one. This is fine by me definitionally, but I don't think the fact that models will not learn search(narrow) is very interesting for alignment, or has the implications that you list in the post? Though ofc I might still be misunderstanding you here.
2Vladimir_Nesov
This suggests that the choice of decision theory that amplifies a decision making model (in the sense of IDA/HCH, or just the way MCTS is used in training AlphaZero) might influence robustness of its behavior far off-distribution, even if its behavior around the training distribution is not visibly sensitive to choice of decision theory used for amplification. Though perhaps this sense of "robustness" is not very appropriate, and a better one should be explicitly based on reflection/extrapolation from behavior in familiar situations, with the expectation that all models fail to be robust sufficiently far off-distribution (in the crash space), and new models must always be prepared in advance of going there.
1Noosphere89
My thinking is that one of the biggest reasons humans managed to dominate is basically 3x more brainpower combined with ways to get rid of the heat necessary to support brainpower, which requires sweating all over the body. Essentially it's the scaling hypothesis applied to biological systems. And since intelligence can be used for any goal, it's not surprising that intelligence's main function was cultural.

I agree that A* and gradient descent are central examples of search; for realistic problems these algorithms typically evaluate the objective on millions of candidates before returning an answer.

In contrast, human problem solvers typically do very little state evaluation - perhaps evaluating a few dozen possibilities directly, and relying (as you said) on abstractions and analogies instead. I would call this type of reasoning "not very search-like".

On the far end we have algorithms like Gauss-Jordan elimination, which just compute the optimal solution dire... (read more)

4johnswentworth
My ontology indeed has search and a narrow notion of optimization as approximately synonyms; they differ only somewhat in type signature and are easily interchangeable. Conceptually, both take in an objective, and return something which scores highly on the objective. (This is narrower than e.g. Flint's notion of "optimization"; in that ontology it might be called a "general-purpose optimizer" instead.) Anyway, insofar as any of this is relevant to the arguments for mesa-optimization, it's the notion of search/optimization as general problem solving which applies there.

See my answer to tailcalled:

a program is more "search-like" if it is enumerating possible actions and evaluating their consequences

I'm curious if you mean something different by search when you say that we're likely to find policies that look like an "explicit search process + simple objective(s)"

7johnswentworth
Yeah, that's definitely not what I mean by search (nor what I think others mean by search, in the context of AI and inner agents). Roughly speaking, a general search process is something which takes in a specification of some problem or objective (from a broad class of possible problems/objectives), and returns a plan which solves the problem or scores well on the objective. For instance, a gradient descent algorithm takes in an objective, and returns a point which scores well on the objective, for a very broad class of possible objectives; gradient descent is therefore a search method. Enumerating possible actions and evaluating their consequences is one way to do general search, but it's wildly inefficient; I would typically refer to that as "brute force search". Gradient descent does better by leveraging backprop and gradients; approximately none of the algorithmic work done by gradient descent comes from direct evaluation of the consequences of actions. And there are many other tricks one can use too - like memoization on subsearches, or A*-style heuristic search, or (one meta-level up from A*) relaxation-based methods to discover heuristics. The key point is that these tricks are all very general purpose: they work on a very wide variety of search problems, and therefore produce general-purpose search algorithms which are more efficient than brute force (at least on realistic problems). More advanced general-purpose search methods seem to rely relatively little on enumerating possible actions and evaluating their consequences. By the time we get to human-level search capabilities, we see human problem-solvers spend most of their effort on nontrivial problems thinking about subproblems, abstractions and analogies rather than thinking directly about particular solutions.

Agreed that "search" is not a binary but more like a continuum, where we might call a program more "search-like" if it is enumerating possible actions and evaluating their consequences, and less "search-like" if it is directly mapping representations of inputs to actions. The argument in this post is that gradient descent (unlike evolution, and unlike human programmers) doesn't select much for "search-like" programs. If we take depth-first search as a central example of search, and a thermostat as the paradigmatic non-search program, gradient descent will ... (read more)

Yeah I think you need some additional assumptions on the models and behaviors, which you're gesturing at with the "matching behaviors" and "inexact descriptions". Otherwise it's easy to find counterexamples: imagine the model is just a single N x N matrix of parameters, then in general there is no shorter description length of the behavior than the model itself. 

Yes there are non-invertible (you might say "simpler") behaviors which each occupy more parameter volume than any given invertible behavior, but random matrices are almost certainly invertible so the actual optimization pressure towards low description length is infinitesimal.

Ah I think that's the crux - I believe the overparametrized regime finds generalizing models because gradient descent finds functions that have low function norm, not low description length. I forget the paper that showed this for neural nets but here's a proof for logistic regression.

2Vladimir_Nesov
I'm thinking of a setting where shortest descriptions of behavior determine sets of models that exhibit matching behavior (possibly in a coarse-grained way, so distances in behavior space are relevant). This description-model relation could be arbitrarily hard to compute, so it's OK for shortest descriptions to be shortest programs or something ridiculous like that. This gives a partition of the model/parameter space according to the mapping from models to shortest descriptions of their behavior. I think shorter shortest descriptions (simpler behaviors) fill more volume in the parameter/model space, have more models whose behavior is given by those descriptions (this is probably the crux; e.g. it's false if behaviors are just models themselves and descriptions are exact). Gradient descent doesn't interact with descriptions or the description-model relation in any way, but since it selects models ~based on behavior, and starts its search from a random point in the model space, it tends to select behaviors from larger elements of the partition of the space of models that correspond to simpler behaviors with shorter shortest descriptions. This holds at every step of gradient descent, not just when it has already learned something relevant. The argument is that whatever behavior is selected, it is relatively simple, compared to other behaviors that could've been selected by the same selection process. Further training just increases the selection pressure.

Agreed on "explicit search" being a misleading phrase, I'll replace it with just "search" when I'm referring to learned programs.

small descriptions give higher parameter space volume, and so the things we find are those with short descriptions

I don't think I understand this. GPT-3 is a thing we found, which has 175B parameters, what is the short description of it?

2Vladimir_Nesov
I mean relatively short, as in the argument for why overparametrized models generalize. They still do get to ~memorize all training data, but anything else comes at a premium, reduces probability of getting selected for models whose behavior depends on those additional details. (This use of "short" as meaning "could be 500 gigabytes" was rather sloppy/misleading of me, in a comment about sloppy/misleading use of words...)

Thinking about this more, I think gradient descent (at least in the modern regime) probably doesn't select for inner search processes, because it's not actually biased towards low Kolmogorov complexity. More in my standalone post, and here's a John Maxwell comment making a similar point.

Agreed with John, with the caveat that I expect search processes + simple objectives to only emerge from massively multi-task training. If you're literally training an AI just on smiling, TurnTrout is right that "a spread of situationally-activated computations" is more likely since you're not getting any value from the generality of search.

The Deep Double Descent paper is a good reference for why gradient descent training in the overparametrized regime favors low complexity models, though I don't know of explicit evidence for the conjecture that "explicit... (read more)

7Ivan Vendrov
Thinking about this more, I think gradient descent (at least in the modern regime) probably doesn't select for inner search processes, because it's not actually biased towards low Kolmogorov complexity. More in my standalone post, and here's a John Maxwell comment making a similar point.

I love the framing of outer alignment as a data quality problem!

As an illustrative data point, the way Google generates "alignment data" for its search evals is by employing thousands of professional raters and training them to follow a 200-page handbook (!) that operationalizes the concept of a "good search result".

Intuitively speaking, the underlying problem is that aligned goals need to generalize robustly enough to block AGIs from the power-seeking strategies recommended by instrumental reasoning, which will become much more difficult as their instrumental reasoning skills improve.

This is the clearest justification of capabilities generalize further than alignment I've seen, bravo!

My main disagreement with the post is that goal misgeneralization comes after situational awareness.  Weak versions of goal misgeneralization are already happening all the time, fro... (read more)

Answer by Ivan Vendrov40

Yes, definitely possible.

Saying the quiet part out loud: VC for both research and product startups runs on trust. To get funding you will mostly likely need someone trusted to vouch for you, and/or to have legible, hard-to-fake accomplishments in a related field, that obviates the need for trust. (Writing up a high quality AI alignment research agenda could be such an accomplishment!). If you DM me with more details about your situation, I might be able to help route you.

I don't think any factored cognition proponents would disagree with

Composing interpretable pieces does not necessarily yield an interpretable system.

They just believe that we could, contingently, choose to compose interpretable pieces into an interpretable system. Just like we do all the time with

  • massive factories with billions of components, e.g. semiconductor fabs
  • large software projects with tens of millions of lines of code, e.g. the Linux kernel
  • military operations involving millions of soldiers and support personnel

Figuring out how to turn interpretabi

... (read more)
4Antoine de Scorraille
Do we really have such good interpretations for such examples? It seems to me that we have big problems in the real world because we don't. We do have very high-level interpretations, but not enough to have solid guarantees. After all, we have a very high-level trivial interpretation of our ML models: they learn! The challenge is not just to have clues, but clues that are relevant enough to address safety concerns in relation to impact scale (which is the unprecedented feature of the AI field).

Agreed on all points! One clarification is that large founder-led companies, including Facebook, are all moral mazes internally (i.e. from the perspective of the typical employee); but their founders often have so much legitimacy that their external actions are only weakly influenced by moral maze dynamics.

I guess that means that if AGI deployment is very incremental - a sequence of small changes to many different AI systems, that only in retrospect add up to AGI - moral maze dynamics will still be paramount, even in founder-led companies.

1A Ray
I think that’s right but also the moral maze will be mediating the information and decision making support that’s available to the leadership, so they’re not totally immune from the influences

basically every company eventually becomes a moral maze

Agreed, but Silicon Valley wisdom says founder-led and -controlled companies are exceptionally dynamic, which matters here because the company that deploys AGI is reasonably likely to be one of those. For such companies, the personality and ideological commitments of the founder(s) are likely more predictive of external behavior than properties of moral mazes.

Facebook's pivot to the "metaverse", for instance, likely could not have been executed by a moral maze. If we believed that Facebook / Meta was o... (read more)

6A Ray
Agree that founders are a bit of an exception.  Actually that's a bit in the longer version of this when I talk about it in person. Basically: "The only people who at the very top of large tech companies are either founders or those who were able to climb to the tops of moral mazes". So my strategic corollary to this is that it's probably weakly better for AI alignment for founders to be in charge of companies longer, and to get replaced less often. In the case of facebook, even in the face of all of their history of actions, I think on the margin I'd prefer the founder to the median replacement to be leading the company. (Edit: I don't think founders remaining at the head of a company isn't evidence that the company isn't a moral maze.  Also I'm not certain I agree that facebook's pivot couldn't have been done by a moral maze.)

In support of this, I remember Geoff Hinton saying at his Turing award lecture that he strongly advised new grad students not to read the literature before trying, for months, to solve the problem themselves.

Two interesting consequences of the "unique combination of facts" model of invention:

  • You may want to engage in strategic ignorance: avoid learning about certain popular subfields or papers, in the hopes that this will let you generate a unique idea that is blocked for people who read all the latest papers and believe whatever the modern equivalent of "
... (read more)
4Ben
I agree with these. Related, if you work in a team I think it is far more important that you read papers no-one else in your team has read, than reading papers that everyone in the team has read. Put that way it is obvious, but many research groups welcome new members with a well-meaning folder of 30+ pdfs  which they claim will be useful.
Load More