We think it works like this
Who is "we"? Is it:
Also, this definitely deserves to be made into a high-level post, if you end up finding the time/energy/interest in making one.
randomly
As an aside (that's still rather relevant, IMO), it is a huge pet peeve of mine when people use the word "randomly" in technical or semi-technical contexts (like this one) to mean "uniformly at random" instead of just "according to some probability distribution." I think the former elevates and reifies a way-too-common confusion and draws attention away from the important upstream generator of disagreements, namely how exactly the constitution is sampled.
I wouldn't normally have said this, but given your obvious interest in math, it's worth pointing out that the answers to these questions you have raised naturally depend very heavily on what distribution we would be drawing from. If we are talking about, again, a uniform distribution from "the design space of minds-in-general" (so we are just summoning a "random" demon or shoggoth), then we might expect one answer. If, however, the search is inherently biased towards a particular submanifold of that space, because of the very nature of how these AIs are trained/fine-tuned/analyzed/etc., then you could expect a different answer.
One of the advantages to remaining agnostic comes from the same argument that users put forth in the comment sections on this very site way back in the age of the Sequences (I can look up the specific links if people really want me to, they were in response to the Doublethink Sequence) for why it's not necessarily instrumentally rational for limited beings like humans to actually believe in the Litany of Tarski: if you are in a precarious social situation, in which retaining status/support/friends/resources is contingent on you successfully signaling to your in-group that you maintain faith in their core teachings, it simply doesn't suffice to say "acquire all the private truth through regular means and don't talk/signal publicly the stuff that would be most dangerous to you," because you don't get complete control over what you signal.
If you learn that the in-group is wrong about some critical matter, and you understand that in-group members realizing you no longer agree with them will result in harm to you (directly, or through your resources being cut off), your only option is, to act (to some extent) deceptively. To take on the role, QuirrellMort-style, of somebody who does not have access to the information you have actually stumbled upon, and to pretend to be just another happy & clueless member of the community.
This is capital-H Hard. Lying (or even something smaller-scale like lesser deceptions), when done consistently and routinely, to people that you consider(ed) your family/friends/acquaintances, is very hard for (the vast majority of) people. For straightforward evolutionary reasons, we have evolved to be really good at detecting when one of our own is not being fully forthcoming. You can bypass this obstacle if the number of interactions you have is small, or if, as is usually the case in modern life when people get away with lies, nobody actually cares about the lie and it's all just a game of make-believe where you just have to "utter the magic words." But when it's not a game, when people do care about honestly signaling your continued adherence to the group's beliefs and epistemology, you're in big trouble.
Indeed, by far the most efficient way of convincing others of your bullshit on a regular basis is to convince yourself first, and by putting yourself in a position where you must do the former, you are increasing the likelihood of the latter with every passing day. Quite the opposite of what you'd like to see happen, if you are about truth-seeking to any large extent.
(addendum: admittedly, this doesn't answer the question fully, since it doesn't deal with the critical distinction between agnosticism and explicit advocacy, but I think it does get at something reasonably important in the vicinity of it anyway)
There's an alignment-related problem, the problem of defining real objects. Relevant topics: environmental goals; task identification problem; "look where I'm pointing, not at my finger"; Eliciting Latent Knowledge.
Another highly relevant post: The Pointers Problem.
So, where are the Knuths of the modern era? Why is modern AI dominated by the Lorem Epsoms of the world? Where is the craftsmanship? Why are our AI tools optimized for seeming good, rather than being good?
[2] Remember back in 2013 when the talk of the town was how vector representations of words learned by neural networks represent rich semantic information? So you could do cool things like take the [male] vector, subtract the [female] vector, add the [king] vector, and get out something close to the [queen] vector? That was cool! Where's the stuff like that these days?
I'm a bit confused by your confusion, and by the fact that your post does not contain what seems to me like the most straightforward explanation of these phenomena. An explanation that I am almost fully certain you are aware of, and which seems to be almost universally agreed upon by those interested (at any level) in interpretability in ML.
Namely the fact that, starting in the 2010s, it happened to be the case (for a ton of historically contingent reasons) that top AI companies (at the beginning, and followed by other ML hubs and researchers afterwards) realized the bitter lesson is basically correct: attempts to hard-code human knowledge or intuition into frontier models ultimately always harm their performance in the long-term compared to "literally just scale the model with more data and compute." This led to a focus, among experts and top engineers, on figuring out scaling laws, ways of improving the quality and availability of data (perhaps through synthetic generation methods), ways of creating better end-user products through stuff like fine-tuning and RLHF, etc, instead of the older GOFAI stuff of trying to figure out at a deeper level what is going on inside the model.
Another way of saying this is that top researchers and companies ultimately stumbled on an AI paradigm which increased capabilities significantly more than had been achievable previously, but at the cost of strongly decoupling "capability improvements" and "interpretability improvements" as distinct things that researchers and engineers could focus on. It's not that capability and interpretability were necessarily tightly correlated in the past; that is not the claim I am making. Rather, I am saying that in the pre-(transformer + RL) era, the way you generated improvements in your models/AI was by figuring out specific issues and analyzing them deeply to find out how to get around them, whereas now, a far simpler, easier, less insight-intensive approach became available: literally just scaling up the model with more data and compute.
So the basic point is that you no longer see all this cool research on the internal representations that models generate of high-dimensional data like word embeddings (such as the word2vec stuff you are referring to in the second footnote) because you no longer have nearly as much of a need for these insights in order to improve the capabilities/performance of the AI tools currently in use. It's fundamentally an issue with demand, not with supply. And the demand from the interpretability-focused AI alignment community is just nowhere close to large enough to bridge the gap and cover the loss generated by the shift in paradigm focus and priorities among the capabilities/"normie" AI research community.
Indeed, the notion that nowadays, the reason you no longer have deep thinkers who try to figure out what is going on or are "motivated by reasons" in how they approach these issues, is somehow because "careful thinkers read LessWrong and decided against contributing to AI progress," seems... rather ridiculous to me? It's not like I enjoy responding to an important question that you are asking with derision in lieu of a substantive response, but... I mean, the literal authors of the word2vec paper you cited were AI (capabilities) researchers working at top companies, not AI alignment researchers! Sure, some people like Bengio and Hofstadter (less relevant in practical terms) who are obviously not "LARP-ing impostors" in Wentworth's terminology have made the shift from capabilities work to trying to raise public awareness of alignment/safety/control problems. But the vast majority (according to personal experience, general impressions, as well as the current state of the discourse on these topics) absolutely have not, and since they were the ones generating the clever insights back in the day, of course it makes sense that the overall supply of these insights has gone down.
I just really don't see how it could be the case that "people refuse to generate these insights because they have been convinced by AI safety advocates that it would dangerously increase capabilities and shorten timelines" and "people no longer generate these insights as much because they are instead focusing on other tasks that improve model capabilities more rapidly and robustly, given the shifted paradigm" are two hypotheses that can be given similar probabilities in any reasonable person's mind. The latter should be at least a few orders of magnitude more likely than the former, as I see it.
some people say that "winning is about not playing dominated strategies"
I do not believe this statement. As in, I do not currently know of a single person, associated either with LW or with decision-theory academia, that says "not playing dominated strategies is entirely action-guiding." So, as Raemon pointed out, "this post seems like it’s arguing with someone but I’m not sure who."
In general, I tend to mildly disapprove of words like "a widely-used strategy", "we often encounter claims" etc, without any direct citations to the individuals who are purportedly making these mistakes. If it really was that widely-used, surely it would be trivial for the authors to quote a few examples off the top of their head, no? What does it say about them that they didn't?
I think it's not quite as clear as needing to shut down all other AGI projects or we're doomed; a small number of AGIs under control of different humans might be stable with good communication and agreements, at least until someone malevolent or foolish enough gets involved.
Realistically, in order to have a reasonable degree of certainty that this state can be maintained for more than a trivial amount of time, this would, at the very least, require a hard ban on open-source AI, as well as international agreements to strictly enforce transparency and compute restrictions, with the direct use of force if need be, especially if governments get much more involved in AI in the near-term future (which I expect will happen).
Do you agree with this, as a baseline?
Does this plan necessarily factor through the using the intent-aligned AGI to quickly commit some sort of pivotal act that flips the gameboard and prevents other intent-aligned AGIs from being used malevolently by self-interested or destructive (human) actors to gain a decisive strategic advantage? After all, it sure seems less than ideal to find yourself in a position where you can solve the theoretical parts of value alignment,[1] but you cannot implement that in practice because control over the entire future light cone has already been permanently taken over by an AGI intent-aligned to someone who does not care about any of your broadly prosocial goals...
In so far as something like this even makes sense, which I have already expressed my skepticism of many times, but I don't think I particularly want to rehash this discussion with you right now...
You've gotten a fair number of disagree-votes thus far, but I think it's generally correct to say that many (arguably most) prediction markets still currently lack the trading volume necessary to justify confidence that EMH-style arguments mean inefficiencies will be rapidly corrected. To a large extent, it's fair to say this is due to over-regulation and attempts at outright banning (perhaps the relatively recent 5th Circuit ruling in favor of PredictIt against the Commodities Future Trading Commission is worth looking at as a microcosm of how these legal battles are playing out in today's day and age).
Nevertheless, the standard theoretical argument that inefficiencies in prediction markets are exploitable and thus lead to a self-correcting mechanism still seems entirely correct, as Garrett Baker points out.
[Coming at this a few months late, sorry. This comment by @Steven Byrnes sparked my interest in this topic once again]
Ngl, I find everything you're written here a bit... baffling, Seth. Your writing in particular and your exposition of your thoughts on AI risk generally does not use evolutionary analogies, but this only means that posts and comments criticizing analogies with evolution (sample: 1, 2, 3, 4, 5, etc) are just not aimed at you and your reasoning. I greatly enjoy reading your writing and pondering the insights you bring up, but you are simply not even close to the most publicly-salient proponent of "somewhat high P(doom)" among the AI alignment community. It makes perfect sense from the perspective of those who disagree with you (or other, more hardcore "doomers") on the bottom-line question of AI to focus their public discourse primarily on responding to the arguments brought up by the subset of "doomers" who are most salient and also most extreme in their views, namely the MIRI-cluster centered around Eliezer, Nate Soares, and Rob Bensinger.
And when you turn to MIRI and the views that its members have espoused on these topics, I am very surprised to hear that "The arguments for misgeneralization/mis-specification stand on their own" and are not ultimately based on analogies with evolution.
But anyway, to hopefully settle this once and for all, let's go through all the examples that pop up in my head immediately when I think of this, shall we?
From the section on inner & outer alignment of "AGI Ruin: A List of Lethalities", by Yudkowsky (I have removed the original emphasis and added my own):
From "A central AI alignment problem: capabilities generalization, and the sharp left turn", by Nate Soares, which, by the way, quite literally uses the exact phrase "The central analogy"; as before, emphasis is mine:
From "The basic reasons I expect AGI ruin", by Rob Bensinger:
From "Niceness is unnatural", by Nate Soares:
From "Superintelligent AI is necessary for an amazing future, but far from sufficient", by Nate Soares:
From the Eliezer-edited summary of "Ngo and Yudkowsky on alignment difficulty", by... Ngo and Yudkowsky:
From "Comments on Carlsmith's “Is power-seeking AI an existential risk?"", by Nate Soares:
From "Soares, Tallinn, and Yudkowsky discuss AGI cognition", by... well, you get the point:
From "Humans aren't fitness maximizers", by Soares:
From "Shah and Yudkowsky on alignment failures", by the usual suspects:
From the comments on "Late 2021 MIRI Conversations: AMA / Discussion", by Yudkowsky:
From Yudkowsky's appearance on the Bankless podcast (full transcript here):
At this point, I'm tired, so I'm logging off. But I would bet a lot of money that I can find at least 3x the number of these examples if I had the energy to. As Alex Turner put it, it seems clear to me that, for a very high portion of "classic" alignment arguments about inner & outer alignment problems, at least in the form espoused by MIRI, the argumentative bedrock is ultimately based on little more than analogies with evolution.