If anyone wants to have a voice chat with me about a topic that I'm interested in (see my recent post/comment history to get a sense), please contact me via PM.
My main "claims to fame":
Have you seen A Master-Slave Model of Human Preferences? To summarize, I think every human is trying to optimize for status, consciously or subconsciously, including those who otherwise fit your description of idealized platonic researcher. For example, I'm someone who has (apparently) "chosen ultimate (intellectual) freedom over all else", having done all of my research outside of academia or any formal organizations, but on reflection I think I was striving for status (prestige) as much as anyone, it was just that my subconscious picked a different strategy than most (which eventually proved quite successful).
at the end of the day, what’s even the point of all this?
I think it's probably a result of most humans not being very strategic, or their subconscious strategizers not being very competent. Or zooming out, it's also a consequence of academia being suboptimal as an institution for leveraging humans' status and other motivations to produce valuable research. That in turn is a consequence of our blind spot for recognizing status as an important motivation/influence for every human behavior, which itself is because not explicitly recognizing status motivation is usually better for one's status.
I'm still using it for this purpose, but don't have a good sense of how much worse it is compared to pre-0325. However I'm definitely very wary of the sycophancy and overall bad judgment. I'm only using them to point out potential issues I may have overlooked, and not e.g. whether a draft is ready to post, or whether some potential issue is a real issue that needs to be fixed. All the models I've tried seem to err a lot in both directions.
But in the end, those plans are not opposed to each other.
I think they are somewhat opposed, due to signaling effects: If you're working on Plan 2 only, then that signals to the general public or non-experts that you think the risks are manageable/acceptable. If a lot of people are working on Plan 2, that gives ammunition to the people who want to race or don't want to pause/stop to say "Look at all these AI safety experts working on solving AI safety. If the risks are really as high as the Plan 1 people say, wouldn't they be calling for a pause/stop too instead of working on technical problems?"
I wonder whether, if you framed your concerns in this concrete way, you'd convince more people in alignment to devote attention to these issues? As compared to speaking more abstractly about solving metaethics or metaphilosophy.
I'm not sure. It's hard for me to understand other humans a lot of the time, for example these concerns (both concrete and abstract) seem really obvious to me, and it mystifies me why so few people share them (at least to the extent of trying to do anything about them, like writing a post to explain the concern, spending time to try to solve the relevant problems, or citing these concerns as another reason for AI pause).
Also I guess I did already talk about the concrete problem, without bringing up metaethics or metaphilosophy, in this post.
(Of course, you may not think that's a helpful alternative, if you think solving metaethics or metaphilosophy is the main goal, and other concrete issues will just continue to show up in different forms unless we do it.)
I think a lot of people in AI alignment think they already have a solution for metaethics (including Eliezer who explicitly said this in his metaethics sequence), which is something I'm trying to talk them out of, because assuming a wrong metaethical theory in one's alignment approach is likely to make the concrete issues worse instead of better.
For instance, I'm also concerned as an anti-realist that giving people their "aligned" AIs to do personal reflection will likely go poorly and lead to outcomes we wouldn't want for the sake of those people or for humanity as a collective.
This illustrates the phenomenon I talked about in my draft, where people in AI safety would confidently state "I am X" or "As an X" where X is some controversial meta-ethical position that they shouldn't be very confident in, whereas they're more likely to avoid overconfidence in other areas of philosophy like normative ethics.
I take your point that people who think they've solved meta-ethics can also share my concrete concern about possible catastrophe caused by bad reflection among some or all humans, but as mentioned above, I'm pretty worried that if their assumed solution is wrong, they're likely to contribute to making the problem worse instead of better.
BTW, are you actually a full-on anti-realist, or actually take one of the intermediate positions between realism and anti-realism? (See my old post Six Plausible Meta-Ethical Alternatives for a quick intro/explanation.)
While I appreciate this work being done, it seems a very bad sign for our world/timeline that the very few people with both philosophy training and an interest in AI x-safety are using their time/talent to do forecasting (or other) work instead of solving philosophical problems in AI x-safety, with Daniel Kokotajlo being another prominent example.
This implies one of two things: Either they are miscalculating the best way to spend their time, which indicates bad reasoning or intuitions even among humanity's top philosophers (i.e., those who have at least realized the importance of AI x-risk and are trying to do something about it). Or they actually are the best people (in a comparative advantage sense) available to work on these other problems, in which case the world must be on fire, and they're having to delay working on extremely urgent problems that they were trained for, to put out even bigger fires.
(Cross-posted to LW and EAF.)
Strongly agree that metaethics is a problem that should be central to AI alignment, but is being neglected. I actually have a draft about this, which I guess I'll post here as a comment in case I don't get around to finishing it.
I often talk about humans or AIs having to solve difficult philosophical problems as part of solving AI alignment, but what philosophical problems exactly? I'm afraid that some people might have gotten the impression that they're relatively "technical" problems (in other words, problems whose solutions we can largely see the shapes of, but need to work out the technical details) like anthropic reasoning and decision theory, which we might reasonably assume or hope that AIs can help us solve. I suspect this is because due to their relatively "technical" nature, they're discussed more often on LessWrong and AI Alignment Forum, unlike other equally or even more relevant philosophical problems, which are harder to grapple with or "attack". (I'm also worried that some are under the mistaken impression that we're closer to solving these "technical" problems than we actually are, but that's not the focus of the current post.)
To me, the really central problems of AI alignment are metaethics and metaphilosophy, because these problems are implicated in the core question of what it means for an AI to share a human's (or a group of humans') values, or what it means to help or empower a human (or group of humans). I think one way that the AI alignment community has avoided this issue (even those thinking about longer term problems or scalable solutions) is by assuming that the alignment target is someone like themselves, i.e. someone who clearly understands that they are and should be uncertain about what their values are or should be, or are at least willing to question their moral beliefs, and eager or at least willing to use careful philosophical reflection to solve their value confusion/uncertainty. To help or align to such a human, the AI perhaps doesn't need an immediate solution to metaethics and metaphilosophy, and can instead just empower the human in relatively commonsensical ways, like keeping them safe and gather resources for them, and allow them to work out their own values in a safe and productive environment.
But what about the rest of humanity who seemingly are not like that? From an earlier comment:
I've been thinking a lot about the kind [of value drift] quoted in Morality is Scary. The way I would describe it now is that human morality is by default driven by a competitive status/signaling game, where often some random or historically contingent aspect of human value or motivation becomes the focal point of the game, and gets magnified/upweighted as a result of competitive dynamics, sometimes to an extreme, even absurd degree.
(Of course from the inside it doesn't look absurd, but instead feels like moral progress. One example of this that I happened across recently is filial piety in China, which became more and more extreme over time, until someone cutting off a piece of their flesh to prepare a medicinal broth for an ailing parent was held up as a moral exemplar.)
Related to this is my realization is that the kind of philosophy you and I are familiar with (analytical philosophy, or more broadly careful/skeptical philosophy) doesn't exist in most of the world and may only exist in Anglophone countries as a historical accident. There, about 10,000 practitioners exist who are funded but ignored by the rest of the population. To most of humanity, "philosophy" is exemplified by Confucius (morality is everyone faithfully playing their feudal roles) or Engels (communism, dialectical materialism). To us, this kind of "philosophy" is hand waving and make things up out of thin air, but to them, philosophy is learned from a young age and unquestioned. (Or if questioned, they're liable to jump to some other equally hand-wavy "philosophy" like China's move from Confucius to Engels.)
What are the real values of someone whose apparent values (stated and revealed preferences) can change in arbitrary and even extreme ways as they interact with other humans in ordinary life (i.e., not due to some extreme circumstances like physical brain damage or modification), and who doesn't care about careful philosophical inquiry? What does it mean to "help" someone like this? To answer this, we seemingly have to solve metaethics (generally understand the nature of values) and/or metaphilosophy (so the AI can "do philosophy" for the alignment target, "doing their homework" for them). The default alternative (assuming we solve other aspects of AI alignment) seems to be to still empower them in straightforward ways, and hope for the best. But I argue that giving people who are unreflective and prone to value drift god-like powers to reshape the universe and themselves could easily lead to catastrophic outcomes on par with takeover by unaligned AIs, since in both cases the universe becomes optimized for essentially random values.
A related social/epistemic problem is that unlike certain other areas of philosophy (such as decision theory and object-level moral philosophy), people including alignment researchers just seem more confident about their own preferred solution to metaethics, and comfortable assuming their own preferred solution is correct as part of solving other problems, like AI alignment or strategy. (E.g., moral anti-realism is true, therefore empowering humans in straightforward ways is fine as the alignment target can't be wrong about their own values.) This may also account for metaethics not being viewed as a central problem in AI alignment (i.e., some people think it's already solved).
I'm unsure about the root cause(s) of confidence/certainty in metaethics being relatively common in AI safety circles. (Maybe it's because in other areas of philosophy, the various proposed solutions are more obviously unfinished or problematic, e.g. the well-known problems with utilitarianism.) I've previously argued for metaethical confusion/uncertainty being normative at this point, and will also point out now that from a social perspective there is apparently wide disagreement about the problems among philosophers and alignment researchers, so how can it be right to assume some controversial solution to it (which every proposed solution is at this point) as part of a specific AI alignment or strategy idea?
I want to highlight a point I made in an EAF thread with Will MacAskill, which seems novel or at least underappreciated. For context, we're discussing whether the risk vs time (in AI pause/slowdown) curve is concave or convex, or in other words, whether the marginal value of an AI pause increases or decreases with pause length. Here's the whole comment for context, with the specific passage bolded:
Whereas it seems like maybe you think it's convex, such that smaller pauses or slowdowns do very little?
I think my point in the opening comment does not logically depend on whether the risk vs time (in pause/slowdown) curve is convex or concave[1], but it may be a major difference in how we're thinking about the situation, so thanks for surfacing this. In particular I see 3 large sources of convexity:
Like: putting in the schlep to RL AI and create scaffolds so that we can have AI making progress on these problems months earlier than we would have done otherwise
I think this kind of approach can backfire badly (especially given human overconfidence), because we currently don't know how to judge progress on these problems except by using human judgment, and it may be easier for AIs to game human judgment than to make real progress. (Researchers trying to use LLMs as RL judges apparently run into the analogous problem constantly.)
having governance set up such that the most important decision-makers are actually concerned about these issues and listening to the AI-results that are being produced
What if the leaders can't or shouldn't trust the AI results?
I'm trying to coordinate with, or avoid interfering with, people who are trying to implement an AI pause or create conditions conducive to a future pause. As mentioned in the grandparent comment, one way people like us could interfere with such efforts is by feeding into a human tendency to be overconfident about one's own ideas/solutions/approaches.
That fully boils down to whether the experience includes a preference to be dead (or to have not been born).
I'm pretty doubtful about this. It seems totally possible that evolution gave us a desire to be alive, while also gave us a net welfare that's negative. I mean we're deluded by default about a lot of other things (e.g., think there are agents/gods everywhere in nature, don't recognize that social status is a hugely important motivation behind everything we do), why not this too?
Let’s take an area where you have something to say, like philosophy. Would you be willing to outsource that?
Outsourcing philosophy is the main thing I've been trying to do, or trying to figure out how to safely do, for decades at this point. I've written about it in various places, including this post and my pinned tweet on X. Quoting from the latter:
Among my first reactions upon hearing "artificial superintelligence" were "I can finally get answers to my favorite philosophical problems" followed by "How do I make sure the ASI actually answers them correctly?"
Aside from wanting to outsource philosophy to ASI, I'd also love to have more humans who could answer these questions for me. I think about this a fair bit and wrote some things down but don't have any magic bullets.
(I currently think the best bet to eventually getting what I want is to encourage an AI pause along with genetic enhancements for human intelligence, have the enhanced humans solve metaphilosophy and other aspects of AI safety, then outsource the rest of philosophy to ASI, or have the enhanced humans decide what to do at that point.)
BTW I thought this would be a good test for how competent current AIs are at understanding someone's perspective so I asked a bunch of them how Wei Dai would answer your question, and all of them got it wrong on the first try, except Claude Sonnet 4.5 which got it right on the first try but wrong on the second try. It seems like having my public content in their training data isn't enough, and finding relevant info from the web and understanding nuance are still challenging for them. (GPT-5 essentially said I'd answer no because I wouldn't trust current AIs enough, which is really missing the point despite having this whole thread as context.)
If solving alignment implies solving difficult philosophical problems (and I think it does), then a major bottlenecks for verifying alignment will be verifying philosophy, which in turn implies that we should be trying to solve metaphilosophy (i.e., understand the nature of philosophy and philosophical reasoning/judgment). But that is unlikely to be possible within 2-4 years, even with the largest plausible effort, considering the history of analogous fields like metaethics and philosophy of math.
What to do in light of this? Try to verify the rest of alignment, just wing it on the philosophical parts, and hope for the best?
I kind of want to argue against this, but also am not sure how this fits in with the rest of your argument. Whether or not there's an upper bound that's plausibly a lot lower than perfectly solving alignment with certainty, it doesn't seem to affect your final conclusions?