Strongly agree.
Related: It's disheartening to recognize, but it seems the ML community might not even get past the first crucial step in reducing risks, which is understanding them. We appear to live in a world where most people, including key decision-makers, still don't grasp the gravity of the situation. For instance, in France, we still hear influential figures like Arthur Mensch, CEO of Mistral, saying things like, "When you write this kind of software, you always control what's going to happen, all the outputs the software can have." As long as such individuals are leading AGI labs, the situation will remain quite dire.
+1 for the conflationary alliances point. It is especially frustrating when I hear junior people interchange "AI Safety" and "AI Alignment." These are two completely different concepts, and one can exist without the other. (The fact that the main forum for AI Safety is the "Alignment Forum" does not help with this confusion). I'm not convinced the goal of the AI Safety community should be to align AIs at this point.
However, I want to make a small amendment to Myth 1: I believe that technical work which enhances safety culture is generally very positive. Examples of such work include scary demos like "BadLlama," which I cite at least once a week, or benchmarks such as Evaluating Frontier Models for Dangerous Capabilities, which tries to monitor particularly concerning capabilities. More "technical" works like these seem overwhelmingly positive, and I think that we need more competent people doing this.
It is especially frustrating when I hear junior people interchange "AI Safety" and "AI Alignment." These are two completely different concepts, and one can exist without the other. (The fact that the main forum for AI Safety is the "Alignment Forum" does not help with this confusion)
One issue is there's also a difference between "AI X-Safety" and "AI Safety". It's very natural for people working on all kinds of safety from and with AI systems to call their field "AI safety", so it seems a bit doomed to try and have that term refer to x-safety.
I believe that technical work which enhances safety culture is generally very positive.
All of the examples that you mentioned share one critical non-technical aspect though. Their results are publicly available (I guess they were funded by general public, e.g. in case of "BadLlama" - by donations and grants to a foundation Palisade Research and IIIT, an Indian national institute). If you took the very same "technical" research and have it only available to a potentially shady private company, then that technical information could help them to indeed circumvent Llama's safeguards. At that point, I'm not sure if one could still confidently call it "overwhelmingly positive".
I agree that the works that you mentioned are very positive, but I think that the above non-technical aspect is necessary to take into consideration.
Yeah, this agrees with my thinking so far. However, I think if you could research how to align AIs specifically to human flourishing (as opposed to things like obedience/interpretability/truthfulness, which defer to the user's values), that kind of work could be more helpful than most.
I very much agree with human flourishing as the main value I most want AI technologies to pursue and be used to pursue.
In that framing, my key claim is that in practice no area of purely technical AI research — including "safety" and/or "alignment" research — can be adequately checked for whether it will help or hinder human flourishing, without a social model of how the resulting techologies will be used by individuals / businesses / governments / etc..
And we don't have good social models of technology for really any technology, even retrospectively. So AI is certainly one we are not going to align with human flourishing in advance. When it comes to human flourishing the humanizing of technologies take a lot of time. Eventually we will get there, but it's a process that requires a lot of individual actors making choices and "feature requests" from the world, features that promote human flourishing.
Before we can even start to try to align AIs to human flourishing, we first need a clear definition of what that means. This has been a topic accessible to philosophical thought for millenia and yet still has no, universally accepted definition so how can you consider AI alignment helpful. Even if that we could all agree on what "human flourishing" meant, you would still have the problem of lock-in i.e. our AI overlords will never allow that definition to evolve once they have assumed control. Would you want to be trapped in the Utopia of someone born 3000 years ago? Better than being exterminated but still not what we want.
I think the key to approaches like this is to eschew pre-existing, complex concepts like "human flourishing" and look for a definition of Good Things that is actually amenable to constructing an agent that Does Good Things. There's no guarantee that this would lead anywhere; it relies on some weak form of moral realism. But an AGI that follows some morality-you-largely-agree-with by its very structure is a lot more appealing to me than an AGI that dutifully maximizes the morality-you-punched-into-its-utility-function-at-bootup, appealing enough that I think it's worth wading into moral philosophy to see if the idea pans out.
Good article. There especially needs to be more focus on the social model. I am an optimist, however I think ~half of how things could go wrong with AI is with alignment technically solved. The more you expect a slow takeoff or non-fatal overhang, the more this is important.
I don't find this framing compelling. Particularly wrt to this part:
Obedience — AI that obeys the intention of a human user can be asked to help build unsafe AGI, such as by serving as a coding assistant. (Note: this used to be considered extremely sci-fi, and now it's standard practice.)
I grant the point that an AI that does what the user wants can still be dangerous (in fact it could outright destroy the world). But I'd describe that situation as "we successfully aligned AI and things went wrong anyway" rather than "we failed to align AI". I grant that this isn't obvious; it depends on how exactly AI alignment is defined. But the post frames its conclusions as definitive rather than definition-dependent, which I don't think is correct.
Is the-definition-of-alignment-which-makes-alignment-in-isolation-a-coherent-concept obviously not useful? Again, I don't think so. If you believe that "AI destroying the world because it's very hard to specify a utility function that doesn't destroy the world" is a much larger problem than "AI destroying the world because it obeys the wrong group of people", then alignement (and obedience in particular) is a concept useful in isolation. In particular, it's... well, it's not definitely helpful, so your introductory sentence remains literally true, but it's very likely helpful. The important thing is does make sense to work on obedience without worrying about how it's going to be applied because increasing obedience is helpful in expectation. It could remain helpful in expectation even if it accelerates timelines. And note that this remains true even if you do define Alignment in a more ambitious way.
I'm aware that you don't have such a view, but again, that's my point; I think this post is articulating the consequences of a particular set of beliefs about AI, rather than pointing out a logical error that other people make, which is what its framing suggests.
[edit: pinned to profile]
In a similar sense to how the agency you can currently write down about your system is probably not the real agency, if you do manage to write down a system whose agency really is pointed in the direction that the agency of a human wants, but that human is still a part of the current organizational structures in society, those organizational structures implement supervisor trees and competition networks which mean that there appears to be more success available if they try to use their ai to participate in the competition networks better - and thus goodhart whatever metrics are being competed at, probably related to money somehow.
If your AI isn't able to provide the necessary wisdom to get a human from "inclined to accidentally use an obedient powerful ai to destroy the world despite this human's verbal statements of intention to themselves" to "inclined to successfully execute on good intentions and achieve interorganizational behaviors that make things better", then I claim you've failed at the technical problem anyway, even though you succeeded at obedient AI.
If everyone tries to win at the current games (in the technical sense of the word), everyone loses, including the highest scoring players; current societal layout has a lot of games where it seems to me the only long-term winning move is not to play and to instead try to invent a way to jump into another game, but where to some degree you can win short-term. Unfortunately it seems to me that humans are RLed pretty hard by doing a lot of playing of these games, and so having a powerful AI in front of them is likely to get most humans trying to win at those games. Pick an organization that you expect to develop powerful AGI; do you expect the people in that org to be able to think outside the framework of current society enough for their marginal contribution to push towards a better world when the size of their contribution suddenly gets very large?
I found your reply really interesting.
Because I find it so interesting and want to understand it: What does the "RLed" in "Unfortunately it seems to me that humans are RLed pretty hard by doing a lot of playing of these games" mean? That term is not familiar to me.
Like seth said, I just mean reinforcement learning. Described in more typical language, people take their feelings of success from whether they're winning at the player-vs-environment and player-vs-player contests one encounters in everyday life; opportunities to change what contests are possible are unfamiliar. I also think there are decision theory issues[1] humans have. and then of course people do in fact have different preferences and moral values. but even among people where neither issue is in play, I think people have pretty bad self-misalignment as a result of taking what-feels-good-to-succeed-at feedback from circumstances that train them into habits that work well in the original context, and which typically badly fail to produce useful behavior in contexts like "you can massively change things for the better". Being prepared for unreasonable success is a common phrase referring to this issue, I think.
[1] in case this is useful context: a decision theory is a small mathematical expression which roughly expresses "what part of past, present, and future do you see as you-which-decides-together", or stated slightly more technically, what's the expression that defines how you consider counterfactuals when evaluating possible actions you "could [have] take[n]"; I'm pretty sure humans have some native one, and it's not exactly any of the ones that are typically discussed but rather some thing vaguely in the direction of active inference, though people vary between approximating the typically discussed ones. The commonly discussed ones around these parts are stuff like EDT/CDT/LDTs { FDT, UDT, LIDT, ... }
As far as I know, the answer is simply that you have to model the social landscape around you and how your research contributions are going to be applied.
In other words, it matters who receives your ideas, and what they choose to do with those ideas, even when your ideas are technical advances in AI safety or "alignment".
Like others I agree.
Some of what you're saying I interpret as an argument for increased transdisciplinarity in AI (safety) research. It's happening, but we would all likely benefit from more. By transdisciplinarity I mean collaborative work that transcends disciplines and includes non-academic stakeholders.
My technical background is climate science not AI, but having watched that closely, I’d argue it’s a good example of an (x/s)risk-related discipline that (relative to AI) is much further along in the process of branching out to include more knowledge systems (transdisciplinary research).
Perhaps this analogy is too locally specific to Australia to properly land with folks, but AI-safety does not have the climate-safety equivalent of "cultural burning" and probably won't for some time. But it will eventually (have to – for reasons you allude to).
Some of what you’re saying, especially the above quoted, I think relates deeply to research ethics. Building good models of the social landscapes potentially impacted by research is something explored at depth across many disciplines. If I can speak frankly, the bleeding edge of that research is far removed from Silicon Valley and its way of thinking and acting.
A little addendum: This excerpt from the website announcing Ilya's new AI safety company illustrates your points well I think: "We approach safety and capabilities in tandem, as technical problems to be solved through revolutionary engineering and scientific breakthroughs."
Yes, agreed. And: the inevitable dual-use nature of alignment research shouldn't be used as a reason or excuse to not do alignment research. Remaining pure by not contributing to AGI is going to be cold comfort if it arrives without good alignment plans.
So there is a huge challenge: we need to do alignment research that contributes more to alignment than to capabilities, within the current social arrangement. It's very tough to guess which this is. Optimistic individuals will be biased to do too much; cautious individuals will tend to do too little.
One possible direction is to put more effort toward AGI alignment strategy. My perception is that we have far less than would be optimal, and that this is the standard nature of science. People are more apt to work on object level projects rather than strategize about which will best advance the field's goals. We have much more than average in AGI safety, but the large disagreements among well-informed and conscientious thinkers indicates that we could still use more. And it seems like such strategic thinking is necessary to decide which object-level work is will probably advance alignment faster than capabilities.
AI that obeys the intention of a human user can be asked to help build unsafe AGI, such as by serving as a coding assistant.
I think a better example of your point is "Corrigible AI can be used by a dictator to enforce their rule."
I'm not certain that Myth #1 is a necessarily myth for all approaches to AI Safety. Specifically, if the Value Learning approach to AI safety turned out to be the most effective one, then the AI will be acting as an alignment researcher and doing research (in the social sciences) to converge its views on human values to the truth, and then using that as an alignment target. If in addition to that, you also believe that human values are a matter of objective fact (e.g. that if they are mostly determined by a set of evolved Evolutionary Psychology adaptations to the environmental niche that humans evolved in), and are independent of background/cilture/upbringing, then the target that this process converges to might be nearly independent of the human social context in which this work started, and of the desires/views/interests of the specific humans involved at the beginning of the process.
However, that is a rather strong and specific set of assumptions required for Myth #1 not to be a myth: I certainly agree that in general and by default, for most ideas in Alignment, human context matters, and that the long-term outcome of a specific Alignment technique being applied in, say, North Korea, might differ significantly from it being applied in North America.
Designing new technology seems to be what I'm best at, but I've often asked myself:
What's the point of new technology if all it does is give institutions more room to be corrupt?
Computers got much faster, but software and web pages bloat until they're slow, until there's enough of a problem that management gets pushback and has to make changes they otherwise wouldn't. More technology can improve the results of a small amount of effort, but any amount of technology multiplied by zero effort from leadership towards a goal is zero.
One additional maxim to consider is that the AI community in general can only barely conceptualize and operationalize difficult concepts, such as safety. Historically, the AI community was good at maximizing some measure of performance, usually pretty straight forward test set metrics such as classification accuracy. Culturally this is how the community approaches all the problems -- by aggregating complex phenomena into a single number. Note that this approach is not used in that many fields outside of AI and math, as you always have to make some lossy simplifications.
We can observe this malpractice in AI safety as well. There is a cottage industry of datasets and papers collecting "safety" samples, and we use these to measure some safety metric. We can then compare the numbers for different models and this makes AI folks happy. But there is barely any discussion about how representative these datasets really are for real-life risks, how comprehensive the data collection process is, or how sound it is to use random crowd-sourced workers or LLMs to generate such samples. The threats and risks are also rarely described in more detail -- often it's just a lot of hand-waving.
Based on my pretty deep experience with one aspect of AI safety (societal biases), I have very little confidence in our ability to understand AI behavior. Compared to measuring performance on well-defined NLP tasks, once we involve societal context, the intricacies of what we are trying to measure are beyond simple benchmarks. Note that we have entire fields that are trying to understand some of these problem in human societies, but we are to believe that we can collect a test set with few thousand samples and this should be enough to understand how AI works.
Curated.
The overall point here seems true and important to me.
I think I either disagree, or am agnostic about, some of the specific examples given in the Myth vs Reality section. I don't think they're loadbearing for the overall point. I may try to write those up in more detail later.
Technical AI safety and/or alignment advances are intrinsically safe and helpful to humanity, irrespective of the state of humanity.
I think this statement is weakly true, insofar as almost no misuse by humans could possibly be worse than what a completely out of control ASI would do. Technical safety is a necessary but not sufficient condition to beneficial AI. That said, it's also absolutely true that it's not nearly enough. Most scenarios with controllable AI still end with humanity nearly extinct IMO, with only a few people lording their AI over everyone else. Preventing that is not a merely technical challenge.
Physics Myths vs reality.
Myth: Ball bearings are perfect spheres.
Reality: The ball bearings have slight lumps and imperfections due to manufacturing processes.
Myth: Gravity pulls things straight down at 9.8 m/s/s.
Reality: Gravitational force varies depending on local geology.
You can do this for any topic. Everything is approximations. The only question is if they are good approximations.
I would have liked to write a post that offers one weird trick to avoid being confused by which areas of AI are more or less safe to advance, but I can’t write that post. As far as I know, the answer is simply that you have to model the social landscape around you and how your research contributions are going to be applied.
Another thing that can't be ignored is the threat of Social Balkanization, Divide-and-conquer tactics have been prevalent among military strategists for millennia, and the tactic remains prevalent and psychologically available among the people making up corporate factions and many subcultures (likely including leftist and right-wing subcultures).
It is easy for external forces to notice opportunities to Balkanize a group, to make it weaker and easier to acquire or capture the splinters, which in turn provides further opportunity for lateral movement and spotting more exploits. Since awareness and exploitation of this vulnerability is prevalent, social systems without this specific hardening are very brittle and have dismal prospects.
Sadly, Balkanization can also emerge naturally, as you helpfully pointed out in Consciousness as a Conflationary Alliance Term, so the high base rates make it harder to correctly distinguishing attacks from accidents. Inadequately calibrated autoimmune responses are not only damaging, but should be assumed to be automatically anticipated and misdirected by default, including as part of the mundane social dynamics of a group with no external attackers.
There is no way around the loss function.
But with ambitious spears they made us change
They crouched behind their mirrors and fought on.
Robin Williamson, Creation
I go back to "know thyself" as a solution, but I think it needs to happen at species scale due to the times, so it needs to effectively be a lot more addictive than heretofore. ML is perfect for this, because one uses one's own data, so no artists or even corporations are being ripped off. If the buttons that others are pushing can be understood, all kinds of manipulation might be stymied.
Such an app would need to be a tetris-like way to understand emotions, to golf cart around Plato's cave before going deeper with a flashlight to see what tunnels time has made, giving a feeling of being understood as an incentive. I've been working on one of the two apps that popped into my head for that (Phobrain), and just open sourced a workbench for building a representation of self by labeling 'Rorschach pairs' of pics yes/no. The other app would be worry beads with axial sensing and force feedback, with vibration and temperature also possible. Aside from exploring oneself and holding the hand of god, one might hold hands with someone around the world or across the table in a meeting, or carry the beads in one's underwear - there should be enough today's-world possibilities there for moneymaking products that could be adapted for next-world understanding.
This feels like I'm not addressing the formal risk aspect for the industry, but that angle of discussion seems unlikely to withstand the changes that the first 'killer app' that decides a battle or war will bring. The momentum of evolution with war will not be stopped by people discussing it, only by choosing a new model at a species scale.
Broadly agree, in that most safety research expands control over systems and our understanding of them, which can be abused by a bad actor.
This problem is encountered by for-profit companies, where profit is on the lines instead of catastrophe. They too have R&D departments and research directions which have the potential for misuse. However, this research is done inside a social environment (the company) where it is only explicitly used to make money.
To give a more concrete example, improving self-driving capabilities also allows the companies making the cars to intentionally make them run people down if they so wish. The more advanced the capabilities, the more precise they can be in deploying their pedestrian-killing machines onto the roads. However, we would never expect this to happen as this would clearly demolish the profitability of a company and result in the cessation of these activities.
AI safety research is not done in this kind of environment at present whatsoever. However, it does seem to me that these kinds of institutions that carefully vet research and products, only releasing them when they remain beneficial, are possible.
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
Thank you for pointing this out, as it is very important.
The morality / ethics of the human beings matters a lot. But it seems to matter more than just a lot. If we get even a little thing wrong here, ...
But we're getting more than just a little wrong here, imo. Afaict most modern humans are terribly confused about morality / ethics. As you say "what is even good"
I've spoken with serious mathematicians who believe they might have a promising direction to the AI alignment problem. But they're also confused about what's good. That is not their realm of expertise. And math is not constrained by ethics; you can express a lot of things in math, wholesome and unwholesome. And the same is so with the social dynamics, as you point out.
This is why MAPLE exists, to help answer the question of what is good, and help people describe that in math.
But in order to answer the question for REAL, we can't merely develop better social models. Because again, your point applies to THAT as well. Developing better social models does not guarantee 'better for humanity/the planet'. That is itself a technology, and it can be used either way.
We start by answering "what is good" directly and "how to act in accord with what is good" directly. We find the true 'constraints' on our behaviors, physical and mental. The entanglement between reality/truth and ethics/goodness is a real thing, but 'intelligence' on its own has never realized it.
most modern humans are terribly confused about morality
The other option is being slightly less terribly confused, I presume.
This is why MAPLE exists, to help answer the question of what is good
Do you consider yourselves having significant comparative advantage in this area relative to all other moral philosophers throughout the millenia whose efforts weren't enough to lift humanity from the aforementioned dismal state?
We have a significant comparative advantage to pretty much all of Western philosophy. I know this is a 'bold claim'. If you're further curious you can come visit the Monastic Academy in Vermont, since it seems best 'shown' rather than 'told'. But we also plan on releasing online content in the near future to communicate our worldview.
We do see that all the previous efforts have perhaps never quite consistently and reliably succeeded, in both hemispheres. (Because, hell, we're here now.) But it is not fair to say they have never succeeded to any degree. There have been a number of significant successes in both hemispheres. We believe we're in a specific moment in history where there's more leverage than usual, and so there's opportunity. We understand that chances are slim and dim.
We have been losing the thread to 'what is good' over the millenia. We don't need to reinvent the wheel on this; the answers have been around. The question now is whether the answers can be taught to technology, or whether technology can somehow be yoked to the good / ethical, in a way that scales sufficiently.
We have a significant comparative advantage to pretty much all of Western philosophy.
I do agree that there are some valuable Eastern insights that haven't yet penetrated the Western mainstream, so work in this direction is worth a try.
We believe we’re in a specific moment in history where there’s more leverage than usual, and so there’s opportunity. We understand that chances are slim and dim.
Also reasonable.
We have been losing the thread to ‘what is good’ over the millenia. We don’t need to reinvent the wheel on this; the answers have been around.
Here I disagree. I think that much of "what is good" is contingent on our material circumstances, which are changing ever faster these days, so it's no surprise that old answers no longer work as well as they did in their time. Unfortunately, nobody has discovered a reliable way to timely update them yet, and very few seem to even acknowledge this problem.
Hm, you know I do buy that also.
The task is much harder now, due to changing material circumstances as you say. The modern culture has in some sense vaccinated itself against certain forms of wisdom and insight.
We acknowledge this problem and are still making an effort to address them, using modern technology. I cannot claim we're 'anywhere close' to resolving this? We're just firmly GOING to try, and we believe we in particular have a comparative advantage, due to a very solid community of spiritual practitioners. We have AT LEAST managed to get a group of modern millienials + Gen-Zers (with all the foibles of this group, with their mental hang-ups and all -- I am one of them)... and successfully put them through a training system that 'unschools' their basic assumptions and provides them the tools to personally investigate and answer questions like 'what is good' or 'how do i live' or 'what is going on here'.
There's more to say, but I appreciate your engagement. This is helpful to hear.
As an AI researcher who wants to do technical work that helps humanity, there is a strong drive to find a research area that is definitely helpful somehow, so that you don’t have to worry about how your work will be applied, and thus you don’t have to worry about things like corporate ethics or geopolitics to make sure your work benefits humanity.
Unfortunately, no such field exists. In particular, technical AI alignment is not such a field, and technical AI safety is not such a field. It absolutely matters where ideas land and how they are applied, and when the existence of the entire human race is at stake, that’s no exception.
If that’s obvious to you, this post is mostly just a collection of arguments for something you probably already realize. But if you somehow think technical AI safety or technical AI alignment is somehow intrinsically or inevitably helpful to humanity, this post is an attempt to change your mind. In particular, with more and more AI governance problems cropping up, I'd like to see more and more AI technical staffers forming explicit social models of how their ideas are going to be applied.
If you read this post, please don’t try to read this post as somehow pro- or contra- a specific area of AI research, or safety, or alignment, or corporations, or governments. My goal in this post is to encourage more nuanced social models by de-conflating a bunch of concepts. This might seem like I’m against the concepts themselves, when really I just want clearer thinking about these concepts, so that we (humanity) can all do a better job of communicating and working together.
Myths vs reality
Epistemic status: these are claims that I’m confident in, assembled over 1.5 decades of observation of existential risk discourse, through thousands of hours of conversation. They are not claims I’m confident I can convince you of, but I’m giving it a shot anyway because there’s a lot at stake when people don’t realize how their technical research is going to be misapplied.
Myth #1: Technical AI safety and/or alignment advances are intrinsically safe and helpful to humanity, irrespective of the state of humanity.
Reality: All technical advances in AI safety and/or “alignment” can be misused by humans. There are no technical advances in AI that are safe per se; the safety or unsafety of an idea is a function of the human environment in which the idea lands.
Examples:
Myth #2: There’s a {technical AI safety VS AI capabilities} dichotomy or spectrum of technical AI research, which also corresponds to {making humanity more safe VS shortening AI timelines}.
Reality: Conflating these concepts has three separate problems with it, (a)-(c) below:
a) AI safety and alignment advances almost always shorten AI timelines.
In particular, the ability to «make an AI system do what you want» is used almost instantly by AI companies to help them ship AI products faster (because the AI does what users want) and to build internal developer tools faster (because the AI does what developers want).
(When I point this out, usually people think I’m somehow unhappy with how AI products have been released so quickly. On the contrary, I’ve been quite happy with how quickly OpenAI brought GPT-4 to the public, thereby helping the human public to better come to grips with the reality of ongoing and forthcoming AI advances. I might be wrong about this, though, and it's not a load-bearing for this post. At the very least I’m not happy about Altman's rush to build a $7TN compute cluster, nor with OpenAI’s governance issues.)
b) Per the reality of Myth #1 explained above, technical AI safety advances sometimes make humanity less safe.
c) Finally, {making humanity more safe VS shortening AGI timelines} is itself a false dichotomy or false spectrum.
Why? Because in some situations, shortening AGI timelines could make humanity more safe, such as by avoiding an overhang of over-abundant computing resources that AGI could abruptly take advantage of if it’s invented too far in the future (the “compute overhang” argument).
What to make of all this
The above points could feel quite morally disorienting, leaving you with a feeling something like: "What is even good, though?"
This disorientation is especially likely if you were on the hunt for a simple and reassuring view that a certain area of technical AI research could be easily verified as safe or helpful to humanity. Even if I’ve made clear arguments here, perhaps the resulting feeling of moral disorientation might make you want to reject or bounce off this post or the reasoning within it. It feels bad to be disoriented, so it’s more comfortable to go back to a simpler, more oriented worldview of what kind of AI research is “the good kind”.
Unfortunately, the real world is a complex sociotechnical system that’s confusing, not only because of its complexity, but also because the world can sometimes model you and willfully misuse you, your ideas, or your ambitions. Moreover, I have no panacea to offer for avoiding this. I would have liked to write a post that offers one weird trick to avoid being confused by which areas of AI are more or less safe to advance, but I can’t write that post. As far as I know, the answer is simply that you have to model the social landscape around you and how your research contributions are going to be applied.
In other words, it matters who receives your ideas, and what they choose to do with those ideas, even when your ideas are technical advances in AI safety or "alignment". And if you want to make sure your ideas land in a way that helps and doesn’t harm humanity, you just have to think through how the humans are actually going to use your ideas. To do a good job of that, you have to carefully think through arguments and the meanings of words (“alignment”, “safety”, “capabilities”, etc.) before conflating important load-bearing concepts for steering the future of AI.
Avoiding such conflations is especially hard because forming a large alliance often involves convincing people to conflate a bunch of concepts they care about in order to recruit you to their alliances. In other words, you should in general expect to see large alliances of people trying to convince you to conflate value-laden concepts (e.g., “technical safety”, “alignment”, “security”, “existential safety”) in order to join them (i.e., conflationary alliances).
Recap of key points