I got into AI at the worst time possible
2023 marks the year AI Safety went mainstream. And though I am happy it is finally getting more attention, and finally has highly talented people who want to work in it; personally, it could not have been worse for my professional life. This isn’t a thing I normally talk about, because it’s a very weird thing to complain about. I rarely permit myself to even complain about it internally. But I can’t stop the nagging sensation that if I had just pivoted to alignment research one year sooner than I did, everything would have been radically easier for me.
I hate saturated industries. I hate hyped-up industries. I hate fields that constantly make the news and gain mainstream attention. This was one of the major reasons why I had to leave the crypto scene, because it had become so saturated with attention, grift, and hype, that I found it completely unbearable. Alignment and AGI was one of those things almost no one even knew about, and fewer even talked about, which made it ideal for me. I was happy with the idea of doing work that might never be appreciated or understood by the rest of the world.
Since 2015, I had planned to get involved, but at the time I had no technical experience or background. So I went to college, majoring in Computer Science. Working on AI and what would later be called “Alignment” was always the plan, though. I remember having a shelf in my college dorm, which I used to represent all my life goals and priorities: AI occupied the absolute top. My mentality, however, was that I needed to establish myself enough, and earn enough money, before I could transition to it. I thought I had all the time in the world.
Eventually, I got frustrated with myself for dragging my feet for so long. So in Fall 2022, I quit my job in cybersecurity, accepted a grant from the Long Term Future Fund, and prepared for spending a year of skilling-up to do alignment research. I felt fulfilled. When my brain normally nagged me about not doing enough, or how I should be working on something more important, I finally felt content. I was finally doing it. I was finally working on the Extremely Neglected, Yet Conveniently Super Important Thing™.
And then ChatGPT came out two months later, and even my mother was talking about AI.
If I believed in fate, I would say it seems as though I was meant to enter AI and Alignment during the early days. I enjoy fields where almost nothing has been figured out. I hate prestige. I embrace the weird, and hate any field that starts worrying about its reputation. I’m not a careerist. I can imagine many alternative worlds where I got in early, maybe ~2012 (I’ve been around the typical lesswrong/rationalist/transhumanist group for my entire adult life). I’d get in, start to figure out the early stuff, identify some of the early assumptions and problems, and then get out once 2022/2023 came around. It’s the weirdest sensation to feel like I’m too old to join the field now, and also feel as though I’ve been part of the field for 10+ years. I’m pretty sure I’m just 1-2 Degrees from literally everyone in the field.
The shock of the field/community going from something almost no one was talking about to something even the friggin’ Pope is weighing in on is something I think I’m still trying to adjust to. Some part of me keeps hoping the bubble will burst, AI will “hit a wall”, marking the second time in history Gary Marcus was right about something, and I’ll feel as though the field will have enough space to operate in again. As it stands now, I don’t really know what place it has for me. It is no longer the Extremely Neglected, Yet Conveniently Super Important Thing™, but instead just the Super Important Thing. When I was briefly running an AI startup (don’t ask), I was getting 300+ applicants for each role we were hiring in. We never once advertised the roles, but they somehow found them anyway, and applied in swarms. Whenever I get a rejection email from an AI Safety org, I’m usually told they receive somewhere in the range of 400-700 applications for every given job. That’s, at best, a 0.25% chance of acceptance: substantially lower than Harvard. It becomes difficult for me to answer why I’m still trying to get into such an incredibly competitive field, when literally doing anything else would be easier. “It’s super important” is not exactly making sense as a defense at this point, since there are obviously other talented people who would get the job if I didn’t.
I think it’s that I could start to see the shape of what I could have had, and what I could have been. It’s vanity. Part of me really loved the idea of working on the Extremely Neglected, Yet Conveniently Super Important Thing™. And now I have a hard time going back to working on literally anything else, because anything else could never hope to be remotely as important. And at the same time, despite the huge amount of new interest in alignment, and huge number of new talent interested in contributing to it, somehow the field still feels undersaturated. In a market-driven field, we would see more jobs and roles growing as the overall interest in working in the field did, since interest normally correlates with growth in consumers/investors/etc. Except we’re not seeing that. Despite everything, by most measurements, there seems to still be fewer than 1000 people working on it fulltime, maybe as low as ~300, depending on what you count.
So I oscillate between thinking I should just move on to other things; and thinking I absolutely should be working on this at all cost. It’s made worse by sometimes briefly doing temp work for an underfunded org, sometimes getting to the final interview stage for big labs, and overall thinking that doing the Super Important Thing™ is just around the corner… and for all I know, it might be. It’s really hard for me to tell if this is a situation where it’s smart for me to be persistent in, or if being persistent is dragging me ever-closer to permanent unemployment, endless poverty/homelessness/whatever-my-brain-is-feeling-paranoid-about… which isn’t made easier by the fact that, if the AI train does keep going, my previous jobs in software engineering and cybersecurity will probably not be coming back.
Not totally sure what I’m trying to get out of writing this. Maybe someone has advice about what I should be doing next. Or maybe, after a year of my brain nagging me each day about how I should have gotten involved in the field sooner, I just wanted to admit that: despite wanting the world to be saved, despite wanting more people to be working on the Extremely Neglected, Yet Conveniently Super Important Thing™, some selfish, not-too-bright, vain part of me is thinking “Oh, great. More competition.”
Thanks for sharing your experience here.
One small thought is that things end up feeling extremely neglected once you index on particular subquestions. Like, on a high-level, it is indeed the case that AI safety has gotten more mainstream.
But when you zoom in, there are a lot of very important topics that have <5 people seriously working on them. I work in AI policy, so I'm more familiar with the policy/governance ones but I imagine this is also true in technical (also, maybe consider swapping to governance/policy!)
Also, especially in hype waves, I think a lot of people end up just working on the popular thing. If you're willing to deviate from the popular thing, you can often find important parts of the problem that nearly no one is addressing.
Thanks for writing this, I think this is a common and pretty rough experience.
Have you considered doing cybersecurity work related to AI safety? i.e. work would help prevent bad actors stealing model weights and AIs themselves escaping. I think this kind of work would likely be more useful than most alignment work.
I'd recommend reading Holden Karnofsky's takes, as well as the recent huge RAND report on securing model weights. Redwood's control agenda might also be relevant.
I think this kind of work is probably extremely useful, and somewhat neglected, it especially seems to be missing people who know about cybersecurity and care about AGI/alignment.
Working on this seems good insofar as greater control implies more options. With good security, it's still possible to opt in to whatever weight-sharing / transparency mechanisms seem net positive - including with adversaries. Without security there's no option.
Granted, the [more options are likely better] conclusion is clearer if we condition on wise strategy.
However, [we have great security, therefore we're sharing nothing with adversaries] is clearly not a valid inference in general.
Not necessarily. If we have the option to hide information, then even if we reveal information, adversaries may still assume (likely correctly) we aren't sharing all our information, and are closer to a decisive strategic advantage than we appear. Even in the case where we do share all our information (which we won't).
Of course the more options are likely better option holds if the lumbering, slow, disorganized, and collectively stupid organizations which have those options somehow perform the best strategy, but they're not actually going to take the best strategy. Especially when it comes to US-China relations.
ETA:
[we have great security, therefore we're sharing nothing with adversaries] is clearly not a valid inference in general.
I don't think the conclusion holds if that is true in general, and I don't think I ever assumed or argued it was true in general.
then even if we reveal information, adversaries may still assume (likely correctly) we aren't sharing all our information
I think the same reasoning applies if they hack us: they'll assume that the stuff they were able to hack was the part we left suspiciously vulnerable, and the really important information is behind more serious security.
I expect they'll assume we're in control either way - once the stakes are really high.
It seems preferable to actually be in control.
I'll grant that it's far from clear that the best strategy would be used.
(apologies if I misinterpreted your assumptions in my previous reply)
One option would be to find a role in AI more generally that allows you to further develop your skills whilst also not accelerating capabilities.
Another alternative: I suspect that more people should consider working at a normal job three or four days per week and doing AI Safety things on the side one or two days.
I suspect that working on capabilities (edit: preferably applications rather than building AGI) in some non-maximally-harmful position is actually the best choice for most junior x-risk concerned people who want to do something technical. Safety is just too crowded and still not very tractable.
Since many people seem to disagree, I'm going to share some reasons why I believe this:
These are pretty sane takes (conditional on my model of Thomas Kwa of course), and I don't understand why people have downvoted this comment. Here's an attempt to unravel my thoughts and potential disagreements with your claims.
AGI that poses serious existential risks seems at least 6 years away, and safety work seems much more valuable at crunch time, such that I think more than half of most peoples’ impact will be more than 5 years away.
I think safety work gets less and less valuable at crunch time actually. I think you have this Paul Christiano-like model of getting a prototypical AGI and dissecting it and figuring out how it works -- I think it is unlikely that any individual frontier lab would perceive itself to have the slack to do so. Any potential "dissection" tools will need to be developed beforehand, such as scalable interpretability tools (SAEs seem like rudimentary examples of this). The problem with "prosaic alignment" IMO is that a lot of this relies on a significant amount of schlep -- a lot of empirical work, a lot of fucking around. That's probably why, according to the MATS team, frontier labs have a high demand for "iterators" -- their strategy involves having a lot of ideas about stuff that might work, and without a theoretical framework underlying their search path, a lot of things they do would look like trying things out.
I expect that once you get AI researcher level systems, the die is cast. Whatever prosaic alignment and control measures you've figured out, you'll now be using that in an attempt to play this breakneck game of getting useful work out of a potentially misaligned AI ecosystem, that would also be modifying itself to improve its capabilities (because that is the point of AI researchers). (Sure, its easier to test for capability improvements. That doesn't mean you can't transfer information embedded into these proposals such that modified models will be modified in ways the humans did not anticipate or would not want if they had a full understanding of what is going on.)
Mentorship for safety is still limited. If you can get an industry safety job or get into MATS, this seems better than some random AI job, but most people can’t.
Yeah -- I think most "random AI jobs" are significantly worse for trying to do useful work in comparison to just doing things by yourself or with some other independent ML researchers. If you aren't in a position to do this, however, it does make sense to optimize for a convenient low-cognitive-effort set of tasks that provides you the social, financial and/or structural support that will benefit you, and perhaps look into AI safety stuff as a hobby.
I agree that mentorship is a fundamental bottleneck to building mature alignment researchers. This is unfortunate, but it is the reality we have.
Funding is also limited in the current environment. I think most people cannot get funding to work on alignment if they tried? This is fairly cruxy and I’m not sure of it, so someone should correct me if I’m wrong.
Yeah, post-FTX, I believe that funding is limited enough that you have to be consciously optimizing for getting funding (as an EA-affiliated organization, or as an independent alignment researcher). Particularly for new conceptual alignment researchers, I expect that funding is drastically limited since funding organizations seem to explicitly prioritize funding grantees who will work on OpenPhil-endorsed (or to a certain extent, existing but not necessarily OpenPhil-endorsed) agendas. This includes stuff like evals.
The relative impact of working on capabilities is smaller than working on alignment—there are still 10x as many people doing capabilities as alignment, so unless returns don’t diminish or you are doing something unusually harmful, you can work for 1 year on capabilities and 1 year on alignment and gain 10x.
This is a very Paul Christiano-like argument -- yeah sure the math makes sense, but I feel averse to agreeing with this because it seems like you may be abstracting away significant parts of reality and throwing away valuable information we already have.
Anyway, yeah I agree with your sentiment. It seems fine to work on non-SOTA AI / ML / LLM stuff and I'd want people to do so such that they live a good life. I'd rather they didn't throw themselves into the gauntlet of "AI safety" and get chewed up and spit out by an incompetent ecosystem.
Safety could get even more crowded, which would make upskilling to work on safety net negative. This should be a significant concern, but I think most people can skill up faster than this.
I still don't understand what causal model would produce this prediction. Here's mine: One big limiting factor to the amount of safety researchers the current SOTA lab ecosystem can handle is bottlenecked by their expectations for how many researchers they want or need. On one hand, more schlep during pre-AI-researcher-era means more hires. On the other hand, more hires requires more research managers or managerial experience. Anecdotally, it seems like many AI capabilities and alignment organizations (both in the EA space and in the frontier lab space) seemed to have been historically bottlenecked on management capacity. Additionally, hiring has a cost (both the search process and the onboarding), and it is likely that as labs get closer to creating AI researchers, they'd believe that the opportunity cost of hiring continues to increase.
Skills useful in capabilities are useful for alignment, and if you’re careful about what job you take there isn’t much more skill penalty in transferring them than, say, switching from vision model research to language model research.
Nah, I found very little stuff from my vision model research work (during my undergrad) contributed to my skill and intuition related to language model research work (again during my undergrad, both around 2021-2022). I mean, specific skills of programming and using PyTorch and debugging model issues and data processing and containerization -- sure, but the opportunity cost is ridiculous when you could be actually working with LLMs directly and reading papers relevant to the game you want to play. High quality cognitive work is extremely valuable and spending it on irrelevant things like the specifics of diffusion models (for example) seems quite wasteful unless you really think this stuff is relevant.
Capabilities often has better feedback loops than alignment because you can see whether the thing works or not. Many prosaic alignment directions also have this property. Interpretability is getting there, but not quite. Other areas, especially in agent foundations, are significantly worse.
Yeah this makes sense for extreme newcomers. If someone can get a capabilities job, however, I think they are doing themselves a disservice by playing the easier game of capabilities work. Yes, you have better feedback loops than alignment research / implementation work. That's like saying "Search for your keys under the streetlight because that's where you can see the ground most clearly." I'd want these people to start building the epistemological skills to thrive even with a lower intensity of feedback loops such that they can do alignment research work effectively.
And the best way to do that is to actually attempt to do alignment research, if you are in a position to do so.
I think safety work gets less and less valuable at crunch time actually. [...] Whatever prosaic alignment and control measures you've figured out, you'll now be using that in an attempt to play this breakneck game of getting useful work out of a potentially misaligned AI ecosystem
Sure, but you have to actually implement these alignment/control methods at some point? And likely these can't be (fully) implemented far in advance. I usually use the term "crunch time" in a way which includes the period where you scramble to implement in anticipation of the powerful AI.
One (oversimplified) model is that there are two trends:
Eventually, the second trend implies that safety work is less valuable, but probably safety work has already massively gone up in value by this point.
(Also, note that the default way of automating safety work will involve large amounts of human labor for supervision. Either due to issues with AIs or because of lack of trust in these AIs systems (e.g. human labor is needed for a control scheme.)
My biggest reason for disagreeing (though there are others) is thinking that people often underestimate the effects that your immediate cultural environment has on your beliefs over time. I don't think humans have the kind of robust epistemics necessary to fully combat changes in priors from prolonged exposure to something (for example, I know someone who was negative on ads, joined Google and found they were becoming more positive on it without coming from or leading to any object-level changes in their views, and back after they left.)
Contra One Critical Try: AIs are all cursed
I don't feel like making this a whole blog post, but my biggest source for optimism for why we won't need to one-shot an aligned superintelligence is that anyone who's trained AI models knows that AIs are unbelievably cursed. What do I mean by this? I mean even the first quasi-superintelligent AI we get will have so many problems and so many exploits that taking over the world will simply not be possible. Take a "superintelligence" that only had to beat humans at the very constrained game of Go, which is far simpler than the real world. Everyone talked about how such systems were unbeatable by humans, until some humans used a much "dumber" AI to find glaring holes in Leela Zero's strategy. I expect, in the far more complex "real world", a superintelligence will have even more holes, and even more exploits, a kind of "swiss chess superintelligence". You can say "but that's not REAL superintelligence", and I don't care, and the AIs won't care. But it's likely the thing we'll get first. Patching all of those holes, and finding ways to make such an ASI sufficiently not cursed will also probably mean better understanding of how to stop it from wanting to kill us, if it wanted to kill us in the first place. I think we can probably get AIs that are sufficiently powerful in a lot of human domains, and can probably even self-improve, and still be cursed. The same way we have AIs with natural language understanding, something once thought to be a core component of human intelligence, that are still cursed. A cursed ASI is a danger for exploitation, but it's also an opportunity.
Humans are infinitely cursed (see "cognitive biases" or "your neighbour-creationist"), it doesn't change the fact that humans are ruling the planet.
On 12 September 1940, the entrance to the Lascaux Cave was discovered on the La Rochefoucauld-Montbel lands by 18-year-old Marcel Ravidat when his dog, Robot, investigated a hole left by an uprooted tree (Ravidat would embellish the story in later retellings, saying Robot had fallen into the cave.)[8][9] Ravidat returned to the scene with three friends, Jacques Marsal, Georges Agnel, and Simon Coencas. They entered the cave through a 15-metre-deep (50-foot) shaft that they believed might be a legendary secret passage to the nearby Lascaux Manor.[9][10][11] The teenagers discovered that the cave walls were covered with depictions of animals.[12][13] Galleries that suggest continuity, context or simply represent a cavern were given names. Those include the Hall of the Bulls, the Passageway, the Shaft, the Nave, the Apse, and the Chamber of Felines. They returned along with the Abbé Henri Breuil on 21 September 1940; Breuil would make many sketches of the cave, some of which are used as study material today due to the extreme degradation of many of the paintings. Breuil was accompanied by Denis Peyrony, curator of Les eyzies (Prehistory Museum) at Les Eyzies, Jean Bouyssonie and Dr Cheynier.
The cave complex was opened to the public on 14 July 1948, and initial archaeological investigations began a year later, focusing on the Shaft. By 1955, carbon dioxide, heat, humidity, and other contaminants produced by 1,200 visitors per day had visibly damaged the paintings. As air condition deteriorated, fungi and lichen increasingly infested the walls. Consequently, the cave was closed to the public in 1963, the paintings were restored to their original state, and a monitoring system on a daily basis was introduced.
Lascaux II, an exact copy of the Great Hall of the Bulls and the Painted Gallery was displayed at the Grand Palais in Paris, before being displayed from 1983 in the cave's vicinity (about 200 m or 660 ft away from the original cave), a compromise and attempt to present an impression of the paintings' scale and composition for the public without harming the originals.[10][13] A full range of Lascaux's parietal art is presented a few kilometres from the site at the Centre of Prehistoric Art, Le Parc du Thot, where there are also live animals representing ice-age fauna.[14]
The paintings for this site were duplicated with the same type of materials (such as iron oxide, charcoal, and ochre) which were believed to be used 19,000 years ago.[9][15][16][17] Other facsimiles of Lascaux have also been produced over the years.
They have also created additional copies, Lascaux III, Lascaux IV, and Lascaux V.
Consequently, the cave was closed to the public in 1963, the paintings were restored to their original state, and a monitoring system on a daily basis was introduced.
“I actually find it overwhelmingly hopeful, that four teenagers and a dog named Robot discovered a cave with 17,000-year-old handprints, that the cave was so overwhelmingly beautiful that two of those teenagers devoted themselves to its protection. And that when we humans became a danger to that caves' beauty, we agreed to stop going. Lascaux is there. You cannot visit.”
-John Green
People preserve the remains of Lucy, work hard to preserve old books, the Mona Lisa is protected under bullet-proof glass and is not up for sale.
What is the mechanistic reason for this? There are perfect copies of these things, yet humans go through great lengths to preserve the original. Why is there the Sacred?
They have created copies of Lascaux, yet still work hard to preserve the original. Humans cannot enter. They get no experience of joy from visiting. It is not for sale. Yet they strongly desire to protect it, because it is the original, and no other reason.
Robin Hanson gave a list of Sacred characteristic, some I find promising:
Sacred things are highly (or lowly) valued. We revere, respect, & prioritize them.
Sacred is big, powerful, extraordinary. We fear, submit, & see it as larger than ourselves.
We want the sacred “for itself”, rather than as a means to get other things.
Sacred makes us feel less big, distinct, independent, in control, competitive, entitled.
Sacred quiets feelings of: doubts, anxiety, ego, self-criticism, status-consciousness.
We get emotionally attached to the sacred; our stance re it is oft part of our identity.
We desire to connect with the sacred, and to be more associated with it.
Sacred things are sharply set apart and distinguished from the ordinary, mundane.
Re sacred, we fear a slippery slope, so that any compromise leads to losing it all.
If we can understand the sacred, it seems like a concept that probably wouldn’t fall into a simple utility function, something that wouldn’t break out-of-distribution. A kind of Sacred Human Value Shard, something that protects our part of the manifold.
A practical reason for preserving the original is that new techniques can allow new things to be discovered about it. A copy can embody no more than the observations that we have already made.
There's no point to analysing the pigments in a modern copy of a painting, or carbon-dating its frame.
if we could somehow establish how information from the original was extracted, do you expect humans to then destroy the original or allow it to be destroyed?
No. The original is a historical document that may have further secrets to be revealed by methods yet to be invented. A copy says of the original only what was put into it.
Only recently an ancient, charred scroll was first read.
I think you're missing the point. If we could establish that all important information had been extracted from the original, would you expect humans to then destroy the original or allow it to be destroyed?
My guess is that they wouldn't. Which I think means practicality is not the central reason why humans do this.
I think you’re missing my point, which is that we cannot establish that.
Yes, I’m questioning your hypothetical. I always question hypotheticals.
The following is a conversation between myself in 2022, and a newer version of myself earlier this year.
On AI Governance and Public Policy
2022 Me: I think we will have to tread extremely lightly with, or, if possible, avoid completely. One particular concern is the idea of gaining public support. Many countries have an interest in pleasing their constituents, so if executed well, this could be extremely beneficial. However, it runs high risk of doing far more damage. One major concern is the different mindset needed to conceptualize the problem. Alerting people to the dangers of Nuclear War is easier: nukes have been detonated, the visual image of incineration is easy to imagine and can be described in detail, and they or their parents have likely lived through nuclear drills in school. This is closer to trying to explain someone the dangers of nuclear war before Hiroshima, before the Manhattan Project, and before even tnt was developed. They have to conceptualize what an explosion even is, not simply imagining an explosion at greater scale. Most people will simply not have the time or the will to try to grasp this problem, so this runs the risk of having people calling for action to a problem they do not understand, which will likely lead to dismissal by AI Researchers, and possibly short-sighted policies that don’t actually tackle the problem, or even make the problem worse by having the guise of accomplishment. To make matters worse, there is the risk of polarization. Almost any concern with political implications that has gained widespread public attention runs a high risk of becoming polarized. We are still dealing with the ramifications of well-intentioned, but misguided, early advocates in the Climate Change movement two decades ago, who set the seeds for making climate policy part of one’s political identity. This could be even more detrimental than a merely uninformed electorate, as it might push people who had no previous opinion on AI to advocate strongly in favor of capabilities acceleration, and to be staunchly against any form of safety policy. Even if executed using the utmost caution, this does not stop other players from using their own power or influence to hijack the movement and lead it astray.
2023 Me: Ah, Me’22,, the things you don’t know! Many of the concerns of Me’22 I think are still valid, but we’re experiencing what chess players might call a “forced move”. People are starting to become alarmed, regardless of what we say or do, so steering that in a direction we want is necessary. The fire alarm is being pushed, regardless, and if we don’t try to show some leadership in that regard, we risk less informed voices and blanket solutions winning-out. The good news is “serious” people are going on “serious” platforms and actually talking about x-risk. Other good news is that, from current polls, people are very receptive to concerns over x-risk and it has not currently fallen into divisive lines (roughly the same % of those concerned fall equally among various different demographics). This is still a difficult minefield to navigate. Polarization could still happen, especially with an Election Year in the US looming. I’ve also been talking to a lot of young people who feel frustrated not having anything actionable to do, and if those in AI Safety don’t show leadership, we might risk (and indeed are already risking), many frustrated youth taking political and social action into their own hands. We need to be aware that EA/LW might have an Ivory Tower problem, and that, even though a pragmatic, strategic, and careful course of action might be better, this might make many feel “shut out” and attempt to steer their own course. Finding a way to make those outside EA/LW/AIS feel included, with steps to help guide and inform them, might be critical to avoiding movement hijacking.
On Capabilities vs. Alignment Research:
2022 Me: While I strongly agree that not increasing capabilities is a high priority right now, I also question if we risk creating a state of inertia. In terms of the realms of safety research, there are very few domains that do not risk increasing capabilities research. And, while capabilities continues to progress every day, we might risk failing to keep up the speed of safety progress simply because every action risks an increase in capabilities. Rather than a “do no harm” principle, I think counterfactuals need to be examined in these situations, where we must consider if there is a greater risk if we *don’t* do research in a certain domain.
2023 Me: Oh, oh, oh! I think Me’22 was actually ahead of the curve on this one. This might still be controversial, but I think many got the “capabilities space” wrong. Many AIS-inspired theories that could increase capabilities are for systems that could be safer, more interpretable, and easier to monitor by default. And by not working on such systems we instead got the much more inscrutable, dangerous models by default, because the more dangerous models are easier. To quote the vape commercials, “safer != safe” but I still quit smoking in favor of electronics because safer is still at least safer. This is probably a moot point now, though, since I think it’s likely too late to create an entirely new paradigm in AI architectures. Hopefully Me’24 will be happy to tell me we found a new, 100% safe and effective new paradigm that everyone’s hopping on. Or maybe he’ll invent it.
Going to the moon
Say you’re really, really worried about humans going to the moon. Don’t ask why, but you view it as an existential catastrophe. And you notice people building bigger and bigger airplanes, and warn that one day, someone will build an airplane that’s so big, and so fast, that it veers off course and lands on the moon, spelling doom. Some argue that going to the moon takes intentionality. That you can’t accidentally create something capable of going to the moon. But you say “Look at how big those planes are getting! We've gone from small fighter planes, to bombers, to jets in a short amount of time. We’re on a double exponential of plane tech, and it's just a matter of time before one of them will land on the moon!”
Contra Scheming AIs
There is a lot of attention on mesaoptimizers, deceptive alignment, and inner misalignment. I think a lot of this can fall under the umbrella of "scheming AIs". AIs that either become dangerous during training and escape, or else play nice until humans make the mistake of deploying them. Many have spoken about the lack of an indication that there's a "humanculi-in-a-box", and this is usually met with arguments that we wouldn't see such things manifest until AIs are at a certain level of capability, and at that point, it might be too late, making comparisons to owl eggs, or baby dragons. My perception is that getting something like a "scheming AI" or "humanculi-a-box" isn't impossible, and we could (and might) develop the means to do so in the future, but that it's a very, very different kind of thing than current models (even at superhuman level), and that it would take a degree of intentionality.
But you say “Look at how big those planes are getting! We’ve gone from small fighter planes, to bombers, to jets in a short amount of time. We’re on a double exponential of plane tech, and it’s just a matter of time before one of them will land on the moon!”
...And they were right? Humans did land on the moon roughly on that timeline (and as I recall, there were people before the moon landing at RAND and elsewhere who were extrapolating out the exponentials of speed, which was a major reason for such ill-fated projects like the supersonic interceptors for Soviet bombers), and it was a fairly seamless set of s-curves, as all of the aerospace technologies were so intertwined and shared similar missions of 'make stuff go fast' (eg. rocket engines could power a V-2, or it could power a Me 163 instead). What is a spy satellite but a spy plane which takes one very long reconnaissance flight? And I'm sure you recall what the profession was of almost all of the American moon landers were before they became astronauts - plane pilots, usually military.
And all of this happened with minimal intentionality up until not terribly long before the moon landing happened! Yes, people like von Braun absolutely intended to go to the moon (and beyond), but those were rare dreamers. Most people involved in building all of those capabilities that made a moon mission possible had not the slightest intent of going to the moon - right up until Kennedy made his famous speech, America turned on a dime, and, well, the rest is history.
It is said that in long-term forecasting, it is better to focus on capabilities than intentions... And intentions have never been more mutable, and more irrelevant on average, than with AIs.
(“If your solution to some problem relies on ‘If everyone would just…’ then you do not have a solution. Everyone is not going to just. At no time in the history of the universe has everyone just, and they’re not going to start now.”)
It seems pretty likely to me that current AGIs are already scheming. At least it seems like the simplest explanation for things like the behavior observed in this paper: https://www.alignmentforum.org/posts/ZAsJv7xijKTfZkMtr/sleeper-agents-training-deceptive-llms-that-persist-through
I assume that by scheming you mean ~deceptive alignment? I think it's very unlikely that current AIs are scheming and I don't see how you draw this conclusion from that paper. (Maybe something about the distilled CoT results?)
The best definition I would have of "scheming" would be "the model is acting deceptively about its own intentions or capabilities in order to fool a supervisor" [1]. This behavior seems to satisfy that pretty solidly:
Of course, in this case the scheming goal was explicitly trained for (as opposed to arising naturally out of convergent instrumental power drives), but it sure seems to me like its engaging in the relevant kind of scheming.
I agree there is more uncertainty and lack of clarity on whether deceptively-aligned systems will arise "naturally", but the above seems like a clear example of someone artificially creating a deceptively-aligned system.
Joe Carlsmith uses "whether advanced AIs that perform well in training will be doing so in order to gain power later", but IDK, that feels really underspecified. Like, there are just tons of reasons for why the AI will want to perform well in training for power-seeking reasons, and when I read the rest of the report it seems like Joe was more analyzing it through the deception of supervisors lens.
I agree current models sometimes trick their supervisors ~intentionally and it's certainly easy to train/prompt them to do so.
I don't think current models are deceptively aligned and I think that this poses substantial additional risk.
I personally like Joe's definition and it feels like a natural category in my head, but I can see why you don't like it. You should consider tabooing the word scheming or saying something more specific as many people mean something more specific that is different from what you mean.
Yeah, that makes sense. I've noticed miscommunications around the word "scheming" a few times, so am in favor of tabooing it more. "Engage in deception for instrumental reasons" seems like an obvious extension that captures a lot of what I care about.
If progress in AI is continuous, we should expect record levels of employment. Not the opposite.
My mentality is if progress in AI doesn't have a sudden, foom-level jump, and if we all don't die, most of the fears of human unemployment are unfounded... at least for a while. Say we get AIs that can replace 90% of the workforce. The productivity surge from this should dramatically boost the economy, creating more companies, more trading, and more jobs. Since AIs can be copied, they would be cheap, abundant labor. This means anything a human can do that an AI still can't becomes a scarce, highly valued resource. Companies with thousands or millions of AI instances working for them would likely compete for human labor, because making more humans takes much longer than making more AIs. Then say, after a few years, AIs are able to automate 90% of the remaining 10%. Then that creates even more productivity, more economic growth, and even more jobs. This could continue for even a few decades. Eventually, humans will be rendered completely obsolete, but by that point (most) of them might be so filthy rich that they won't especially care.
This doesn't mean it'll all be smooth-sailing or that humans will be totally happy with this shift. Some people probably won't enjoy having to switch to a new career, only for that new career to be automated away after a few years, and then have to switch again. This will probably be especially true for people who are older, those who have families, want a stable and certain future, etc. None of this will be made easier by the fact it'll probably be hard to tell when true human obsolescence is on the horizon, so some might be in a state of perpetual anxiety, and others will be in constant denial.
The inverse argument I have seen on reddit happens if you try to examine how these ai models might work and learn.
One method is to use a large benchmark of tasks, where model capabilities is measured as the weighted harmonic mean of all tasks.
As the models run, much of the information gained doing real world tasks is added as training and test tasks to the benchmark suite. (You do this whenever a chat task has an output that can be objectively checked, and for robotic tasks you run in lockstep a neural sim similar to Sora that makes testable predictions for future real world input sets)
What this means is most models learn from millions of parallel instances of themselves and other models.
This means the more models are deployed in the world - the more labor is automated - the more this learning mechanism gets debugged, and the faster models learn, and so on.
There are also all kinds of parallel task gains. For example once models have experience working on maintaining the equipment in a coke can factory, and an auto plant, and a 3d printer plant, this variety of tasks with common elements should cause new models trained in sim to gain "general maintenance" skills at least for machines that are similar to the 3 given. (The "skill" is developing a common policy network that compresses the 3 similar policies down to 1 policy on the new version of the network)
With each following task, the delta - the skills the AI system needs to learn it doesn't already know - shrinks. This shrinking learning requirement likely increases faster than the task difficulty increases. (Since the most difficult tasks is still doable by a human, and also the AI system is able to cheat a bunch of ways. For example using better actuators to make skilled manual trades easy, or software helpers to best champion Olympiad contestants)
You have to then look at what barriers there are to AI doing a given task to decide what tasks are protected for a while.
Things that just require a human body to do: Medical test subject. Food taster, perfume evaluator, fashion or aesthetics evaluator. Various kinds of personal service worker.
AI Supervisor roles: Arguably checking that the models haven't betrayed us yet, and sanity checking plans and outputs seem like this would be a massive source of employment.
AI developer roles : the risks mean that some humans need to have a deep understanding of how the current gen of AI works, and the tools and time to examine what happened during a failure. Someone like this needs to be skeptical of an explanation by another ai system for the obvious reasons.
Government/old institution roles: Institutions that don't value making a profit may continue using human staff for decades after AI can do their jobs, even when it can be shown ai makes less errors and more legally sound decisions.
TLDR: Arguably for the portion of jobs that can be automated, the growth rate should be exponential, from the easiest and most common jobs to the most difficult and unique ones.
There is a portion of tasks that humans are required to do for a while, and a portion where it might be a good idea not to ever automate it.
The following is a conversation between myself in 2022, and a newer version of me earlier this year.
On the Nature of Intelligence and its "True Name":
2022 Me: This has become less obvious to me as I’ve tried to gain a better understanding of what general intelligence is. Until recently, I always made the assumption that intelligence and agency were the same thing. But General Intelligence, or G, might not be agentic. Agents that behave like RLs may only be narrow forms of intelligence, without generalizability. G might be something closer to a simulator. From my very naive perception of neuroscience, it could be that we (our intelligence) is not agentic, but just simulates agents. In this situation, the prefrontal cortex not only runs simulations to predict its next sensory input, but might also run simulations to predict inputs from other parts of the brain. In this scenario, “desire” or “goals”, might be simulations to better predict narrowly-intelligent agentic optimizers. Though the simulator might be myopic, I think this prediction model allows for non-myopic behavior, in a similar way GPT has non-myopic behavior, despite only trying to predict the next token (it has an understanding of where a future word “should” be within the context of a sentence, paragraph, or story). I think this model of G allows for the appearance of intelligent goal-seeking behavior, long-term planning, and self-awareness. I have yet to find another model for G that allows for all three. The True Name of G might be Algorithm Optimized To Reduce Predictive Loss.
2023 Me: interesting, me’22, but let me ask you something: you seem to think this majestic ‘G’ is something humans have, but other species do not, and then name the True Name of ‘G’ to be Algorithm Optimized To Reduce Predictive Loss. Do you *really* think other animals don’t do this? How long is a cat going to survive if it can’t predict where it’s going to land? Or where the mouse’s path trajectory is heading? Did you think it was all somehow hardcoded in? But cats can jump up on tables, and those weren’t in the ancestry environment, there’s clearly some kind of generalized form of prediction occurring. Try formulating that answer again, but taboo “intelligence”, “G”, “agent”, “desire”, and “goal”. I think the coherence of it breaks down.
Now, what does me’23 think? Well, I’m going to take a leaf from my own book, and try to explain what I think without the words mentioned above. There are predictive mechanisms in the Universe that can run models of what things in the Universe might do in future states. Some of these predictive mechanism are more computationally efficient than others. Some will be more effective than others. A more effective and efficient predictive mechanism, with a large input of information about the Universe, could be a very powerful tool. If taken to the theoretical (not physical) extreme, that predictive mechanism would hold models of all possible future states. It could then, by accident or intention, guide outcomes toward certain future states over others.
2022 Me: according to this model, humans dream because the simulator is now making predictions without sensory input, gradually creating a bigger and bigger gap from reality. Evidence to support this is from sensory-deprivation tanks, where humans, despite being awake, have dream-like states. I also find it interesting that people who exhibit Schizophrenia, which involves hallucinations (like dreams do), can tickle themselves. Most people can be tickled by others, but not themselves. But normal people on LSD can do this, and also can have hallucinations. My hairbrained theory is that something is going wrong when initializing new tokens for the simulator, which results in hallucinations from the lack of correction from sensory input, and a less strong sense of self because of a lack of correction from RL agents in other parts of the brain.
2023 Me: I don’t want to endorse crackpot theories from Me’22, so I’m just going to speak from feelings and fuzzy intuitions here. I will say hallucinations from chatbots are interesting. When getting one to hallucinate, it seems to be kind of “making up reality as it goes along”. You say it’s a Unicorn, and it will start coming up with explanations for why it’s a Unicorn. You say it told you something you never told it, and it will start acting as though it did. I have to admit it does have a strange resemblance to dreams. I find myself in New York, but remember that I had been in Thailand that morning, and I hallucinate a memory of boarding a plane. I wonder where I got the plane ticket, and I hallucinate another memory of buying one. These are not well-reasoned arguments, though, so I hope Me’24 won’t beat me up too much about them.
2022 Me: I have been searching for how to test this theory. One interest of mine has been mirrors.
2023 Me: Don’t listen to Me’22 on this one. He thought he understood something, but he didn’t. Yes, the mirror thing is interesting in animals, but it’s probably a whole different thing, not the same thing.
A man asks one of the members of the tribe to find him some kindling so that he may start a fire. A few hours pass, and the second man returns, walking with a large elephant.
“I asked for kindling.” Says the first.
“Yes.” Says the second.
“Where is it?” Asks the first, trying to ignore the large pachyderm in the room.
The second gestures at the elephant, grinning.
“That’s an elephant.”
“I see that you are uninformed. You see, elephants are quite combustible, despite their appearance. Once heat reaches the right temperature, its skin, muscles, all of it will burn. Right down to its bones.”
“What is the ignition temperature for an elephant”
“I don't know, perhaps 300-400°C”
The first held up two stones.
“This is all I have to start a fire.” He says, “It will only create a few sparks at best… I’m not even sure how I can get it to consistently do that much, given how hard this will be for people thousands of years from now to replicate.”
“That is the challenge.” The second nodded solemnly, “I’m glad you understand the scope of this. We will have to search for ways to generate sparks at 400° so that we can solve the Elephant Kindling Problem.”
“I think I know why you chose the elephant. I think you didn’t initially understand that almost everything is combustible, but only notice things are combustible once you pay enough attention to them. You looked around the Savana, and didn't understand that dry leaves would be far more combustible, and your eyes immediately went to the elephant. Because elephants are interesting. They’re big and have trunks. Working on an Elephant Problem just felt way more interesting than a Dry Leaves Problem, so you zeroed all of your attention on elephants, using the excuse that elephants are technically combustible, failing to see the elegant beauty in the efficient combustibility of leaves and their low ignition temperature.”
“Leaves might be combustible. But think of how fast they burn out. And how many you would have to gather to start a fire. An elephant is very big. It might take longer to get it properly lit, but once you do, you will have several tons of kindling! You could start any number of fires with it!”
“Would you have really made these conclusions if you had searched all the possible combustible materials in the Savana, instead of immediately focusing on elephants?”
“Listen, we can’t waste too much time on search. There are thousands of things in the Savana! If we tested the combustibility and ignition temperature of every single one of them, we’d never get around to starting any fires. Are elephants the most combustible things in the Universe? Probably not. But should I waste time testing every possible material instead of focusing on how to get one material to burn? We have finite time, and finite resources to search for combustible materials. It’s better to pick one and figure out how to do it well.”
“I still think you only chose elephants because they’re big and interesting.”
“I imagine that ‘big’ and ‘useful as kindling material’ are not orthogonal. We shouldn’t get distracted by the small, easy problems, such as how to burn leaves. These are low hanging fruit that anyone can pick. But my surveys of the tribe have found that figuring out the combustibility of elephants remains extremely neglected.”
“What about the guy who brought me a giraffe yesterday?”
“A giraffe is not an elephant! I doubt anything useful will ever come from giraffe combustibility. Their necks are so long that they will not even fit inside our caves!”
“What I am saying is that others have brought in big, interesting-looking animals, and tried to figure out how to turn them into kindling. Sure, no one else is working on the Elephant Kindling Problem. But that’s also what the guy with the giraffe said, and the zebra, and the python.”
“Excuse me,” Said a third, poking his head into the cave, “But the Python Kindling Problem is very different from the Elephant one. Elephants are too girthy to be useful. But with a python, you can roll it into a coil, which will make it extremely efficient kindling material.”
The second scratched his chin for a moment, looking a bit troubled.
“What if we combined the two?” He asked. “If we wound the python around a leg of the elephant, the heat could be transferred somewhat efficiently.”
“No, no, no.", argued the third, "I agree combining these two problems might be useful. But it would be far better to just cut the trunk off the elephant, and intertwine it with the python. This could be very useful, since elephant hide is very thick and might burn slower. This gives us the pros of a fast-burning amount of kindling, mixed with a more sustained blaze from the elephant.”
“Might I interject.” Said a fourth voice, who had been watching quietly from the corner, but now stepped forward, “I have been hard at work on the Giraffe Kindling problem, but think that we are actually working on similar things. The main issue has always been the necks. They simply won’t fit inside the cave. We need a solution that works in all edge cases, after all. If it’s raining, we can’t start a fire outside. But if we use the python and the elephant trunk to tie the neck of the giraffe against the rest of its body, we could fit the whole thing in!”
“I think this will be a very fruitful collaboration.” Said the second, “While at first it seemed as though we were all working on different problems, it turns out by combining them, we have found an elegant solution.”
“But we still can’t generate sparks hot enough to combust any of them!” Shouted the first, “All you’ve done is made this even more complicated and messy!”
“I am aware it might seem that way to a novice.” Said the second, “But we have all gained great knowledge in our own domains. And now it is time for our fields to evolve into a true science. We are not amateurs anymore, simply playing around with fire. We are now establishing expertise, creating sub-domains, arriving at a general consensus of the problem and its underlying structure! To an outsider, it will probably look daunting. But so does every scientific field once it matures. And we will continue to make new ground by standing on the shoulders of elephants!”
“Giraffes.” Corrected the fourth.
“Zebras.” Answered a fifth.