I continue to be deeply, deeply skeptical of the whole ‘we need exactly the best possible frontier models to do good alignment work’ argument, that whatever the current best option is must be both necessary and sufficient.
I know asking other people to do work for you is generally, unfair, but would it be too much to ask for you to include a section along the lines of "AI alignment breakthroughs this week"?
There's a new innovation at least as significant as this or this almost every week, and it would be a bit heartening to see that you were actually tracking progress towards solving the alignment problem instead of just focusing on pausing AI progress.
If your claim is that we can make progress on alignment without working on cutting-edge models, then you should actually be tracking/encouraging such research.
More likely, they say the creativity came from MrUgleh. Which it did, in many important senses, amazing work. I’m confused by the prompt not having at least a weird trick in it, seems like you should have to work harder for this?
You're aware it's using ControlNet, right?
Yann LeCun made a five minute statement to the Senate Intelligence Committee (which I am considering sufficient newsworthiness it needs to be covered), defending the worst thing you can do, open sourcing AI models.
It sure does sound like that, but later he testifies that:
first of all the Llama system was not made open source ... we released it in a way that did not authorize commercial use, we kind of vetted the people who could download the model; it was reserved to researchers and academics
If you read his prepared opening statement carefully, he never actually claims that Llama is open source; just speaks at length about the virtues of openness and open-source. Easy to end up confused though!
That does seem like the right threshold under any reasonable meaning of catastrophic, so long as it is understood that once found no patch can address the issue. The next level of models may or may not reach ASL-3. My guess is a 4.5-level model mostly wouldn’t count, a 5-level model mostly would count.
My current not-strongly-held guess is that something like 'GPT-5' won't be ASL-3, but will be close and will be above a threshold where it will be significantly useful in producing the next generation. Something I'd be willing to call a weak early form of recursive self-improvement. Thus, I would expect the next generation, 'GPT-6', to be clearly into ASL-3 and able to do quite a bit of useful R&D work. Thus, I would expect a 'GPT-7' to be ASL-4. If each of these steps takes 1.5 years, that's really not long until ASL-4.
See my related prediction market here: https://manifold.markets/NathanHelmBurger/will-gpt5-be-capable-of-recursive-s?r=TmF0aGFuSGVsbUJ1cmdlcg
(market makes a slightly stronger case than stated here, in order to be controversial enough to be interesting)
Hephaistos Fnord (reply to OP): Cassandra is a dig bc being believed is more prestigious than being right. If you’re right and no one believes you, you’re still a loser. And even if everyone dies because they didn’t believe you, you still die a loser.
I feel personally attacked by this relatable content. How dare Hephaistos just come out and tell the world the key content of my recurring nightmares?
Helsing, a European AI defense startup raising $223 million at a $1.7 billion valuation.
Naming choices... "We also have immortal unstoppable monster in the basement, but ours is on the good side!"
We are about to see what looks like a substantial leap in image models. OpenAI will be integrating Dalle-3 into ChatGPT, the pictures we’ve seen look gorgeous and richly detailed, with the ability to generate pictures to much more complex specifications than existing image models. Before, the rule of thumb was you could get one of each magisteria, but good luck getting two things you want from a given magisteria. Now, perhaps, you can, if you are willing to give up on adult content and images of public figures since OpenAI is (quite understandably) no fun.
We will find out in a few weeks, as it rolls out to ChatGPT+ users.
As usual a bunch of other stuff also happened, including a model danger classification system from Anthropic, OpenAI announcing an outside red teaming squad, a study of AI impact on consultant job performance, some incremental upgrades to Bard including an extension for GMail, new abilities to diagnose medical conditions and some rhetorical innovations.
Also don’t look now but GPT-3.5-Turbo-Instruct plays Chess at 1800 Elo, and due to its relative lack of destructive RLHF seems to offer relatively strong performance at a very low cost and very high speed, although for most purposes its final quality is still substantially behind GPT-4.
Table of Contents
Language Models Offer Mundane Utility
Diagnose eye diseases. This seems like a very safe application even with false positives, humans can verify anything the AI finds.
Diagnose foetal growth restrictions early.
In theory and technically using graph neural networks, use the resulting ‘reading mode’ in Android or Chrome to strip out the words from a webpage, in an actually readable size and font, much more accurate than older attempts. Seems you have to turn it on under chrome flags.
GPT-4 showing some solid theory of mind in a relatively easy situation. Always notice whether you are finding out it can do X consistently, can do X typically, or can do X once with bespoke prompting.
The same with failure to do X. What does it mean that a model would ever say ~X, versus that it does all the time, versus it does every time? Each is different.
How to convince people who are unimpressed by code writing that LLMs are not simply parrots? Eliezer asked on Twitter, and said this was somehow the best answer he got so far, yet it would not convince me at all if I wasn’t already convinced. There are a lot of answers, none of them seem like good convincers.
Latest sharing of custom instructions for GPT-4, from 0.005 Seconds:
And one by maxxx:
Ethan Mollick offers new working paper on the value of AI for the future of work.
The abstract echoes what we have often seen.
The AI quality and quantity differences was big, prompt engineering advice made a small additional difference. AI meant 12% more tasks, 25% faster work and 40% higher quality (whatever that last one means on some scale).
What tasks were selected?
Standard consultant tasks all, I suppose. No idea why we need an inspirational memo from our consultant, or why anyone ever needs an inspirational memo from anyone, yet still the memos come.
Everyone can write those memos now. Not only is the old bottom half now performing above the old top half, the difference between the top and bottom halves shrunk by 78%.
This requires a strong overlap between the differentiating skills of high performers and the abilities of GPT-4. Performing well as a consultant must mean some combination of things like performing class and translating knowledge between forms, working efficiently, matching the spec, being what I might call ‘uncreatively creative’ by coming up with variations on themes, and so on.
There is danger. When a task humans normally do well at was selected where the AI was known to give a convincing but wrong answer, those using it did worse.
Notice that the overview session cut the damage by 43%.
Yes, if the AI is going to get a given question convincingly wrong, it is going to undercut performance on that task, especially for those who don’t know what to watch out for. That is not the default failure mode for AI. The default failure mode, for now, is highly convincing failure. You ask for an answer, it spits out obvious nonsense, you get annoyed and do it yourself.
Put these together, and we should expect that with experience AI stops being a detriment on most difficult tasks. Users learn when and how to be skeptical, the task of being skeptical is usually not so difficult, and the AI remains helpful with subtasks.
Ethan’s analysis of successful AI use says that there are two methods. You can be a Centaur, outsourcing subtasks to AI when the AI is capable of them, doing the rest yourself. Or you can be a Cyborg, who intertwines their efforts with the AI. I am not convinced this is a meaningful difference, at most this seems like a continuum.
Language Models Don’t Offer Mundane Utility
David Chapman in response to Mollick asks the philosophical question, what use is he who does the task that should never have been done at all?
That is a fascinating contrast. Is evaluating people a bullshit task? Like most bullshit, the ‘real work’ that needs doing could be done without it. Yet like much bullshit, without it, the works will fail to properly function. In this case, the evaluation was an estimation of intelligence. Which is a task that I would have expected an LLM to be able to do well as a giant correlation machine. Perhaps this ability is the kind of thing that ‘alignment’ is there to remove.
I do agree that using a consulting firm is stacking the deck in GPT-4’s favor. An LLM is in many ways an automated consulting machine, with many of the same advantages and disadvantages. Proper formatting and communications? Check. A list of all the obvious things you should have already been doing, the ideas you should have already had, and social proof to convince people to actually do it? Right this way. Expert understanding and brilliant innovations? Not so much.
Paper worries LLMs too often give the same ‘correct’ answers to questions, and this might narrow diversity of thought.
As I understand it, there are two distinct drivers of this uniformity.
One is that models have a natural bias where more likely outcomes get overweighted, and less likely outcomes get underweighted. This is the ‘bad’ type of bias, as they reference on page 24, you get 98% (versus the real average of 78%) of software engineers described as male, 98% of executive assistants (instead of the real average of 89%) as female, and so on.
The paper says ‘no one understands this’ but it seems pretty obvious to me? You get scored poorly when you present ‘unlikely’ outcomes, you get higher feedback marks when you stick to the more likely case, so you end up trained to do the more likely case most of the time.
The other type of lack of diversity is that we actively want the LLM to never give a ‘wrong’ answer in various ways, including many ‘wrong’ answers that are rather popular with humans, so we train those behaviors out of it. The LLM can’t ever be saying something racist or otherwise horrible, which is quite explicitly a narrowing of diversity of thought.
The paper also trots out (on pages 25-26) a litany of standard examples of systematic discrimination by LLMs against traditionally disfavored groups, which also seems rather well understood, as some combination of (1) picking up the biases repeated endlessly in the training data and (2) picking up on statistical correlations that we as a society have decided (usually quite reasonably) it is imperative to ignore, but which the LLM’s scoring benchmark rewards it for not ignoring. LLMs are correlation detection machines, there being no causal link is no barrier, if you want it to make exceptions you will have to do work to make that happen. This is discussed in the paper under the ‘correct answer’ effect in 4.1, but seems to me like a distinct problem.
The authors then speculate in 4.2, as Tyler Cowen highlighted in his reaction, about how the authors believe GPT-3.5 might be systematically conservative biased, which is not what others have found, and is not what conservatives or liberals seem to observe when using it. Their evidence is that GPT-3.5 systematically gives conservative-level weight to considerations of authority, purity and loyalty, even when simulating liberals. Tyler speculates that this is because many liberals do have such conservative moral foundations, in their own way. I would add that, if this were true, given who actually uses ChatGPT, then it would be a sign of increased diversity of thought.
What exactly is the effect anyway? Here are the big response shifts:
Note that the expensive item question is GPT-3.5 simply being right and overcoming a well-known bias, the original sample is acting crazy here. Whereas in the last two questions, we see a pure example of straight up Asymmetric Justice that is taken to its logical extreme rather than corrected.
Mostly I think that going to a default in many cases of 99% ‘correct’ answers is fine. That is the purpose of the LLM, to tell you the ‘correct’ answer, some combination of logically and socially and probabilistically correct. There are particular cases where we might want to object. In general, that’s the question you are mostly asking. If you want a diversity of responses, you can ask for that explicitly, and you will often get it, although there are cases where we intentionally do not allow that, calling such a diversity of response ‘unaligned’ or ‘harmful.’ In which case, well, that’s a choice.
A fun anecdote from a post discussed above, so odd a blind spot to still have.
Matthew Yglesias continues to fail to extract mundane utility.
Nassim Nicolas Taleb, very on brand, takes it further and actively seeks out disutility.
Figuring out how to break something, or profit from it or make it do something stupid or otherwise exploit it, is indeed often an excellent tool for understanding it. Other times, you need to be careful, as answers like ‘with a hammer’ are usually unhelpful.
The statistical similarities note is on point. That is how you trick the LLM. That is also how the LLM tricks you. Checking for statistical similarities and correlations, and checking if they could be responsible for an answer, is good practice.
Of the people who are complaining GPT-4 has gotten worse, perhaps a lot of that is that they are using the iOS app, where it has what is perhaps a deliberately anti-useful system prompt?
Level Two Bard
Jack Krawczyk proudly announces the best Bard yet.
Jack:
I mean, great, sure, that’s all good, but it does not solve the core problem with Bard. Only Gemini can solve the core problem with Bard.
It is also amusing that Google knows how to spot some of the time Bard is wrong, and responds with colored warnings of the error rather than fixing it.
Kevin Roose tries out the new extensions, is impressed in theory but not in practice.
I do not want to downplay the extent to which Bard sucks, but of all the places to hallucinate this seems completely fine. It is your history, having a substantial error rate is survivable because you should be good at spotting errors. If anything, I would be much more interested in not missing anything than in not hallucinating. I can check for errors, but I am asking Bard because I can’t remember or can’t find the thing.
However, Bard literally keeps hallucinating about Google’s own interface settings, when I ask it about how to access the new Bard extensions. And, I mean, come on.
Wouldn’t You Prefer a Good Game of Chess?
It seems GPT could always play chess, if I played against it we would have a good game, except with previous attempts RLHF got in the way?
Want to try it out? Will Depue built a quick GitHub you can use or you can go to the app here (requires entering your API key).
The FUD will continue, of course. Evidence does not make such things stop. Arguments do not make such things stop. A good rule of thumb is that any time anyone says ‘the debate is over’ in any form, both in general and in particular when it comes to AI, that they are wrong. Being clearly wrong is not a hard barrier to opinion.
GPT-4 Real This Time
GPT-3.5-Turbo-Instruct is the latest hotness. You get cheaper and faster, and you get far less intentional crippling than we see in GPT-4, and it seems for a lot of purposes that’s actually really good? Here it is matching GPT-4 on precision, recall and F1, although it does far worse on some other questions.
For example, here’s Amjad Masad, the CEO of Replit, reporting on their test:
Fun with Image Generation
Dalle-3 is here. Rolling out early October to ChatGPT+ users. Let’s play.
All gorgeous, although highly hand picked, so we cannot fully be sure quite yet.
What is most exciting is that you can use natural language, and much better match exact requests.
This all looks super exciting, and I continue to be unafraid of image models. Lots more images at the link. The policy for artists is that you can opt-out of your images being used for training, but you have to actively do that.
What are the limits to your fun?
A shame.
What about quantity? Models like this have a tendency to default to giving you one to four outputs at once. One of the things I most love about hosting image generation locally is that I can and do queue up 100 or more versions at once, then come back later to sort through the outputs. That is especially important when trying something complicated or when you are pushing your model’s size limitations, where the AI is likely to often botch the work.
A combined presumed refusal to do ‘adult content’ and the inability to show anyone by name, and presumably a reluctance to give you too many responses at once, and what I assume is the inability to use a LoRa for customization, still leaves a lot of room for models that lack those limitations, even if they are otherwise substantially worse. We shall see how long it takes before the new follow-complicated-directions abilities get duplicated under open source ‘unsafe’ conditions. My default guess would be six months to a year?
The Information frames this as OpenAI beating Google to the punch on multimodal.
The actual alignment issue will be adversarial attacks. If we are allowed to input images, there are attacks using that vector that lack any known defenses. This is widely speculated to be why GPT-4 was released without such capabilities in the first place. We will see how they have dealt with this.
Caplan awards his illustration prize to Erickson L, final work looks great, and it turns out humans still have the edge. For now. Some entries used AI, the two I liked the best including the winner did not. Saniya Suriya I thought had the best AI entry. The process for getting good AI looks highly finicky and fiddly. Photoshop was commonly used to deal with unwanted objects.
AI appears to create something in at least somewhat of a new style, a kind of Escher derivative perhaps.
More likely, they say the creativity came from MrUgleh. Which it did, in many important senses, amazing work. I’m confused by the prompt not having at least a weird trick in it, seems like you should have to work harder for this?
Another technique AI is excellent at:
Stable Diffusion AI Art: step back until you see it… #aiart #sdxl
Deepfaketown and Botpocalypse Soon
Amazon limits book authors to self-publishing three books per day. I suppose this is not too annoying to real authors that often. AI authors who are churning out more than three nonsense books per day will have to use multiple accounts. Why not instead not have a limit and use it as an alarm bell? As in, if you publish 34 books in a day, let’s do a spot check, if it’s not your back catalogue then we know what to do.
It really is crazy that it remains entirely on us to deal with pure commercial speech invading our communication channels. AI is going to make this a lot less tolerable.
As usual, why tort when you can tax? It is periodically suggested, never implemented. If you use my communication channel, and I decide you wasted my time, you can push a button. Then, based on how often the button gets pushed, if this happens too often as a proportion of outreach, an escalating tax is imposed. So a regular user can occasionally annoy someone and pay nothing, but a corporation or scammer that spams, or especially mass spams using AI, owes big, and quickly gets blacklisted from making initial contacts without paying first.
Get Involved
Lilian Weng’s safety team at OpenAI is hiring (Research Scientist, Safety and Machine Learning Engineer, Safety). This is not the superalignment team, and these are not entry-level positions. It looks like they will be doing mundane safety. That still includes model evaluation criteria and audits, along with various alignment techniques, including extracting intent from fine tuning data sets. It also means doing work that will enhance the commercial value of, and thus investment in the development of, frontier models.
Thus, the sign of your impact on this team is non-obvious. If you do consider working on such things, form your own opinion about that. If you do apply use the opportunity to investigate such questions further.
Introducing
OpenAI Red Team Network. Members will be given opportunities to test new models, or test an area of interest in existing models. Apply until December 1. One note is that they are using ‘how much can this AI trick or convince another AI’ as a key test, as well as how well AIs can pass hidden messages to each other without being detected.
Anthropic publishes its Responsible Scaling Policy (RSP).
ASL-1 is harmless.
ASL-2 is mostly harmless, due to a combination of unreliability and alternative sources of similar information. All current LLMs are considered ASL-2.
ASL-3 is dangerous, increasing risk of catastrophic misuse verses a non-AI baseline, or showing low-level autonomous capabilities.
That does seem like the right threshold under any reasonable meaning of catastrophic, so long as it is understood that once found no patch can address the issue. The next level of models may or may not reach ASL-3. My guess is a 4.5-level model mostly wouldn’t count, a 5-level model mostly would count.
There is no clear definition for ASL-4 or higher, except implicitly from the definitions of the first three. ASL-4 is also, not coincidentally, a reasonable stand-in for ‘creating this without knowing exactly what you are doing, how you are doing it and who is going to access it for what seems like rather a horrible idea.’ The standards for such models are not written yet.
As with many such things, this all seems good as far as it goes, but punts on quite a lot of the most important questions.
Claim that a small group already has access to Gemini.
Ebay feature to allow generating an entire product listing from a photo, including titles, descriptions, pricing and more.
Helsing, a European AI defense startup raising $223 million at a $1.7 billion valuation. They aim to ‘boost defense and security for democracies,’ working with governments including Germany, and are active in Ukraine. They’re strictly working for the good guys, so it’s fine, right?
DeepMind introduces and widely releases for free AlphaMissense, a new AI tool classifying the effects of 71 million ‘missense’ mutations, predicting about a third of them would be dangerous. Most have never been seen in humans. On those that have, this modestly outperforms existing classification methods.
In Other AI News
Politico’s Laurie Clarke chronicles the shift of UK policy around AI to actually worrying about existential risk, which they say is heavily influenced by EAs, Ian Hogarth’s article in the Financial Times and the pause letter.
I want to flag that this statement is very strange and wrong?
I am not going to bother taking a survey, but no, they don’t think that.
The post warns of potential regulatory capture from the taskforce, without any speculation as to what that would involve, warning of EA’s ‘ties to Silicon Valley.’
This line of attack continues all around, despite it making no sense. You could say the companies themselves seek regulatory capture by warning that their products might kill everyone on Earth, and decide how much sense you think that makes.
Or you can take it a step further, say that because people in the companies have been influenced by the ideas of those worried about extinction risk and we often talk to the people who we want to stop from destroying the world, that those people worried about extinction risk are suspect because this represents ties to Silicon Valley. That’s an option too. It is the standard journalist approach to attacking those one dislikes.
To be fair, there is this quote:
I think this is highly inaccurate. Are there those who are not speaking out as loudly or clearly as they could to avoid antagonizing the AI companies? Yes, that is definitely happening. Are EAs broadly doing their bidding? I don’t see that at all.
EAs are indeed, as the article puts it, ‘scrambling to be part of the UK’s taskforce.’ Why wouldn’t they? This is a huge opportunity for impact, and all that is asked in return is being willing to work for the government on its terms.
Business is good.
Business is growing, as Anthropic partners with Boston Consulting Group to expand their B2B and expand the reach of Claude.
A CEO that knows the minimum amount about allowing business. Wow are people blinded by mobile and web browsers these days.
AI increasingly is the business of science.
AI software maker Databricks raises $500 million at $43 billion.
Adobe Photoshop updates its Terms of Service for use of generative AI, banning any training of other AIs on its outputs, and putting the onus on users to prevent this even by third parties. I presume the overly broad language to lay groundwork for suing those who do the scraping, not going after users who are not trying to help train models at scale. Alternatively it could be a standard ‘make the ToS include everything we can think of’ move.
A good Nividia A100 remains hard to rent. I will not allocate scarce resources via price.
Chroma usage at all-time highs. The lull is in the rate of core capability improvements.
Technical Details
New paper says models get big jumps when they figure out grammar, then another on getting external capabilities (direct link to paper).
A series of strange failure modes and speculations about whether we are solving them.
A technique I am surprised is not used more is Contrastive Decoding. Does the better model think a response is more or less likely than the worse model?
There are lots of huge obvious problems with relying on this, yet we humans do use it all the time in real life, reversed stupidity is far from zero evidence.
Quiet Speculations
Michael Nielsen offers his thoughts on existential risk from superintelligence. A fine read, interesting and well-considered throughout. Yet as he notes, it is unlikely to change many minds. Everyone needs to hear different things, focuses on different aspects. The things Michael focuses on here mostly do not seem so central to me. Other considerations he does not mention concern me more. There are so many different ways things could play out towards the same endpoint.
I think that also explains a lot of the perception of weak arguments on the risk side. If it addresses the wrong things it looks dumb to you, but people can’t agree on what is the wrong thing, and someone who has done a lot of thinking notices that the arguments are mostly dumb because they need to address dumb things. Also the whole situation is super complicated, so you keep drilling until you get to where things get confusing.
Tyler Cowen (Bloomberg) speculates on the impact of AI on classrooms and homework. He notes that schools have always largely been testing for and teaching conscientiousness (and I would add, although he does not use these words, obedience and conformity and discipline and long term orientation). If AI is available, even if the assignments fail to otherwise adjust, those tests still apply, many college students cannot even track deadlines and reliably turn in work.
As for knowledge of the actual material, does it even matter? Sometimes yes, sometimes no, if yes then you can test in person. I would note that even when it does matter, many (most?) students will not learn it if they are not directly tested. I expect Tyler’s solution of in-person testing, often including oral examinations (or, I would add, reliance on evaluations of classroom interactions like you see in law schools) to become vital to get students to actually care about the material. Tyler suggests ‘the material’ was mostly never so important anyway, and will become less important now that you can look it up with AI. I agree less important, and often already overrated, but beware of underrating it.
Ireland is moving now to implement some of these changes. There is a problem, but not so much of a crisis as to make us go back to real formal examinations.
Predictions for impact of AI on economic growth continue to be modest, also huge anyway. Graph below is claiming something like ~1.5% average predicted acceleration of GDP growth. That would be wonderful without being dangerously transformative.
The economics are no doubt rolling their eyes at statemtents like Ngo’s.
New working paper by Gary Charness, Brian Jabarian and John List is Generation Next: Experimentation with AI. They notice several ways LLMs can help refine experimental design and implementation. Here is the abstract:
LLMs can definitely help with all of that if used responsibly, as can various forms of other AI. This only scratches the surface of what such tools can help with, yet even scratching that surface can be quite valuable.
The Quest for Sane Regulations
Here’s a wild one, welcome to 2023.
Ada Lovelace Institute offers advice on how to learn from history on how to regulate better. Seems mostly correct as far as it goes but also generic, without attention to what makes the situation unique.
Senator Schumer held his meeting with various CEOs and advocates. Everyone was there. Everyone agrees Something Must Be Done.
They can’t agree on the something that therefore we must do.
There were some grandstanding about not
wasting all that time letting Senators grandstandletting individual Senators ask questions in favor of letting the CEOs and advocates actually talk.Zuckerberg’s response points out that there is not a huge amount of known practical damage that is being done yet, with the current version. Yes, you can find all that information out on your own, the same way that you could duplicate Facebook’s functionality with a series of email threads or by actually walking over and talking to your friends in person for once.
That does not change the two-days-later principle, that any open source LLM you release means the release of the aligned-only-to-the-current-user version of that LLM two days later. We now have a point estimate of what it took for Llama 2: $800 and a few hours of work.
As usual, concerns stayed focused on the mundane. Advocates care deeply about the particular thing they advocate for.
I am happy that everyone involved is talking. I am disappointed, although not surprised, that there is no sign that anyone involved noticed we might all die. There is still quite a long way to go and much work to do.
Should we pause AI development if we have the opportunity to do so? Tyler Cowen continues to reliably link to anyone who argues that the answer is no. Here is Nora Belrose on the EA forum making the latest such argument. Several commentors correctly note that the post feels very soldier mindset, throwing all the anti-pause arguments (and pro-we-are-solving-alignment arguments) at the wall with no attempt at balance.
A lot of the argument is that realistic pauses would not be implemented properly, would be incomplete, would eventually stop or be worked around, so instead of trying anyway we should not pause. David Manheim has the top comment, which point out this strawmanning.
The true justification seems to be that Nora thinks the current path is going pretty well on the alignment front, which seems to me like a fundamental misunderstanding of the alignment problem in front of us. The bulk of the post is various forms of alignment optimism if we stay on our current path. Nora thinks alignment is easy, thinks aligning current models means we are ahead of the game on our ability to align future more capable models, and that aligning AI looks easy relative to aligning humans or animals, which are black boxes and give us terrible tools to work with. Approaches and concerns that are not relevant to current models are treated as failures.
I continue to be deeply, deeply skeptical of the whole ‘we need exactly the best possible frontier models to do good alignment work’ argument, that whatever the current best option is must be both necessary and sufficient. If it is necessary, it is unlikely to be sufficient, given we need to align far more capable future models. If it is sufficient, it seems unlikely it is necessary. If it happens to be both right now, will it also be both in a few months? If so, how?
Tyler also links us to Elad Gil on AI regulation, which he is for now against, saying it is too early. Elad’s overall picture is similar to mine in many ways – in his words, a short term AI optimist, and a long term AI doomer. For now, Elad says, existing law and regulation should mostly cover AI well, and slowing AI down would do immense economic and general damage. We should do export controls and incident reporting, and then expand into other areas slowly over time ass we learn more.
That would indeed be exactly my position, if I had no worries about extinction risk and more generally did not expect AI to lead to superintelligence. Elad does a good job laying out that regulation would interfere with AI progress, but despite being a long-term self-proclaimed doomer he does not address why one might therefore want to interfere with AI progress, or why one should not be concerned that more AI progress now leads to extinction risk, or why our responses to such concerns can afford to wait. As such, the post feels incomplete – one side of a tradeoff is presented, the other side’s key considerations that Elad agrees with are ignored.
Yann LeCun made a five minute statement to the Senate Intelligence Committee (which I am considering sufficient newsworthiness it needs to be covered), defending the worst thing you can do, open sourcing AI models. It does not address the reasons this is the worst thing you can do, instead repeating standard open source talking points. He claims open source will help us develop tools ‘faster than our adversary’ despite our adversary knowing how the copy and paste commands work. He does helpfully endorse charting a path towards unified international safety standards, and notes that not all models should be open sourced.
The Week in Audio
From a month ago, lecture by Pamela Samuelson of UC Berkeley on copyright law as it applies to AI. As per other evaluations, she says copying of training data poses potential big problems, whereas output similarity is less scary. She emphasizes that submitting statements to the copyright office can matter quite a bit, and the recent notice of inquiry is not something the industry can afford to ignore. So if you would rather submit one of those statements the industry would not like, that could matter too.
Rhetorical Innovation
Helen Toner is latest to pound the drum against the terms ‘short term’ and ‘long term.’
For related reasons, I generally use ‘mundane harms’ and ‘extinction/existential risks.’
We call upon the government to regulate, not because it would make things easy, but because it would make things hard.
It is not this simple, people do not get convinced by this, yet also it kind of is this simple:
The people do not want superintelligent AI. We know this because we keep asking them and they reliably tell us this. Vox has the latest report on the latest poll from AIPI, the one from September.
Daniel Colson offers a reminder of the poll highlights:
The China argument? Not so convincing to regular people.
It’s easy to forget how lopsided these numbers were. A 67-14 result is rather extreme. A 73-11 result on liability is as high as it gets. Was the 46-22 on unknowns versus weaker known threats what you would have expected? Still, a reminder that this is still the old results we knew about, and also that we should be concerned that the polls involved were likely framed in a not fully neutral fashion.
As Connor Leahy points out, we do not put things to votes all that often, but a lot of the point of this is to dispel the idea that we must race ahead because the public will never care about such risks and there is no politically reasonable way to do anything about the situation. It is important to note that such claims are simply false. There is every expectation that those calling for action to prevent us all dying would gain rather than lose popularity and political viability.
No One Would Be So Stupid As To
What is our ask, in terms of what we request you not be so stupid as to?
I agree with Eliezer here. If you have a high estimate, your duty is to disclose. If you are also doing things that especially accelerate the timeline of realizing that high estimate, or especially increase your estimate of extinction risk, then stop doing those things in particular. The list here seems like the central examples of that. Otherwise, I’d prefer that you stop entirely or shift to actively helpful work instead, but that does seem supererogatory.
In the meantime, well, yes, this is fine.
People continue to claim you can make ‘responsible open models.’ You can do that by having the model suck. If it is insufficiently capable, it is also responsible. Otherwise, I have yet to hear an explanation of how to make the ‘responsibility’ stick in the face of two days of fine tuning work to undo it.
Aligning a Smarter Than Human Intelligence is Difficult
Asimov’s laws made great stories, all of which were about how they do not work. There are so many obvious reasons they would not go well if implemented. And yet.
Is that fair? Sort of. Under 6.3.3, ‘other potential risks,’ after ‘misuse’ and ‘unemployment’ each get a paragraph, here is the paragraph referenced above with line breaks added:
It certainly is a lazy-as-hell paragraph. This was clearly not what anyone cared about when writing this paper. They wanted to talk about all the cool new things we could do, and put in a section quickly saying ‘yeah, yeah, misuse, they took our jobs, SkyNet’ with generic calls for interpretability and control mechanisms and anticipating dangers and downsides and all that. Agents are cool, you know? Think of the Potential. Let’s focus on that.
Which is a perfectly legitimate thing to do in a paper, actually,, so long as you do not think your paper is actively going to get everyone killed. Here is this amazing new thing, we wrote a cool paper about it, yes there are potential downsides that might include everyone dying and that is Somebody Else’s Problem, but we confirm it is real and please do solve it before you actually create the thing and deploy it, please?
Their prescription and level of suggested caution, all things considered, is actually damn good. I’d happily take a deal that said ‘we agree to deploy AI agents if and only if we have comprehensive operational mechanistic understanding of the underlying systems and what we are building on top of them, and also have anticipated the direct and indirect impacts of the agents, and also we have ensured they are harm minimizing and will listen to orders.’
Is that good enough to keep us all alive even if we got it? My guess is no, because harm minimization has no good definition even in theory, and none of this handles the competitive dynamics involved. So there would still be more work to do. But if we indeed had all that handled, I would be relatively optimistic about what we could do from there.
I also found the central thesis on point. From the abstract:
For reasons of bandwidth, I did not get to check out the full paper in detail. It did seem interesting and I can imagine people bidding high enough for me to revisit.
I Didn’t Do It, No One Saw Me Do It, You Can’t Prove Anything
Davidad cites recent writeup of air traffic control software failures from brittle legacy code as all the more reason to accelerate his project of formal verification via AI. Here it’s not about superintelligence and doing bespoke work, it’s about normal work.
I continue to be confused how any of this would work in practice. I’ve been conversing a bit with Davidad and reading some papers to try and understand better, but haven’t gotten very far yet. I get how, if you can well-specify all the potential dynamics and forces and requirements in a situation, including all potential adversarial actions, then one could in theory offer a proof of how the system would behave. It still seems very hard to do this in non-narrowly-isolated real world situations, or to get much edge in replacing legacy code in these kinds of systems.
The likely hardest problem, even in basic cases like air traffic control, is likely to be properly specifying the requirements, or proving that your specification of requirements is sufficient. That does not seem like an LLM-style job.
As Patrick McKenzie would happily point out, getting a full specification of exactly what are the actual practical necessary specifications of your software project to replace a bunch of old government legacy code is usually the hard part. If you knew exactly how to specify everything the code had to do and not do, you’d be most of the way there already.
If an LLM gives you a specification, how easy is it to check intent-alignment? Yes, easier than checking intent-alignment on a full program for which you are offered no formal proofs. Still, this does not sound especially easy. Any small deviation from what it should say could potentially lay ground for an exploit.
Often, indeed, the correct specification’s job is to preserve the value of existing exploits the system relies upon to function, or various functionaries rely upon for compensation. That is a lot of what makes things so hard. That’s the thing about using proofs and specifications, especially when replacing legacy systems, you need to ensure that you are indeed matching the real requirements exactly. This is in direct explicit conflict with matching the official requirements, which you officially have to do and in practice cannot afford to do. That’s (one reason) why these projects cost 100x what they seem like they should.
How do we dare give this style of task to an LLM? We could have a human check all the details and corner cases and red teaming and such to ensure the specification was sufficient, but that sounds similarly hard to writing the specification yourself and also checking it. Is the plan some sort of second-level provably correct method of assembling the correct specifications? How would that even work?
In theory, the idea of using advanced AIs to create provably safe and effective outputs, and only those outputs, which we then verify and use, does seem like an approach we should consider. I am very happy Davidad has a well-resourced team working on this. That does not mean I see how we get to any of this working.
People Are Worried About AI Killing Everyone
The latest formulation of exactly how Eliezer expects this to go down by default, if nothing smarter proves available, dropped this week. I do not consider these details load-bearing, but others do so here you go.
He then explains why he keeps on describing the scenario he actually expects, rather than something people would find more palatable. If you present a more conventional scenario, it motivates a more conventional, proportionate response. It is only because of the extent of the threat that we need to respond so strongly.
I think this is a mistake, which is why I typically describe more conventional scenarios. There are indeed conventional-shaped ways AIs could beat us, that would be much easier for people to believe, that still get to the same place. The arguments Eliezer quotes indeed do not make sense, but the conventional takeover the works more or less how a highly capable human would take over, followed by setting up infrastructure to survive without us? I think those scenarios make perfect sense, except insofar as the AI would presumably have superior other options.
A strange choice some are making is to call these people ‘Cassandra.’
Nevin Freeman explains that he has come around, responding to the Vox article above.
Indeed.
Other People Are Not As Worried About AI Killing Everyone
Tony Blair is not worried, saying extinction risk from AI is ‘science fiction.’ Which is potentially a big problem for the UK taskforce when Labor (probably) wins the next UK election, and Blair’s influence on the new government will loom large.
Within the already-fact realms, Tony Blair seems highly down to Earth, his answers reasonable. He says PM Sunak gets it and is on the right track with AI. Also note the wording. Not anticipating is quite different from dismissing a possibility, as is saying a concern lies in the future. If the taskforce is doing good work, it will be raising UK prestige and competence and preparedness in highly valuable ways even if you are unconcerned about extinction as such. The details here make me optimistic.
Google employee Francois Chollet is not worried, joins the group thinking that I must be buying into some sort of marketing scam, why else would people be warning that their product might kill everyone. Surely corporations would upsell, not downplay, the risks inherent in their new products.
The Lighter Side
I’m Jesus, *****.
Well, actually.
This is fine, the company’s name is Dictador:
For now, the post says, human executives still make all the key decisions including firing employees. So there’s nothing to worry about, right?
Especially since, as it turns out, I’m not sorry at all.
Nor should we have been earlier.