RE: GPT getting dumber, that paper is horrendous.
The code gen portion was completely thrown off because of Markdown syntax (the authors mistook back-ticks for single-quotes, afaict). I think the update to make there is that it is decent evidence that there was some RLHF on ChatGPT outputs. If you remember from that "a human being will die if you don't reply with pure JSON" tweet, even that final JSON code was escaped with markdown. My modal guess is that markdown was inserted via cludge to make the ChatGPT UX better, and then RLHF was done on that cludged output. Code sections are often mislabeled for what language they contain. My secondary guess is that the authors used an API which had this cludged added on top of it, such that GPT just wouldn't output plaintext code, tho that is baffled by the "there were any passing examples".
In the math portion they say GPT-4-0613 only averaged 3.8 CHARACTERS per response. Note that "[NO]" and "[YES]" both contain more than 3.8 characters. Note that GPT-4 does not answer hardly any queries with a single word. Note that the paper's example answer for the primality question included 1000 characters, so the remaining questions apparently averaged 3 characters flat. Even if you think they only fucked up that data analysis: I also replicated GPT-4 failing to solve "large" number primality, and am close to calling a that cherry picked example. It is a legit difficult problem for GPT, I agree that anyone who goes to ChatGPT to replicate will agree the answer they get back is a coin flip at best. But we need to say it again for the kids in the back: the claim is that GPT-4 got 2% on yes/no questions. What do we call a process that gets 2% on coin flip questions?
Amusingly, when I went to test the question myself, I forgot to switch Code Interpreter off, and it carried out getting the correct result in the sensible way.
Tyler Cowen has his unique take on the actors strike and the issue of ownership of the images of actors. As happens frequently, he centers very different considerations than anyone else would have, in a process that I cannot predict (and that thus at least has a high GPT-level). I do agree that the actors need to win this one.
I do agree with his conclusion. If I got to decide, I would say: Actors should in general only be selling their images only for a particular purpose and project. At minimum, any transfer of license should be required to come with due consideration and not be a requirement for doing work, except insofar as the rights pertain narrowly to the work in question.
I'm...not sure what you envision this looking like?
While I don't think AI is there yet, and it may not get there before larger disruptions occur, if you imagine a hypothetical world in which one day of video and motion capture of a person lets you make an infinite amount of AI-generated video of them, acting cannot realistically be a long-term career in that world. Attempting to set up rules governing images to make acting remain a long-term career will be massively wasteful (using years of person-work to do one day of person-work) for no reason other than to provide employment to a legacy profession.
First, I agree with your general conclusion : laws to protect a limited number of humans in a legacy profession are inefficient. Though this negotiation isn't one of laws, it's unions vs studios, where both sides have leverage to force the other to make concessions.
However, I do see a pattern here. Companies optimizing for short term greed very do often create the seeds of larger problems:
(3) bothers me in that it's excessively greedy, it doesn't come close to paying a human being to even come to LA at all. It's unsustainable.
Theoretically capitalism should be fixing these examples automatically. I'm unsure why this doesn't happen.
Actually on reflection assuming AI continues to improve, 1 and 2 also can stay in disequilibrium.
True, but I definitely don't expect such a flawless AI to be available any soon. Even Stable Diffusion is not stable enough to consistently draw the exact same character twice, and the current state of AI-generated video is much worse. Remember the value of the long tail: if your AI-generated movie has 99% good frames and 1% wonky frames, it will still looks like a very bad product compared to traditional movies, because consumers don't want movies where things look vaguely distorted once per minute (maybe the stunt doubles should be more concerned about being replaced by AI frames that the actor themselves?).
xAI seems a potentially significant player to me. We could end up with a situation in which OpenAI is the frontier of safety research (via the superalignment team), and xAI is the frontier of capabilities research (e.g. via a Gemini-style combination of LLMs and "self-play").
You're doing a great job with these newsletters on AI.
Events continue to come fast and furious. Custom instructions for ChatGPT. Another Congressional hearing, a call to understand what is before Congress and thus an analysis of the bills before Congress. An joint industry safety effort. A joint industry commitment at the White House. Catching up on Llama-2 and xAI. Oh, and potentially room temperature superconductors, although that is unlikely to replicate.
Is it going to be like this indefinitely? It might well be.
This does not cover Oppenheimer, for now I’ll say definitely see it if you haven’t, it is of course highly relevant, and also definitely see Barbie if you haven’t.
There’s a ton to get to. Here we go.
Table of Contents
Language Models Offer Mundane Utility
New prompt engineering trick dropped.
Also potentially new jailbreak at least briefly existed: Use the power of typoglycemia. Tell the model both you and it have this condition, where letters within words are transposed. Responses indicate mixed results.
Have GPT-4 walk you through navigating governmental bureaucratic procedures.
I expect this to stop feeling like living in the future quickly. The question is how the government will respond to this development.
Translate from English to Japanese with a slider to adjust formality of tone.
Language Models Don’t Offer Mundane Utility
They can’t offer mundane utility if you do not know about them.
If you do know, beware of prompt injection, which can now even in extreme cases be available via images (paper) or sound.
This seems like a strong argument for never open sourcing model weights?
OpenAI discontinues its AI writing detector due to “low rate of accuracy.”
Has GPT-4 Gotten Worse?
Over time, it has become more difficult to jailbreak GPT-4, and it has increasingly refused various requests. From the perspective of OpenAI, these changes are intentional and mostly good. From the perspective of the typical user, these changes are adversarial and bad.
The danger is that when one trains in such ways, there is potential for splash damage. One cannot precisely target the requests one wants the model to refuse, so the model will then refuse and distort other responses as well in ways that are undesired. Over time, that damage can accumulate, and many have reported that GPT-4 has gotten worse, stupider and more frustrating to use over the months since release.
Are they right? It is difficult to say.
The trigger for this section was that Matei Zaharia investigated, with Lingjiao Chen and Jamez Zou (paper), asking about four particular tasks tested in narrow fashion. I look forward to more general tests.
What the study found was a large decline in GPT-4’s willingness to follow instructions. Not only did it refuse to produce potentially harmful content more often, it also reverted to what it thought would be helpful on the math and coding questions, rather than what the user explicitly requested. On math, it often answers before doing the requested chain of thought. On coding, it gives additional ‘helpful’ information rather than following the explicit instructions and only returning code. Whereas the code produced has improved.
Here’s Matei’s summary.
They also note that some previously predictable responses have silently changed, where the new version is not inherently worse but with the potential to break workflows if one’s code was foolish enough to depend on the previous behavior. This is indeed what is largely responsible for the reduced performance on coding here: If extra text is added, even helpful text, the result was judged as ‘not executable’ rather than testing the part that was clearly code.
Whereas if you analyze the code after fixing it for the additional text, we actively see substantial improvement:
That makes sense. Code is one area where I’ve heard talk of continued improvement, as opposed to most others where I mostly see talk of declines in quality. As Matei points out, this still represents a failure to follow instructions.
On answering sensitive questions, the paper thinks that the new behavior of giving shorter refusals is worse, whereas I think it is better. The long refusals were mostly pompous lectures containing no useful information, let’s all save time and skip them.
Arvind Narayanan investigates and explains why they largely didn’t find what they think they found.
I ran the math experiment once myself and got at least a few bits of evidence, using the exact prompt from the paper. I did successfully elicit CoT. GPT-4 then got the wrong answer on 17077 being prime despite, and then when I corrected its error (pointing out that 7*2439 didn’t work) it got it wrong again claiming 113*151 worked, then it said this:
Which, given I wasn’t using plug-ins, has to be fabricated nonsense. No points.
Given Arvind’s reproduction of the failure to do CoT, it seems I got lucky here.
In Arvind’s blog post, he points out that the March behavior does not seem to involve actually checking the potential prime factors to see if they are prime, and the test set only included prime numbers, so this was not a good test of mathematical reasoning – all four models sucked at this task the whole time.
[links to his resulting new blog post on the topic]
I saw several people (such as Benjamin Kraker here) taking the paper results at face value, as Arvind feared.
Or this:
Link is to a TechRadar post, which reports user complaints of worse performance, without any systematic metrics or a general survey.
Matei responds to Arvind:
It is a big usability deal if GPT-4 is overruling user requests, likely the result of overly aggressive RLHF and fine tuning. This could push more users, including myself, towards using Claude 2 for many purposes instead. For now, I’ve often found it useful to query both of them.
Ethan Mollick went back and tested his old prompts, and confirms that the system has changed such that it performs worse if you are prompting it the way that was optimal months ago, but that is fine if you have adjusted to the new optimum.
Or here’s another interpretation:
But Doctor, These Are The Previous Instructions
Custom instructions for ChatGPT are here. We are so back.
This is insanely great for our mundane utility. Prompt engineering just got a lot more efficient and effective.
First things first, you definitely want to at least do a version of this:
Nivi is going with the following list:
Or you might want to do something a little more ambitious?
deepfates: As I suspected it’s basically a structured interface to the system prompt. Still interesting to see, especially the “quietly think about” part. And obviously it’s a new angle for prompt injection which is fun.
That was fun. What else can we do? Nick Dobos suggests building a full agent, BabyGPT style.
Thread has several additional agent designs.
When in doubt, stick to the basics.
I Will Not Allocate Scarce Resources Via Price
Seriously, what is wrong with people.
Fun with Image Generation
Bryan Caplan contest to use AI to finish illustrating his graphic novel.
Deepfaketown and Botpocalypse Soon
Tyler Cowen has his unique take on the actors strike and the issue of ownership of the images of actors. As happens frequently, he centers very different considerations than anyone else would have, in a process that I cannot predict (and that thus at least has a high GPT-level). I do agree that the actors need to win this one.
I do agree with his conclusion. If I got to decide, I would say: Actors should in general only be selling their images only for a particular purpose and project. At minimum, any transfer of license should be required to come with due consideration and not be a requirement for doing work, except insofar as the rights pertain narrowly to the work in question.
There is precedent for this. Involuntary servitude and slave contracts are illegal. Many other rights are inalienable, often in ways most agree are beneficial. Contracts that do not involve proper consideration on both sides are also often invalid.
On the strike more generally, Mike Solana asks: “which one is it though?”
It is a good question because it clarifies what is going on, or threatens to happen in the future. If the concern was that AI could do the job as well as current writers, artists and actors, then that would be one issue. If they can’t and won’t be able to do it at all, then there would no problem.
Instead, we may be about to be in a valley of despair. The AI tools will not, at least for a while, be able to properly substitute for the writers or actors. However they will be able to do a terrible job, the hackiest hack that ever hacked, at almost zero marginal cost, in a way that lets the studios avoid compensating humans. This could result in not only jobs and compensation lost, but also in much lower quality products and thus much lower consumer surplus, and a downward spiral. If used well it could instead raise quality and enable new wonders, but we need to reconfigure the incentives so that the studios have the right incentives.
So, the strikes will continue, then.
They Took Our Jobs
These Richmond restaurants now have robot servers. The robots come from China-based PuduTech. They purr and go for $15,000 each.
Good news translators, the CIA is still hiring.
Ethan Mollick gives a simple recommendation on ‘holding back the strange tide of AI’ to educators and corporations: Don’t. Ignore it and employees will use it anyway. Ban it and employees will use it anyway, except on their phones. Those who do use it will be the ‘wizards.’ Your own implementation? Yep, the employees will once use ChatGPT anyway. Embrace new workflows and solutions, don’t pre-specify what they will be. As for educators, they have to figure out what comes next, because their model of education focuses on proof of work and testing rather than on teaching. AI makes it so much easier to learn, and so much harder to force students to pretend they are learning.
Get Involved
Alignment-related grantmaking was for a time almost entirely talent constrained. Thanks to the urgency of the task, the limited number of people working on it and the joys of crypto, there was more money looking to fund than there were quality projects to fund.
This is no longer the case for the traditional Effective Altruist grantmaking pipelines and organizations. There are now a lot more people looking to get involved who need funding. Funding sources have not kept pace. Where the marginal FTX dollar was seemingly going to things like ‘movement building’ or ‘pay someone to move to Bay for a while and think about safety’ or invested in Anthropic, the current marginal dollar flowing through such systems is far more useful. So if you are seeking such funding, keep this in mind when deciding on your approach.
Here is an overview of the activities of EA’s largest funding sources.
As a speculation granter for the SFF process that I wrote about here (link to their website), in previous rounds I did not deploy all my available capital. In the current round, I ran out of capital to deploy, and could easily have deployed quite a lot more, and I expect the same to apply to the full round of funding decisions that is starting now.
On the flip side of this, there is also clearly quite a lot of money both large and small that wants to be deployed to help, without knowing where it can be deployed. This is great, and reflects good instincts. The danger of overeager deployment is that it is still the case that it is very easy to fool yourself into thinking you are doing good alignment work, while either tackling only easy problems that do not help, or ending up doing what is effectively capabilities work. And it is still the case that the bulk of efforts involve exactly such traps. As a potential funder, one must look carefully at individual opportunities. A lot of the value you bring as a funder is by assessing opportunities, and helping ensure people get on and stay on track.
The biggest constraint remains talent, in particular the willingness and ability to create effective organizations that can attempt to solve the hard problems. If that could be you and you are up for it, definitely prioritize that. Even if it isn’t, if you have to choose one I’d still prioritize getting involved directly over providing funding – I expect the funding will come in time, as those who are worried or simply see an important problem to be solved figure out how to deploy their resources, and more people become more worried and identify the problem.
Given these bottlenecks, how should we feel about leading AI labs stepping up to fund technical AI safety work?
My answer is that we should welcome and encourage leading AI labs funding technical safety work. There are obvious synergies here, and it is plausibly part of a package of actions that together is the best way for such labs to advance safety. There are costs to scaling the in-house team and benefits to working with independent teams instead.
We should especially welcome offers of model access, and offers of free or at-cost compute, and opportunities to talk and collaborate. Those are clear obvious wins.
What about the concern that by accepting funding or other help from the labs, researchers might become beholden or biased? That is a real risk, but in the case of most forms of technical work everyone wants the same things, and the labs would not punish one for not holding back the bad news or for finding negative results, at least not more so than other funding sources.
What about the impact on the lab? One could say the lab is ‘safety washing’ and will use such funding as an excuse not to do their own work. That is possible. What I find more plausible is that the lab will now identify as a place that cares about safety and wants to do more such work, and also potentially creates what Anthropic calls a ‘race to safety’ where other labs want to match or exceed.
There are specific types of work that should strive to avoid such funding. In particular, I would be wary of letting them fund or even pay organizations doing evaluations, or work that too easily translates into capabilities. In finance, we have a big problem where S&P, Finch and Moody’s all rely on companies paying to have their products rated for safety. This puts pressure on them to certify products as safer than they are, which was one of the contributing causes of the great financial crisis in 2008. We do not want a repeat of that. So ideally, the labs will fund things like mechanistic interpretability and other technical work that lack that conflict of interest. Then other funders can shift their resources to Arc and others working on evaluations.
The flip side: Amanda Ngo looking for an AI-related academic or research position that will let her stay in America. I don’t know her but her followers list speaks well.
If you want to work on things like evals and interpretability, Apollo Research is hiring.
If you want to work at Anthropic on the mechanistic interpretability team, and believe that is a net positive thing to do, they are hiring, looking for people with relevant deep experience.
If you want to advance a different kind of safety than the one I focus on, and have leadership skills, Cooperative AI is hiring.
If you want to accelerate capabilities but will settle for only those of the Democratic party, David Shor is hiring.
Can We X This AI?
A week or so before attempting to also rebrand Twitter as X (oh no), Elon Musk announced the launch of his latest attempt to solve the problem of building unsafe AGI by building unsafe AGI except in a good way. He calls it x.ai, I think now spelled xAI.
Our best source of information is the Twitter spaces Elon Musk did. I wish he’d stop trying to convey key information this way, it’s bad tech and audio is bad in general, why not simply write up a post, but that is not something I have control over.
I will instead rely on this Alex Ker summary.
It is times like this I wish I had Matt Levine’s beat. When Elon Musk does stupid financial things it is usually very funny. If xAI did not involve the potential extinction of humanity, instead we only had a very rich man who does not know how any of this works and invents his own ideas on how any of it works, this too would be very funny.
Yeah, no. That’s not a thing. But could we make it a thing and steelman the plan?
I presume that the plan must be to make the AI curious in particular about humans. Curiosity in general won’t work, because humans are not going to be the maximum amount of curiosity one can satisfy out of any possible configuration of atoms, once your skill at rearranging them is sufficiently strong. What if that curiosity was a narrow curiosity about humans in particular?
Certainly we have humans like this. One can be deeply curious about trains, or about nineteenth century painters, or about mastering a particular game, or even a particular person, while having little or no curiosity about almost anything else. It is a coherent way to be.
To be clear, I do not think Elon is actually thinking about it this way. Here’s from Ed Krassen’s summary:
But that does not mean the realization could not come later. So let’s suppose it can be a thing. Let’s suppose they find a definition of curiosity that actually was curious entirely about humans, such that being curious about us and keeping us alive was the best use of atoms, even more so than growing capabilities and grabbing more matter. And let’s suppose he solved the alignment problem sufficiently well that this actually stuck. And that he created a singleton AI to do this, since otherwise this trait would get competed out.
How do humans treat things we are curious about but do not otherwise value? That we do science to? Would you want to be one of those things?
Elon has clarified that this means a fundamental physics problem.
This seems to me to not be necessary. We cannot assume that there are such problems that have solutions that can be found through the relevant levels of pure intelligence.
Nor does it seem sufficient, the ability to solve a physics problem could easily be trained, and ideally should be trained, on a model that lacks general intelligence.
I very much hope that Musk is right about this. I do think there’s a chance, and it is a key source of hope. Others have made the case that AGI is far and our current efforts will not succeed.
I do not think it is reasonable to say it is ‘not succeeding’ in the present tense. AI advances are rapid, roughly as scaling laws predicted, and no one expected GPT-4 to be an AGI yet. The paradigm may fail but it has not failed yet. Expressed as present failure, the claim feels like hype or what you’d put in a pitch deck, rather than a real claim.
Sure. Those seem like reasonable things to track. That does not tell us what to do with that information once it crosses some threshold, or what decisions change based on the answers.
Lots of compute per person makes sense. So does iterating on ideas and challenging each other.
If you are ‘shipping quickly’ in the AI space, then how are you being safe? The answer can be ‘our products are not inherently dangerous’ but you cannot let anyone involved in any of this anywhere near a new frontier model. Krassen’s summary instead said we should look for information on their first release in a few weeks.
Which is bad given the next answer:
Yes, it is good to be useful for both consumers and businesses, yay mundane utility. But in what sense are you going to be in competition with Google and OpenAI? The only actual way to ‘compete with OpenAI’ is to build a frontier model, which is not the kind of thing where you ‘iterate quickly’ and ‘ship in a few weeks.’ For many reasons, only one of which is that you hopefully like being not dead.
This is known to be Elon’s position on Twitter’s data. Note that he is implicitly saying he will not be using private data, which is good news.
This may end up being true but I doubt Elon has thought well about it, and I do not expect this statement to correlate with what his engineers end up doing.
I am very confident he has not thought this through. It’s going to be fun.
I’ve seen Elon’s arms length transactions. Again, it’s going to be fun.
I am even more confident Elon has not thought this one through. It will be even more fun. No, Elon is not going to go to prison. Nor would him going to prison help. Does he think the government will have no other way to stop the brave truth-telling AI if he stands firm?
Sigh, again with the open source nonsense. The good news is that if Elon wants xAI to burn his capital on compute indefinitely without making any revenue, he has that option. But he is a strange person to be talking about being voracious for profit, if you have been paying attention to Twitter. I mean X? No, I really really mean Twitter.
Oh, you’re being serious. Love it. Too perfect. Can’t make this stuff up.
I am very curious to see how they attempt to do that, if they do indeed so attempt.
Quite the probability distribution. Admire the confidence. But not the calibration.
Then those votes will, I presume, be disregarded, assuming they were not engineered beforehand.
Here are some additional notes, from Krassen’s summary:
Quite so. Eventually those in charge might actually notice the obvious about this, if enough people keep saying it?
Does he now? There are some trades he could do and he has the capital to do them.
Yes, I suppose we can agree that if we can agree on something we should do it, and that safety is worth a non-zero amount. For someone who actively expresses worry about AI killing everyone and who thinks those building it are going down the wrong paths, this is a strange unwillingness to pay any real price to not die. Does not bode well.
Simeon’s core take here seems right.
I believe all of these claims simultaneously on xAI:
The question is to what extent those involved know that they do not know what they are doing on safety. If they know they do not know, then that is mostly fine, no one actually knows what they are doing. If they think they know what they are doing, and won’t realize that they are wrong about that in time, that is quite bad.
Scott Alexander wrote a post Contra the xAI Alignment Plan.
Scott also notices that Musk dismisses the idea of using morality as a guide in part because of The Waluigi Effect, and pushes back upon that concern. Also I love this interaction:
Whenever you hear a reason someone is bad, remember that there are so many other ways in which they are not bad. Why focus on the one goat?
Introducing
ChatGPT for Android. I guess? Browser version seemed to work fine already.
Not AI, but claims of a room temperature superconductor that you can make in a high school chemistry lab. Huge if true. Polymarket traded this morning at 22%. Prediction market the first morning was at 28% it would replicate and this morning was 23%, Metaculus this morning is at 20%. Preprint here. There are two papers, one with exactly three authors – the max for a Nobel Prize – so at least one optimization problem did get solved. It seems they may have defected against the other authors by publishing early, which may have forced the six-author paper to be rushed out before it was ready. Here is some potential explanation.
Eliezer Yudkowsky is skeptical, also points out a nice side benefit:
Jason Crawford is also skeptical.
Arthur B gives us an angle worth keeping in mind here. If this type of discovery is waiting to be found – and even if this particular one fails to replicate, a 20% predictions is not so skeptical about there existing things in the reference class somewhere – then what else will a future AGI figure out?
There are high-utility physical affordances waiting to be found. We cannot know in advance which ones they are or how they work. We can still say with high confidence that some such affordances exist.
In Other AI News
Anthropic calls for much better security for frontier models, to ensure the weights are kept in the right hands. Post emphasizes the need for multi-party authorization to AI-critical infrastructure design, with extensive two-party control similar to that used in other critical systems. I strongly agree. Such actions are not sufficient, but they sure as hell are necessary.
Financial Times post on the old transformer paper, Attention is All You Need. The authors have since all left Google, FT blames Google for moving too slowly.
Quiet Speculations
Seán Ó hÉigeartaigh offers speculations on the impact of AI.
Our society’s central discourse presumes this is always the distribution of harms, no matter the source. It is not an unreasonable prior to start with, I see the logic of why you would think this, along with the social and political motivations for thinking it.
In the case of full AGI or especially artificial superintelligence (ASI) this instinct seems very right, with the caveat that the powerful by default are the one or more AGIs or ASIs, and the least empowered can be called humans.
In the case of mundane-level AI, however, this instinct seems wrong to me, at least if you exclude those who are so disempowered they cannot afford to own a smartphone or get help from someone who has such access.
Beyond that group, I expect mundane AI to instead to help the least empowered the most. It is exactly that group whose labor will be in disproportionally higher demand. It is that group that most needs the additional ability to learn, to ask practical questions, to have systems be easier to use, and to be enabled to perform class. They benefit most that systems can now be customized for different contexts and cultures and languages. And it is that group that will most benefit in practical terms when society is wealthier and we are producing more and better goods, because they have a higher marginal value of consumption.
To the extent that we want to protect people from mundane AI, or we want to ensure gains are ‘equally distributed,’ I wonder if this is not instead the type of instinct that thinks that it would be equitable to pay off student loans.
The focus on things like algorithmic discrimination reveals a biased worldview that sees certain narrow concerns as central to life, and as being in a different magisteria from other concerns, in a way they simply are not. It also assumes a direction of impact. If anything, I expect AI systems to on net mitigate such concerns, because they make such outcomes more blameworthy, bypass many ways in which humans cause such discrimination and harms, and provide places to put thumbs on the scale to counterbalance such harms. It is the humans that are most hopelessly biased, here.
I see why one would presume that AI defaults to favoring the powerful. When I look at the details of what AI offers us at the mundane utility level of capabilities, I do not see that.
Very true.
My note however would be that once again we always assume change will result in greater power imbalances. If we are imagining a world in which AIs remain tools that humans are firmly in control of all resources, then unless there is a monopoly or at least oligopoly on sufficiently capable AIs, why should we assume this rather than its opposite? One could say that one ‘does not need to worry’ about good things like reducing such imbalances while we do need to worry about the risks or costs of increasing them, and in some ways that is fair, but it paints a misleading picture, and it ignores the potential risks of the flip side if the AIs could wreck havoc if misused.
The bigger issue, of course, is the loss of control.
Yes. I’d even say this is understated.
Bernie-Sanders-meme-style I am once again asking everyone to stop with this talking point. There is not one tiny fixed pool of ‘AI harms’ attention and another pool of ‘all other attention.’ Drawing attention to one need not draw attention away from the other on net at all.
Also most mitigations of future harms help mitigate present harms and some help the other way as well, so the solutions can be complementary.
In terms of mundane deployment we have the same clash as with every other technology. This is a clash in which we are typically getting this wrong by deploying too slowly or not at all – see the standard list and start with the FDA. We should definitely be wary of restrictions on mundane utility causing more harm than good, and of a one-dial ‘boo AI vs. yay AI’ causing us to respond to fear of future harms by not preventing future harms and instead preventing present benefits.
Is there risk that by protecting against future harms we delay future benefits? Yes, absolutely, but there is also tons to do with existing (and near future still-safe) systems to provide mundane utility in the meantime, and no one gets to benefit if we are all dead.
Quite so.
That does not mean the mundane harms-versus-benefits calculation will soon swing to harmful here. I am mostly narrowly concerned on that front about enabling misuse of synthetic biology, otherwise my ‘more knowledge and intelligence and capability is good’ instincts seem like they should continue to hold for a while, and defense should be able to keep pace with offense. I do see the other concerns.
The far bigger question and issue is at what point do open source models become potentially dangerous in an extinction or takeover sense. Once you open source a model, it is available forever, so if there is a base model that could form the core of a dangerous system, even years down the line, you could put the world onto a doomed path. We are rapidly approaching a place where this is a real risk of future releases – Llama 2 is dangerous mostly as bad precedent and for enabling other work and creating the ecosystem, I start to worry more as we get above ~4.5 on the GPT-scale.
Um, yes. Yes it can. Strongly prefer to avoid such solutions if possible. Also prefer not to avoid such solutions if not possible. Litany of Tarski.
Knowing the situation now and facing it down is the best hope for figuring out a way to succeed with a minimum level of such tactics.
The similarity is an unfortunate reality that is often used (often but not always in highly bad faith) to discredit such claims. When used carefully it is also a legitimate concern about what is causing or motivating such claims, a question everyone should ask themselves.
Yes again. If I could be confident AGI in the next few decades was impossible, then I would indeed be writing very, very different weekly columns. Life would be pretty awesome, this would be the best of times and I’d be having a blast. Also quite possibly trying to found a company and make billions of dollars.
Could an AI company simply buy the required training data?
I do not see this as a question of cost. It is a question of quality. If you pay for humans to create data you are going to get importantly different data than you would get from StackOverflow. That could mean better, if you design the mechanisms accordingly. By default you must assume you get worse, or at least less representative.
One can also note that a hybrid approach seems obviously vastly superior to a pure commission approach. The best learning is always on real problems, and also you get to solve real problems. People use StackOverflow now without getting paid, for various social reasons. Imagine if there was payment for high quality answers, without any charge for questions, with some ‘no my rivals cannot train on this’ implementation.
Now imagine that spreading throughout the economy. Remember when everything on the web was free? Now imagine if everything on the web pays you, so long as you are creating quality data in a private garden. What a potential future.
In other more practical matters: Solving for the equilibrium can be tricky.
White House Secures Voluntary Safety Commitments
We had excellent news this week. That link goes to the announcement. This one goes to the detailed agreement itself, which is worth reading.
The White House secured a voluntary agreement with seven leading AI companies – not only Anthropic, Google, DeepMind, Microsoft and OpenAI, also Inflection and importantly Meta as well – for a series of safety commitments.
Voluntary agreement by all parties involved, where available with the proper teeth, is almost always the best solution to collective action problems. It shows everyone involved is on board. It lays the groundwork for codification, and for working together further towards future additional steps and towards incorporating additional parties.
Robin Hanson disagreed, speculating that this would discourage future action. In his model, ‘something had to be done, this is something and we have now done it.’ So perhaps we won’t need to do anything more. I do not think that it how this works, and reflects a mentality where, in Robin’s opinion, nothing needed to be done absent the felt need to Do Something.
They do indeed intend to codify:
The question is, what are they codifying? Did we choose wisely?
Security testing and especially outside evaluation and red teaming is part of any reasonable safety plan. At least for now this lacks teeth several levels versus what is needed. It is still a great first step. The details make it sound like this is focused too much on mundane risks, although it does mention biosecurity. The detailed document makes it clear this is more extensive:
That’s explicit calls to watch for self-replication and physical tool use. At that point one can almost hope for self-improvement or automated planning or manipulation, so this list could be improved, but damn that’s pretty good.
This seems unambiguously good. Details are good too.
Do not sleep on this. There are no numbers or other hard details involved yet but it is great to see affirmation of the need to protect model weights, and to think carefully before releasing such weights. It also lays the groundwork for saying no, do not be an idiot, you are not allowed to release the model weights, Meta we are looking at you.
This is another highly welcome best practice when it comes to safety. One could say that of course such companies would want to do this anyway and to the extent they wouldn’t this won’t make them do more, but this is an area where it is easy to not prioritize and end up doing less than you should without any intentionality behind that decision. Making it part of a checklist for which you are answerable to the White House seems great.
This will not directly protect against existential risks but seems like a highly worthwhile way to mitigate mundane harms while plausibly learning safety-relevant things and building cooperation muscles along the way. The world will be better off if most AI content is watermarked.
My only worry is that this could end up helping capabilities by allowing AI companies to identify AI-generated content so as to exclude it from future data sets. That is the kind of trade off we are going to have to live with.
The companies are mostly already doing this with the system cards. It is good to codify it as an expectation. In the future we should expect more deployment of specialized systems, where there would be temptation to do less of this, and in general this seems like pure upside.
There isn’t explicit mention of extinction risk, but I don’t think anyone should ever release information on their newly released model’s extinction risk, in the sense that if your model carries known extinction risk what the hell are you doing releasing it.
This is the one that could cause concern based on wording of the announcement, with the official version being even more explicit that it is about discrimination and bias and privacy shibboleths, it even mentions protecting children. Prioritizing versus what? If it’s prioritizing research on bias and discrimination or privacy at the expense of research on everyone not dying, I will take a bold stance and say that is bad. But as I keep saying there is no need for that to be the tradeoff. These two things do not conflict, instead they compliment and help each other. So this can be a priority as opposed to capabilities, or sales, or anything else, without being a reason not to stop everyone from dying.
Yes, it is a step in the wrong direction to emphasize bias and mundane dangers without also emphasizing or even mentioning not dying. I would hope we would all much rather deal with overly biased systems that violate our privacy than be dead. It’s fine. I do value the other things as well. It does still demand to be noticed.
I do not treat this as a meaningful commitment. It is not as if OpenAI is suddenly committed to solving climate change or curing cancer. If they can do those things safety, they were already going to do them. If not, this won’t enable them to.
If anything, this is a statement about what the White House wants to consider society’s biggest challenges – climate change, cancer and whatever they mean by ‘cyberthreats.’ There are better lists. There are also far worse ones.
That’s six very good bullet points, one with a slightly worrisome emphasis, and one that is cheap talk. No signs of counterproductive or wasteful actions anywhere. For a government announcement, that’s an amazing batting average. Insanely great. If you are not happy with that as a first step, I do really know what you were expecting, but I do not know why you were expecting it.
As usual, the real work begins now. We need to make the cheap talk not only cheap talk, and use it to lay the groundwork for more robust actions in the future. We are still far away compute limits or other actions that could be sufficient to keep us alive.
The White House agrees that the work must continue.
Bullet points of other White House actions follow. I cringe a little every time I see the emphasis on liberal shibboleths, but you go to war with the army you have and in this case that is the Biden White House.
Also note this:
That list is a great start on an international cooperation framework. The only problem is who is missing. In particular, China (and less urgently and less fixable, Russia).
That’s what the White House says this means. OpenAI confirms this is part of an ongoing collaboration, but offers no further color.
I want to conclude by reiterating that this action not only shows a path forward for productive, efficient, voluntary coordination towards real safety, it shows a path forward for those worried about AI not killing everyone to work together with those who want to mitigate mundane harms and risks. There are miles to go towards both goals, but this illustrates that they can be complementary.
Jeffrey Ladish shares his thoughts here, he is in broad agreement that this is a great start with lots of good details, although of course more is needed. He particularly would like to see precommitment to a sensible response if and when red teams find existentially threatening capabilities.
Bloomberg reported on this with the headline ‘AI Leaders Set to Accede to White House Demand for Safeguards.’ There are some questionable assumptions behind that headline. We should not presume that this is the White House forcing companies to agree to keep us safe. Instead, my presumption is that the companies want to do it, especially if they all do it together so they don’t risk competitive disadvantage and they can use that as cover in case any investors ask. Well, maybe not Meta, but screw those guys.
The Ezra Klein Show
Ezra Klein went on the 80,000 hours podcast. Before I begin, I want to make clear that I very much appreciate the work Erza Klein is putting in on this, and what he is trying to do, it is clear to me he is doing his best to help ensure good outcomes. I would be happy to talk to him and strategize with him and work together and all that. The critiques here are tactical, nothing more. The transcript is filled with food for thought.
I also want to note that I loved when Erza called Wiblin out on Wiblin’s claim that Wiblin didn’t find AI interesting.
One could summarize much of the tactical perspective offered as: Person whose decades-long job has been focused on the weeds of policy proposal and communication details involving the key existing power players who actually move legislation suggests that the proper theory of change requires focus on the weeds of policy proposal and communication details involving the key existing power players who actually move legislation.
This bit is going around:
Robert Wiblin gives his response in these threads.
Everyone has their own idea for what any reasonable person would obviously do if the world was at stake. Usually it involves much greater focus on whatever that person specializes in and pays attention to more generally.
Which does not make them wrong. And it makes sense. One focuses on the areas one thinks matter, and then notices the ways in which they matter, and they are top of mind. It makes sense for Klein to be calling for greater focus on concrete proposal details.
It also is a classic political move to ask why people are not ‘informed’ about X, without asking about the value of that information. Should I be keeping track of this stuff? Yes, to some extent I should be keeping track of this stuff.
But in terms of details of speeches and the proposals ‘floating around Congress’ that seems simultaneously quite the in-the-weeds level and also they’re not very concrete.
There is a reason Robin Hanson described Schumer’s speech as ‘news you can’t use.’ On the other proposals: We should ‘create a new agency’? Seems mostly like cheap talk full of buzzwords that got the words ‘AI’ edited in so Bennett could say he was doing something. We should ‘create a federal commission on AI’? Isn’t that even cheaper talk?
Perhaps I should let more of this cross my threshold and get more of my attention, perhaps I shouldn’t. To the extent I should it would be because I am directly trying to act upon it.
Should every concerned citizen be keeping track of this stuff? Only if and to the extent that the details would change their behavior. Most people who care about [political issue] would do better to know less about process details of that issue. Whereas a few people would greatly benefit from knowing more. Division of labor. Some of you should do one thing, some of you should do the other.
So that tracking is easier, I had a Google document put together of all the bills before Congress that relate to AI.
So after an in-depth review of all of the proposals before Congress, it does not look like there is much of relevance before Congress. To the extent people want to Do Something, the something in question is to appoint people to look into things, or to prepare reports, or perhaps to prohibit things on the margin in ways that do not much matter, often in ways that have not been, shall we say, thought through in terms of their logistics.
The ‘most real’ of the proposals to file proper reports are presumably Ted Lieu’s blue ribbon commission and Bennett’s cabinet-level task force. My instinct is to prefer Lieu’s approach here, but the details of such matters are an area in which I am not an expert. If we could get the right voices into the room, I’d be down with such things. If not, I’d expect nothing good to come out of them, relative to other processes that are underway.
Erza Klein actually goes even further into the weeds than the popular quote suggests.
I am uncertain how much to buy this. Definitely not a zero amount. It seems right that this is an area of relative under-investment. Someone should get on that. But also in a real sense, this is a different theory than the theory that we should know about the concrete proposals out there. If we are in an educational phase, if we are in the phase where we build up relationships, is that not what matters?
Similarly, Ezra later highlights the need to know which congress members think what and which ones people listen to, and once again even for those focused on DC and politics these are competing for attention.
And then he talks about the need to build DC institutions:
This is such a crazy system. You call whoever is physically there and technically has an organization? Do those calls actually matter? As usual, my expectation is that people will create organizations, and this will not happen, and only retroactively will people explain the ways in which those involved were ‘doing it wrong’ or what was the critical missing step. Also I notice the story does not seem to have a great ending, did they end up influencing the path of crypto law? What about counterfactually?
There’s a kind of highly anti-EMH vibe to all of this, where there are very cheap, very small interventions that do orders of magnitude more work and value creation, yet people mostly don’t do them. Which I can totally believe, but there’s going to be a catch in terms of implementation details being tricky.
Erza Klein says that AI people don’t go to DC because SF is where the action is, they want to hang out with other AI people and work on interesting problems. Certainly there is a lot of that. They also I am sure enjoy making several times as much money, and specifically not dealing with politics or fundraising.
I also have a bit of an axe to grind about people who measure things in terms of names and also locations.
Names are not something I focus my finite memory upon. As in: I have looked carefully at the AI Blueprint Bill of Rights, without knowing that it was run by Alondra Nelson, whose name has otherwise come up zero times. Tracking key Senators seems more reasonable, it still seems like a specialized need. I have witnessed many intricate debates about the legal philosophy of different Supreme Court Justices that, while I often find them interesting, almost never have any practical value no matter how much one cares about cases before the court.
There is division of labor. Some other people should be tracking the names because they are interacting with the people. Most people should not have to.
Locations are similar. For decades I have heard ‘if you are not in exactly San Francisco you don’t count.’ Now it is ‘if you are not in Washington, DC you do not count.’ There is certainly truth to location mattering, the same as names mattering, but also one hears so much of talking one’s own book in order to make it so.
This all seems deeply similar to Tyler Cowen expressing such frustration, perhaps outrage, that those involved have not engaged properly with the academic literature and have not properly ‘modeled,’ in his preferred economic sense, the dynamics in question. And then to say we have ‘lost the debate’ because the debate is inside national security or other inside-government circles, with the Democratic process irrelevant.
There is even a direct disagreement there about where hope lies. Ezra Klein seems the Democratic process as promising, able to do worthwhile things, while wanting to keep the national security types out of it. Tyler Cowen suggests often that the national security discussions are what matters, the Democratic ones are sideshows at best or walking disasters waiting to happen if they are otherwise.
Many others talk about failure to use other, doubly orthogonal techniques and approaches, that also vary quite a lot from each other. Some of which are working.
Another Congressional Hearing
This one seemed a clear improvement over the previous one. Policy talk got more concrete, and there was much more focus on extinction risks and frontier models. You can see a partial transcript here.
Senator Blumenthal’s opening remarks were mostly quite good.
Senator Hawley is concerned about the wrong companies getting too large a share of the poisoned bananas, and confident that corporations are automatic money printing machines of limitless power. Later he would spend his entire first time allocation expressing alarm that Google might in the future incorporate Claude into its services. He does not seem to realize DeepMind exists and Gemini is coming. Think of how much money Google might make, of the power, he warns, isn’t that alarming?
Later he pivoted to concerns about our chip supply chain. Then (~1:55) he pivots again to the workers who helped evaluate ChatGPT’s data, warning that they outsourced so many jobs and also that those who got the jobs ‘that could have been done in the United States’ were exploited, overworked and traumatized, and paid only a few dollars an hour. Why, he asks, shouldn’t American workers benefit rather than having it be built by foreigners? He frames this as if the workers sorting through data are the ones who have the good jobs and that benefit from ‘building’ the AI like it was some manufacturing offshoring story, but expresses relief when Dario says Constitutional AI will mean less labor is needed – presumably if no one gets the job at all, that means you didn’t outsource it, so that’s fine.
Then Blumenthal responds that Americans to do those jobs need ‘training,’ which in context makes zero sense. And says no, we’re not going to pause, it’s gold rush and we have to ‘keep it made in America’ which he thinks includes data evaluators, somehow? And again says ‘training’ as if it is a magic word.
Then Blumenthal asks who our competitors are, about potential ‘rogue nations.’ He frames this as asking who needs to be brought into an international agreement, so both sides of the coin (jingoism and cooperation) are present. Russell points out the UK is actually our closest competitor, says he has talked to the major players in China and that the level of threat is currently slightly overstated and all the Chinese are doing is building inferior copycat systems, although intent is there. The Chinese he says have more public but less private money, and the big customer is state security, so they are mostly good at security-related things. But that the Chinese aren’t allowing the freedom required to do anything.
Russel says that everyone else is far behind, and Russia in particular has nothing. Bengio mentions there are good researchers in the EU and Canada.
Later Blumenthal does ask about real safety, and gets into potential threats and countermeasures, including kill switches, AutoGPT, the insanity of open sourcing frontier models, and more. He is not a domain expert, he still has a long way to go, but it is clear that he is actually trying, in a way that other lawmakers aren’t.
Blumenthal suggests requiring issue reporting, which everyone agrees would be good. This won’t solve the core problems but definitely seems underutilized so far. He notes that there needs to be a ‘cop on the beat,’ that enforcement is required, and that recalls only work if they are enforced and consumers don’t ignore them. It’s great groundwork for helping him understand these issues better.
Hawley does ask one very good question, which is what one or at most two things Congress should do now (~2:14). Russel says create an agency and remove violating systems from the market. Bengio affirms those and adds funding for safety work. Amodei emphasizes testing and auditing, and that we only have 2-3 years before there are serious threats, then later the threat of autonomous replication.
Senator Klobuchar is concerned about Doing Something and Protecting Democracy, fighting disinformation and scams. She spent a little time having the witnesses giving her awareness-raising speeches for her, rather than seeking information, and that’s it.
Senator Blackburn lectured Amodei about the dangers of social media, says we are ‘behind on privacy.’ Asks if current regulations will be enough, it is pointed out that we don’t enforce our existing laws, she replies that obviously we need more then. She reminds us that ‘in Tennessee AI is important’ and that it’s terrible that Spotify does not list enough female country music artists in its suggestions. And that’s about it.
Yoshio Bengio and Stuart Russell both focused on the imminent arrival of full AGI.
Bengio warned that he’d gone from thinking we had centuries before AGI to as few as a few years, and he talked about how to measure danger, talking about access, alignment, scope of action and scope of intelligence. He called for:
At 1:01 Bengio emphasizes the need to prevent the release of further frontier models. At 1:42 he says we need a singular focus on guarding against rogue AI.
Stuart in particular noted how we do not know how LLMs work or what their goals are or how to set those goals, and warning that we are likely to lose control. Russell calls LLMs a ‘piece of the puzzle’ of AGI.
Stuart Russel estimates the cash value of AI of at least 14 quadrillion (!) dollars. Around 2:14 he says he expects AI to be responsible for the majority of economic output.
He also estimates 10 billion a month (!) going into AI startups right now.
Russel’s suggestions:
He later at 1:45 clarifies that if there is a violation, a company can go out of business for all he cares, unless they can prove that they will ‘never do that again.’ Which is not, in an LLM context, a thing one can ever prove. So, out of business, then. This is not a reasonable standard to apply to anything when the harm is finite, so it only makes sense in the context of extinction risks. But one could say that if there is a future risk that kills us if it ever happens, perhaps you need to demonstrate your ability to control risks in general on a similarly robust level?
He also says it is ‘basic common sense’ to force AI companies to only use designs where they understand how those designs work. That we should only run provably safe code, not merely scan for problems. So, again, a full ban on LLMs?
Which, again, might not be the worst outcome given the big picture, but wow, yeah.
Where are we on appreciating the dangers? Not nowhere. This is regarding discussion of kill switches, which are the kind of measure that definitely fails when you need it most:
Jeffrey Ladish came away impressed. You can watch the full hearing here.
Dario Amodei offered Anthropic’s thoughts. Here is Robert Wilbin pulling out the policy ideas, here is Dario’s full opening testimony.
Dario starts by asking a good question, echoing what we learned in the Vox post with a side of national security.
He notes the extreme rate of progress on AI, and that we must think ahead.
Good choice of metaphor. If anything, I would be skating farther ahead than that. He draws a distinction between:
The implied timeline is left as an exercise for the reader. He calls the long-term risks ‘at least potentially real’ and focuses on medium-term risks. This is what worries me about Anthropic’s political approach, and its failure to grapple with what is needed.
What are the policy recommendations?
Not bad at all. I see what he did there. These are important steps in the right direction, while appearing practical. It is not an obviously incorrect strategy. It potentially lays the foundation for stronger measures later, especially if we take the right steps when securing our hardware.
As usual, a reminder, no, none of the things being discussed fit the parallels their opponents trot out. Those who speak of AI regulations as doing things like ‘shredding the Constitution’ or as totalitarian are engaging in rather blatant hyperbole.
[in reference to Dario’s ‘we’re going to have a really bad time’ remarks.]
JJ – e/acc: This is irrational fear mongering and indistinguishable from totalitarian policing of innovation and knowledge creation.
Misha: No it’s pretty distinguishable. Back in the USSR my dad worked in a lab and had coworkers literally get disappeared on him.
The Frontier Model Forum
This is certainly shaped a non-zero amount like exactly what we need? The announcement is worth reading in full.
Here is the joint announcement (link to OpenAI’s copy, link to Google’s, link to Anthropic’s).
This is simultaneously all exactly what you would want to hear, and also not distinguishable from cheap talk. Here’s how they say it will work:
That is… not a description of how this will work. That is a fully generic description of how such organizations work in general.
One must also note that the announcement does not explicitly mention any version of existential risk or extinction risk. The focus on frontier models is excellent but this omission is concerning.
The Week in Audio
Dario Amodei on Hard Fork. Some good stuff, much of which raises further questions. Then a bonus review of a Netflix show that I am glad I stayed for, no spoilers.
Jan Leike on AXRP seems quite important, I hope to get to that soon.
Rhetorical Innovation
By Leo Gao, and yes this happens quite a lot.
An objection often raised is that the AI would not be unable to kill everyone, but it would choose not to do so in order to preserve its supply chain.
The response of course is that the AI would simply first automate its supply chain, then kill everyone.
To which the good objection is ‘that requires a much higher level of technology than it would take to take over or kill everyone.’
To which the good response is ‘once the AI has taken over, it can if necessary preserve humanity until it has sufficiently advanced, with our help if needed, to where it can automate its supply chain.’
So this is only a meaningful defense or source of hope if the AI cannot, given time and full control over the planet including the remaining humans, automate its supply chain. Even if you believe nanotech is impossible, and robotics is hard, robotics is almost certainly not that hard.
Eliezer Yudkowsky and Ajeya Cotra give it their shots once again:
I think Eliezer is right about the path that is more likely to happen. I think Ajeya is right here that that she is suggesting the rhetorically easier path, and that it is sufficient to prove the point. My additional note above is that ‘takeover first, then develop [nanotech or other arbitrary tech] later, then finally kill everyone’ is the default lowest tech-and-capabilities-level path, and potentially the best existence proof.
An unsolved problem is communicating a probability distribution to journalists.
Rest of thread explains that this is the exponential distribution, such as with the half-life of a radioactive particle, and that there are various ways that it could make sense for people’s timelines for AGI to roughly be moving into the future at one year per year as AGI does not occur, or they might go faster or slower than that.
I think this is both technically true and in practice a cop out. The reason we can be so confident with the particle is that we have a supremely confident prior on its decay function. We have so many other observations that the one particle failing to decay does not change our model. For AGI timelines this is obviously untrue. It would be an astounding coincidence if all the things we learned along the way exactly cancel out in this sense. That does not mean that it is an especially crazy prior if you can observe whether AGI is near but not how far it is if it is far. It does mean you are likely being sloppy, either falling back on a basic heuristic or at least not continuously properly incorporating all the evidence.
Roon notices something.
This is because it is very hard to imagine how such scenarios do not turn catastrophic towards the end, and LLMs are predictors.
Defense in Depth
In his good recent podcast with the Future of Life Institute, Jason Crawford calls for using defense in depth to ensure AI models are safe. That means adding together different safety measures, each of which would be insufficient on their own. He uses the metaphor of layers of Swiss cheese. Each layer has many holes, but if you have to pass through enough layers, there will be no path that navigates all of them.
I agree that, while all our methods continue to be full of holes, our best bet is to combine as many of them as possible.
The problem is that this is exactly the type of strategy that will definitely break down when faced with a sufficiently powerful optimization process or sufficiently intelligent opponent.
I believe the metaphor here is illustrative. If you have three slices of Swiss cheese, you can line them up so that no straight line can pass through all of them. What you cannot do, if there is any gap between them, is defeat a process that can move in ways that are not straight lines. A sufficiently strong optimization process can figure out how to navigate to defeat each step in turn.
Or, alternatively, if you must navigate in a straight line and each cheese slice has randomly allocated holes and is very large, you can make it arbitrarily unlikely that a given path will work, while being confident a sufficiently robust search will find a path.
This is how I think about a lot of safety efforts. It is on the margin useful to ‘stack more layers’ of defenses, more layers of cheese. When one does not have robust individual defenses, one falls back on defense in depth, while keeping in mind that when the chips are down and we most need defense to win, any defense in depth composed of insufficiently strong individual components will inevitably fail.
Security mindset. Unless you have a definitely safe superintelligent system, you definitely have an unsafe superintelligent system.
Same goes for a sufficiently powerfully optimizing system.
No series of defensive layers can solve this. There are potential scenarios where such layers can save you on the margin by noticing issues quickly and allowing rapid shutdown, they could help you notice faster that your solution does not work – and that is so much better than not noticing – but that’s about it, and even that is asking a lot and should not be relied upon.
Aligning a Smarter Than Human Intelligence is Difficult
Red teaming is a great example of something that is helpful but insufficient.
If your model is sufficiently capable that it could pose existential risk, you cannot be confident that it is insufficiently capable to fool your read team. Or, alternatively, if it is potentially an existential risk in the future when given additional scaffolding and other affordances, you cannot be confident that the red team will be able to find and demonstrate the problem.
A new paper looks into how well interpretability techniques scale.
Technical solutions to social problems are often the best solutions. Often the social problem is directly because of a technical problem with a technical solution, or that needs one. What you cannot do is count on a technical solution to a social problem without understanding the social dynamics and how they will respond to that solution.
In the case of AI, I’d also say that there are both impossible technical problems and impossible social problems. If you solve one, you may or may not have solved the other. A sufficiently complete social solution means no AGI, so no technical problem. A sufficiently complete technical solution overpowers potential social challenges. The baseline most people imagine involves a non-overpowering (non-pivotal, non-singleton-creating) technical solution, and not preventing the AGI from existing, which means needing to solve both problems, and dying if we fail at either one.
As illustration, here is why technical solutions often don’t solve social problems, in a way that is highly relevant to dealing with AI:
People Are Worried About AI Killing Everyone
An important fact about the world:
Yes, a sufficiently advanced intelligence could one-shot mind-hack you verbally.
There are of course lots of lesser hacks that would be quite sufficient to fool a human for all practical purposes in most situations. Often they are highly available to humans, or would be easily available to an expert with greater observation powers and thinking speed.
Other People Are Not As Worried About AI Killing Everyone
A reader found the open letter text from last week. It is in full:
That is much better than the summaries suggested, including a far more reasonable use of the word ‘good.’ If you add the word ‘centrally’ (or ‘primarily an’) before existential, it becomes a highly reasonable letter with which I disagree. If you instead add the word ‘only’ then I would not even disagree.
During the Dario Amodei interview on Hard Fork, one of the hosts mentions that when e/acc (accelerationists) downplay the dangers of AI one should ask what financial motives they have for this. I actually don’t agree. That is not a game that we should be playing. It does not matter that many of the accelerations are playing it, when they go low we can go high. One should strive not to question the other side’s motives.
Also, pretty much everyone has financial motive to push ahead with AI, and everyone who believes AI is safe and full of potential and worth accelerating should be investing in AI, likely also working in the field. None of it is suspicious. I’d be tempted to make an exception for those who are in the business of building hype machines, and who question the motives of others in turn as well, but no. We’re better than that.
Tyler Cowen links us to Kevin Munger’s speculation that Tyler Cowen is an Information Monster (akin to the ‘Utility Monster’) and wants to accelerate AI because it will give him more information to process. Certainly Tyler can process more information than anyone else I know – I process a lot and I am not remotely close. Given Tyler linked to him, there is likely not zero truth to this suggestion.
Other People Want AI To Kill Everyone
I am happy that those who feel this way often are so willing to say, out loud, that this is their preference. Free speech is important. So is knowing that many of those working to create potential human extinction would welcome human extinction. Whenever I hear people express a preference for human extinction, my response is not ‘oh how horrible, do not let this person talk’ it is instead ‘can you say that a bit louder and speak directly into this microphone, I’m not quite sure I got that cleanly.’
One must note that most accelerationists and most AI researchers do not believe this. I At least, I strongly believe this is true, and that most agree with Roon’s statement here.
And not only not molecular squiggles, also not anything else that we do not value. We are allowed, nay obligated, to value that which we value.
Max Tegmark confirms.
Andrew Critch offers what he thinks is the distribution of such beliefs.
This call goes well beyond free speech or tolerating the intolerant. This is someone actively working to cause the extinction of mankind, and asking us to respect that perspective lest it lead to an unhealthy competitive dynamic. I can’t rule out this being strategically correct, but I’ll need to see a stronger case made, and also that a big ask.
I do agree that it is right to engage with good faith arguments, even if the conclusions are abhorrent. I wish our society was much better about this across the board. That said, I do not think there are ‘valid points’ to make in all five of these views, they are mostly Obvious Nonsense.
The last three I would dismiss as rather horrifying things to know sometimes actually come out of the mouth of a real human being. I am not a Buddhist but optimizing to the point of human extinction for minimizing suffering is the ultimate death by Goodhart’s Law. The metric is not the measure, the map is not the territory. Being fine with human extinction because it is ‘fair’ is even worse and points to deep pathologies in our culture that anyone would suggest it. The result is not inevitable, but even if it was that does not mean one should embrace bad things.
Saying the AI are humanity’s children so them replacing us is fine is suicide by metaphor. AIs are not our children, our children are our children, and I choose to care about them. I wonder how many of those advocating for this have or plan to have human children. Again, this points to a serious problem with our culture.
I would also note that this is a remarkably technophobic, traditionalist objection, even within its own metaphor. It is not ‘good to be displaced’ by one’s children, it is that humans age and die, so the alternative to displacement is the void. Whereas with AI humans need not even die, and certainly can be replaced by other humans, so there is no need for this.
The justification that the AI will be ‘more moral’ than us misunderstands morality, what it is for and why it is here. The whole point of morality is what is good for human. If you think that morality justifies human extinction, once again you are dying via Goodhart’s Law, albeit with a much less stupid mistake.
Rob Bensinger offers his response to the five arguments. He sees (b) and (c) as reducing to (a) when the metaphors are unconfused, which seems mostly right, and that (d) and (e) come down to otherwise expecting a worse-than-nothing future, which I think is a steelman that assumes the person arguing wants better things rather than worse things, whereas I think that is not actually what such people think. We agree that (a) is the real objection here (and it is also the more popular one). Rob seems to treat ‘future AIs built correctly could have more worth than us’ as not a conceptual error, but warns that we are unlikely to hit such a target, which seems right even if we take morality as objective and not about us, somehow. And of course he points out that would not justify wiping out the old.
I do agree that it makes sense to understand where these views are coming from.
Andrew Critch then tells us that it is fine to feel outrage and perhaps some fear, but we need to also show empathy and not paper over such concerns, and continue to engage in respectful and open discourse.
I at least agree with the call on discourse. I believe that if we do this, we will be guided by the beauty of our weapons. And that this would extend to so many other discussions that have nothing to do with AI, as well.
Is this from Rob Miles right?
I do think this is often step one. I also think it has no conflict with care, empathy or consideration. The most emphatic thing to start with is often ‘hey I think you made a mistake or didn’t realize realize what you were saying, that seems to imply [X, Y, Z], are you sure you meant that?’
However I do think that in many cases such people actually do believe it and that belief will survive such inquiries.
There are other in some ways ‘more interesting’ subsets of ‘all people’ that one could ask if someone supports being killed, as well.
In many cases yes, people embrace the abstract when they would not embrace the concrete, or the general but not the particular. Or they would endorse both, but they recoil from how to the particular sounds, or they’d prefer to get a particular exception for themselves, or they haven’t realized the implications.
In other cases, I think the bulk of cases, no. The default real ‘pro-extinction’ case is, I think, that it would be better if no one was killed and instead people died out of natural causes after happy bountiful retirements. There is then a split as to whether or not it would much bother them if instead everyone, including their children (note that usually they don’t have any) are instead killed, violently or otherwise.
Usually, when they imagine the outcome, they imagine a form of the retirement scenario, and find a way to believe that this is what would happen. This seems exceedingly unlikely (and also would bring me little comfort).
To summarize, yes, dealing with the following is frustrating, why do you ask?
What Is E/Acc?
Here is a very strange prediction.
This reflects a very different understanding than mine of many distinct dynamics. I do not believe EA is going to become entirely about AI safety (or even x-risk more generally). If EA did narrow its focus, I would not expect the e/acc crowd to pick up the mantles left on the ground. They are about accelerationism, trusting that the rising tide will lift all boats rather than drown us. Or they see technology as inevitable or believe in the doctrines of open source and freedom to act. Or in some cases they simply don’t want anyone denying them their cool toys, with which I can certainly sympathize. They are not about precise effectiveness or efficiency.
Whereas I see others asserting the opposite, that EA and e/acc are the same.
Vassar replies with a hint.
What is the thing that both claim to want? A good future with good things and without as many bad things, rather than a bad future without good things or with more bad things.
This is not as much a given as we might think. There are many ideologies that prefer a bad future to a good future, or that care about the bad things happening to the bad people rather than preventing them, and that care about the good things happening more often to the good people than the bad people rather than there being more good things in general. On this important question, both EA and e/acc have it right.
The differences are what is the good, and how does one achieve the good?
Vassar’s description of how both groups think one achieves the good actually seem reasonably accurate.
EAs (seem to) believe at core that you figure out the best actions via math and objective formalism. This is a vast improvement on the margin compared to the standard method of using neither math nor formalism. The problem is that it is incomplete. If your group collectively does it sufficiently hard to the exclusion of other strategies, you get hit hard by Goodhart’s Law, and you end up out-of-distribution, so it stops working and starts to actively backfire.
E/accs seem to believe at core that the good comes from technology and building cool new stuff. That is a very good default assumption, much better than most philosophical answers. One should presume by default that making humans better at doing things and creating value will usually and on net turn out well. There is a reason that most people I respect have said, or I believe would agree with, statements of the form ‘e/acc for everything except [X]’ where X is a small set of things like AGIs and bioweapons.
The problem is that AGI is a clear exception to this, which is (essentially) that when it would no longer be the humans doing the things and deciding what happens next, then that stops being good for the humans, and we are approaching the point where that might happen.
More generally, technology has turned out to almost always be beneficial to humans, exactly where and because humans have remained in control, and exactly where and because the humans noticed that there were downsides to the technology and figured out how to mitigate and handle them.
There are twin failure modes, where you either ignore the downsides and they get out of control, or you get so scared of your own shadow you stop letting people do net highly useful things in any reasonable way.
We used to do a decent amount of the first one, luckily the tech tree was such that we had time to notice and fix the problems later, in a way we likely won’t be able to with AGI. We now often fail in the second way, instead, in ways we may never break out of, hence the e/acc push to instead fail in the first way.
Steering and slamming the gas pedal both move the car. They are not the same thing.
I notice my frustrations here. The true e/acc would be advocating for a wide range of technological, scientific and practical advances, a true abundance agenda, more more more. Ideally it would also notice that often there are downsides involved, and tell us humans to get together and solve those problems, because that is part of how one makes progress and ensures it constitutes progress.
Instead, what we usually see does seem to boil down to a ‘just do go ahead’ attitude, slamming the gas pedal without attempt to steer it. Often there is denial that one could steer at all – the Onedialism theory that you have an accelerator and a break, but no wheel, you’re on a railroad and the track will go where it goes so full steam ahead.
Is the following fair?
Not entirely yes. Not entirely no.
I also agree with Rob Bensinger here, also with Emmett Shear, except that ‘how dare anyone entertain the hypothesis’ has its own inherent problems even when the hypothesis in question is false.
The Lighter Side
A fun comparison: Llama-2 versus GPT-4 versus Claude-2 as they argue that a superintelligence will never be able to solve a Rubik’s Cube.
Tag yourself.
Replica made a plug-in to make NPCs in video games respond to the player, so this player went around telling the NPCs they are AIs in a video game (2 minutes).
It’s important to plan ahead.
Sometimes alignment is easy.
Would be a real shame if people did so much of this that humans had to fact check or even write the articles.
The ultimate prize is Blizzard actually introducing Glorbo, since why not.
So true. Also Elon is changing the name of Twitter.
Here it is, everyone:
My favorite variant of this was from the finals of Grand Prix: Pittsburgh, when all three of my team were up 1-0 in our matches, and an opponent said ‘everything is going according to their plan!’