Yi Zeng's full statement to the UN. Also, why he co-signed open letters on AI risk.
With China and the UN Security Council developing positions on AI risk, I think we can say that the issue has now been recognized everywhere in world politics.
Here's the UN press release. I think it's somewhat amazing that someone from Anthropic addressed the UN Security Council, but that's how far we've come now.
Of course we should now prepare to be disappointed, dismayed and horrified, as the highest powers now forget the issue for months at a time, or announce perspectives and initiatives that seem to be going in the wrong direction. But that's politics. The point is that increasingly there's political awareness and an organizational infrastructure, that at least nominally is meant to deal with AI risk at a global level, when a year ago there was basically nothing.
Senator Ed Markey warns of the dangers of AIs put in charge of nuclear weapons. Could events spiral out of human control, with all sides worried that if they do not automate their systems that they will become vulnerable? This is certainly a scenario to worry about.
It is also one for which history provides useful context. The Soviet version is called the Perimeter System, known in the US as the Dead Hand. It was designed for second strike capability, and is able to determine the scale of, and launch, a full nuclear strike even in the event of total decapitation of the human command elements.
As of 2018, it was still operational.
Once I am caught up I intend to get my full Barbieheimer on some time next week, whether or not I do one right after the other. I’ll respond after. Both halves matter – remember that you need something to protect.
That's why it has to be Oppenheimer first, then Barbie. :)
When I look at the report, I do not see any questions about 2100 that are more ‘normal’ such as the size of the economy, or population growth, other than the global temperature, which is expected to be actual unchanged from AGI that is 75% to arrive by then. So AGI not only isn’t going to vent the atmosphere and boil the oceans or create a Dyson sphere, it also isn’t going to design us superior power plants or forms of carbon capture or safe geoengineering. This is a sterile AGI.
This doesn't feel like much of a slam dunk to me. If you think very transformative AI will be highly distributed, safe by default (i.e. 1-3 on the table) and arise on the slowest end of what seems possible, then maybe we don't coordinate to radically fix the climate and we just use TAI to adapt well individually, decarbonize and get fusion power and spaceships but don't fix the environment or melt the earth and just kind of leave it be because we can't coordinate well enough to agree on a solution. Honestly that seems not all that unlikely, assuming alignment, slow takeoff and a mediocre outcome.
If they'd asked about GDP and they'd just regurgitated the numbers given by the business as usual UN forecast after just being queried about AGI, then it would be a slam dunk that they're not thinking it through (unless they said something very compelling!). But to me while parts of their reasoning feel hard to follow there's nothing clearly crazy.
The view that the Superforecasters take seems to be something like "I know all these benchmarks seem to imply we can't be more than a low number of decades off powerful AI and these arguments and experiments imply super-intelligence should be soon after and could be unaligned, but I don't care, it all leads to an insane conclusion, so that just means the benchmarks are bullshit, or that one of the 'less likely' ways the arguments could be wrong is correct. (Note that they didn't disagree on the actual forecasts of what the benchmark scores would be, only their meaning!)
One thing I can say is that it very much reminds me of Da Shi in the novel Three Body Problem (who - and I know this is fictional evidence - ended up being entirely right in this interaction that the supposed 'miracle' of the CMB flickering was a piece of trickery)
You think that's not enough for me to worry about? You think I've got the energy to gaze at stars and philosophize?"
"You're right. All right, drink up!"
"But, I did indeed invent an ultimate rule."
"Tell me."
"Anything sufficiently weird must be fishy."
"What... what kind of crappy rule is that?"
"I'm saying that there's always someone behind things that don't seem to have an explanation."
"If you had even basic knowledge of science, you'd know it's impossible for any force to accomplish the things I experienced. Especially that last one. To manipulate things at the scale of the universe—not only can you not explain it with our current science, I couldn't even imagine how to explain it outside of science. It's more than supernatural. It's super-I-don't-know-what...."
"I'm telling you, that's bullshit. I've seen plenty of weird things."
"Then tell me what I should do next."
"Keep on drinking. And then sleep."
Never assume that people are doing the thing they should obviously be doing, if it would be even slightly weird or require a bit of activation energy or curiosity. Over time, I do expect people to use LLMs to figure out more latent probabilistic information about people in an increasingly systematic fashion. I also expect most people to be not so difficult to predict on most things, especially when it comes to politics. Eventually, we will deal with things like ‘GPT-5 sentiment analysis says you are a racist’ or what not far more than we do now, in a much more sophisticated way than current algorithms. I wonder if we will start to consider this one of the harms that we need to protect against.
This is already one of my big short-term concerns. The process of using LLMs to figure out that information right now requires a lot of work, but what happens when it doesn't? When LLMs with internet access and long term memory start using these tactics opaquely when answering questions? It seems like this capability could emerge quickly, without much warning, possibly even just with better systems built around current or near-future models.
"Hey, who is behind the twitter account that made this comment? Can you show me everything they've ever written on the internet? How about their home address? And their greatest hopes and fears? How racist are they? Are they more susceptible to manipulation after eating pizza or korean bbq? What insights about this person can you find that have the highest chance of hurting their reputation? What damaging lie about their spouse are they most likely to believe?"
How much about you can be determined just from what other people are willing to give up, similar to identifying an individual through familial DNA searching?
By the Matt Levine Vacation Rule, I took several days to go to Seattle and there was a truly epic amount of news. We had x.AI, Llama 2, upgrades to ChatGPT, a profile of Anthropic, a ton of very interesting papers on a variety of topics, several podcasts that demand listening, fully AI-generated South Park episodes and so much more. I could not fully keep up. Oh, and now we have Barbieheimer.
Thus, I have decided to spin out or push to next week coverage of four stories:
These might get their own posts or they might get pushed to next week, depending on what I find on each. Same with my coverage of Oppenheimer since I haven’t seen it yet, and my bonus thoughts on Mission Impossible: Dead Reckoning (spoiler-free review of MI:DR for now: Good fun if you like such movies, some interesting perspective of on how people would handle such a situation, a lot of clear struggling between the writer who knows how any of this works and everyone else involved in the film who didn’t care and frequently but not always got their way.)
Table of Contents
Language Models Offer Mundane Utility
Early reports from many sources say Clause 2 is much better than GPT-4 at sounding human, being creative and funny, and generally not crippling its usefulness in the name of harmlessness. Also is free and takes 75k words of context.
Radiologists do not fully believe the AI, fail to get much of the benefit, says paper.
Paper is gated, but the obvious response is that even under such circumstances the radiologists are updating in the correct direction. The problem is that they do not update as far as Bayes would indicate, which I would have predicted and matches other results.
So the question is, what is the marginal cost of using the AI? If your workflow is ‘do everything you would have otherwise done, also the AI tells you what it thinks’ then how much time is lost using the AI information, and how much money is paid? If the answers are ‘very little’ to both compared to the stakes, then even small updates in the AI’s direction are good enough. Then, over time, doctors can learn to trust the AI more.
The danger is if the AI is a substitute for a human processing the same information, where the humans would have updated properly from a human and won’t from an AI. As long as you avoid that scenario, you should be fine.
This generalizes. As Tyler notes, he often gets a better answer than either unassisted AI or unassisted Tyler. I report the same. We are in the window where the hybrid approach gives the best results if you know how to use it, with the danger that AI makes errors a lot. So the key is to know how and when to error check, and not to prematurely cut out other information sources.
ChatGPT goes to Harvard. Matt Yglesias throws it at his old classes via essay assigmeents, it comes back with a 3.34 GPA. The Microeconomics TA awarded an A based on ‘impressive attention to detail.’
Yglesias does not note that the average GPA at Harvard is high. Very high. As in perhaps 4.18 (!?). So while an A is definitely passing, it is already below average. Holding down a 3.34 is a very slacker move. Rather than Harvard being super hard, getting into Harvard is super hard. If you want hard classes go to MIT.
This is also confined to homework. Colleges already use various techniques to ensure your grade usually comes mostly from other sources. Rather than worry about whether ChatGPT can do the homework, Yglesias ends with the right question, which is whether there is any value in the skills you learn. For liberal arts majors, even more than before, they will need to rethink how they plan on providing value.
Can you use LLMs for sentiment analysis? Paper says yes, GPT-4 is quite good, with r~0.7 versus dictionary analysis getting r~0.25, on par with fine tuned models, and without any special setup whatsoever. It does perform worse in African languages than Western ones, which is to be expected. A quirk is that in English GPT-4 scored worse than GPT-3.5, due to labeling less examples as neutral:
One should not entirely trust the baseline human scoring either. What is the r-score for two different human sentiment assessments? Bard and Claude both predict an r-score for two humans of about 0.8, which is not that much higher than 0.7.
The definition of ‘neutral’ is also fluid. If GPT-4 is giving less answers as neutral, that could be because it is picking up on smaller details, one could reasonably say that there is rarely such a thing as fully neutral sentiment. There’s a great follow-up experiment here that someone should run.
How about polling and identification? See the paper AI-Augmented Surveys:: Leveraging Large Language Models for Opinion Prediction in Nationally Representative Surveys.
Never assume that people are doing the thing they should obviously be doing, if it would be even slightly weird or require a bit of activation energy or curiosity. Over time, I do expect people to use LLMs to figure out more latent probabilistic information about people in an increasingly systematic fashion. I also expect most people to be not so difficult to predict on most things, especially when it comes to politics. Eventually, we will deal with things like ‘GPT-5 sentiment analysis says you are a racist’ or what not far more than we do now, in a much more sophisticated way than current algorithms. I wonder if we will start to consider this one of the harms that we need to protect against.
Language Models Don’t Offer Mundane Utility
Level Three Bard? What are the big changes from their recent update, which at least one source called ‘insane’ (but that source calls everything new insane)?
Those are great upgrades, and some other touches are nice too. What they do not do is address the fundamental problem with Bard, that it falls down at the core task. Until that is addressed, no number of other features are going to make the difference.
It did successfully identify that Alabama has a higher GDP per capita than Japan, which GPT-4 got wrong. And this is Claude 2 when I asked, a remarkable example of ‘write first word incorrectly and then try to explain your way out of it’:
Dave Friedmen considers Code Interpreter by decomposing the role of data scientist into helping people visualize and understand data, versus serving as a barrier protecting civilians against misinterpreting the data and jumping to conclusions. Code Interpreter helps with the first without doing the second.
The ideal solution is for everyone to be enough of a data scientist to at least know when they need to check with a real data scientist. The second best is to essentially always check before acting on conclusions, if the data involved isn’t simple and the conclusion can’t be verified.
Is a little knowledge a net negative thing if handled badly? It certainly can be, especially if the data is messed up in non-standard ways and people aren’t thinking about that possibility. In general I would say no. And of course, the tool is a godsend to people (like me) with basic statistical literacy who want basic information, even if I haven’t had a use case yet.
To use LLMs fully and properly you’ll want a computer or at least a phone, which in Brazil will be a problem since electronics cost several times what they cost in America. I once employed a Brazilian woman, and every time she’d take a trip home she would load up on electronics to sell, more than paying for her trip. It is a huge tax on electronics, also a huge subsidy for travel, so perhaps this is good for Brazil’s creativity and thus Tyler should not be so opposed?
Reasoning Out Loud
Anthropic papers on Measuring Faithfulness in Chain-of-Thought Reasoning (paper 1, paper 2). I have worries about the a few of metrics chosen but it’s great stuff.
Maybe? Sometimes CoT is self-correcting, where it causes you to notice and fix errors. Statistically, it should introduce errors, but in some ways it can be a good sign if it doesn’t. Their example is math problem where the error should fully carry forward unless there is some sort of sanity or error check.
For some tasks, forcing the model to answer with only a truncated version of its chain of thought often causes it to come to a different answer, indicating that the CoT isn’t just a rationalization. The same is true when we introduce mistakes into the CoT.
An obvious response is that a larger model will have better responses without CoT, and will have more processing power without CoT, so one would expect CoT to change its response less often. These seem like the wrong metrics?
It is an odd tradeoff. You sacrifice accuracy in order to get consistency, which comes with interpretability, and also forces the answer to adhere to the logic of intermediate steps. Claude described the reflection of desired intermediate steps as increasing fairness and GPT-4 added in ‘ethics,’ whereas one could also describe it as the forced introduction of bias. The other tradeoffs Claude mentioned, however, seem highly useful. GPT-4 importantly pointed out Generalizability and Robustness. If you are doing something faithful, it could plausibly generalize better out of distribution.
In terms of what ethics means these days, treasure this statement from Bard:
If I was the patient, I would rather the AI system be more accurate. Am I weird?
Claude’s similar statement was:
What does it mean for a more accurate answer to have more ‘unjustified’ inference, and why would that be less safe?
I believe the theory is that if you are being faithful, then this is more likely to avoid more expensive or catastrophic mistakes. So you are less likely to be fully accurate, but the combined cost of your errors is lower. Very different metrics, which are more important in different spots.
This in turn implies that we are often training our models on incorrect metrics. As a doctor or in many other situations, one learns to place great importance on not making big mistakes. Much better to make more frequent smaller ones. ‘Accuracy’ should often reflect this, if we want to maximize accuracy it needs to be the kind of accuracy we value. Training on Boolean feedback – is A or B better, is this right or wrong – will warp the resulting training.
Fun with Image and Sound Generation
One simple request, that’s all he’s ever wanted.
People report various levels of success. Arthur seems to have the closest:
No sofas, but it is 11.
The thing about all these images is that they are, from a last-year perspective, all pretty insanely great and beautiful. If you want something safe for work that they know how to parse, these models are here for you. If you want NSFW, you’ll need to go open source, do more work yourself and take a quality hit, but there’s no shortage of people who want to help with that.
The next issue up is that there are requests it does not understand. Any one thing that forms a coherent single image concept, you are golden. If you want to ‘combine A with B’ you are often going to have problems, it can’t keep track, and counting to 11 distinct objects like this also will cause issues.
FABRIC, a training-free method used for iterative feedback (paper, demo, code). You give it thumbs up/down on various images to tell the model what you want. An interesting proposal. You still need the ability to create enough variance over proposed images to generate the necessary feedback.
Frank Sinatra sings Gangsta’s Paradise (YouTube). We will doubtless overuse and abuse such tools sometimes once they are easy to use, so beware of that and when in doubt stick to the originals unless you have a strong angle.
Deepfaketown and Botpocalypse Soon
SHOW-1 promises to let you create new episodes of TV shows with a prompt – it will w rite, animate, direct, voice and edit for you. The simulations of South Park episodes have some obvious flaws, but are damn impressive.
Right now the whole thing seems to be a black box in many ways, as they have set it up. You give a high-level prompt for the whole episode, then it checks off all the boxes and fleshes things out. If you could combine this with the ability to do proper fine-tuning and editing, that then properly propagates throughout, that would be another big leap forward. As would being able to use a more creative, less restricted model than GPT-4 that still had the same level (or higher would of course be even better) of core ability.
They Took Our Jobs
SAG, the screen actors guild, has joined the writers guild in going on strike.
I have never before seen as clear a case of ‘yes, I suppose a strike is called for here.’
I can see a policy of the form ‘we can use AI to reasonably modify scenes in which you were filmed, for the projects for which you were filmed.’ This is, well, not that.
So, strike. They really are not left with a choice here.
There are also other issues, centrally the sharing of profits (one could reply: what profits?) from streaming, where I do not know enough to have an opinion.
Overall, so far so good on the jobs front? Jobs more exposed to AI have increased employment. Individual jobs have been lost but claim is that across skill levels AI has been good for workers. So far. Note that the results here are not primarily focused on recent generative AI versus other AI, let alone do they anticipate larger AI economic impacts. Small amounts of AI productivity impact on the current margin seem highly likely to increase employment and generally improve welfare. That does not provide much evidence on what happens if you scale up the changes by orders of magnitude. This is not the margin one will be dealing with.
In order to take someone’s job, you need to know when to not take their job. Here’s DeepMind, for once publishing a paper that it makes sense to publish (announcement).
When having a system where sometimes A decides and sometimes B decides, the question is always who decides who decides. Here it is clear the AI decides whether the AI decides or the human decides. That puts a high burden on the AI to avoid mistakes – if the AI is 75% to be right and the human is 65% to be right, and they often disagree, who should make the decision? Well, do you want to keep using the system?
Get Involved (aka They Offered Us Jobs)
The Berkeley Existential Risk Initiative (BERI) is seeking applications for new university collaborators here. BERI offers free services to academics; it might be worth applying if you could use support of any kind from a nimble independent non-profit!
Governance.ai has open positions. Some of them have a July 23 deadline, so act fast.
Rand is hiring technology and security policy fellows.
Apollo Research hiring policy fellow to help them prepare for the UK summit, with potential permanent position. Aims to start someone July 1 if possible so should move fast.
Benjamin Hilton says that your timelines on transformative AI should not much impact whether you seek a PhD before going into other AI safety work, because PhD candidates can do good work, and timelines are a difficult question so you should have uncertainty. That all seems to me like it is question begging. I also find it amusing that 80,000 hours as only a ‘medium-depth’ investigation into this particular career path.
Introducing
New Google paper: Symbol tuning improves in-context learning in language models
Various versions of PaLM did better at few-shot learning when given arbitrary symbols as labels, rather than using natural language. I had been low-key wondering about this in various forms. A key issue with LLMs is that every word is laden with vibes and outside contexts. You want to avoid bringing in all that, so you can use an arbitrary symbol. Nice.
LIME: Learning Inductive Bias for Primitives of Mathematical Reasoning. The concept is that one can first train on tasks that reward having inductive bias favoring solutions that generalize for the problem you care about (what a weird name for that concept) while not having other content that would mislead or confuse. Then after that you can train on what you really want with almost all your training time. Seems promising. Also suggests many other things one might try to further iterate.
Combine AI with 3D printing to bring down cost of prosthetics from $50k+ to less than $50? That sounds pretty awesome. It makes sense that nothing involved need be inherently expensive, at least in some cases. You still need the raw materials.
In Other AI News
A fun game: Guess what GPT-4 can and can’t do.
GPT-4 chat limit to double to 50 messages per 3 hours. I never did hit the 25 limit.
Apple tests ‘Apple GPT,’ Develops Generative AI Tools to Catch OpenAI. Project is called Ajax, none of the details will surprise you. The market responded to this as if it was somehow news, it seems?
Google testing a new tool, known internally as ‘Genesis,’ that can write news articles, and that was demoed for NYT, WSJ and WaPo.
Sam Hogan says that essentially all the ‘thin wrapper on GPT’ companies, the biggest of which is Jasper, look like they will fail, largely because existing companies are spinning up their own open source-based solutions rather than risk a core dependency on an outside start-up.
I buy both these explanations. If you see a major tech company (Twitter) cut most of its staff without anything falling over dead, then it becomes far easier to light a fire under everyone’s collective asses.
He also holds out hope for ‘AI moonshots’ like cursor.so, harvey.ai and runwaymi.com that are attempting to reimagine industries.
From where I sit, the real winners are the major labs and their associated major tech companies, plus Nvidia, and also everyone who gets to raise nine or ten figures by saying the words ‘foundation model.’ And, of course, the users, who get all this good stuff for free or almost free. As in:
Broadening the Horizon: A Call for Extensive Exploration of ChatGPT’s Potential in Obstetrics and Gynecology. I simultaneously support this call and fail to understand why this area is different from other areas.
Quiet Speculations
Goldman Sachs report on economic impact of shows a profound lack of imagination.
This is what they discuss in a topic called ‘AI’s potentially large economic impacts.’
So, no. If you are asking if AI is priced into the market? AI is not.
There’s also lots of confident talk such as this:
They can’t even see the possibility of disruption of individual large corporations.
They call this one-time modest boost ‘eye-popping’ as opposed to a complete fizzle. The frame throughout is treating AI as a potential ‘bubble’ that might or might not be present, which is a question that assumes quite a lot.
They usefully warn of interest rate risk. If one goes long on beneficiaries of AI, one risks being effectively short interest rates that AI could cause to rise, so it is plausible that at least a hedge is in order.
If you do not give people hope for the future, why should they protect it?
I do give credit to Invisible Man for using the proper Arc Words. Responses are generally fun, but some point out the seriousness of the issue here. Which is that our society has profoundly failed to give many among us hope for their future, to give them Something to Protect. Instead many despair, often regarding climate change or their or our collective economic, political and social prospects and going from there, for reasons that are orthogonal to AI. Resulting in many being willing to roll extremely unfriendly dice, given the alternative is little better.
To summarize and generalize this phenomenon, here’s Mike Solana:
Who are the best debuggers? Why, speedrunners, of course.
We need to bring more of that energy to AI red-teaming and evaluations, and also to the rest of training and development. Jailbreak and treacherous turn hype. Bonus points every time someone says ‘that’s never happened before!’
There is no doubt that many will try. Like with Google, it will be an escalating war, as people attempt to influence LLMs, and the labs building LLMs do their best to not be influenced.
It will be a struggle. Overall I expect the defenders to have the edge. We have already seen how much better LLMs perform relative to cost, data and size when trained exclusively on high-quality data. Increasingly, I expect a lot of effort to be put into cleaning and filtering the data. If it’s 2025 and you are scraping and feeding in the entire internet without first running an LLM or other tools on it to evaluate the data for quality and whether it is directly attempting to influence an LLM, or without OOM changes in emphasis based on source, you will be falling behind. Bespoke narrow efforts to do this will doubtless often succeed, but any attempts to do this ‘at scale’ or via widely used techniques should allow a response. One reason is that as the attackers deploy their new data, the defenders can iterate and respond, so you don’t need security mindset, and the attacker does not get to dictate the rules of engagement for long.
The Superalignment Taskforce
Jan Leike, OpenAI’s head of alignment and one of the taskforce leaders, gave what is far and away the best and most important comment this blog has ever had. Rarely is a comment so good that it lowers one’s p(doom). I will quote it in full (the ‘>’ indicates him quoting me), and respond in-line, I hope to be able to engage more and closer in the future:
This is a huge positive update. Essentially everyone I know had a version of the other interpretation, that this was to be at least in part an attempt to construct the alignment researcher, rather than purely figuring out how to align the researcher once it (inevitably in Leike’s view) is created, whether or not that is by OpenAI.
I definitely do not think of the comparison this way. Some potentially deep conceptual differences here. Yes, GPT-4 on many tasks ends up within the human range of performance, in large part because it is trained on and to imitate humans on exactly those tasks. I do not see this as much indicative of a human-level intelligence in the overall sense, and consider it well below human level.
Given a sufficiently aligned and sufficiently advanced AI it is going to be highly useful for alignment research. The question is what all those words mean in context. I expect AIs will continue to become useful at more tasks, and clear up more of the potential bottlenecks, leaving others, which will accelerate work but not accelerate it as much as we would like because we get stuck on remaining bottlenecks, with some but limited ability to substitute. By the time the AI is capable of not facing any such important bottlenecks, how advanced will it be, and how dangerous? I worry we are not thinking of the same concepts when we say the same words, here.
Right. Once you have a sufficiently robust (aligned) compiler it is a great tool for helping build new ones. Which saves you a ton of time, but you still have to code the new compiler, and to trust the old one (e.g. stories about compilers that have backdoors and that pass on those backdoors and their preservation to new compilers they create, or simple bugs). You remove many key bottlenecks but not others, you definitely want to do this but humans are still the limiting factor.
I read this vision as saying the intended AI system could perhaps be called ‘human-complete,’ capable of doing the intellectual problem solving tasks humans can do, without humans. If that is true, and it has the advantages of AI including its superhuman skill already on many tasks, how do we avoid it being highly superhuman overall? And thus, also, didn’t we already have to align a smarter thing, then? The whole idea is that the AI can do that which we can’t, without being smarter than us, and this seems like a narrow target at best. I worry again about fluid uses of terms unintentionally confusing all involved, which is also partly on me here.
Great point, and that’s the trick. If what B does is develop a new technique that A uses, and so the alignment ‘comes from A’ in a central sense, you should be good here. That’s potentially safe. The problem is that if you have A align C (or D, or Z) without using the chain of agents, then you don’t get to set an upper bound on the intelligence or other capabilities gap between the aligner and the system to be aligned. How are humans going to understand what is going on? But it’s important to disambiguate these two approaches, even if you end up using forms of both.
Very reassuring last note, I’d have loved to see an even stronger version of the parenthetical here (I believe that even substantially improved variants of RLHF almost certainly won’t be sufficient, including CAI).
This goes back to the disambiguation above. If the AI ‘gives us a boon’ in the form of a new technique we can take it back up the chain, which sidesteps the problem here. It does not solve the underlying hard steps, but it does potentially help.
That seems like exactly the right way to think about this. If you use interpretability carefully you can avoid applying too much of the wrong selection pressures. Having a security mindset about information leakage on this can go a long way.
I’m not convinced it goes as far as suggested here. The 1-bit-per-evaluation limit only holds if the evaluation only gives you one bit of information – if what you get back is a yes/no, or a potential kill switch, or similar. Otherwise you potentially leak whatever other bits come out of the evaluation, and there will be great temptation to ask for many more bits.
Good question. My guess is mainly yes. If we train an AI to intentionally manifest property X, and then ask another AI to detect that particular property X (or even better train it to detect X), then that is a task AI is relatively good at performing. We can tell it what to look for and everything stays in-distribution.
Certainly it is highly possible for us to lose to X-shaped things that we could have anticipated. The hard problem is when the issue is Y-shaped, and we didn’t anticipate Y, or it is X’ (X prime) shaped where the prime is the variation that dodges our X-detector.
Agreed on both counts. I do expect substantial work on this. My worry is that I do not think a solution exists. We can put bounds on the answer but I doubt there is a safe way to know in advance which systems are dangerous in the central sense.
Very good to hear this explicitly, I had real uncertainty about this as did others.
This is very much appreciated. Definitely some of the big problems are machine learning problems, and working on those issues is great so long as we know they are deeply insufficient on their own, and that it is likely many of those solutions are up the tech tree from non-ML innovations we don’t yet have.
This is close to the central question. How much of the hard problem of aligning the HLAR (Human-Level Alignment Researcher) is contained within the problem of aligning GPT-5? And closely related, how much of the hard part of the ASI-alignment problem is contained within the HLAR-alignment problem, in ways that wouldn’t pass down into the GPT5-alignment problem? My model says that, assuming GPT-5 is still non-dangerous in the way GPT-4 is non-dangerous (if not, which I think is very possible, that carries its own obvious problems) then the hard parts of the HLAR problem mostly won’t appear in the GPT-5 problem.
If that is right, measuring GPT-5 alignment problem progress won’t tell you that much about HLAR-alignment problem progress. Which, again, I see as the reason you need a superalignment taskforce, and I am so happy to see it exist: The techniques that will work most efficiently on what I expect to see in GPT-5 are techniques I expect to fail on HLAR (e.g. things like advanced versions of RLHF or CAI).
Yep – if you can evaluate any given position in a game, then you can play the game by choosing the best possible post-move position, the two skills are highly related, but everyone can see checkmate or when a player is at 0 life.
Another way of saying this is that evaluation of individual steps of a plan, when the number of potential next steps is bounded, is similarly difficult to generation. When the number of potential next steps is unbounded you also would need to be able to identify candidate steps in reasonable time.
Identifying end states is often far easier. Once the game is over or the proof is offered, you know the result, even if you could not follow why the steps taken were taken, or how much progress was made towards the goal until the end.
The question then becomes, is your problem one of evaluation of end states like checkmate, or evaluation of intermediate states? When the AI proposes an action, or otherwise provides output, do we get the affordances like checkmate?
I see the problem as mostly not like checkmate. Our actions in the world are part of a broad long term story with no fixed beginning and end, and no easy evaluation function that can ignore the complexities on the metaphorical game board. Thus, under this mode of thinking, while evaluation is likely somewhat easier than generation, it is unlikely to be so different in kind in degree of difficulty.
Others ask, is the Superalignment Taskforce a PR move, a real thing, or both? The above interaction with Leike makes me think it is very much a real thing.
One cannot however know for sure at this stage. As Leike notes evaluation can be easier than generation, but evaluation before completion can be similarly difficult to generation.
David Champan is skeptical.
If they actually pulled off ‘human-level alignment researcher’ in an actually safe and aligned fashion, that might still fail to solve the problem, but it certainly would be a serious attempt. Certainly Leike’s clarification that they meant ‘be ready to align a human-level alignment researcher if and when one is created’ is a serious attempt.
On the second cynical interpretation, even if true, that might actually be pretty great news. It would mean that the culture inside OpenAI has moved sufficiently towards safety that a 20% tax is being offered in hopes of mitigating those concerns. And that as concerns rise further with time as they inevitably will, there will likely be further taxes paid. I am fine with the leadership of OpenAI saying ‘fine, I guess we will go devoting a ton of resources to solving the alignment problem so we can keep up employee morale and prevent another Anthropic’ if that actually leads to solving the problem.
Superforecasters Feeling Super
Scott Alexander covers the results of the Extinction Tournament, which we previously discussed and which I hadn’t otherwise intended to discuss further. My conclusion was that the process failed to create good thinking and serious engagement around existential risks, and we should update that creating such engagement is difficult and that the incentives are tricky, while updating essentially not at all on our probabilities. One commentor who participated also spoke up to say that they felt the tournament did have good and serious engagement on the questions, this simply failed to move opinions.
Performance on other questions seemed to Scott to be, if there was a clear data trend or past record, extrapolate it smoothly. If there wasn’t, you got the same disagreement, with the superforecasters being skeptical and coming in lower consistently. That is likely a good in-general strategy, and consistent with the ‘they didn’t actually look at the arguments in detail and update’ hypothesis. There certainly are lots of smart people who refuse to engage.
An important thing to note is that if you are extrapolating from past trends in your general predictions, that means you are not expecting big impact on other questions from AI. Nor does Scott seem to think they should have done otherwise? Here’s his summary on non-engineered pathogens.
This seems super wrong to me. Yes, our medicine and epidemiology are better. On the flip side, we are living under conditions that are otherwise highly conducive to pathogen spread and development, with a far larger population, and are changing conditions far faster. Consider Covid-19, where all of our modern innovations did not help us so much, whereas less advanced areas with different patterns of activity were often able to shrug it off. Then as things change in the future, we should expect our technology and medicine to advance further, our ability to adapt to improve (especially to adjust our patterns of activity), our population will peak, and so on. Also we should expect AI to help us greatly in many worlds, and in others for us to be gone and thus immune to a new pandemic. Yet the probabilities per year here are constant.
That experience was that he did not get substantive engagement and there was a lot of ‘what if it’s all hype’ and even things like ‘won’t understand causality.’ And he points out that this was in summer 2022, so perhaps the results are osculate anyway.
Predictions are difficult, especially about the future. In this case, predictions about their present came up systematically short.
I do not think Scott is properly appreciating the implications of ‘they already agreed we would get AGI by 2100.’ It is very clear from the answers to other questions that the ‘superforecasters’ failed to change their other predictions much at all based on this.
When I look at the report, I do not see any questions about 2100 that are more ‘normal’ such as the size of the economy, or population growth, other than the global temperature, which is expected to be actual unchanged from AGI that is 75% to arrive by then. So AGI not only isn’t going to vent the atmosphere and boil the oceans or create a Dyson sphere, it also isn’t going to design us superior power plants or forms of carbon capture or safe geoengineering. This is a sterile AGI.
If you think we will get AGI and this won’t kill everyone, there are reasons one might think that. If you think we will get AGI and essentially nothing much will change, then I am very confident that you are wrong, and I feel highly safe ignoring your estimates of catastrophic risk or extinction, because you simply aren’t comprehending what it means to have built AGI.
I think this is very obviously wrong. I feel very confident disregarding the predictions of those whose baseline scenario is the Futurama TV show, a world of ‘we build machines that are as smart as humans, and then nothing much changes.’ That world is not coherent. It does not make any sense.
The Quest for Sane Regulations
China seems to be getting ready to do some regulating? (direct link in Chinese)
If China enforced this, it would be a prohibitive hill to climb. Will this take effect? Will they enforce it? I can’t tell. No one seemed interested in actually discussing that and it is not easy for me to check from where I sit.
There is also this at the UN Security Council:
Yes. Exactly.
Whether or not they walk the talk, they are talking the talk. Why are we not talking the talk together?
If your response is that this is merely cheap talk, perhaps it is so at least for now. Still, that is better than America is doing. Where is our similar cheap talk? Are we waiting for them to take one for Team Humanity first, the whole Beat America issue be damned? There are two sides to every story, your region is not automatically the hero.
The UN also offers this headline, which I will offer without further context and without comment because there is nothing more to say: UN warns that implanting unregulated AI chips in your brain could threaten your mental privacy.
The FTC is on the case.
That has, in the end and as I understand it, very little to do with the FTC’s decision, they can do whatever they want and there is no actual ways for OpenAI to comply with existing regulations if the FTC decides to enforce them without regard to whether they make sense to enforce. Thus:
If you call every negative result a course of action requiring oversized punishment, regardless of benefits offered, that’s classic Asymmetric Justice and it will not go well for a messy system like an LLM.
What this tells OpenAI is that they need to design their ‘safety’ procedures around sounding good when someone asks to document them, rather than designing them so that they create safety. And it says that if something happens, one should ask whether it might be better to not notice or investigate it in a recorded way, lest that investigation end up in such a request for documents.
This is of course exactly the wrong way to go about things. This limits mundane utility by focusing on avoiding small risks of mundane harms over massive benefits, while doing nothing (other than economically) to prevent the construction of dangerous frontier models. It also encourages sweeping problems under the rug and substituting procedures documenting effort (costs) for actual efforts to create safety (benefits). We have much work ahead of us.
Gary Gensler, chair of the SEC, is also on the case. He is as usual focusing on the hard questions that keep us safe, such as whether new technologies are complying with decades old securities laws, while not saying what would constitute such compliance and likely making it impossible for those new technologies to comply with those laws.
Wither copyright concerns, asks Science. It is worth noting that copyright holders have reliably lost when suing over new tech.
If image models lose their lawsuits, text models are next, if they even are waiting.
The post stakes out a clear pro-fair-use position here.
Also discussed are the tests regarding fair use, especially whether the new work is transformative, where the answer here seems clearly yes. The tests described seem to greatly favor AI use as fair. There are a non-zero number of images that such models will effectively duplicate sometimes, the one I encountered by accident being the signing of the Declaration of Independence, but only a few.
Connor Leahy bites the bullet, calls for banning open source AGI work entirely, along with strict liability for AI harms. I would not go as far as Connor would, largely because I see much less mundane harm and also extinction risk from current models, but do think there will need to be limits beyond which things cannot be open sourced, and I agree that liability needs to be strengthened.
The Week in Audio
Note: By default I have not listened, as I find audio to be time inefficient and I don’t have enough hours in the day.
Future self-recommending audio: What should Dwarkesh Patel ask Dario Amodei? Let him know in that thread.
Odd Lots, the best podcast, is here with How to Build the Ultimate GPU Cloud to Power AI and Josh Wolfe on Where Investors Will Make Money in AI.
Also in this thread, let Robert Wilbin know what to ask Jan Leike, although sounds like that one is going to happen very soon.
Eliezer Yudkowsky talks with Representative Dan Crenshaw.
I will 100% be listening to this one if I haven’t yet. The interaction named here is interesting. You need to draw a distinction between the laws we would need to actually enforce to not die (preventing large training runs) versus what would be helpful to do on the margin without being decisive either way like labeling LLM outputs.
Adam Conover interviews Gary Marcus, with Adam laying down the ‘x-risk isn’t real and industry people saying it is real only makes me more suspicious that they are covering up the real harms’ position up front. Gary Marcus predicts a ‘total shitshow’ in the 2024 election due to AI. I would take the other side of that bet if we could operationalize it (and to his credit, I do think Marcus would bet here). He claimed the fake Pentagon photo caused a 5% market drop for minutes, which is the wrong order of magnitude and highly misleading, and I question other claims of his as well, especially his characterizations.
Ethan Mollick discusses How AI Changes Everything, in the mundane sense of changing everything.
Two hour interview with Stephen Wolfram, linked thread has brief notes. Sounds interesting.
Rhetorical Innovation
Oppenheimer seems likely to be a big deal for the AI discourse. Director Christopher Nolan went on Morning Joe to discuss it. Video at the link.
Once I am caught up I intend to get my full Barbieheimer on some time next week, whether or not I do one right after the other. I’ll respond after. Both halves matter – remember that you need something to protect.
It could happen.
In and in the name of the future, might we protest too much? It has not been tried.
Seems reasonable to me.
Andrew Critch points out that the speed disadvantage of humans versus AIs is going to be rather extreme. That alone suggests that if we in the future lack compensating other advantages of similar magnitude, it will not end well.
I am not sure why Musk thinks this is a good response here?
Joscha Bach suggests hope is a strategy, Connor Leahy suggests a different approach.
It does seem like something worth preventing.
The particular hopes here are not even good hopes. ‘Decorative’ does not make any sense as an expectation, we are not efficient decorations given the alternatives that will be available, plus the issue that this result would not seem great even if it happened. Could we be useful enough to survive at equilibrium? In the long run, I have no idea how or why that would be true.
The dangers of anthropomorphism run both ways with respect to agency. We risk thinking of a system as having agency or other attributes the system lacks. And we also run the risk that actively thinking of it as a tool that is ‘not like a human,’ a kind of anti-anthropomorphism, leads to us not noticing that de facto agency has been transferred to it. Our desire to avoid the first mistake can cause the second.
The abstract:
It is a fine line to walk. I very much like the idea that agency can be a continuous function, rather than a Boolean, and that it has subcomponents where you can have distinct levels of each of them. That seems more important than the proposed details.
There was going to be a potentially fun although more likely depressing and frustrating debate on AI existential risk between Dan Hendrycks and Based Beff Jezos, but Jezos backed out. A shame. Thread offers some of the arguments Hendrycks was planning to offer.
My favorite notes:
Based Beff Jezos does not seem to be a fan of humanity. In general I presume it is useful to point out when advocates for acceleration of AI are fine with human extinction.
One of the most frustrating parts of the whole debate is that different people have wildly different reasons for rejecting (and also for believing) the dangers of AI. That’s why it is so great when someone says what their crux is so you can respond:
What does it mean to have inherent intent? If the question is whether superintelligences will have intents and goals, then it seems rather obvious that if you open source everything then the answer will be yes, because some people will give them that.
Westworld’s first few seasons were awesome but it was not there to make the case for AI risk, nor did it make that case difficult to miss. If I had billions of dollars. Certainly more media projects are a good idea.
David Krueger offers a potential additional angle to illustrate x-risks.
I do not think David successfully makes the connection to extinction risk here, but the concept is essentially (as I understand it) that more and more things get put on AI-directed autopilot over time, until the AIs are effectively directing most behaviors, and doing so towards misaligned proxy goals that it is difficult for anyone to alter, and that get stuck in place. Then this can go badly ‘on its own’ or it can be used as affordances by AIs in various ways.
No One Would Be So Stupid As To
Good news, no one was in fact so stupid as to. Other than Meta, who released Llama 2, which I will be covering later.
Aligning a Smarter Than Human Intelligence is Difficult
One insufficient (and perhaps unnecessary) but helpful thing would be better tests and benchmarks for danger.
People Are Worried About AI Killing Everyone
If you are worried about a rogue AI, you should also find the implementation details of a rogue AI fascinating. So many options are available without an obviously correct path, even if you restrict to what an ordinary human can come up with. It is a shame that we will probably never get to know how it all went down before it’s all over.
The default is the AI does something we did not think about, or some combination or with some details we did not think about, because the AI will be smarter and in various ways more capable, with new affordances. It’s still a fun game to put yourself in its shoes and ask how you would play the hand it was dealt.
Senator Ed Markey warns of the dangers of AIs put in charge of nuclear weapons. Could events spiral out of human control, with all sides worried that if they do not automate their systems that they will become vulnerable? This is certainly a scenario to worry about.
Warnings about such particular dangers are valid, and certainly help people notice and appreciate the danger. They are however neither necessary, sufficient or unique in terms of danger, as Eliezer explains.
Other People Are Not As Worried About AI Killing Everyone
A new open letter signed by 1,300 experts (what kind? Does it matter?) say AI is a ‘force for good, not a threat to humanity.’ I found several news articles reporting this, none of which linked to the actual letter, and Bing also failed to find the actual letter, so I don’t know if they also had a substantive argument or if they are rubber and I am glue. I like Seb Krier would like to hope that we can eventually beyond asking whether something is ‘a force for good’ but, well, these experts it seems are having none of it.
In Noema, several authors write The Illusion of AI’s Existential Risk, with the tagline of ‘Focusing on the prospect of human extinction by AI in the distant future may prevent us from addressing AI’s disruptive dangers to society today.’
They literally open like this:
If human extinction was not on the line, I would stop there and simply say I am not reading this and you can’t make me. Alas, human extinction is at stake, and sone friend thought their post constituted positive engagement, so instead I read this so you don’t have to.
The authors make the mistake of assuming that there is only one scenario that could result in extinction, a single rogue AI, and then dismiss it because of ‘physical limits.’ From there, it gets worse, with a total failure to actually envision the future, and the required irrelevant few paragraphs spent on Pascal’s Wager.
My friend thinks their best engagement was when they argue that the AI would, before killing us, first need to automate a wide variety of economic activity, the entire logistics of creating computer chips, and power plants, and so on, before it could safely dispose of the humans.
This is true. Either the AI would have to automate its existing supply chains, or it would require new technologies that supported new and different ways of physically maintaining (and presumably expanding) its existence, ways to act in the world without us, before it killed us all, assuming it cared about maintaining its existence afterwards.
That is not likely to long be a substantive barrier in most scenarios. An AI capable of killing everyone is capable of instead taking over, then if necessary forcing the remaining humans to construct (and if necessary, to help research and design) the required physical tools to no longer need those humans, before killing those it kept around for all such purposes. More likely, the AI would design new physical mechanisms to maintain its supply chains, with or without the use of nanotechnology. The human implementation of economic production is quite obviously suboptimal, the same way that humans modify their physical environment for more efficient production as our technology allows.
I will note this:
AI ethics people, hear me: That is 100% entirely on you. This is a you problem.
I and most others I know are very happy to do a ‘why not both.’ We do not treat mundane harms as fake or as distractions and often support aggressive mitigations, such as strict liability for AI harms. We are happy to debate the object level trade-offs of mundane harms versus mundane benefits and work to find the best solutions there.
That is despite continuous rhetoric of this type, in response to the real and rather likely threat of human extinction. Almost all such articles, including this one, fail to engage with the arguments in any substantive way, while claiming to have provided a knock-down explanation for why such danger can safely be dismissed, and questioning the motivations and sanity of their opponents.
If your signatures are missing on a simple one sentence statement that states that we might face human extinction and we should prevent it, it is because your leaders either do not believe that we face human extinction, do not believe we should prioritize preventing it, or think that such a simple statement should not be signed for some other reason. Which is entirely everyone’s right not to sign a petition, but it seems strange to act like that decision is not on you. I am sure they would have been most welcome, and that many of the same signatories would be happy to sign a similarly simple, direct, well-crafted statement regarding the need for mitigation of mundane AI harms, if it did not include a dismissal of extinction risks or a call for their deprioritization.
My challenge then is: Show me the version of the statement that leaders of the ‘AI Ethics’ community would have signed, that we should have presented them instead. Do you want to add ‘without minimizing or ignoring other more proximate dangers of AI’ to the beginning? Or would it have been something more? Let’s talk.
There is no need for the existential dangers of AI to ‘distract’ from the mundane harms of AI. All the ways to mitigate the existential dangers also help with the mundane harms. The sensible ways to help with the mundane harms are at least non-harmful for the existential dangers. The world contains many harms worth mitigating, both current and potential, mundane and existential.
Seth Lazar and Alondra Nelson in Science ask ‘AI Safety on whose terms?’ They acknowledge that AI might kill everyone, but warn that a technical solution would be insufficient because such a system would still be powerful and be misused, so they call for a ‘sociotechnical’ approach.
I strongly agree that merely ensuring the AI does not kill everyone does not ensure a good outcome. I also strongly agree with their call to question whether such systems should be built at all. They even note explicitly that there may be no technical solution to alignment possible. Excellent.
Alas, they claim then that the insufficiency of a technical solution mean it is correct to instead emphasize things other than a technical solution, at the expense of a technical solution. Then they further attempt to delegitimatize ‘narrow technical AI safety’ by attacking its ‘structural imbalances’ and lack of diversity.
None of which interact in any way with the need to solve a technical problem to prevent everyone from dying.
This seems to be a literal claim that a technical solution that prevents literally everyone from dying would ‘only compound AI’s dangers.’
When such statements get published, something has gone horribly wrong.
Could the statement be right? In theory, yes, if you believe that ‘no one would be so stupid’ as to build a dangerous system without the necessary technical alignment. I consider this a rather insane thing to believe, and also they not only do not claim it, but explicitly raise the need to question whether we should be building such systems, implying (I think very correctly) that we are about to build them anyway.
Or, I suppose, you could think that an uneven power distribution is worse than everyone dying. That the world is about reducing bias and satisfying liberal shibboleths. Looking at the rest of their AI issue, and what they have chosen to cover, this seems highly plausible.
Bryan Caplan echoes the error that if you believe AI might kill everyone, you should therefore be making highly destructive financial decisions. To be fair to him, he only said it in response to Nathan bringing the question up, still:
As an economist, Bryan Caplan should be able to do a ballpark calculation of the value of income smoothing and the costs of ending up deeply underwater, versus the small marginal benefits of a higher standard of living now that you can’t sustain, and the choice to either be able or unable to repay the debt when and if it comes due, and one can see that the correct adjustments here are small. Even if you match Bryan’s 100x criteria (e.g. 50% chance within 10 years) you still have to plan for the other 50%, and the optionality value of being able to borrow is high even in normal times.
It is extremely destructive to put social pressure on people to engage in such destructive behaviors, in order to seem credible or not see oneself as a hypocrite or as irrational. It also is not a valid argument. Please, everyone, stop making this rhetorical move.
I would echo that to include claims of the type ‘are you short the market?’ if the beliefs in question do not actually support, when you do the calculation, shorting the market, even if true. The default AI extinction scenarios involve only positive market impacts until well past the point where any profits could be usefully spent. A better question is, does your portfolio reflect which stocks will benefit? I also will allow suggesting being long volatility, where a case can at least be made, and which isn’t obviously destructive.
And I will echo that, while nothing I ever say is ever investment advice, I always try to have a very high bar for investments or gambles that, under the EMH, have strongly negative returns. The EMH is false, I do bet against it, but when I do I strive to get good odds within the frame where the EMH mostly holds.
The Lighter Side
Plans to stop rogue AIs be like:
Hippokleides: All foreign spies must now register via the .gov.uk portal and pay a fee before carrying out activities in the UK.
Oliver Renick: it’s interesting to me that ChatGPT has capacity/interest to do only ONE haiku in the voice of a pirate.
In other news you can use:
It’s not only Llama 2, things are getting a little out of hand.
For those who don’t know, here is the Venn Diagram of Doomers and People Keeping You Safe Here:
You can’t knock ‘em out, you can’t walk away, so here for your use is the politest way to say…
1
With his permission this is now his (Moskovitz’s) name for the purposes of this newsletter.