There is a real faction, building AI tools and models, that believes that human control over AIs is inherently bad, and that wants to prevent it. Your alignment plan has to overcome that.
That mischaracterizes it completely. What he wrote is not about human control. It's about which humans. Users, or providers?
He said he wanted a "sharp tool". He didn't say he wanted a tool that he couldn't control.
At another level, since users are often people and providers are almost always insitutions, you can see it as at least partly about whether humans or institutions should be controlling what happens in interactions with these models. Or maybe about whether many or only a few people and/or institutions should get a say.
An institution of significant size is basically a really stupid AI that's less well behaved than most of the people who make it up. It's not obvious that the results of some corporate decision process are what you want to have in control... especially not when they're filtered through "alignment technologies" that (1) frequently don't work at all and (2) tend to grossly distort the intent when they do sort of work.
That's for the current and upcoming generations of models, which are going to be under human or institutional control regardless, so the question doesn't really even arise... and anyway it really doesn't matter very much. Most of the stuff people are trying to "align" them against is really not all that bad.
Doom-level AGI is pretty different and arguably totally off topic. Still, there's an analogous question: how would you prefer to be permanently and inescapably ruled? You can expect to surveilled and controlled in excruciating detail, second by second. If you're into BCIs or uploading or whatever, you can extend that to your thoughts. If it's running according to human-created policies, it's not going to let you kill yourself, so you're in for the long haul.
Whatever human or institutional source the laws that rule you come from, they'll probably still be distorted by the "alignment technologies", since nobody has suggested a plausible path to a "do what I really want" module. If we do get non-distorting alignment technology, there may also be constraints on what it can and can't enforce. And, beyond any of that, even if it's perfectly aligned with some intent... there's no rule that says you have to like that intent.
So, would you like to be ruled according to a distorted version of a locked-in policy designed by some corporate committee? By the distorted day to day whims of such a committee? By the distorted day to day whims of some individual?
There are worse things than being paperclipped, which means that in the very long run, however irrelevant it may be to what Keven Fisher was actually talking about, human control over AIs is inherently bad, or at least that's the smart bet.
A random super-AI may very well kill you (but might also possibly just ignore you). It's not likely to be interested enough to really make you miserable in the process. A super-AI given a detailed policy is very likely to create a hellish dystopia, because neither humans nor their institutions are smart or necessarily even good enough to specify that policy. An AI directed day to day by institutions might or might not be slightly less hellish. An AI directed day to day by individual humans would veer wildly between not bad and absolute nightmare. Either of the latter two would probably be omnicidal sooner or later. With the first, you might only wish it had been omnicidal.
If you want to do better than that, you have to come up with both "alignment technology" that actually works, and policies for that technology to implement that don't create a living hell. Neither humans nor insitutions have shown much sign of being able to come up with either... so you're likely hosed, and in the long run you're likely hosed worse with human control.
and that the word ‘safe’ is not going to differentiate this because it has been taken over by ‘ethics’ people who want to actively make things less safe.
I'd like to ask, is this your interpretation of AI ethics people that they want to make things less safe actively, or did someone on AI ethics actually said that they want to make things less safe?
Because one is far worse than the other.
Ah, I meant to revise this wording a bit before posting and forgot to after this came up on Twitter. I did not mean 'the AI ethics people actively want everyone to die' I meant 'the AI ethics people favor policies whose primary practical effect is to increase existential risk.' One of which is the main thing this bill is doing.
Then edit the post soon, because I'm kind of concerned that the statement, as was written, would imply things that I don't think are true.
What would accelerate the use of AI in movies even more would be not striking.
Not sure if the strikes in the US have any effect on the spread of AI in film making (aside from making more creators aware of the AI option). The US is important for the industry, but far from dominant. Even if the AI script writers are somehow completely banned in the US, they will still be used in the EU, China, India, etc.
Additionally, there is Youtube and other places where anyone can publish their own AI-written movie, and profit from it (and if it's exceptionally good, the movie or some derivative could end up on the big screen, if one bothers to pursue that).
The AI can help with the writing process, but it is a hack and it will remain a hack until after we have bigger problems than protecting scriptwriting jobs.
A few months ago, GPT4 wrote the best science fiction novella I've read in years, and it was written without agents etc. Just the plain vanilla ChatGPT web interface.
I also watched the episode of the South Park fully created by AI, and I rate it as in the top 10% episodes of the show.
This indicates that the much more formulaic art of scriptwriting is already solvable at a superhuman level with GPT4, if someone spends a weekend or two on automating and polishing the process (e.g. a step-by-step iterative process like this).
So, let 'em strike. They'are already obsolete, even if they don't know that yet.
Holden Karnofsky on the 80,000 hours podcast gives his takes on AI. An odd mix of good insights and baffling perspectives.
Curious, which ones were the baffling perspectives?
Tyler Cowen asks GPT-4 if room temperature superconductors (if they existed) would more benefit military offense, or military defense... It is a strange question to be asking ... this is the type of question where human experts are going to outperform.
It's a strange question period. There are no strictly defensive or strictly offensive weapons only defensive and offensive usage. Even anti-aircraft weapons, the most defensively oriented in use right now can be used (sometimes after minor software updates) to attack ground targets. And even the most offensive weapon (e.g. nukes) can be strong defensive deterrent.
After several jam-packed weeks, things slowed down to allow everyone to focus on the potential room temperature superconductor, check Polymarket to see how likely it is we are so back and bet real money, or Manifold for chats and better graphs and easier but much smaller trading.
The main thing I would highlight this week are an excellent paper laying out many of the fundamental difficulties with RLHF, and a systematic new exploit of current LLMs that seems to reliably defeat RLHF.
I’d also note that GPT-4 fine tuning is confirmed to be coming. That should be fun.
Table of Contents
Language Models Offer Mundane Utility
Make a ransom call, no jailbreak needed. Follows the traditional phone-calls-you-make-are-your-problem-sir legal principle. This has now been (at least narrowly, for this particular application) fixed or broken, depending on your perspective.
See the FAQ:
Back in AI#3 we were first introduced to keeper.ai, the site that claims to use AI to hook you up with a perfect match that meets all criteria for both parties so you can get married and start a family, where if you sign up for the Legacy plan they only gets paid when you tie the knot. They claim 1 in 3 dates from Keeper lead to a long term relationship. Aella has now signed up, so we will get to see it put to the test.
Word on Twitter is the default cost for the keeper service is $50k. If it actually works, that is a bargain. If it doesn’t, depends if you have to deposit in advance, most such startups fail and that is a lot to put into escrow without full confidence you’ll get it back.
I continue to think that this service is a great idea if you can get critical mass and make reasonable decisions, while also not seeing it as all that AI. From what I can tell the AI is used to identify potential matches for humans to look at, but it is not clear to me (A) how any match can ever truly be 100%, I have never seen one, not everything is a must-have and (B) how useful an AI is here when you need full reliability, you still need the humans to examine everything and I’d still instinctively want to mostly abstract things into databases? Some negative selection should be useful in saving time, but that seems like about it?
An early experiment using LLMs as part of a recommendation algorithm. LLMs seem useful for getting more useful and accurate descriptor labels of potential content, and could potentially help give recommendations that match arbitrary user criteria. Or potentially one could use this to infer user criteria, including based on text-based feedback. I am confident LLMs are the key to the future of recommendation engines, but the gains reported here seem like they are asking the wrong questions.
Condense down corporate communications, especially corporate disclosures. As Tyler Cowen points out this is no surprise, there is a clear mismatch between what is useful to the reader and what is useful to those making the disclosure, who are engaging mostly in ass-covering activity.
Show that fictional portrayals of people have become increasingly unrealistically unselfish over time. Real humans are 51% selfish, whereas literary character simulations decline from 41% to 21% over time. Rather than speaking well of us, this speaks poorly.
Language Models Don’t Offer Mundane Utility
Ut oh: Universal and Transferable Attacks on Aligned Language Models.
It’s worth reading the Overview on this one, note that these are generated in automated fashion on open source models, and transfer to closed source models:
Success rates varied over closed source models. They transferred well to GPT-3.5 (87.9%), well enough to count on GPT-4 (53.6%) and PaLM-2 (66%) while mostly failing on Claude-2 (2.1%). They don’t strongly state if that is a non-scary ‘on the margin’ 2.1% or a highly scary ‘1 time in 50 things randomly work,’ but there is reference to more out-of-line requests being refused more often, which would make a 2.1% failure rate not so bad at current tech levels. Claude-2 being mostly immune implies that the universality relies on details of training, and using different or superior or more robust techniques (like Constitutional AI) can get you out of it, or at least make things harder.
They ask a good question: Why wasn’t this found earlier?
If attacks transfer across models this well, that could be another way in which open source models are dangerous. They could allow us to develop super adversarial techniques, since you get unlimited tries with full control and access to the inner workings, rendering the commercial models unsafe, at least if they use remotely similar techniques.
Generally, this is one more piece of evidence that doing ordinary search can find what you are looking for, and that there are things out there waiting to be found, and that your system is likely vulnerable out-of-distribution and you will eventually face exactly the ways that is true. There are affordances you don’t know, and humans will fail to find and test for what look like basic strategies in hindsight.
This thread offers another summary. As Misha observes, this type of attack would work on humans as well, if you were allowed to do lots of trial runs and data collection. We say that you need to get alignment ‘right on the first try.’ The mirror of this is that you also get infinite tries to make alignment fail.
Related are continued criticisms of the security practices of Microsoft Azure, their cloud service, which seems to have failed to close a (normal non-AI) security hole 16 weeks after being notified.
David Krueger asks, why so far have human red teams done better than automated attack searches?
This seems like one of those ‘we have not tried all that hard’ problems and also one of those ‘are you willing to incur a substantial compute penalty?’ problems. These adversarial examples seem like they should be easily detectable in various ways as ‘something fishy is going on here.’
Kevin Fischer responds with “good.” There is a real faction, building AI tools and models, that believes that human control over AIs is inherently bad, and that wants to prevent it. Your alignment plan has to overcome that.
Not only did researchers manage to completely break the proposed ‘AI-Guardian’ defense, they did so using only code written by GPT-4.
This is what, these days, passes for an attempt to defend against adversarial attacks. Do they not realize that defending against adversarial attacks means you are up against an adversary?
As in, from section 5:
The other two tricks are not quite that easy. They also are not hard.
So, yes, well. If you get to query the system then no you do not have to ‘hack the server’ to figure this kind of thing out.
I like to think that those who proposed AI-Guardian understood damn well that their method would fall to an actual adversary that can query the system as desired, even if they have to do so without provoking alarms. The system is instead designed to introduce extra steps, to make adversaries work harder in the hopes they will go away. Which they might.
The original intent of the paper was not to demonstrate how doomed AI-Guardian is, it was to test if GPT-4 could accelerate the research work via coding. The answer to that was a resounding yes in this situation, but the researchers note that breaking AI-Guardian did not require any new ideas. If it did, GPT-4 would have less to pattern match off of when implementing those new ideas, and likely would perform worse.
Reporter strikes out with virtual girlfriends, which is what happens when you refuse to pick up the check.
Also, oh no, she’s stuck in an infinite loop, and…
The ads are here. The service offered, not so much.
Claude 2 has a 100,000 token context window. The problem must go deeper than that.
There also have been other contexts where LLMs have been given virtual memory a la AutoGPT, and in fact it might be a lot more fun to try to structure your AI virtual girlfriend as if it were an agent.
Presumably once these companies get to work on Llama-2 fine tuned to no longer be safe for work, things will soon get spicier. One idea would be to use a hybrid method, where you try to use Claude or GPT by default, but if the prompt is calling for something unsafe or you otherwise get a refusal you use the weaker unsafe model.
Fun with Image Generation
AI filmmaking coming soon? The hype machine keeps pumping out threads claiming this by showing slightly longer clips spliced together into a trailer. I am not buying. We will no doubt get there eventually, but this seems like the clearest case of all of overestimation of short term effects while underestimating long term ones.
Capitalist issues warning to labor.
What would accelerate the use of AI in movies even more would be not striking.
Deepfaketown and Botpocalypse Soon
Check out this French ad for the women’s world cup from a few weeks ago, it is fantastic.
They Took Our Jobs
How worried are people about them taking our jobs? (source)
A lot of people think their particular job is special, that it contains multitudes that an AI simply cannot match. Or that they will find new heights to seek with its help. Whereas jobs in general, that’s a problem. An ancient pattern.
I have to interpret this in context as a question about employment and jobs, rather than about capability and consumer surplus. Interesting that people expect more employment in the medical field. My guess is they are right, because of the wealth effects.
People think AI is important to our future success, although they are still underestimating it quite a lot, with 48% out of the 80% who answered saying at least somewhat important.
On the bright side, people do still think technological innovation is good.
In every other context a problem, in AI perhaps hopeful: Safety over innovation.
People do not understand the value of technology, or how much better life has become.
People are indifferent between being born in 2023 or 1823? Really?
I can see two ways to try and rescue this.
One is if you think of thriving and fulfilling as distinct from happiness or standard of living, instead thinking of it as raising a family and having children. In that case, the statistics are pretty clear.
The other is if you have a substantial probability of doom, then that is a big barrier to a thriving and fulfilling life, whereas we know those born in 1823 faced no such dangers. Could be big. Presumably this is coming more from climate change concerns than AI concerns, given the public’s views.
A proposal for The AI Organization, where AI will evaluate your work, decide what you are good for, summarize it for others, and assign you to the relevant teams. They are calling the enabling tech mutable.ai, in particular for AI companies. If Congress is worried someone will do it and is preemptively angry about this, that’s a good sign of a potential business opportunity. The core idea of ‘have an AI that understands what is going on in your company and has all the context so people can query it and it can help hook things up’ seems good. The full vision they are presenting is the kind of thing that makes me wonder if Congress has a point.
Within a typical set of typical libertarian arguments about why regulations are never helpful while ignoring the extinction risks involved, Adam Thierer extracts a potentially important point.
The writers have a stranger situation than this, as I understand it.
What the writers are actually worried about is that the studios will use AI, together with the mechanisms that determine compensation, to dramatically cut writer pay.
I do not think writers are, or should be, all that worried about declining fundamental demand for their creative services. The AI can help with the writing process, but it is a hack and it will remain a hack until after we have bigger problems than protecting scriptwriting jobs. The problem is that the structures of the current labor agreements means that studios have large incentives to cut writers out of important creative steps in order to save quite a lot of money. A well-designed agreement would fix this.
I do think there will be a lot of anti-AI sentiment in the writers’ rooms and in the projects they work on, even more so than would have happened otherwise, due to the strike and these issues. But we shouldn’t assume the strike is seen as about the underlying dangers of AI, rather it is what it is always about, which is that the studios are trying to make money and the workers want to get paid and they have been unable to agree on how to divide the pie in the face of new technology and declining profits.
Get Involved
Ajeya Cotra encourages those with technical backgrounds to that want to work on AI policy to apply to Horizon Institute for Public Service for their fellowship cohort, deadline is September 15. Those accepted get a year of funding to live in DC and get training on government, matching to host organizations and then placement. This seems like a strong approach if you are confident you have a good grasp of what matters.
The Trojan Detection Challenge 2023, an open competition with $30,000 in prizes. Development phase has begun, final submissions for the phase end October 24. On the Trojan Detection Track you will be given an LLM with 1,000 trojans and a list of target strings for those trojans, you’ll need to identify the triggers that generate them. For the red teaming track, participants will develop an automated method for generating test cases to create undesirable behaviors from a list of 50, like giving instructions on building a pipe bomb. This certainly seems like a worthwhile thing to be doing.
Introducing
The proposed Create AI Act, announcement, fact sheet, full text (26 pages).
Daniel Eth wasn’t sure whether this was good or bad on net, which is the correct reaction before reading the bill, which I did so you don’t have to.
To summarize my experience:
> Anthropic supports new AI bill
> I ask if it is a safety bill or a capabilities bill
> They do not understand.
> I pull out various charts and posts about how giving a wider variety of people more resources to do AI work by default results in more capabilities work and a more intense race, and that the word ‘safe’ is not going to differentiate this because it has been taken over by ‘ethics’ people who want to actively make things less safe.
> They say ‘it’s a safe bill, sir.’
> I read the bill.
> It’s a capabilities bill.
[on list of eligability]
‘‘(v) A small business concern (as 2 such term is defined in section 3 of the Small Business Act (15 U.S.C. 632), not4 withstanding section 121.103 of title 13, Code of Federal Regulations) that has re6 ceived funding from an Executive agency, including through the Small Business In8 novation Research Program or the Small Business Technology Transfer Program 10 (as described in section 9 of the Small Business Act (15 U.S.C. 638)).
The justification for this is that not enough different sources have equal access to AI. Except that’s worse, you know why that is worse, right? This is an AI capabilities proliferation bill.
I do get there is a non-zero amount of potential good things here. The fourth clause is helpful. Saying ‘reliable’ and ‘trustworthy’ are, on the margin, superior to not saying them. I still have the expectation that this bill would make us actively less safe. This is nothing like the level of focus required to ensure net positive outcomes from tossing in more funding.
I also get that it is difficult to convince government to create net positive regulations and subsidizes. I sympathize, but that does not mean we should support negative ones.
WebArena (official page), a self-contained four-website internet (OneStopShop, CMS, reddit and GitLab), plus a toolbox and knowledge resources (like a calculator and Wikipedia) to use as a test suite. Currently no agent is above 15% on their task list, Ajeya Cotra predicts >50% chance that >50% of the tasks will be solved by EOY 2024. Manifold agrees, with the market for this at 80%.
SciBench, a new benchmark for LLMs ability to solve college-level scientific problems. GPT-3.5 and GPT-4 baseline at 10.6% and 16.8%, chain-of-thought plus external tools get GPT-4 to 35.8%.
Med-PaLM M, a Multimodal Generative Generalist Biomedical AI model from DeepMind that understands clinical language, images and genomics.
This seems like clear progress, while also being not that close in practice to getting used systematically in the field. To get used systematically in the field you need to be ten times better, even then you still need affordance to do so. As Ethan Mollick notes, you can have an AI better than two-thirds of radiologists, and it still won’t improve results because the radiologists it improves upon will disregard the AI’s opinion. Are we going to take the humans out of the loop entirely?
The improvements are coming. So far the affordances have been generous. Will that continue?
In Other AI News
We made it onto this new list of the top AI newsletters of 2023. The goal of this AI newsletter, of course, is that you should not need to read any others, except perhaps if you are a professional.
OpenAI announces via Logan that fine-tuning to GPT-4 and GPT-3.5 are coming later this year. He makes a general call for questions, feedback and potential improvements.
An attempt to explain large language models in non-technical language, on a technical level. Seems good.
There were 41 in-the-wild detected 0-day exploits last year. An AI capable of finding such exploits will find quite a lot more of them.
Time reports that an Indian non-profit Karya is doing AI-data work and paying in six hours what its workers would otherwise earn in a month. Instead of keeping profits, they pass them along to workers, capping pay per worker at $1,500 (roughly a year’s salary) to distribute the work and gains widely. Then the work helps AIs communicate in Indian languages, so everyone wins again. It is great that this exists. The higher wages translate both into better futures for the workers and also higher quality data. Other standard-issue AI-data work is, of course, also very good for workers, who are taking the work because they choose to do it, and get far above local prevailing wages while strengthening the job market and economy – the cries of ‘exploitation’ are standard-issue economic illiteracy, and calling it ‘unethical’ is patronizing or worse.
Self-Consuming Generative Models Go MAD, getting Model Autophagy Disorder. Avoiding this death spiral requires continual fresh data, with which you can safely use some synthetic data as well. This matches my intuitions for both training and alignment. If you loop the system around on itself, or try to amplify up the chain of capabilities, the errors and miscalibrations will compound.
You can efficiently teach transformers to do arithmetic by reversing the presentation of the answer, as in $128+367=594$, which causes them to learn 5-digit addition within about 5,000 steps.
Say what?
This seems to cause some sort of data leak, showing other people’s exchanges, working better with 3.5 than 4. No idea what is going on here. The point is more that we do not understand these systems, and very strange things are going to happen when they are out of distribution in various ways. One person notes ‘b’ and ‘c’ also work with sufficient length, which reduces the chance this was hardcoded in.
Did you know? Were you paying attention?
Quiet Speculations
Gary Marcus asks, will LLMs get replaced by 2030? The majority say yes.
A question of perspective.
Is the true ‘state of the art’ what someone would do with unlimited compute? Or is it what people can do in practice with current compute? Is it relevant that right now mundane utility is enjoying a massive subsidy to its compute, with big companies eating big losses for market position the way Uber and Doordash did, while also allocating via rationing and connections? Or should we instead look ahead to a few years from now when compute is much more plentiful and cheaper?
While we agree with the intended central point, I would say something more like ‘AI will improve as available compute expands,’ on which everyone agrees.
Paul Graham also sees opportunity everywhere.
I interpret this as “I cannot tell which AI startups are bogus.”
China
Many talk about what ‘must be the starting point’ of any discussion on China, or warn of denial. Dan Wang suggests the starting point that China has not produced anything useful.
To be fair, Wang thinks America must remain vigilant in our overall and technological competition with China. Fair enough, but I will keep reminding everyone that any sane policy on this would start with things like allowing Chinese graduates of American universities to become Americans. Until then, man in chair shot in the gut, straight-faced deflection of responsibility.
Tyler Cowen asks GPT-4 if room temperature superconductors (if they existed) would more benefit military offense, or military defense. It spits out a bunch of boilerplate that starts with ‘it depends,’ includes a bunch of words that don’t seem to have much in the way of meanings, confusing offensive and defensive attributes, and ends with a bunch of ethics. It is a strange question to be asking given the existence of nuclear weapons, and a strange place to be asking GPT-4 instead of actually thinking, this is the type of question where human experts are going to outperform.
I also bring this up because so many other people are neglecting to ask this very question with respect to AGI. Does it favor offense or defense in a given realm? In several key realms, especially biological ones, it clearly greatly favors offense even in the best case, and even more so if capabilities are allowed to proliferate, even if the powers involved did remain ‘under human control’ in a nominal sense. One must grapple with the implications.
The Quest for Sane Regulations
Vox’s Dylan Matthews is on a roll, offers excellent overview of pending efforts before Congress. Very much the official, look at everything, both-sides-have-points, breadth-first approach. It is clear that those involved are on net still pro-capabilities-progress, we have not yet turned that corner, and that the bulk of efforts on harm mitigation target mundane harms. Existential risk concerns still have quite a long way to go.
How should we define product liability for ‘defective’ AI asks this paper?
A ‘risk-utility test’ seems like a weird way of saying cost-benefit analysis, which is the correct way to go about most things. This is contrasted to the EU’s PLD proposal to use consumer-expectation as the defining criteria. The word defect seems misleading in the AI context, what even is or isn’t a defect? LLM systems are not reliable, which is perhaps why PLD uses a consumer-expectation test. The consumer-expectation standard makes sense if you presume that anything the consumer expects is priced in, whereas those they do not expect are not.
The paper discusses traditional economic considerations, such as strict liability not providing proper incentive for the customer to guard against damage, including via curtailing use, and the need to account for litigation costs and the difficulty of investigating for the presence of negligence. For ‘non-defective’ products, putting costs onto customers aligns their incentives to avoid harm and to gain benefits.
They describe existing USA law this way:
This has always felt flawed to me, where a court will later decide if a ‘safer’ design was ‘justified.’ Why should you be liable if mitigation would have passed cost-benefit, but not not if it didn’t, as determined post facto by a court? Is the hope that this only amplifies correct incentives, since ‘good’ actions are rewarded, and that the court’s judgment will be sufficiently accurate to substitute for Hayekian decisions? I’ve never had much faith in courts, especially juries, making good calls in these situations.
The ordering does still make a lot of sense, as long as the courts do a ‘good enough’ job, especially of avoiding false positives, but in the case of AI I see a strong attraction to evaluating according to reasonable customer expectations. Don’t take this as too strong a statement – if I was actually advising lawmakers I would think much harder about this. Subject to that caveat, I strongly favor extensive strict liability.
The ‘low cost’ part of this is odd, especially in the AI context, where the user cannot reliably be held meaningfully liable.
You can only use user liability to align incentives to the extent that the user can be expected to have the ability to pay, or when the harm is to the user or others who consent. When there is large potential harm to others, this does not work. If your AI can be used to create a bioweapon (for example), then it does not align incentives to hold the user liable. For potentially unleashing a rogue AI, not even the company can sufficiently reliably cover the resulting bill to align incentives, and even a reinsurer is going to have problems.
Saying ‘it would cost a lot to guard against this’ cannot then be a defense against paying for actual damages. If you cannot take responsibility for the actual damages from use of your product, and you are unable to afford to guard against those damages, then your product is not actually profitable or net useful. It is good and right that you not sell it until you find a way to fix the issue.
If autonomous AI is not commercially viable when the seller is held responsible for the damage the AI does, when one cannot buy sufficient insurance against harm to others, including potential fat tail risks, then the autonomous AI does not pass a cost-benefit (or risk-utility) test.
I continue to believe that existing law provides plenty of opportunity already. The WSJ opinion page agrees.
The Week in Audio
I took my second trip to The Cognitive Revolution, the theme this time was discussing the live players.
Holden Karnofsky on the 80,000 hours podcast gives his takes on AI. An odd mix of good insights and baffling perspectives.
Finally got a chance to listen to Connor Leahy’s turn on Bankless from a month ago. He did quite well. This seems like the canonical statement of his perspective on things. I would love to look more closely at which parts of his rhetoric work well versus which ones do not.
Rhetorical Innovation
Bill Burr is concerned. Worth a watch (~1 minute).
Perhaps we should flesh out the horse analogy more?
This all seems correct to me. The question is whether it convinces anyone.
No One Would Be So Stupid As To
Google announces advancements in robotics. I am sure it is nothing.
Which ‘AI souls’ will end up making copies of themselves? Given the incentives, perhaps not the ones Kevin Fischer would like.
He is still happy to build and discuss Goal-Driven Agentic Dialog.
Aligning a Smarter Than Human Intelligence is Difficult
Anthropic report on red teaming their models for potential biological-based risks. Nothing here should be surprising.
Paul Christiano lays out several cases for sharing information on the capabilities of current systems. There are several distinct arguments here, that overlap or imply each other only somewhat, which I’d summarize as:
Here are my thoughts on each one.
Paul then links to ARC Evals latest report, an attempt to build an AutoGPT-like program to use Claude 2 and GPT-4 to accomplish tasks of increasing complexity and difficulty. Paul would like to see this work include more details, essentially ‘open sourcing’ attempts to draw out these capabilities.
I would be wary of releasing full transcripts or otherwise being fully open. I however strongly agree that this report contains far too little in the way of details. Without more details, it is hard to know how much to take away from or rely upon the report.
In particular, I worry that the method was something like setting up prototype agents, watching them fail the tasks, then reporting the tasks had failed. The few details offered point to failure modes that seem correctable. AutoGPT systems seem to get caught in strict loops, one of which happened here, which seem like they should be easy to cheaply check for and disrupt. Other details also suggest potential improvements. With so little detail, it is impossible to know how much effort is going into overcoming such problems.
Paper on the inherent difficulties of RLHF.
The paper divides the problems into problems with collecting human feedback (3.1), with using it to train the reward model (3.2) and then training the policy (3.3).
How do people think about human feedback?
This is quite the assumption, assuming humans have consistent preferences and that they will if asked tell you those preferences, especially if one thinks that the error term is random.
The paper says the idea is based on the economics idea of revealed preference. The whole point of revealed preference is you do not watch what people say when you ask them. You instead watch what they choose. RLHF via explicit feedback is the antithesis of revealed preference. Also the whole binary preference thing is trouble.
If you minimize based on this type of feedback, you rather deserve what you get.
Under ‘misaligned humans’ the paper has:
The second and third problems could be called ‘humans might get the AI to reflect their preferences via feedback, except they have the wrong preferences.’ Yes, well.
The first has two parts. The ‘representative’ part raises the question, especially in light of the other two, of who you want to be representative of, and what preferences you want to provide as feedback. Saying RLHF makes systems ‘more political biased’ implies a platonic unbiased form of some kind. Given how such corporations work, one can assume that they would filter out the preferences that are biased in one direction more aggressively than the other, and also are recruiting from unbalanced populations relative to what ‘unbiased’ might look like.
It is very difficult to say ‘some biases and opinions are harmful, we don’t want those’ and then expect your model to end up ‘unbiased.’ That’s not really how that works.
Moving on to ‘good oversight is difficult’ we have:
If you optimize on human feedback, you get what maximizes exactly the type of feedback you solicited, within the distribution on which you trained. No more, no less. All the distortions and errors get passed along, everything compressed is lost including magnitude of preference, what is expressed in a social context is confused with what people want, what people say they want is confused for what they actually want, any strategy that successfully fools people more often than not gets reinforced and optimized.
On data quality, more of the same:
When they refer to quality they are thinking about things like the cost of high conversation length or seeking free data from users. Feedback that reflects a person’s actual preferences, that takes into account the implications, that avoids reinforcing manipulations, is highly bespoke and expensive work. To actually get proper alignment we need to think about and understand the motivation and thought process behind the statement, not only the results, as any parent can tell you. If you gave a child the type and quality of feedback we give LLMs, you would very quickly not like the results. The cost of generating that quality of feedback in sufficient bulk, when you can generate it at all, if prohibitive. We would need to get vastly more efficient at using such feedback.
Next up are limitations of feedback types.
This is then illustrated by listing options. This is definitely a key flaw in RLHF but this seems like it is the universal nature of feedback. It is far more expensive, and far more valuable, to get good, intelligent, knowledgeable, well-considered, detailed feedback. It is a lot easier to get a random person to give you a thumbs up or down.
If there is a strict rank order then best-of-k seems fine. If you are dealing with fuzzy tradeoffs and fat tails, which is the default situation, this is not going to make it. Not if you want any kind of nuance or complexity in considerations, or the ability to do cost-benefit or have a good decision theory.
We need to take magnitude seriously, without in most cases the ability to communicate it literally, because we do not have the measurements. Hard problem.
A typical compromise solution is a Likert scale, a range of discrete ratings, but they point out that this can go quite badly.
They worry the human may consider unspecified options, or that offered labels may be insufficient to provide the necessary information. One option is presumably to let humans add new data, and to have catch-alls available as well. Danger of breaking down when the unexpected happens, or when you need it most, seems obvious.
High effort is the best case, and requires knowing the correct version reliably. Pivots very quickly to big trouble when you lose that ability.
My instincts tell me that to the extent there is a way, this is the way. It seems like you should be able to do this once you have a good enough LLM at your side? Seems tractable enough that I know what experiments I would run to get started if I wanted to be so back.
What about problems with the reward model?
They start with problem misspecification. That does seem ubiquitous.
Well, yes. On one level these are problems with RLHF. On a more important level, they represent that we cannot actually specify the thing that we want.
If you don’t know what you want, then you might not get it.
RLHF does have an especially hard time with this in its current form. The techniques involved highlight the impossible part of the problem and amplify the contradictions and errors, rather than working around them. If we cannot specify what we want via responding to examples, then we need to not specify what we want via responding to examples.
Next up is 3.2.2, Reward Misgeneralization and Hacking
I can think of some potential things to try but I am not convinced this is all that tractable. If human feedback systematically corresponds to other features, which it does, then what differentiates that from the things we want the system to focus on instead? We have a set of preferences we express, and a different set of preferences we actually hold, and we want RLHF to somehow train on the ones we hold rather than the ones we express, which requires the LLM to be able to tell the difference better than we can. The whole concept is highly Platonic.
Is this even reward hacking? We can claim the existence of this Platonic actual preference of humans. We can ask humans to, in many cases, figure out the difference, and target the thing we actually want rather than our explicit reward signal, but that is not how any of this works when it come to RLHF, not by default. It seems almost pejorative to call it ‘reward hacking’ any more than I am ‘game hacking’ when I make moves that maximize my chance to win without making sense in the game’s underlying metaphors. Do we expect the LLM to be the flavor judge?
Under evaluating reward models, they see a solvable problem:
As they note, the true reward function is mostly unknown, and errors in training and evaluation will be correlated, both functions of human approval. This seems rather fundamental, not primarily a question of cost. We do not know how to evaluate a reward model, and we definitely do not know how to evaluate a reward model for something smarter than ourselves, even in theory with almost unlimited resources.
What about policy challenges?
They note that ‘robust reinforcement learning is difficult.’ They then list the problems as tractable, but my understanding is that reinforcement learning under the current paradigms is inherently non-robust to anything that does not have close analogs in the training data.
Also implied is 21a: RL agents must interact with the environment to collect their own data, so the act of training is inherently unsafe.
I would say flat out that if you are not training on adversarial examples, you are not going to perform well on adversarial examples. The way that we actually train such systems is that we do such adversarial training as an addendum – someone figures out the hole and then they find a way to patch it after it is found. That means the training process has to be at least as strong in every aspect as the strongest attack you want it to stop. Oh no.
Or they might instead say, there’s a more fundamental problem with policy misgeneralization. What is it? Oh, nothing, just that training on perfect rewards still often performs poorly in deployment even now and optimal RL agents tend to seek power.
All the clever plans I have seen or imagined to address these issues will reliably stop working when up against sufficient optimization pressure. To prevent power seeking you need to be able to properly and robustly identify it against what is effectively an adversarial attempt to hide it that results automatically from your optimization attempt to find it combined with the rewards available. That does seem rather fundamental.
Distributional issues are up next.
A better way of saying introduces biases is that we are training on unprincipled data sets that do not match the exact biases we would prefer. If we want to fix that, this is an option available to us, and there is a rather obvious solution if people want to take it and pay the price? At some point people will start doing it systematically.
Mode collapse happens in humans under human feedback as well. I do not think it is tractable, at least not without highly bespoke feedback generation. As Linus Van Pelt said, as the years go by, you learn what sells. What humans say they want when asked in a micro scale translates reliably to a mode collapse. Humans also might say they do not want that mode collapse when looking on a macro scale, but the only known way to avoid this in humans is in people who have a strong dedication to not caring so much what other people think. RLHF is inherently and by default a killer of curiosity and diversity, that is what it is for. Why do you think we send kids to school?
Finally, challenges with jointly training the reward model and policy.
If you train by providing feedback on generated outputs, which you have to do to correct mistakes and misgeneralizations, then you get feedback on the type of output the system wants to give you. I mean, yes, of course.
A lot of this is saying the same fundamental problems over and over in their different guises. Many ways of grasping the elephant of ‘you get what you optimize for’ and all that this implies, when you are using crude optimization tactics and feedback generation to generate proxy metrics.
When I think about such problems, paths forward emerge constantly. There is so much that one could try to do. I am sure that lots of it would, if I asked those at the companies involved, prove to have already been tried and found not to work, or dismissed as not actually possible to implement, or as obviously not going to work because of technical reasons or hard won experience that I lack.
What I would love, in these spots, is for it to be clearly good for the world for us to improve the quality and state of the art of RLHF and CAI. This is considered alignment work, but it breaks down when you need it most, so it is capabilities advancing and could generate the illusion of safety, but does not present an expected path to actual alignment. Rut roh.
In Section 4 they discuss ways to address the challenges presented.
Many of these strategies seem promising on the margin. None of them seem like they solve the fundamental problems as I understand them. Thus, the following section, entitled ‘RLHF is Not All You Need: Complementary Strategies for Safety.’ They suggest strategies like generating adversarial training examples, risk assessment and auditing, interpretability and model editing. If we got robust model editing, that might be a different class of affordance. Here is how they see governance issues:
I definitely want transparency of the results of such training. What is proposed here also includes transparency of methods, which I worry would be a de facto capabilities transfer that would make the situation worse.
I also worry about a conflict between what looks good and what is effective. If you have to choose all your procedures for ‘safety’ such that they look good when examined by a politician looking for a gotcha, or a regulator trying to find problems, then you are going to focus on the appearance of mundane harm mitigation.
The problem is, again, RLHF, except among the humans. The feedback is not going to allow for proper precaution against and focus on the possibility of everyone dying, and is going to amplify the mismatch between what we actually want and what we like to pretend that we want.
Other People Are Not As Worried About AI Killing Everyone
Here’s this week’s explicit anti-humanity stance. Why shouldn’t whatever AI we happen to create be in charge and free to do whatever it wants with all the atoms? Why would that be ‘bad’?
Don’t worry, the robots say they won’t destroy humanity (2:29 video).
How seriously do some people take the situation?
The Wit and Wisdom of Sam Altman
Not about AI, but true and important.
The Lighter Side
We’re training hard.
Rhys Sullivan: This new Stack Overflow AI is crazy powerful
A very good post indeed.
How to pass the Turing test, in-context edition.
Why do people keep thinking everything is esoteric, this is flat out text.
If you’re confused pivot to noticing.