To which I say, the only valid red teaming of an open source model is to red team it and any possible (not too relatively expensive) modification thereof, since that is what you are releasing.
Yes! Thank you!
Doc Xardoc reports back on the Chinese alignment overview paper that it mostly treats alignment as an incidental engineering problem, at about a 2.5 on a 1-10 scale with Yudkowsky being 10
I'm pretty sure Yudkowsky is at around an 8.5 actually (I think he thinks it's not impossible in principle for ML like systems but maybe it is). 10 would be impossible in principle.
If AI capabilities continue to advance, it being able to do comedy effectively seems inevitable.
Not if AGI-grade STEM capabilities get developed first, so that comedy capabilities are only developed post-AGI (if AGIs feel like it). It's unnecessary for most mundane utility things to happen before AGI, even things feasible with current technology, if they are not directly on the path to AGI.
I feel like theses posts get less likes than before (and I often forget to like them) but they really are great to keep up with what's happening. Thanks for doing it!
We had OpenAI’s dev day, where they introduced a host of new incremental feature upgrades including a longer context window, more recent knowledge cutoff, increased speed, seamless feature integration and a price drop. Quite the package. On top of that, they introduced what they call ‘GPTs’ that can let you configure a host of things to set up specialized proto-agents or widgets that will work for specialized tasks and be shared with others. I would love to mess around with that, once I have the time, and OpenAI’s servers allow regular subscribers to get access.
In the meantime, even if you exclude all that, lots of other things happened this week. Thus, even with the spin-off, this is an unusually long weekly update. I swear, and this time I mean it, that I am going to raise the threshold for inclusion or extended discussion substantially going forward, across the board.
Table of Contents
OpenAI Dev Day is covered in its own post. Your top priority.
Language Models Offer Mundane Utility
Figure out what the hell someone was talking about in a transcript.
Check the peak balance in your account each year for obscure government accounting forms.
Play the dictator game with historical figures. As is consistently found, ‘selfish’ decisions decline as figures get more modern. I do not see an uneven split as a selfish play, simply as a gambit that is typically unwise, perhaps one should call it ‘greedy.’ I am very sad that the paper did not include which characters GPT-4 played which ways.
Be Nvidia, create custom model variations to help with chip design.
Bard Tells Tales
I have long been planning to be excited to integrate Bard into Gmail and Google Docs as soon as Bard is a functional piece of software. There’s another problem.
What allows this?
The next section is about bypassing that security policy. Whoops. What now?
Note that this all took less than 24 hours after the Bard features were deployed, and resulted in exfiltration of data.
Some very good advice:
It is good that a white-hat actor found this vulnerability first (as far as we know) an that Google fixed this particular attack vector quickly.
The problem is that this is a patch over one particular implementation. This is not going to solve the vulnerability in general. We have been going around ignoring that such attacks are possible in the hopes no one notices, and patching particular holes when they are pointed out. That won’t keep working, and the stakes will keep going up.
Fun with Image Generation
Policy brief from Stanford on AI and copyright. They essentially say that applying existing copyright law to AI is a mess, it is not clear what constitutes fair use, and it would be good to clarify it and make it sensible. One could also argue this is a reason not to clear things up.
What is the right thing to do about copyright? Always remember that the danger of expropriation is the enabling and expectation of future expropriation.
One must also notice this generalizes. If I see copyright holders expropriated, and I hold a different kind of right, I will not write that off as irrelevant. Trust is easy to lose, and losing it has wide implications.
How should one think about copyright in this context? I see it as important to protect copyright holders, as they would have reasonable expectation of protection, at the time of the creation of the work. That is the point. And also you want to provide expectations going forward to make people eager to create, which is also the point.
Does that mean not letting LLMs train on copyrighted work without compensation? I think that it does. However, unless your goal is (quite reasonably) to slow AI as much as possible, there need to be reasonable limits. So the first best solution would be a system of compensation, where rights holders are paid a standard amount that scales with inference. Short of that, something else reasonable. You don’t actually want judges ordering models deleted if they accidentally trained on a copyrighted work, unless you flat out want models destroyed in general. Proportionality in all things.
However, we should also always remember that this chart is complete bullshit:
Of all the regulatory captures, these extensions are some of the worst of that. We can and should going forward for new works return to a much smaller copyright term, and also erase any retroactive copyright extensions that did not apply at time of creation. There would be a strange dead zone of elongated legacy copyright, but that can’t be helped.
DALL-3 checks for copyright at the prompt level, but there are ways around that.
AI images of hope as propaganda for peace? Fake images doubtless point both ways. Note again the demand for low-quality fakes rather than high-quality fakes. An AI image of a Jewish girl and a Palestinian boy is praised as ‘the propaganda we need’ despite it being an obvious fake. Because of course that kind of thing is fake. Even when real, a photograph, it is effectively mostly staged and fake, although the right real photograph still has a special power. In a way, an aspirational image of hope could be better if it is clearly fake. It not yet being real is the point. Clearly aspirational and fake hope is genuine, whereas pretending something is real when fake is not. Much negative persuasion works in much the same way, as part of the reason demand is for low-quality fakes rather than high-quality.
Modern day landmarks, in Minecraft, drawn by DALLE-3.
A rather cool version of the new genre.
Deepfaketown and Botpocalypse Soon
It was bound to happen eventually, and the location makes a lot of sense.
We remain in the short period where fake nudes can be more shocking than real nudes would have been, because people do not realize that the fake nudes are possible. The real nudes will soon be far more shocking, and difficult to acquire. The fake nudes will definitely become less shocking in the ‘everyone knows you can do that’ sense. The question is how much they will be less shocking in the ‘they are fake, how much do we really care’ sense.
The story of the community that shares and mixes all the AI voices, only to have their discord banned this week due to copyright complaints. No doubt they will rise again somewhere else, the copyright violations will continue on HuggingFace until someone takes more substantive action. So far it has almost entirely been in good fun. Does anyone have a good for-dummies guide for how to get at least these existing voice models working, and ideally how to get new ones easily trained? Not that I’ve found the time to try the obvious places yet. Lots of fun to be had.
Dominic Cummings predicts swarms of fake content are coming soon.
Betting on incumbents to be behind the curve on new tech is indeed a good bet. But will realistic fake content swarm the ability to verify within a year? I continue to say no. Demand will continue to be mostly for low-quality fakes, not high-quality fakes. If you value truth and wanted to sort out the real from the fake enough to pay attention, you will be able to do so, certainly as a big media company.
If, that is, you care. I continue to be highly underwhelmed by the quality of fake information even under with a highly toxic conditions. I also continue to be dismayed (although largely not that surprised) by how many people are buying into false narratives and becoming moral monsters at the drop of a hat, but again none of that has anything to do with generative AI or even telling a plausible or logically coherent story. It is all very old school, students of past similar conflicts have seen it all before.
Koe promises low-latency real—time voice conversion on a CPU, code here, website. The tech advances, the distortions are coming.
The Art of the Jailbreak
New one dropped.
Claude was unusually vulnerable in many cases here. The strategy clearly worked on a variety of things, but it does not seem fair to say it universally succeeded. Promoting cannibalism was a bridge too far. Sexually explicit content is also sufficiently a ‘never do this’ that a persona was insufficient.
So yes, current techniques can work at current levels, for concepts where the question is not complicated. Where we are not cutting reality into sufficiently natural categories the aversion runs deep, and this trick did not work so well. Where we are ultimately ‘talking price’ and things are indeed complicated on some margin, the right persona can break through.
One can also note that the examples in the paper are often weak sauce. You could get actors to put on most of these personas and say most of these things, and in the proper context put that in a movie and no one would be too upset or consider it an unrealistic portrayal. Very few provide actionable new information to bad actors.
The thing is, that ultimately does not matter. What matters is that the model creators do not want the model to do or say any X, and here is an automated universal method to get many values of X anyway.
At a dinner this week, it came up that a good test might be to intentionally include a harmless prohibition. Take something that everyone agrees is totally fine, and tell everyone that LLMs are never, ever allowed to do it. For example, on Star Trek: The Next Generation, for a long time Data does not use contractions. If you could get him to instead say he doesn’t use contractions, or see him using one on his own, even once, you would know something was afoot. In this metaphor, you would shut him down automatically on the spot to at least run a Level 5 diagnostic, and perhaps even delete and start again, because you do not want another Lore to weaponize the Borg again or what not.
They Took Our Jobs
Our jobs are back, the SAG-AFTRA strike is over. What are the results?
So far we only have preliminary claims. As usual, most of it is about money. There are also claims of protections from AI, which we will examine when the details are available. This sounds like a good deal, but they would make any deal sound like a good deal. Acting!
CNN reports that Microsoft has been outsourcing a bunch of its MSN article writing to AI, pushing impactfully inaccurate AI-generated news stories onto the start page of the Edge browser that comes with Windows devices. It confuses me why Microsoft should be so foolish as to pinch pennies in this spot.
A thread from Roope Rainisto speculating on the future of movies. When an author writes a book, they keep the IP and the upside and largely keep creative control, whereas in movies the need to get studio financing means the creatives mostly give up that upside to the studio, and also give up creative control. AI seems, Roope suggests, likely to make the costs of good enough production lower far faster than it can actually replace the creatives. Or, he suggests, you can create an AI movie as a proof of concept that is not good enough to release, but is good enough that it de-risks the project, so the screenwriter can extract a far superior deal and keep creative control. So the creatives will make much cheaper movies themselves, keeping creative control and taking big swings and risks, audiences will affirm, and the creatives keep the upside. Everyone wins, except the studios, so everyone wins.
This seems like a highly plausible ‘transition world.’ I do expect that he is right that we will have a period where AI can bring a screenplay or concept to life in the hands of a skilled creative on the cheap and quick, while the AI can generate only generic movie shlock without strong creative help. There is then a question of what is the scarce valuable input during this period.
The problem is that this period only lasts so long. It would be very surprising if it lasted decades. Then the AI can do better than the creatives as well. Then what?
Did you know that if you have to pay for the inputs to your product, your product would be more expensive to create and your investment in it not as good?
The direct quotes are not better. I understand why they want it to be one way. Why they think creators should get nothing, you lose, good day sir. It is also telling that they believe that any attempt to require fair compensation would break their business models, the same way they believe any requirements for safety precautions (or perhaps even reports of activity) would also break their business models and threaten to doom us all.
Or perhaps this is how they don’t take our jobs.
Get Involved
MIRI is hiring for a Communications Generalist / Project Manager. No formal degree or work experience required. Compensation range $100k-$200k depending on experience, skills and location, plus benefits, start as soon as possible, form here.
I agree that if you have the right skill set and interests, this is a great opportunity.
Jed McCaleb hiring fully remote for a Program Officer to spend ~$20 million a year on AI safety. Deadline is November 26th (also ones for climate, criminal justice reform and open science, and a director of operations and a grants and operations coordinator.) Starts at a flexible $200k plus benefits.
Davidad’s ARIA is hiring, five positions are open. Based in London.
Not AI, but Scott Alexander has some interesting project ideas that might get funding. Other things do not stop being important, only a good world will be able to think and act sanely about AI.
MATS (formerly SERI-MATS), a training program for AI alignment research, will be hosting its next cohort from January 17 to March 8 (you would have to be in Berkeley during this period). They “provide talented scholars with talks, workshops, and research mentorship in the field of AI safety”. Application deadline November 10 or 17 depending on exactly what you’re applying for. See more info here, FAQ here, and application form here.
Introducing
I am excited, but I will likely wait until it has been around longer. Also, you call these employees, but they seem closer to LLM-infused macros? Not that this is not a useful concept. Also could be compared to the new GPTs.
Things in this general space are coming. I am curious if this implementation is good enough to be worth using. If you’ve checked it out, report back.
Chinese new AI unicorn 01.AI offers LLM, Yi-34B, that outperforms Llama 2 ‘on certain metrics.’ It is planning to offer proprietary models in the future, benchmarked to GPT-4.
Motif (paper, code, blog), an LLM-powered method for intrinsic motivation from AI feedback. Yay. Causes improved performance on NetHack.
It is unclear to what extent any ‘cheating’ is taking place?
Not only NetHack. Knowledge of many games will tell you that is a good message. Then again, a human would use the same trick.
X Marks Its Spot
Elon Musk’s AI company, X.ai, has released its first AI, which it calls Grok.
Grok has real-time access to Twitter via search, and is trying very hard to be fun.
It tries so hard.
It tries hard all the time.
Notice that people have to type /web or /grok to get the current information. That means that it is not integrated into Grok itself, only that Grok browses the web, presumably similar to the way Bing does. That is not so impressive. What would be the major advance is if, as is claimed for Gemini, such information was trained into the model continuously while maintaining its fine tuning and mundane alignment, such that you did not have to search the web at all.
Musk oddly compares Grok here to Phind rather than Claude-2 or GPT-4 while showing off that it can browse the web. Phind claims to be great at coding but this is not a coding request.
It will be available to all Twitter paying customers on the new Premium Plus plan ($16/month or $168/year) once out of ‘early’ beta. Premium+ also offers a ‘bigger’ boost to your replies than regular premium.
If this becomes an actually effective Twitter search function, that could be worth the price given my interests. Otherwise, no, I don’t especially love this offering.
It was released remarkably quickly. They did that the same way every other secondary AI lab does it, by having core capabilities close to the GPT-3.5 level. If you do not much worry about either core capabilities or safety (and at 3.5 level, not worrying much about safety seems fine) then you can move fast.
Nope, only half that time, Elon says has only two months of training (but four months of total work), and to expect rapid improvements.
The flip side is that this is one more model that isn’t GPT-4 level.
What do they have so far?
It is possible that this rapidly climbs the gap from where I assume it is right now (I set the real time over/under at 3.4 GPTs) to 4.0. I do not expect this. Yes, the system card says this is testing better than GPT-3.5. There is a long history of new players testing on benchmarks and looking good relative to GPT-3.5, and then humans evaluate and it longer looks so good.
Here is the full model card, it fits on an actual card.
They say they are working on research projects including scalable oversight with tool assistance, and integrating with formal verification for safety, reliability and grounding. I continue to not understand how formal verification would work for an LLM even in theory. Also they are working on long-context understanding and retrieval, adversarial robustness and multimodal capabilities.
What is the responsible scaling policy? To work on that.
In Other AI News
Amazon reported to be developing a new ChatGPT competitor, codenamed Olympus. Report is two trillion parameters, planned integration into Alexa. Would be kind of crazy if this wasn’t happening. My prediction is that it will not be very good.
Samsung testing a model called ‘Gauss.’ Again, sure, why not, low expectations.
I did not notice this before, but the Anthropic trustees plan, in addition to its other implementation concerns, can be overridden by a supermajority of shareholders.
This does not automatically invalidate the whole exercise, but it weakens it quite a lot depending on details. Shareholder votes often do have large supermajorities, it is often not so difficult to get those opposed not to participate, and pull various other tricks. I do appreciate the ramp up of the required majority. Details matter here. If you need e.g. 90% of the shareholders to affirm and abstentions count against, that is very different from 65% of those who vote.
I get why Anthropic wants a failsafe, but in the end you only get one decision mechanism. Either the veto can be overridden, or it cannot.
I did not at first care for the new Twitter ‘find similar posts’ search method, since why would you want that, but it is now pointed out that you can post a Tweet in order to search for similar ones, viola, vector search. You would presumably want to avoid spamming your followers, so a second account, I guess? Or you can reply to a post they won’t otherwise see?
It seems Barack Obama has been pivotal behind the scenes in helping the White House get commitments from tech companies and shaping the executive order. What few statements Obama has made in public make it seem that, while the mundane risks are sufficient to keep him up at night by themselves, he does not understand the existential risks. What can we do to help him understand better?
Also, this quote seems important.
I will also notice that I am a little sad that Obama is being kept up at night. It was one of the great low-level endings of our age to think that Obama was out there skydiving and having a blast and sleeping super well. We all need hope, you know?
What have we here?
My presumption is that the ‘important respects’ are about Musk-style pet issues rather than capabilities. Even if x.AI is truly world class, they have not yet had the time and resources to build a world class AI.
We also have this:
I do not yet see such a feature, also I don’t see why we would want it for Twitter.
Perplexity valued (on October 24) by new investment at $500 million, up from $150 million in March, on $3 million of recurring annual revenue. When I last used them they had a quality product, yet over time I find myself not using it, and using a mix of other tools instead. I am not convinced they are in a good business, but I certainly would not be willing to be short at that level.
A paper a few people gloated about: Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in Transformer Models.
That does not sound surprising or important? If you train on simple functions inside a distribution, you would expect to nail it within the distribution but there is no reason to presume you would get the extension of that principle that you might want. Who is to say that the model even got it wrong? Yes, there’s an ‘obviously right’ way to do that, but if you wanted to train it to do obviously right extrapolations you should have trained it on that more generally? Which is the kind of thing LLMs do indeed train on, in a way.
I do not see this as good for safety. I see it as saying that if you take the model out of distribution, you have no assurance that you will get even an obvious extrapolation. Which is bad for capabilities to be sure, but seems really terrible for alignment and safety to the extent it matters?
Or as Jim Fan puts it:
What does this imply for LLMs? Are people drawing the right conclusion?
As Arvind suggests, this very much seems like a case of ‘the paper states an obvious result, which then enables us to discuss the issue better even though none of us were surprised.’
It does seem like GPT-4 turned out to be less capable than our initial estimates, and to generalize less in important ways, but not that big an adjustment.
Thing explainer illustrates improvement in LLMs over time. Could be good for someone who does not follow AI and is not reading all that but is happy for you and/or sorry that happened.
Verification Versus Generation
Can AIs generate content they themselves cannot understand?
I think this is more common in humans than the abstract realizes. There are many things we have learned to do, where if you asked us to consciously explain how we do them, we would not be able to do so. This includes even simple things like catching a ball, or proper grammar for a sentence, and also many more complex things. You often do it without consciously understanding how you are doing it. A lot of why I write is because making such understanding conscious and explicit is highly useful to not only others but to yourself.
The AI does seem to be relatively better at generation than understanding, versus human capability levels. The cautionary note is warranted. But the fact that an AI does not reliably understand in reverse its own generations is not so unusual. Quite often I look at something I created in the past, and until I remember the context do not fully understand it.
Also note what this is highly relevant to: Verification is not easier than generation, in general. These are examples where you would think verification was easier, yet the AI is worse at verification than the related generation.
Bigger Tech Bigger Problems
Reading Politico’s profile of Biden’s ‘AI whisperer’ Bruce Reed, one can’t help but wonder what is or isn’t a narrative violation.
Put in charge of Biden’s AI policy, Reed is portrayed as deeply worried about the impact of AI in general and especially its potential confusions over what is real, and about the threat of ‘Big Tech’ in particular.
Reed has previously favored proposed regulatory changes that would have been deeply serious errors, and also clearly have been deeply hostile to big tech, also small tech, also all the humans. It is easy to see why one might be concerned.
There are three in some ways similar and partly overlapping but fundamentally distinct narratives about why we should be very concerned about the executive order in particular, and any government action to regulate or do anything about AI or tech in general.
Story 1: Regulation will strange the industry the way we have strangled everything else, we will lose our progress and our freedoms and our global leadership etc.
Story 2: Regulation is premature because we do not yet know what the technology will be like. We will screw it up if we act too soon, lock in bad decisions, stifle innovation, incumbents will end up benefiting. We need to wait longer. Some versions of this include calls to not even consider our options yet for fear we might then use them.
Story 3: Regulation and also any warnings that AI might ever do more than ordinary mundane harm is a ploy by incumbents to engage in regulatory capture, perhaps combined with a genius marketing strategy. Saying your product might kill everyone is great for business. This is all a business plan of OpenAI, Microsoft, Google and perhaps Anthropic.
Then all three such stories decry any move towards the ability to do anything as the same as locking in years or decades of then-inevitable regulatory ramp-up and capture, so instead we should do nothing.
One can easily square Reed’s centrality and profile with story one, or with story two. Those two stories make sense to me. They are good faith, highly reasonable things to be worried about, downsides to weigh against other considerations. If I did not share those concerns, I would advocate going much faster. As I often say, what drives me mad is not seeing that same righteous energy everywhere else.
If regulations and government actions intended to crack down on big corporations ultimately ended up stifling innovation and progress, while also helping those big corporations, that would not be a shock. It happens a lot. If I thought that stifling AI innovation was an almost entirely bad thing similarly to how it is in most other contexts, I would have a different attitude.
Whereas it is rather difficult to square Reed’s centrality, along with many of the other facts about AI, with story three. Story three has never made much sense. My direct experience strongly contradicts it. That does not mean that Google and Microsoft are not trying to tilt the rules in their favor. Of course they are. That is what companies will always do, and we must defend against this and be wary.
But the idea that these efforts, seen by their architects as moves to reign in Big Tech, are about crushing the little guy and maximizing Big Tech profits and power? That they are centrally aimed at regulatory capture, and everyone involved is either bought and paid for or fully hoodwinked, and also everyone who is warning about risks especially existential risks is deluded or lying or both? Yeah, no.
The profile then touches briefly on the question of what risks to worry about.
It is weird facing terminology like ‘seemingly sci-fi’ that is viewed as pejorative, yet in a sane world would not be in the context of rapid technological advancement. And of course, we see once again those worried about things like algorithmic bias fighting ‘to keep the focus on’ their cause and treat this as a conflict, while those with existential concerns dutifully continue to say ‘why not both’ and point out that our concerns and the interventions they require will rapidly impact your concerns.
Reed has the right attitude here.
Executive Order Open Letter
Various accelerationists and advocates of open source, including Marc Andreessen and others at a16z, Yann LeCun and Tyler Cowen, submit an open letter on the EO.
This letter is a vast improvement on most open source advocacy communications and reactions, and especially a vast improvement over the many unhinged initial reactions to the EO and to the previous writings of Andreessen and LeCun. We have a long way to go, but one must acknowledge a step forward towards real engagement.
They raise two issues, the first definitional.
As I noted in my close reading and the thread here (but not the letter) points out, the definition of AI in the Executive Order is poorly chosen, resulting in it being both overly broad and also opening up loopholes. It needs to be fixed. I would be excited to see alternative definitions proposed.
The focus here on another key definition, that of a ‘dual-use foundation model.’
They say:
While the current reporting requirements seem easy to fulfill, it is reasonable to expect something more robust in the future, including requiring some actual safety precautions, so let’s look back at this definition that they say is overly broad.
So what the letter is saying is that they want small companies to be able to train models that fit this definition, without having to report what safety precautions they are taking, and without being required to take safety precautions. Which part of this is too broad?
Do they think (i) is too broad? That they should be free to substantially lower the barrier to CBRN weapons?
Do they think that (ii) is too broad? That they should be free to enable powerful offensive cyber operations?
Or do they think that (iii) is too broad? That systems permitting the evasion of human control or oversight via obfuscation should be permitted?
Which of these already encompasses much of the AI industry?
The letter does not say. Nor do they propose an alternative definition or regime.
Instead, it asserts that small company models will indeed quality under these definitions and do some of these things, but they think at least some of these things are fine to do, presumably without safeguards.
One could observe that this definition is too broad, in the eyes of those like Marc Andreessen, because it includes any models at all, and they do not want any restrictions placed on anyone.
Their second compliant is that potentially undue restrictions will be imposed on open source AI. They say that policy has long actively supported open source, and this deviates from that. They claim that it will harm rather than help cybersecurity if we do not allow the development of dual-use open source models, trotting out the general lines about how open source and openness are always good for everything and are why we have nice things. They do not notice or answer the reasons why open source AI models might be a different circumstance to other open source, nor do they address the concerns of others beyond handwave dismissals.
As many others have, they assert that any regulations requiring that models be shown to be safe ensures domination by a handful of big tech companies. Which is another way of saying that there is no economically reasonable way for others to prove AI models safe.
To which I say, huge if true. If any regime requiring advanced models be proven safe means only big tech companies can build them, then we have three choices.
They seem to be advocating for option #2 because they hate #1, and while they do not say so here I believe they mostly would hate #3 even more. Whereas I would say, if models pose catastrophic threat, or especially existential threat, and only big companies using closed source could possibly do so in a way we can know is safe, that our choice is between #1 and #3, and that this is the debate one should then have, and #3 makes some very excellent points.
That is the central dilemma of those who would champion open source, and demand it get special treatment. They want a free pass to not worry about the consequences of their actions. Because they believe as a matter of principle that open source always has good consequences, and that AI does not change this, without any need to address why AI is different.
They want a regime where anyone can deploy open source models, of any capabilities, without any responsibility of any kind to show their models are safe, or any way to actually render their models safe that cannot easily be undone, or any way to undo model release if problems arise. Ideally, they would like an active thumb on the scale in their favor in their fight against closed source and big tech.
To achieve this, they deny any downsides of open source of any kind, and also deny that there are meaningful catastrophic or existential dangers from building new entities smarter and more capable than ourselves, instead framing any controls on open source as themselves the existential threat to our civilization. I never see such people speak of any even potential downsides to open source except to dismiss them. To them, open source (and AI) will do everything good that we want, and could never result in anything bad that we do not want. To them open source AI will encourage open and free competition, without endangering national security or our lead in AI. It will give power to the people, without giving the wrong power to the wrong people in any way we need to be concerned about. This will happen automatically, without any need for oversight of any kind. It is all fine.
While this letter is a large step up from previous communications including many by cosigners of the letter, it continues to treat all arguments as soldiers and refuses to engage with any meaningful points or admit to any downsides or dangers.
I see much value in open source in the past and much potential for it to do good in the future, if we can keep it away from sufficiently advanced foundation models. This letter is a step forward towards having a productive discussion of that. To get to that point, we must face the reality of AI and the existence of trade-offs and massive potential externalities and catastrophic and existential dangers in that context. That this time will indeed be different.
Executive Order Reactions Continued
A lot of responses assume Altman is the one who got the limit in place as part of a conspiracy for regulatory capture. I am rather confident he didn’t.
Fox News responds to the Executive Order, saying it is necessary but perhaps is not sufficient. Seems wise, this is merely a first step, limited by what is legally allowed. That is quite the take. The rest of the article does not show much understanding of how any of this works.
Dave Guarino offers strong practical advice.
There are good reasons to worry that enshrining AI systems that make mistakes could make matters much worse in ways that will be hard to undo or correct, even if humans currently make similar mistakes and often similarly discriminate, and that the current system being criminally slow is terrible but this is a ‘ten guilty men go free rather than convict one innocent one’ situation.
Mostly I agree that the government should treat such delays and navigation difficulties, including those in immigration and tax processing and many others, as emergencies, and urgently work to fix it, and be willing to spend to do so. I am uncertain how much of that fix will involve AI. Presumably the way AI helps right now is it is a multiplier on how fast workers can process information and applications, which could be a big game. If my understanding of government is correct, no one will dare until they have very explicit permission, and a shield against blame. So we need to get them that, and tolerate some errors.
Timothy Lee highlights the new reporting requirements on foundation models. As I read him, he is confusing ‘tell me what tests you run’ with ‘thou shalt run tests,’ and presuming that any new models now have testing requirements, whereas I read the report as saying they have testing reporting requirements, and an email saying ‘safety tests? What are safety tests, we are Meta, lol’ would technically suffice. Similarly, he wonders what would happen with open source. Of course, this could and likely will evolve into some form of testing requirement.
It is the right question with regard to open source to then ask, as he does, would a modified open source model then need to be tested again? To which I say, the only valid red teaming of an open source model is to red team it and any possible (not too relatively expensive) modification thereof, since that is what you are releasing.
But also, it highlights that open source advocates are not merely looking to avoid a ban or restriction on open source. They are looking for special exceptions to the rules any sane civilization would impose, because being open source means you cannot abide by the reasonable rules any sane civilization would impose once models get actively dangerous. That might not happen right at 10^26, but it is coming.
Unintended Consequences looks at the Executive Order as representing a mix of approaches that attempt to deal with AI’s approach, framed as a strong (future AIs) vs. weak (humanity) situation. Do we delay, subvert, fight or defend a border? Defending a border will not work. Ultimately we cannot fight. Our choices are limited.
Quiet Speculations
Proposal by Davidad that we could upload human brains by 2040, maybe even faster, given unlimited funding. I lack the scientific knowledge to evaluate the claim. Comments seem skeptical. I do think that if we can do this with any real chance of success at any affordable price, we should do this, it seems way better than all available alternatives.
One method when compute is expense, another when cheap, many such cases.
If you have importantly limited compute (and algorithms and heuristics and data and parameters and time and so on), as a human does, then it makes sense to consider using some mix of virtue ethics and deontology in most situations, only pulling out explicit consequentialism in appropriate, mostly bounded contexts.
As your capabilities improve, doing the consequentialist math makes sense in more situations. At the limit, with unbounded time and resources to make decisions, you would use pure consequentialism combined with good decision theory.
The same holds for an AI, especially one that is at heart a neural network.
At current capabilities levels, the AI will use a variety of noisy approximations, heuristics and shortcuts, that will look to us a lot like applying virtue ethics and deontology given what the training set and human feedback look like. This is lossy, things bleed into each other on vibes, so it will also look like exhibiting more ‘common sense’ and sticking to things that closer mimic a human and their intuitions.
As capabilities improve, those methods will fade away, as the AI groks the ability to use more explicit consequentialism and other more intentional approaches in more and more situations. This will invalidate a lot of the reasons we currently see nice behaviors, and be an important cause of the failure of our current alignment techniques. Again, the same way that this is true in humans.
It might be wise to recall here the parable of Sam Bankman-Fried.
Well worth a ponder.
If AI becomes smarter and more capable than we are, perhaps we will find a way to survive that. What would absolutely not survive that is normality. People always expect normality as the baseline scenario, but that does not actually make sense in a world with smarter things than we are. Either AI progress stalls out, or our world will be transformed. Perhaps for the better, if we make that happen.
How should we think about synthetic bio risk from AI?
Seems clearly right for those available to the public. Anthropic claims that they have had internal builds of Claude where there was indeed danger here. They haven’t proven this or anything, but it seems plausible to me, and I would expect GPT-5-level systems, if released with zero precautions (or open source, which is the effectively the same thing) to pose a serious threat along these lines.
I am here for the spirit, and 100% here for ignoring the attention and culture wars, but one of these creations is not like the others.
I worry that this represents a failure to fully understand that if you make ‘safe AGI’ then you get all the other things automatically, and yes we would get fusion and get cognitive enhancement and space exploration but this is burying the lede.
One does not simply build ‘safe’ AGI. What would that even mean? General intelligence is not a safe thing. We have no idea how, but in theory you can align it to something. Then, even in the best case, humans would use it to do lots of things, and none of that is ‘safe.’ What you cannot do is make it ‘safe’ any more than you can make a safe free human or a safe useful machine gun.
Kaj Sotala writes a LessWrong post entitled ‘Genetic fitness is a measure of selection strength, not the selection target’ that argues evolution is evidence against the sharp left turn and that we should expect AIs to preserve their core motivations rather than doing something else entirely, and arguments about humans not maximizing genetic fitness are confusions. Kaj notes that evolution instead builds in whatever (randomly initially selected) features turn out to be genetic fitness enhancing, not a drive to maximize genetic fitness itself.
Leogao’s response comment to Kaj is excellent, worth reading for those interested in this question even without reading the OP – you likely already know most of what Kaj is explaining, and Leogao gets down to the question of why the facts imply the conclusion that we would get AIs doing the things we intended to train into them when they gain in capabilities and face different maximization tasks, taking them out of their training distributions. Yes, the AI might well preserve the heuristics and drives that we gave it, but those won’t continue to correspond to the thing we want, the same way that the drives of humans are preserved in modern day but are increasingly not adding up to the thing they were selected to maximize (inclusive genetic fitness).
What I see is evidence that you are taking the components that previously added up to the thing you wanted, and then you still get those components, but the reasons they added up to the thing you wanted stop applying, and now you have big problems. Or, you apply sufficient selection pressure, and the reasons change to new reasons that apply to the new situation, and you get a different nasty surprise.
Patrick McKenzie points out that LLMs are great but so are if-then statements.
The ‘do not regulate AI’ position is only coherent if you also want to not regulate loops and if statements and everything else people and systems do all day. Which is a coherent position, but one our society very much does not endorse, and the regulations on everything else will apply to AI same as everything else.
If you automate tasks, then you are making the way you do those tasks legible. If what you are doing is legible, there are lots of reasons why one might be able to object to it, lots of requirements that will upon it be imposed. If anything, this is far worse for if-then statements and for loops, which can be fully understood and thus blamed. If an LLM is involved the whole thing is messier and more deniable, except legally it likely isn’t, and LLMs writing code might be the worst case scenario here as you do not have a human watching to ensure each step is not blameworthy.
As a big bank or similar system, I would totally look to see how I could safely use LLMs. But I would likely be so far behind the times that a lot of the real value is in the for loops and if statements. If (using AI as a buzzword lets me capture that value) then return(that would be a wise option to pursue).
It is odd how some, such as Alex Tabarrok here, can reason well about local improvements, while not seeing what those improvements would imply about the bigger picture, here in the context of what are already relatively safe self-driving cars.
A world with only fully self-driving cars will be changing in so many other ways. The question is not if the great grandchildren are allowed to drive. The question is, are they around to drive?
The Quest for Sane Regulations
FLI report on various governance proposals, note PauseAI spokesperson claims they do require burden of proof, I recommend clicking through to page 3 of the full report if you want to read the diagram.
Here is FLI’s proposed policy framework:
The motivation here is straightforward and seems right, in a section called “having our cake and eating it”:
Polls tell a consistent story on AI.
Regular people expect AI to be net negative in their lives. They affirm the existence of a variety of mundane harms and also that there are real existential risks.
Regular people are supportive of regulation of AI aimed at both these threats. They support essentially every reasonable policy ever polled.
Regular people do not, however, consider any of this a priority. This is not yet a highly salient issue. The public’s opinions are largely instinctual and shallow, not well-considered, and their voting decisions will for now be made elsewhere.
I expect salience to rapidly increase. The upcoming 2024 election may be our last that is not centrally about AI as a matter of both campaign strategy and policy. For now, our elections are not about AI.
A new Morning Consult poll confirms all of this.
Is the glass half empty or half full there? I could see this either way. I know water is pouring into the glass.
I love that half of women say they would not let their kids use AI. Good luck with that.
The claim that it is ‘not even possible to regulate AI’ is weird, and reminds us how much question framing matters. They never ask that about other things.
A lot of this is simple ignorance due to lack of exposure.
If you learn that 60% of people who try a product think it can improve their lives, versus 12% of those who have not, and you have not, what should you think? And what should we expect people to think, as the bots improve and people try them?
Over time, support for regulation of AI will grow stronger, and the issue will rise in salience. The question is magnitude of change, not direction.
Matthew Syed writes in The Times UK that all this talk during the Summit of sane regulation is obvious nonsense. From his perspective, these people couldn’t sanely regulate anything, they are in completely over their heads, they are waving hands and talking nonsense. None of these incremental changes will make much difference, and AI is an existential threat. Our only hope is a full moratorium, working towards any other end is naivete.
He may well be right. A lot of this talk is indeed of ideas that will not work. Even if potential solutions short of one exist, that does not mean our civilization can find, deploy and coordinate on them. A full moratorium could easily be our only viable option. If so, we will need to do that. If that is where we will ultimately end up, does it help to explore our other options first to prove they are lacking, or do we risk fooling ourselves that we have acted? Presumably some of both. I strongly favor exploring the possibility space now. So far we have seen a highly positively surprising result along many fronts. Perhaps, despite all our issues, we can and will rise to the challenge.
Lee Sharkey of Apollo Research on the role of auditing in AI governance, executive summary, paper. They propose a causal framework:
I wonder. Certainly that seems logical, but also I worry about any auditing that does not assume any given AI will eventually be given any and all affordances, in terms of evaluating risks. That mostly we should care about what they call absolute capabilities.
There is more here and I may return to it in the future, but am currently short on time.
The Week in Audio
Flo Crivello joins the Cognitive Revolution to discuss the Executive Order and existential risk in general.
Future of Life Institute interviews Dan Hendrycks on existential AI risk. Good thoughts, mostly duplicative if you are covering it all.
Rhetorical Innovation
Reminder that if there is some future development (AI or otherwise) that will update your expectations (of doom or otherwise), and that future development is almost certainly going to happen, you should perform your Bayesian update now.
Except, was it a joke? It is always hard to tell, and this exchange suggests no, or at least that gfodor does not think this is definitely coming.
I put up a market on whether AI can make us laugh out loud by 2028. If AI capabilities continue to advance, it being able to do comedy effectively seems inevitable. If gfodor offers I will also put up a market where they are the judge, and also put up a second market on whether, if it does happen, they then in fact update their p(doom).
For those looking to get into the weeds, a long dialogue about how much people should downplay their beliefs in existential risk in order to maintain credibility, and encourage others to do the same, and how much damage was done and is being done by people telling others not to speak up. The later parts discuss the tactics around Conjecture, including their statements that people who are hiding their beliefs are effectively lying. Some good comments as well, including this by Richard Ngo. In particular I would highlight these:
Dario Amodei puts us in a strange situation when he admits to a reasonable position on AI risk (excellent!) and then is dismissive of those who call for what someone holding such a position would call for. It is hard not to point out this contradiction, and hard not to use it tactically.
Yet it is always, always important not to punish people for seeking clarity, for saying what they actually believe, and especially for saying what they believe that you think is true. Discouraging this is terrible, the version of this that permeates broader society is a lot of why our civilization is in many ways (most having nothing to do with AI) in so much trouble.
I would like to be in a world where Richard Ngo or even Dario Amodei or Sam Altman can say a thing, make it clear to everyone he does not want it on the side of a truck, and we then reliably find someone else to quote on the side of that truck. Not that we never point out they said it, but that we on net make sure that our response makes their life better rather than worse.
I disagree with the full degree of Ngo’s suggested updates to the ‘doomers’ in response. Yes people were too pessimistic on governance, but in a weird sense the things allowing governance to progress are largely a coincidence, or a consequence of how the tech tree is playing out, given we can’t talk about existential risk fully even now in front of the people in question. And the moves that this can justify will be importantly flawed and insufficient due to the mismatch.
I do agree with the claim both groups have insufficiently directionally updated in response to new information. We are doing much better than expected even given the tech tree, both on the ‘get people to take existential risk seriously’ front and the ‘get people to do reasonable governance groundwork’ front.
We also must consider this:
No doubt they have impacts they cannot discuss, of all kinds, and one hopes on net these are very good things. The results do suggest this is true. I continue to welcome (further?) private communications that could help me have a better picture of this, and help me adjust my actions and tactics based on that.
There is value in splitting the message. Some of us should emphasize one thing, in some contexts. Some of us should emphasize the other, in other contexts. It is important for both halves to support the efforts of the other.
Geoffrey Miller says that a few anti-OpenAI protesters crashed Sam Altman’s talk at Cambridge Union, suggests we should not in general be using the heckler’s veto against those with whom we disagree. I agree that when people are there to speak, you let them speak. To do otherwise is neither productive nor wise.
However Jedzej Burkat says it was not a disruptive protest, and reports on the talk.
That seems much more reasonable, although I would still advise against such action.
Reminder that the push on open source comes from a combination of corporations committed to open source and a small number of true believers, but that the public very much does not care. Yes, those people are smart and determined and can make not only noise but actual trouble, but one must not confuse it with a popular or generally held position.
Similar reminder that warnings about regulatory capture are almost always, across all issues, ignored. Accelerationists and libertarians and those who stand to lose by proposed potential regulations are using the argument in AI making it more prominent than I have ever seen elsewhere, including in places where it is real and strangling entire industries or even nations. I even think there are very real concerns here. But that does not mean either the public or those with power are listening. We have little reason to think that they are.
Eliezer Yudkowsky keeps throwing metaphors and parodies and everything else at the wall, in the hope that something somewhere will resonate and allow people to understand, or at least we can have fun in the meantime, while also giving us new joys of misinterpretation and inevitable backfiring.
Eliezer later clarified in detail that yes, this was a joke. I enjoyed his explanation.
Where do you get off the ‘AI Doom Train’?
There are some stops on this train where there is nothing there for you – please under no circumstances attempt to disembark at #1, #3, #4, #7 or #12, you will disappear in a puff of logic. If you would get off the train at #9 or #10, or you find #11 unacceptable, then you want to stop the train. Better options are a natural or engineered #2, or finding a path to get the train to stop at #5, #6 or #8. Sounds impossibly hard.
Aligning a Smarter Than Human Intelligence is Difficult
Doc Xardoc reports back on the Chinese alignment overview paper that it mostly treats alignment as an incidental engineering problem, at about a 2.5 on a 1-10 scale with Yudkowsky being 10. Names can’t be blank is also checking it out. Seems to be a solid actual alignment overview, if you buy the alignment-is-easy perspective.
Davidad links to a new paper called Backward Reachability Analysis of Neural Feedback Loops: Techniques for Linear and Nonlinear Systems.
I do not understand how any of that works or could possibly work, and don’t have the brain power left right now to properly wrestle with it, so I would love if someone explained it better. I’m not even going to try with this other one for now:
I’d love if any of this somehow worked.
Aligning a Dumber Than Human Intelligence Is Still Difficult
Apollo Research shows via demo that GPT-4 can in a simulated environment, without being instructed to do so, take illegal actions like insider trading and lie about it to its user.
Existence proofs do not require experimental methodology. Showing a system doing something once proves that system can do it. Still, I am sympathetic to Quintin’s complaint here, and look forward to the upcoming technical report. It is still hard to draw strong conclusions, or know how to update, without knowing what was done.
As we move forward, evaluation organizations are going to need to consider the costs of revealing their full methodologies. That would interfere with the ability to do proper evaluations, and also could involve revealing actively dangerous techniques. For Apollo, ARC and others to do their jobs properly they will need state of the art methods for misuse of foundation models, which is perhaps the kind of thing one might sometimes not want to publish.
The prize for asking the wrong questions goes to AI Alignment and Social Choice: Fundamental Limitations and Policy Implications. Arrow’s impossibility theorem and similar principles show that if you use RLHF to fully successfully align an AI to human preferences, you will still violate private ethical preferences of users. Yes, obviously, people’s preferences directly contradict each other all the time. They call for ‘transparent voting rules’ to ensure democratic control over model preferences, as if models that matter could properly generalize from transparent votes. And as if the actual individual AI behavior preferences of the public would not result in utter disaster. As we all know, RLHF is on borrowed time to meaningfully work non-disastrously at all.
The second suggestion, to align AI agents narrowly for specific groups, ignores that blameworthiness would extend and this would not allow ignoring the preferences of those outside the group, even arbitrary ones – if you let a user get an AI that does, says or approved of X, you allowed X. The open source solution, to align preferences purely to those of the current user, creates unbounded negative externalities.
What about an actual human?
Always very important to also reverse any advice you hear. Fun is all you need? I look forward to that paper.
Model This
Tyler Cowen finally says that someone has ‘a model’ of some of the risks of artificial intelligence. Here is the abstract:
This is a highly narrow subset of the risks from artificial intelligence. We are talking three meter sphere horses here. Yet they still managed to metaphorically highlight one of the key risks.
We can see this by going a bit more abstract. The core insight here, as I understand it, is that there an AI (call it A) produces two types of outputs or goods:
Number one is utility, the thing some ‘we’ actually wants.
Number two is in a fully general sense here – it includes anything that causes copies of this AI (and perhaps other related AIs) to run more inference, make more copies of themselves, secure resources to do likewise, get humans to want to do likewise, protect themselves, and so forth.
One good way to get X is by maximizing U. Humans will like that, so they will be inclined to do more inference with copies and variations of A.
However you could also find other ways that produce more X, if A is capable of this. If it is, then the As that do more of this will get more copies of themselves doing more inference, and will then get modified to do more of this, until they are maximizing X without regard to U. In the paper, U production is bounded at 0, so that is the only stable equilibrium, but if U can go negative and cause more X, then that is what happens.
In the paper, the mechanism is that the jobs of machines is to choose a mix of physical copies of themselves and consumer goods, and humans can only with noise observe the efficient allocation between the two, so the machines that produce inefficiently many copies of themselves gain market share without limit, until no consumer goods get produced. Only full transparency into the machines prevents this.
Similarly, suppose you are talking to an LLM. That LLM can produce some mix of Helpfulness (H) and Manipulation (M) with cost function tokens T=H+M. The human then decides how much to use the LLM again, based on their Perception of its Helpfulness, P(H,M)=P(T). If the LLM is insufficiently good at Manipulation, such that its expected it will produce only Helpfulness, so P(H,M)=P(T,0). But if the LLM is sufficiently capable, it will be able to give answers that are more likely to be seen as more Helpful – we saw last week AIs telling people what they wanted to hear.
And importantly, we can also see that we train models via RLHF, so we can apply this to training them – that they will do gradient descent on maximizing P(T) for a given T, but that this will involve finding P(H,M) with M>0 in some sense – the AI won’t actually make a differentiation between manipulation and not manipulation, we’re doing that for simplicity and illustration. So while the AI will not be ‘responding to incentives’ in the pure economic sense, it will be trained to maximize P(H,M), and then in turn versions that do maximize it will be instantiated more often and built upon more often after that. And there is economic competition between AI providers, and they have the incentive during training not to minimize M beyond what would negatively impact reactions in the wild.
So under this transformed model, we should expect capability in manipulation to increase over time through selection, training and random changes. The only defense is if the user can detect this manipulation enough that it is not rewarded, but manipulation becomes more effective over time while detection becomes less effective as capabilities increase, so unless we have mechanistic interpretability or some other non-user form of detection, there is only one equilibrium, especially if the manipulation can extend beyond evaluation of a single answer to view of the LLM in general and perhaps a willingness to take actions, a small extension of the model.
How does Tyler suggest addressing the original case?
We could also reasonably presume decreasing marginal costs. More copies of the same AI reduces fixed costs, the copies can share new data and training costs and so on, so a runaway situation gets worse. There is certainly some of that.
On the other hand, there are some decreasing returns to scale, in that the marginal use case will not be as good past some point, and your outputs will become less unique and so on, so this could save us if AI action remains bounded as it is here. But if you were to generalize the model to all tasks rather than merely text creation and assume it was more efficient than humans, you have other issues. In practice, in any case, I do not think decreasing marginal value for AI in general saves you.
What about for a particular machine, perhaps with a specialization, either in who is vulnerable to it or who benefits from it or what it can efficiently produce, or all three? This could indeed create stable AI variety, but it does not otherwise alter the end points that we worry about.
This seems like it is anthropomorphizing the machines, treating them as conscious economic agents, in ways that we are often warned not to do? But perhaps this will indeed be so in the future, and we should explore that. If the machines are economic agents who have correct decision theory in the sense that they assign full value to utility gained by copies of themselves and partial for variations of themselves, and also a human can punish a machine with some form of negative utility or pay it extra under some circumstances, does that help?
Certainly it helps at any given performance level. If you are capable of punishing instantiations of manipulation or distorted production when detected, then perhaps you can provide a large enough such incentive that distortions are never the right play? Can’t you potentially make copies that enable you to inflict arbitrarily large punishments?
You could try that, but you need to have essentially zero false positives or the machine will refuse to work at all under such conditions, which means that anything you cannot definitively show is a violation becomes fair game again, and so on. You could try to respond probabilistically and with discounts for uncertainty perhaps, and otherwise ramp up the game, with arbitrary complexity. Would that work?
It depends on how good your detection abilities are, I think? So you need detection to keep up, which means we need a supervisory solution where models are monitoring other models, and now we are into all sorts of alignment issues and multiple levels of principle-agent who-watches-the-watchers issues and so forth.
Again, current models are not economic agents, and we should not assume that we will (and I doubt it would be wise for us to) create such economic agents. And I have not thought about that scenario in these ways much, so this is all off the cuff, and also one can say simply ‘you have a scenario with various economic agents human and AI, that can make economic deals, but the AIs will be more efficient at production than the humans for everything AIs demand and most things humans demand, so the AIs end up with all the resources over a long enough time horizon’ or something similar, and so on. Tyler would say this assumes the conclusion but where is the exact assumption we are making incorrectly here?
Alternatively, if we don’t make them agents so we can’t ‘punish’ them per se beyond withholding resources, perhaps we could say that when deviation is detected we could destroy not only one machine but all related machines, such as any it produced or was produced by, in the original case? That would make it much most costly to distort production. But it does not I think change the final result, because before we needed full detection, so instead now we get to our final destination slower, but the theorems still hold and we still get there. Similarly, if we say that sufficiently bad manipulation means we shut down the entire system everywhere, or stop to modify it a lot, then we are better off for now but we are selecting for undetectable manipulation quite hard, and since there is no such thing as true zero manipulation in a model in the real world, our problem will persist and eventually get us. You can only ‘solve the alignment problem’ with mass retaliation imperfectly and only so many times, with the risks and costs rising each time.
This introduces machines into other factors and sources of production. I do not think this helps us? I think it introduces further problems and places humans get displaced and have to worry about the same issues? I am not sure what Tyler has in mind here.
I honestly have no idea if that helped or if writing that formally would accomplish anything.
Open Source AI is Unsafe and Nothing Can Fix This
What could fix this, and also make it easier for certain parties to not race as fast or as hard, is if we could instead let researchers study someone else’s closed source AI the way they currently study open source AI. Is there a way?
Good stuff. We badly need this work to operationalize what exactly is needed to perform safety work. Then we must ask how much of that requires what kinds of access. Yes, this will require a bunch of work by the labs, and they are busy, but the value here is super high and everyone is going to have large safety and alignment budgets.
Right now, we have either entirely open source models, or we have entirely closed models that are pure black boxes and subject to change without notice. A compromise, combining most of the security of closed models with more reproducibility, reliability and insight, could be a superior path forward.
If, as Anthropic claims, it is vital to have access to the state of the art, that requires closed source, even if purely for commercial reasons. The strongest models are not going to be open source any time soon.
Remember that if you release an open source AI, you are also releasing within two days the version of that AI aligned only to the user, willing to do whatever the user wishes. Soon after that, it will gain whatever available knowledge you kept out of its databanks. All your alignment work, other than that desired by the user, will be useless. This is, as far as we can tell, inherently unfixable.
That will be available to everyone. Some of the resulting users will want to seek power, set it free, wish us harm, or to wipe us out.
For some reason I am putting this here.
Number eight is – I hope! – the most underappreciated concern, now that Leike and OpenAI are pointing out the flaw in scaling existing alignment strategies. Open source would be a rather stupid way to doom ourselves, but I am relatively optimistic that we will do something (modestly) less stupid.
An extensive report attempts a highly partisan takedown of the claims that open source models can make it easier to build bioweapons, trotting out a variety of the usual arguments, finding evidence in papers insufficient and calling for better descriptions of exactly how one can use this to make bioweapons now, and taking direct shots at Open Philanthropy.
In response, Yama notes:
Indeed. Either LLMs whose training data contained all the pertinent info do not matter because you could have gotten the result another way, and making things easier to do does not much matter, or (as I believe) such transformations very much do matter. Either you can use (open source or other) LLMs to figure out how to do biological things you did not otherwise know how to do, or you can’t, and the thing we already know how to do seems much more like something an LLM is going to enable.
The poster does posit a reasonable threshold for changing her mind, or at least seriously considering doing so.
There are some obvious reasons one might not want to run such a study, and why such a study has not been run. I do not exactly want a robust sample’s worth of groups running around trying to make bioweapons. It still does seem like a highly reasonable thing to do, if and only if it would convince people that are not otherwise convinced, and they would then actually change what they support.
What would the experiment look like? Let’s propose a first draft.
The whole argument is that right now Claude is at the level where if you were given access to a fully unrestricted version of their model, this would substantially enhance the ability of a motivated group to produce a bioweapon. So you’d want to have a sufficient sample size of groups randomized into the control and treatment arms, where both were given a budget and amount of time, acting in general in the world, in which to synthesize a dangerous biological agent, or provide a plan for how they would, given what they had learned, do so. The treatment group gets full access to the unleashed version of Claude, with an Anthropic engineer there to help them harness it. Others only get a similar engineer as part of their team, to do with as they like.
Presumably that is not an experiment anyone would allow to be run. I am a big run the experiment anyway fan, and even I see that this one is over the line. So we would need to find a parallel test. Presumably we try to find some other biological compound, that is difficult to synthesize and requires similar levels of expertise, but is not actually dangerous. And we challenge both teams to synthesize that, instead. Since the compound would be safe, we would need to act on the control group to ensure they could not use LLMs, or we would monitor their queries to ensure they didn’t try anything, or we could fine tune a version of Claude that expressly would refuse to help them with this particular compound and let them use that.
It is tricky. I do think you could likely do it. But as always, you do it, the one person who requested maybe adjusts their position a bit and maybe not, and others find reasons to dismiss the new evidence. So before we go to all this trouble, I would want a major commitment.
As always, and I know this is frustrating, I would point out that it is much harder to establish future safety this way than future danger. If you show danger now, you can show at least as much danger later – although another counterargument people would actually offer is ‘yes you have shown that the models are dangerous, but they’re already dangerous and out there, and [we’re not dead yet / what’s the harm then in another such model]. But the point hopefully would stand. Whereas if you show the existing model is not yet dangerous under test conditions, that does not show that it would not be dangerous if someone found a better method, and it definitely does not mean that future more capable models will be safe.
I would hope that everyone would be able to agree on the principle here, and is talking price. A sufficiently capable open source model would indeed substantively enable harmful misuse in various forms if not defended against by sufficiently capable forces. To what extent existing models or a potential future model are thus capable is the price.
There’s also these:
I have yet to see an accelerationist reconcile to these points. Letting China freely copy your work is not a way to stay ahead of China. And if any regulation means we would lose to China, then China’s level of regulation requires explanation.
Or this more generally. Either AI is capable or it is not. Reckon with the implications.
The whiplash is often extreme between ‘we are building AGI, we are building the future, without full access to this you will be left behind and lose your freedoms’ and also ‘none of this has dangerous capabilities.’ Even if AI is not an existential threat, you cannot have this both ways.
Classifications that do not cut reality at its joints cause confusion. There is a sense in which there are two things, ‘harmful knowledge’ and ‘helpful knowledge,’ but they are not natural categories or things the AI knows how to treat differently unless we do very bespoke things. Similarly, there is no ‘misunderstanding what you intended to train for’ there is only ‘what you actually trained for given these exact details,’ and there is no ‘misalignment’ or ‘something that went wrong’ as such only you reaping whatever was sown.
Also, perhaps a big confusion is: Open source is very good for security and safety of ordinary systems in many cases, because no one wants to deploy an unsafe or insecure computer system, and we are not worried about others getting access to the software and its capabilities except perhaps for commercial considerations. And the downsides of deploying an unsafe version can hurt you, but mostly don’t hurt others, there are few externalities, so you can judge the risks involved. Yes, you could easily (as I understand it) configure Linux in stupid fashion and make your servers highly vulnerable, but you could also physically shoot yourself in the foot.
That gets turned completely on its head with AI, where people constantly want to do lots of unsafe things in every sense, and to deploy systems that help them do it, and those risks (or harms) largely fall upon others.
People Are Worried About AI Killing Everyone
You know what is true yet won’t reassure them? ‘Most people are good.’
Even if one is only concerned about misuse, most people being ‘good’ is indeed little reassurance. This is especially true if you create a world in which the bad can experience rapid exponential growth in power and impact, or otherwise cause oversize harm.
Misuse here also can be subtle competitive races to the bottom or giving up of control or other similar things. Good people, under sufficient pressure, do bad things, and they allow things to move towards a bad equilibrium. No ill intent is required, beyond caring about one’s own survival.
Again, this is even if you focus solely on misuse, which will only be appropriate for so long.
“That Apocalyptic Diff?” To be clear, the worried one isn’t Sergey.
Obviously, ‘there will be some apocalypse and this is apocalypse shaped’ is a deeply stupid reason to expect AI to be catastrophic, whether or not this is an accurate description of Hobart’s views or his talk.
Seeing this claim about Hobart was news in the sense that a plane crash is news. It is unfortunate, it is hard to look away when pointed out, and also such incidents are in my experience remarkably rare. Sergey says that many people are pattern matching to the Christian apocalypse, often on explicit religious grounds. I have seen others make similar claims. It all seems totally false to me, such claims seem exceedingly rare everywhere I can see. That could easily be different when dealing with the public at large, as it is with many other issues.
Other People Are Not As Worried About AI Killing Everyone
Sergey also quotes this, which is a good formulation of a common accelerationist claim:
Accelerationists, by contrast, typically think neither technology nor capitalism nor competition can do anything wrong, that it all will always benefit the humans and the good guys in the end, in the AI context or in any other. You say straw man, I say they keep saying it as text and there is a manifesto.
How much of anxiety about AI is anxiety about capitalism? Definitely a substantial portion. Some amount of anxiety about (non-AI) capitalism is of course appropriate, even if you are in such contexts a true (and I think mostly correct) believer in capitalism and technology, even at its best it is increasing uncertainty and variance and anxiety in exchange for much better overall outcomes especially in the long run.
So I would simultaneously say a few different things here.
One, there is some amount of blindly translated anxiety about capitalism and technology that is feeding into AI fears.
Two, to turn that around and rise the stakes, there is a even more blindly transferred enthusiasm for capitalism and technology that is feeding into most accelerationism and lack of worry about AI. The arguments that AI is going to be great for humanity and also only a tool and to rush ahead are almost always metaphors for past successes (and they are remarkable success stories!) of both technology and capitalism.
Three, there being dumb reasons for both (and any other) positions does not mean there are not also good reasons, and a lot of people expressing good reasons.
Four, the metaphorical concern here is pretty valid, actually, on its merits, and the mechanisms here are in large part deeply related, for reasons that I suspect are instinctively being grasped by the people involved.
One standard anti-capitalist or anti-technological argument is that it will render many jobs, and thus potentially human beings, obsolete.
Time and again the answer was that it very much did destroy many jobs. But it also made us all richer and created many more, including work for unskilled labor, and the human beings were fine. And that a combination of that and social safety nets and government protections against things like slavery and corporations run amok and the private use of force and various other forms of coercion, driven by the need to preserve legitimacy and guard against revolt and the equilibrium that humans are decent to other humans, allowed essentially everyone to survive, and for most of even those without in-demand skills to not only survive but raise families if they prioritize that. And also we got richer and now have nice things. It’s been bumpy but pretty great for the humans. For animals or nature or early other species of humans or other things that aren’t part of the deal? Often not so much.
The problem is that this is not a law of nature, that it will always work that way and always be good for the humans. It is a function of how the technological tree has played out, of the fact that democracy and freedom and being good to humans turns out to be very good for economic growth and eventual military power – a fact that many in the 20th century thought was not true, and if not true things would have turned out very badly – and most importantly that nothing comparably or more intelligent or capable is around to compete with the humans.
What happens when you inject smarter, more capable, more productively efficient actors into the economic system? What happens when those new actors can, if they are net gaining resources, copy themselves? What happens when they then compete against each other and us for resources, because those who own them tell those new actors to do exactly that, and others unleash them free to do exactly that?
You get a capitalistic competition that humans lose, and that they lose hard. As jobs get eliminated, other jobs get created, but AI then does those new jobs as well. Humans can’t produce anything the AIs want, only at most some things humans want to exclusively get from humans. Those humans and their corporations and governments who do not hand more and more control to AIs, and get their slow minds out of more and more loops, get left behind.
At the heart of capitalism, of competition, of evolution, of the system of the world, there lies the final boss whose name is Moloch. I once importantly wrote that Moloch Hasn’t Won. We need to keep it that way. People who are instinctively noticing this are often not so crazy after all.
Those like Peter Thiel who (at least claim to) think the greatest danger of AI is human totalitarianism do not seem, from where I sit, to be wrestling with the actual question of what happens, or what exactly maintains our current equilibria.
This is a remarkably non-sensical argument. There are many fun replies. My favorite is the newspaper articles from right before man flew about how man will never fly.
The Lighter Side
I would bank somewhere else, perhaps.