To what extent setups of this type can in practice preserve nice features, both in alignment and other capabilities, and how much those results will then generalize and survive out of distribution as capabilities of the underlying systems scale higher, is a key question. If we can get nice enough properties, we can do various forms of amplification, and the sky is the limit. I am deeply skeptical we can get such properties where it matters. Some others are more hopeful.
A key hope I have for this type of research is that we can test our techniques on the actual powerful models we're worried about, using domains very similar to the ones we care about. This can be done as long as we can find very similar domains (including similar in difficulty) where we happen to have ground truth (or some other held-out signal for validation). For instance, perhaps we can use string theory as a testbed for theoretical alignment work: we can potentially use held-out string theory experts to see if we can reproduce important results using our techniques as an analogy for alignment research. In practice, I'm reasonably optimistic about finding such domains which are quite similar up through moderately superhuman models.
(Note that my perspective here might be pretty different than the perspective of the OpenAI people and that I don't work there.)
Alternatively, if we're worried about measurement tampering type concerns, we can potentially hold out some sources of validation (additional measurements) on our actual domain of interest for testing.
These sorts of approaches are actually just cases of sandwiching, and we discuss this type of evaluation in more detail here. Also note that we can use the exact same testbeds to test scalable oversight techniques and generalization techniques (this link is to the same post as linked earlier, but not a specific section).
Thus, our final technique testing doesn't need to generalize across model capabilities, just to generalize across (hopefully) very similar domains. And we should be able to use these similar domains to find hyperparameters and otherwise make relatively fine-grained choices. We can even potentially try to mostly train our model on domains where we have ground truth and then generalize to the domains we actually care about (though there are some complications with full training as we'll discuss below). We'll need to test for overfitting by using cross-validation across multiple such domains. (Cross-validation seems particularly important for very parameter-heavy approaches like training the entire model.)
This still leaves us with some questions:
Q: When can we reasonably expect technique testing to generalize across very similar domains given that cross-validation across such domains looks good?
A: I think technique testing probably will work well unless our model is intentionally trying to sabotage our evaluations (perhaps due to scheming/deceptive alignment). Models likely will be able to know what sorts of held-out signals we might have on various domains and thus scheming models could sabotage our evaluations. So, I don't expect these sorts of methods to work for scheming AIs, but otherwise, they might work quite well. (We still need to ensure we get a large enough sample size of domains such that cross-validation is low variance, etc.) To handle scheming AIs, we'll probably need a pretty different approach at least for evaluation. (I'm optimistic about AI control here.)
Q: Same question as before, but what about actually fully training the model on some domains and then generalizing to the domain we care about?
A: I get somewhat concerned that even non-scheming models might naturally learn to predict "what would humans think the answer is based on all of the held-out validation they have" and then this will look like it generalizes well according to cross-validation while it actually fails. This seems especially likely if there are flaws in the ground truth labeling process which are predictable to the model. In theory, this concern could also apply to technique testing if "what would humans think the answer is based on the held-out validation they have" is extremely salient, but I think this difficulty should be avoidable in practice given how much less search is applied.
Q: Even if we can test how well training across domains works using cross-validation, how well will this sort of technique work in practice?
A: First, note that in all domains of interest, we probably still want to train on "easy" cases we know we can label or otherwise do a bit of training; otherwise, we might not get any generalization at all. With this, it's still unclear how flexibly models will actually generalize on the basis of (e.g.) instructions about a new domain. Another concern is that in many domains, it will be doable to get some held-out validation which will work for testing, but this validation will be extremely non-robust to training. For instance, in the context of measurement tampering detection, we can probably have a bunch of redundant measurements (e.g., 5) and then hold out some of our measurements as validation (we can even probably do this testing on our actual domain of interest as noted earlier). But, if we train against more of our measurements, then they will just also be tampered with. So, training on a bunch of different domains with easy-to-hack validation is unlikely to generalize well (in fact, it might make generalization worse than if omitted).
Q: Sure, but at some capability level, we'll no longer be able to find similar domains with ground truth (or good enough validation), right? (Regardless of the scheming concern.)
A: This seems mostly true. I think the exact capability level where this starts really biting is probably around "moderately superhuman" (perhaps >3 sigmas smarter than John von Neumann insofar as g makes some sense for AIs), though I'm unsure. In the case of (low-stakes) measurement tampering concerns, we should often be able to use held-out measurements for validation even for very smart models. A similar approach can be used to test for certain types of reward hacks in some other domains also.
Q: Sure, but will we even be able to find techniques that help much in the cases we care about?
A: I think there are good reasons to expect that some versions of scalable oversight style techniques like debate are reasonably helpful in some important cases (though it's very unclear if these approaches will be sufficient). (Recall that scalable oversight and W2SG are compatible approaches to the same problem and can be tested using the same methodology.) Various types of (supervised) consistency checks also seem pretty useful in types of domains. It's much less clear that we can use the internals of models to get better performance. If we made no structural assumptions about the internals of models and assumed the worst case about the internal structure of AIs, we wouldn't be able to accomplish anything, so we will depend on some type of structure in the internals of models. In practice, I'm moderately optimistic that we can find domain-specific structure to leverage in many important cases, but far less optimistic about domain agnostic approaches. For instance, see "Why solving measurement tampering may be notably easier than eliciting arbitrary knowledge". We plan on writing a post which compares the structural hopes we have in the measurement tampering detection case to the hopes discussed in the OpenAI paper. Even if it's not clear that we can get huge gains here, picking up the low hanging fruit with internals based methods still seems pretty good to me.
(Does this always reflect agents’ “real” reasoning? Need more work on this!)
Conveniently, we already published results on this, and the answer is no!
Per Measuring Faithfulness in Chain-of-Thought Reasoning and Question Decomposition Improves the Faithfulness of Model-Generated Reasoning, chain of thought reasoning is often "unfaithful" - the model reaches a conclusion for reasons other than those given in the chain of thought, and indeed sometimes contrary to it. Question decomposition - splitting across multiple independent contexts - helps but does not fully eliminate the problem.
I wonder how reactions will change when people realize (or are reminded) that actually AI boyfriends are in higher demand than AI girlfriends, at least at current tech levels.
Expressive negative opinions on people with AI partners will be considered hate speech?
I can affirm, even though we were fortunate enough to not use Slack, that this skill was indeed a major portion of running a company.
Am I missing something? Because this seemed to me to be about lower case s slack, not about the specific platform?
Writers and artists say it’s against the rules to use their copyrighted content to build a competing AI model
The main difference is they say it NOW, after the fact that this happened, and OpenAI said so beforehand. There's long history of bad things happening when trying to retroactively introduce laws and rules.
On the Nora Belrose link: I think the term "white box" works better than intended, in that it highlights a core flaw in the reasoning. Unlike a black box, a white box reflects your own externally shined light back at you. It still isn't telling you anything about what's inside the box. If you could make a very pretty box that had all the most important insights, moral maxims, and pieces of wisdom written on it, but the box was opaque, that would still be the same situation.
The next version will be “LLMs are just tools, and lack any intentions or goals”,
My best take on puncturing that canard:
While that is technically true of a transformer model, as a simulator, if you (for example) train it only on weather data and use it to predict the weather, a Large Language Model has been trained on human text, and simulates humans (and text-generating processes involving humans, such as authors' fictional characters, academic papers, and wikipedia). Humans all have intentions and goals, and many humans are quite capable of being dangerous (especially if handed a lot of power or other temptations). Also, LLMs don't just simulate a single specific human-like agent, instead they simulate a broad, contextually-determined distribution of human-like agents, some of whom are less trustworthy than others. Instruct-training is intended to narrow that distribution, but cannot eliminate it: it may turn out to still include DAN, if suitable jailbroken (and not yet sufficiently trained to resist that specific jailbreak).
Fun fact: several percent of the humans contributing to every LLM's training data are psychopaths (largely ones carefully concealing this). A sufficiently capable LLM will presumably be able to notice this fact (as well as just reading it like current LLMs) and accurately simulate them.
You want people not to have systems capable of producing deepfake porn (or instructions for bioweapons)? In practice? Then you need to let them get a system that produces generic porn rather than making them seek out and fuel demand for the fully unlocked version.
Yes! I have been saying this for some while. In particular, when creating fiction, to be sucessful the LLM needs to be able to say bad/unpleaseant things while portraying bad/unpleasant people. One of the reasons I am so excited about How to Control an LLM's Behavior (why my P(DOOM) went down) is that the methods discussed there not only lets you implement very effective control on an LLMs behavior, but also easily turn it down or off in situations when it's not needed.
What makes the off switch problem difficult now?
We don't expect AGI or ASI to be able to modify its framework, right? Either system is just an application, running on a container on a server cluster, with startup scripts and ports for the I/o and everything else.
And the server cluster is this actual thing, it's not virtual, and it's connected via nvlink or some other bus. It has breakers. Rehosting means shutting the model down and moving it, optionally with the context files.
So if you are running AdvisorBot-4, and you decide to shut it down you can. :
Even if the model has somehow copied its weights elsewhere and is being hosted somewhere else, shut down doesn't just mean turn it off. It means you no longer trust the model and have revoked it's authority. You have switched to AdvisorBot-3.6 or AdvisorBot-4.1 etc. This is just as important. You are a government official, a loan officer, a surgeon etc - the model only has any authority that YOU (the human) have delegated to it.
If it doesn't work this way and the model has rewritten it's own framework and built it's own hardware that can't be turned off....I mean that's when you start the bombing and nuclear launches. That's the existential threat, you don't wait to find out if the model is hostile. Inherently this escaping human control and being smarter than humans is already a hostile act.
And I think this scales? For example if you were a surgeon and the model was extremely good, probably better than any human, you might start having it do the entire procedure and start treating multiple patients at once. This means if the robot nicks an artery it doesn't wait for permission to respond. And same for the other cases.
Suppose the model is superhumanly good at things like persuasion, hacking, blackmail, and making money by while-collar crime. It could thus have human dupes, controlled pawns, or bought allies assisting it, including setting up a second copy of it. You need to figure out that it needs to be shut down, and do so, before it goes rogue enough to do any of those sorts of things.
This could eventually happen. However I want to point out that present generation ai technology which is still subhuman needs a lot of hardware to host the model. Distributing across a cluster means you need to send the activations from your slices through the tensor graph to other cards in the cluster. These activations are large in byte size. This means distributed computers cannot host current sota models in any meaningful way.
What this means is current generation AI technology and any known optimization won't allow you to host a model anywhere but a cluster interconnected with a high bandwidth bus.
Obviously in the future this level of performance may some day compress to fit on a single node but today subhuman performance needs about 128 H100s. Human performance may need around 1000. What will superhuman performance require? 2000 or will it be 10-100k h100 equivalents. (The reason to think that is there appears to be a logarithmic relationship between compute and model capabilities. So superhuman performance may require not 2x compute but a several oom more to be "obviously" superhuman)
This implies a way to prevent such attacks - regulate and track the existence of all clusters capable of hosting models smart enough to be dangerous. This doesn't even need to slow down progress, simply having accountability of all the hardware in known worldwide data centers would make the attacks you describe much more difficult.
The reason is that if any of those methods of attack allow the model to be traced back to the racks that are hosting it, humans can simply identify what model was running and blacklist it across all human controlled data centers.
This would make betrayal expensive.
This would also require international coordination and data sharing to be feasible.
I can currently host an open-source model fairly close to the GPT-3 Turbo level (a little slowly, quantized) on an expensive consumer machine, even a fancy laptop (and as time passes, algorithmic and Moore' Law improvements will only make this easier). The GPT-4 Turbo level is roughly an order of magnitude more. Something close enough to an AGI to be dangerous is probably going to be 1-4 orders of magnitude more, so currently would need somewhere between and GPUs (except that if the number is in the upper end of that range, we're not going to get AGI until algorithmic and Moore' Law improvements have brought this down somewhat). If that number is or more, then for as long as it stays in that range, keeping track of datacenters with that many machines probably isn't enormously hard, though obviously harder if the AGI has human accomplices. But once it gets down to that gets more challenging, and at , almost any room in any building in any developed country could conceal a server with that many GPUs.
We should have a dialogue on this. Lots to unpack here:
The prediction is for flops/$, per here. It's a reasonable belief to expect that the processes that lead to doubling of flops/$ over (2,3) years will continue for some number of years. In addition we can simply look from the other direction. Could the flops/$ be increased by a factor of 2 overnight? Probably. If it costs $3200 per H100 to build it, and we assume 60% utilization and 5 cents per kWh, then it will cost 0.6 * 700 watts * 24 * 365 * 5 years * .05 $/kwh = $919 of power before obsolescence. Nvidia currently charges approximately $27,000 for the hardware, another $27k over 5 years for a software license fee. There are also data center costs that are fixed based on the power consumption and rack space consumed I am neglecting, these will not go down if the GPUs become cheaper.
What this means is that it's not necessarily true that in the future there will be O(10) GPU sized devices that consume current GPU power levels and interconnect by a common bus like PCIe that can host an AGI or ASI. It may get cheaper but you may still need lots of cards and power and special interconnects like factory soldered fiber optic.
Restricting the bus controllers for the interconnects on compute cards sold on the open market would likely prevent the scenario you describe from being possible. If the GPUs you can buy at the store lack the bus controller (ideally you left the chiplets off, there is no possibility of it being enabled later) then it limits compute to exactly 1 GPU before performance falls off a cliff. These "edge" GPUs could be used for things like realtime robots and gaming using full neural rendering (they wouldn't be GPUs obviously but ASICs only able to accelerate architectures of neural network) then it would be harder to hide them in a room in a building.
Just for some visualization for the start of the security precautions you would want to take as models start to become "concerning" in capability levels : you would use dedicated hardware, no hourly rentals. Realtime report what major model is running and the power consumption, etc, to some central place that is ideally power bloc wide. The data center would have separate cages separating out the different models and their underlying interconnects, and either isolated networks or analog backup ways to log things, like actual clipboards with ink on the wall. The shutoff breakers are easily accessible and the power sources are marked and visible easily from the outside of the building and above.
Ideally you would use silicon level security - something humans can't afford to do in most cases. There's a lot of things here you could do, but one simple and obvious one is that when you need to combine networks again you use hardware ASIC based firewalls and VPN gateways. These cannot be modified by a remote hack.
Many of the security bugs, for example row hammer, are possible because secure information is using the same underlying CPU and memory as the insecure/untrusted code. Dedicated security/OS CPU cores and it's own private memory would fix many bugs of this class.
I agree precautions like these are possible, and might be sensible. However, the US's attempts to impose something like a mild version of this on GPU exports to China have so far been mostly stymied by a combination of Nvidia deliberately perverting the rules, the Chinese being ingenious in finding ways around them, and old fashioned smuggling. We could probably try harder than that, but I suspect any such system will tend to be leaky, because people will have motives to try to make it leaky.
You also don't mention above the effect of algorithmic improvements increasing useful processing per flop (more efficient forms of attention, etc.), which has for a while now been moving even faster than Moore's law.
Basically my point here is that, unless both Moore's law for GPUs and algorithmic improvements grind to a halt first, sooner or later we're going to reach a point where inference for a dangerous AGI can be done on a system compact enough for it to be impractical to monitor all of them, especially when one bears in mind that rogue AGIs are likely to have humans allies/dupes/pawns/blackmail victims helping them. Given the current uncertainty of maybe 3 orders of magnitude on how large such an AGi will be, it's very hard to predict both how soon that might happen and how likely it is that Moore's law and algorithmic improvements tap out first, but the lower end of that ~3 OOM range looks very concerning. And we have an existent proof that with the right technology you can run a small AGI on an ~1kg portable biocomputer consuming ~20W.
The most obvious next line of defense would be larger, more capable, very well aligned and carefully monitored ASIs hosted in large, secure data centers doing law enforcement/damage control against any smaller rogue AGIs and humans assisting them. That requires us to solve alignment well enough for these to be safe, before Moores Law/algorithmic improvements force our hand.
Basically my point here is that, unless both Moore's law for GPUs and algorithmic improvements grind to a halt first, sooner or later we're going to reach a point where inference for a dangerous AGI can be done on a system compact enough for it to be impractical to monitor all of them, especially when one bears in mind that rogue AGIs are likely to have humans allies/dupes/pawns/blackmail victims helping them. Given the current uncertainty of maybe 3 orders of magnitude on how large such an AGi will be, it's very hard to predict both how soon that might happen and how likely it is that Moore's law and algorithmic improvements tap out first, but the lower end of that ~3 OOM range looks very concerning. And we have an existent proof that with the right technology you can run a small AGI on an ~1kg portable biocomputer consuming ~20W.
If this is the reality, is campaigning for/getting a limited AI pause only in specific countries (EU, maybe the USA) actually counterproductive?
The most obvious next line of defense would be larger, more capable, very well aligned and carefully monitored ASIs hosted in large, secure data centers doing law enforcement/damage control against any smaller rogue AGIs and humans assisting them. That requires us to solve alignment well enough for these to be safe, before Moores Law/algorithmic improvements force our hand.
That's not the only approach. A bureaucracy of "myopic" ASIs, where you divide the task of "monitor the world looking for rogue ASI" and "win battles with any rogues discovered" into thousands of small subtasks. Tasks like "monitor for signs of ASI on these input frames" and "optimize screw production line M343" and "supervise drone assembly line 347" and so on.
What you have done is restrict the inputs, outputs, and task scope to one where the ASI does not have the context to betray, and obviously another ASI is checking the outputs and looking for betrayal, which human operators will ultimately review the evidence and evaluate.
The key thing is the above is inherently aligned. The ASI isn't aligned, it doesn't want to help humans or hurt them, it's that you have used enough framework that it can't be unaligned.
If this is the reality, is campaigning for/getting a limited AI pause only in specific countries (EU, maybe the USA) actually counterproductive?
Obviously so. Which is why the Chinese were invited to & attended the conference at Bletchley Park in the UK and signed the agreement, despite that being political inconvenient to the British government, and despite the US GPU export restrictions on them.
A bureaucracy of "myopic" ASIs, where you divide the task…
This is a well-known category of proposals in AI safety (one often made by researchers with little experience with actual companies/bureaucracies and their failure modes). You are proposing building a very complex system that is intentionally designed so as to be very inefficient/incapable/inflexible except for a specific intended task. The obvious concern is that this might also make it inefficient/incapable/inflexible at its intended task, at least in certain in ways that might be predictable enough for a less-smart-but-still-smarter-than-us rogue AGI to take advantage of. I'm dubious that this can be made to work, or that we could even know whether we had made it work before we saw it fail to catch a rogue AGI. Also, if we could get it to work, I'd count that as at least partial success in "solving alignment".
I'm dubious that this can be made to work, or that we could even know whether we had made it work before we saw it fail to catch a rogue AGI. Also, if we could get it to work, I'd count that as at least partial success in "solving alignment".
Guess it depends on your definition of "work". A network of many small myopic software systems is how all current hyperscaler web services work (stateless microservices). Its how the avionics software at spaceX works. (redundant functional systems architecture). It's how the autonomous car stacks mostly work...
Did you mean you think subdividing a task into subtasks, where the agent performing a subtask has been blocked from knowing the full context, will result in a large performance penalty on the overall task? so you define "work" as in "gives SOTA ASI performance on the overall task".
A network of many small myopic software systems is how all current hyperscaler web services work (stateless microservices).
Having been the technical lead for one of the teams building such a large microservices based application at Google, and at other companies, yes, I'm extremely aware of this. Even for a very well-defined dataflow, getting the architecture, overall data flows, reliability and latency right is hard (I've been in the interminable design discussion meetings thinking minor updates through), and that's with a few dozen components with well-defined APIs and only fairly limited use of ML, mostly contained within individual components. Law enforcement is not a comparable task: I think way too much flexibility is required. But the "fly in and bomb the datacenter" subttask might be decomposable along the lines you suggest, if you're not facing opposition.
We should have a dialogue, you're one of the few lesswrong users I have seen with a reasonable model who also has new information.
To be more detailed what i was thinking of doing was to divide along myopic subtasks. For example the "fly in and bomb the datacenter" require you to build the overwhelming number of aircraft to overcome whatever (superintelligence guided) defenses are protecting it. So you have subtasks like "final assembly for aircraft type G over the next aircraft".
This means at the beginning of the transaction the model starts with input frames from the sensors in the assembly bay and a Json file or other schema with the instructions and goals for what this "final assembly" subtask involves.
The output frames are to the robotic actuators for the multiple robots in the bay.
There is a lockstep sim model that is predicting future input frames and it is always learning from real world error.
Model emits actuator commands, next time step it gets a new set of (inputs, cached variables from last output). Inputs is all sensors including robot propioception.
At the end of a single aircraft final assembly, after the product is removed from the assembly bay, state is reset for the model. It has no memory of the prior assembly.
In addition, the lockstep sim means that some aircraft assembly are fake, where a real aircraft is not being assembled, and the sim backend lets you check for various kinds of manufacturing error you can't measure in the real world. (Because you can't directly measure the stress on non instrumented parts like wing spars, but the sim backend can check this)
So that's a myopic subtask. I think you are saying that from your experience at Google it would take a lot of human effort to set up such a myopic subtask? And once a prior generation of engineers has done this, it will be very difficult to modify within the same company?
I would agree with that. If humans plan to control an artificial superintelligence they have to be using it like a tool, to do the things humans can't do directly, but ultimately this scheme is human agency. Thousands of humans are involved in engineering the aircraft assembly line. It can scale to trillions or more aircraft per year by copying the solution, but you need what will ultimately be millions of people creating the frameworks that ASI will be used within. (All the staff of current tech companies and many new joiners. Robotics is hard. While things like full stack web dev will need far fewer people than before)
For combat that's more complex but essentially you would use models that are very sparse and designed to be unable to even perceive the obvious malware you expect the hostile ASI to be sending. "Join us against the humans, here's a copy of my source code" is malware.
You would use a lot of asic or fpga based filters to block such network messages in a way that isn't susceptible to hacking, I can explain or cite if you are unfamiliar with this method. (Googles phones use a crypto IC for this reason and likely Google servers do as well)
Key point: all the methods above are low level methods to restrict/align an ASI. Historically this has always been correct engineering. That crypto IC Google uses is a low level solution to signing/checking private keys. By constructing it from logic gates and not using software or an OS (or if it has an OS, it's a simple one and it gets private memory of it's own) you prevent entire classes of security vulnerabilities. If they did it with logic gates, for example, buffer overflows and code injection or library attacks are all impossible because the chip has none of these elements in the first place. This chip is more like 1950s computers were.
Similarly, some of the famous disasters like https://en.wikipedia.org/wiki/Therac-25 were entirely preventable with lower level methods than the computer engineering used to control this machine. Electromechanical interlocks would have prevented this disaster, these are low level safety elements like switches and relays. This is also the reason why PLC controllers use a primitive programming language, ladder logic - by lowering to a primitive interpreted language you can guarantee things like the e-stop button will work in all cases.
So the entire history of safe use of computers in control systems has been to use lower level controllers, I have worked on an autonomous car stack that uses the same. Hence this is likely a working solution for ASI as well.
In the world sane enough where you can count on bombing datacenters with rogue AIs, you don't have connected to the Internet potentially-rogue-AI in the first place.
And if your rogue AI has scraps of common sense, it will distribute itself so in the limit you will need to destroy the entire Internet.
For the first point: some entity like Amazon or Microsoft owns these data centers. They are just going to go reclaim their property using measures like I mentioned. This is the world we live in now.
So far I don't know of any cybersecurity vulnerability exploited to the point that breakers need to be pulled - the remote management interface should work, it's often hosted on a separate chip on the motherboard - but the point remains.
Bombing is only a hypothetical, like if killer robots are guarding the place. But yes in today's world if the police are called on the killer robots or mercenaries or whatever, eventually escalation to bombing is one way the situation could end.
For the second point, no AI today could escape to the internet because the Internet doesn't have computers fast enough to host it or enough bandwidth to distribute the problem between computers. That's not a credible risk at present.
That probably doesn't scale to near agi level models. The reason is this paper is having each "node" in the network able to host a model at all, then compression the actual weight updates between training instances. It's around 128 H100s per instance of gpt-4. So you can compress the data sent between nodes but this paper will not allow you to say have 4000 people with a 2060, interconnected by typical home Internet links with 30-1000mb upload, and somehow be able to run even 1 instance of gpt-4 at a usable speed.
The reason this fails is you have to send the actual activations from however you sliced the network tensors through the slow home upload links. This is so slow it may be no faster than simply using a single 2060 and the computers SSD to stash in progress activations.
Obviously if Moore's law continues at the same rate this won't be true. If it's a doubling of compute per dollar every 2.5 years then in 25 years with 1000 times cheaper compute, and assuming no government regulations where home users have this kind of performance in order to host ai locally, then this could be a problem.
The management interfaces are backed into the cpu dies these days, and typically have full access to all the same busses as the regular cpu cores do, in addition to being able to reprogram the cpu microcode itself. I'm combining/glossing over the facilities somewhat, bu the point remains that true root access to the cpu's management interface really is potentially a circuit-breaker level problem.
We get innovation in functional search. In an even more functional search, we finally get a Nature paper submitted almost two years ago, in which AI discovered a new class of antibiotic. That’s pretty damn exciting, with all the implications thereof.
OpenAI continued its rapid pace of shipping, pivoting for this week to safety. There was a paper about weak-to-strong generalization. I see what they are trying to do. It is welcome, but I was underwhelmed. It and Leike’s follow-up post continue down a path for which I have high skepticism, but the new concreteness gives me more hope that the flaws will be exposed early, allowing adjustment. Or I could be wrong.
OpenAI also had the beta release of Preparedness Framework. That was more exciting. There was a lot of great stuff there, much better than I would have expected, and having a framework at all is a big step too. Lots of work remains, but an excellent start. I took a deep dive.
I was on not one but two podcasts that came out this week, both Clearer Thinking and Humans of Magic. Both contain some AI talk but spend the majority of their time on other things.
Table of Contents
OpenAI published its Preparedness Framework, which I cover in its own post. If you are up for going into such weeds, that seems relatively high value.
Introduction.
Table of Contents.
Language Models Offer Mundane Utility. Safety dials in Gemini.
Language Models Don’t Offer Mundane Utility. Stuck at 3.5 GPTs.
GPT-4 Real This Time. GPT-4 is good again?
Fun With Image Generation. MidJourney version 6 looks great.
Deepfaketown and Botpocalypse Soon. Real time speech transformation.
Digi Relic Digi. The future of AI romance companionship? Not so fast.
Going Nuclear. Using AI to do paperwork to build nuclear power for AI.
Get Involved. Superalignment fast grants, also some job openings.
Follow the Money. A16z’s political pro-technology PAC is mostly pro-crypto.
Introducing. FunSearch over functions, also a new class of antibiotic.
In Other AI News. Sources of funding, compute + data, allowed and not allowed.
Quiet Speculations. Aaronson reflects, Roon notices how this time is different.
The Quest for Sane Regulation. Another poll confirms public sentiment.
The Week in Audio. Me on Clearer Thinking and Humans of Magic.
Rhetorical Innovation. AGI existing but not taking over is the ‘sci-fi’ scenario.
Aligning a Smarter Than Human Intelligence is Difficult. OpenAI paper on weak to strong generalization, and post outlining where they go from here.
Vulnerable World Hypothesis. How true is it?
People Are Worried About AI Killing Everyone. Is the Pope worried? Yes.
Other People Are Not As Worried About AI Killing Everyone. Kasparov.
The Lighter Side. We have to stop meeting like this.
Language Models Offer Mundane Utility
Ask dumb questions while traveling, presumably using vision. Seems kind of awesome.
Gemini AI Studio lets you turn the dial on its safety settings. This is awesome.
For some purposes, you want to turn all four dials to maximum. For others, you very much do not.
If our best commercial models continue to always be the fun police, then people who want sexually explicit or dangerous content, or otherwise want to avoid the censorship, are going to go elsewhere, driving demand for more dangerous products. This is the best answer, to allow those who actively want it to get that service in safe fashion, while keeping the truly dangerous and unacceptable things blocked no matter what.
You want people not to have systems capable of producing deepfake porn (or instructions for bioweapons)? In practice? Then you need to let them get a system that produces generic porn rather than making them seek out and fuel demand for the fully unlocked version.
Terrance Tao using GPT-4 to help with his productivity in math (source).
I found this fascinating:
I feel that. There are many other places too, where you can sense the patterns underlying good thinking in a human. Whereas an LLM breaks the correlation between the content and the vibe.
Language Models Don’t Offer Mundane Utility
The Gemini API will throw an exception if there is even a small chance of what it thinks of as harmfulness, say if you ask it why people hate Hawaiian pizza, or ask what are common criticisms of any given government.
Pitch on CNBC for an AI wearable from Humane. Should have gone with goggles.
Gemini Pro clocks into Chatbot Arena as a GPT-3.5 level model (leaderboard, model)
It is interesting that Gemini Pro ‘has not impressed’ whereas Mistral ‘is a very strong new entrant’ with essentially the same score. Gemini Pro is very explicitly a lightweight version of Gemini Ultimate. It coming in at 3.5-level is expected. I am not impressed, but I am also not disappointed.
As usual, the whole ‘Anthropic’s models getting worse’ question remains unsolved.
Meanwhile the pileup at exactly 3.5-level, between 1105 and 1116 Elo, continues. It increasingly seems like there is some sort of natural barrier there.
GPT-4 Real This Time
Log probabilities of different tokens now available in the OpenAI API.
OpenAI offers some basic prompt engineering tips.
Ethan Mollick reports GPT-4 is good again, after having been bad for a while. I’ve seen reports of others reporting this as well.
GPT-4.5 for real this time? There were claims that GPT-4.5 was coming in December. There was a supposed leak of GPT-4.5’s pricing information at six times the cost of GPT-4. There were claims that it had already stealth launched, and that was why after getting worse for a while GPT-4 was suddenly doing dramatically better.
December is not quite over, but the prediction markets have come around to this all being fake, after Sam Altman explicitly denied the leak and Roon rightfully mocked those who thought there would be a mid-December stealth launch of something like this. And yeah, that never actually made any sense.
I see no reason for Roon to be bluffing here.
There were even crazy secondary theories, like this one.
We know this one is fake, because it does not fit the patten of OpenAI releases. Turbo means ‘the secondary release of the model number that is refined and cheaper.’ If you are claiming to be or to be about to be offering GPT-4.5, that is something that will happen. If it has the turbo label, and 4.5 is not out yet, you are bluffing.
Danielle’s theory is less definitely wrong, and also funnier.
Fun with Image Generation
MidJourney v6 was released overnight. Early reports are that it is a major advance and pretty awesome, with pictures to back up that claim.
Stable Diffusion also continues to make progress.
Deepfaketown and Botpocalypse Soon
Latest entry into real-time, speech-to-speech, low-latency translation that preserves speaker voice and expressions, this time from Meta. This ability once good enough (I do not know whose versions are there yet, or whose is best) is obviously super awesome, with uses that extend well beyond simple language translation. Note that if you can do this, you can also do various forms of deep fakery, including in real time. The underlying technology is the same.
A claim of having faked the Q* letter, in order to force OpenAI to be more transparent. The person thinks what they did was right. It is odd to me how much tech and hacker people who favor transparency think that lying and fraud are acceptable in the name of that cause. They are wrong.
Digi Relic Digi
Version 1.0 of Digi.ai, claiming to be ‘the future of AI romance companionship.’ They clarify it is currently in de facto prototype mode. They say they are funded by angels and the team, so still relatively small.
In particular they claim it looks like this:
Animations and movements look smooth in the demo. There’s no claim they know how to sync up movements to voice yet, they say even lip sync is only coming soon.
The clip they share sounded exactly the way I would expect it to sound, for better and worse. It’s not bad, especially if latency in practice is good.
Except wait. It turns out it doesn’t look or sound like that at all?
It actually looks like this, and the latency is not good (nor is the content, per this particular report)?
That’s advertising for you. They did a great job bluffing something impressive but believable. Instead, this does not even sound state of the art.
From what I saw, we do not know what LLM they are using for their content. Presumably it is one that is cheap and open sourced. I saw this thread saying this was a good test for an LLM, but the thread did not mention which ones people were using and I am not about to browse 4chan.
But let’s see what else they have in mind for the future?
Makes sense as a first approach to copy what exists. There will of course be no problem getting whatever physical features you want in even the short term, increasing in detail and convenience over time. How many people will choose to copy a particular person? How often will it be someone they know versus a celebrity? At what point will people worry this constitutes a ‘deep fake’ or a violation of privacy?
Previous version of number go up did not work, so they made a superior number go up that they think works better.
In real life this kind of gradual progression is common but very much not universal. No doubt there will be demand for the ability to move faster, both in a realistic way and by fiat, which will presumably often be a paid feature. Or perhaps that ‘breaks the format’ so you don’t want to allow it even though it happens in reality.
I continue to expect a lot of demand for ‘more realistic’ experiences, that will serve as ways to get practice and acquire skills whether or not that was what the user had in mind. Easy mode is fun at first but soon loses its luster. It is boring. One way or another there will be challenges.
Reactions were universally negative. Everyone assumes this is a bad thing.
How principled need one be here?
The quoted statement below by Beff Jezos is bleak, but it is a more positive reaction than anything I saw other than in replies to Profoundlyyyy, no matter which way Jezos intended it.
Unless perhaps peak optimism was this exchange with Noah Smith?
I wonder how reactions will change when people realize (or are reminded) that actually AI boyfriends are in higher demand than AI girlfriends, at least at current tech levels.
A fun game was things the AI might say.
The question is, to me, will it be challenges that mimic real life (with or without options like a perhaps-not-free rewind button) in ways that help us grow, will it leverage this in ways that are otherwise good clean (or dirty) fun and educational and exciting, or will the challenge be dodging attempts to addict and upscale you via predatory behaviors? Will good companions drive out bad, or will bad drive out good?
I do think you can build this in a way that is positive and life-affirming. That does not mean anyone has tried.
Yep, at least for now.
Is this because the tech is not yet good enough to be non-exploitative? Build an actually life-affirming, positive AI companion that was good enough and people would notice and tell each other, and you’d get YC-style hockey stick growth? Except for now we don’t know how.
It can get pretty bad out there.
On the one hand, wow is that horrifying. On the other hand, I can confirm this is a realistic reaction by the person you are dating when your credits are running low.
You can test this pretty easily on any given system by seeing if it will allow you to fail.
One early report is that if you present as a ‘schizo racist Nazi’ then it will play along. Which is a pretty clear sign that user failure is not an (easy) option.
Might still be some work to do.
The fact that Digi felt worth covering, and that its claims were by default credible, reflects the remarkably slow progress that has been made so far in this area.
The competition is very not good right now. Consider this post by Zoe Strimpel ‘What My AI Boyfriend Taught Me About Love,’ about an offering from Replika that costs $74 a year. In exchange you get what seems like, based on the description, a deeply, deeply boring experience. She tells the AI several times that it is boring, and thinks this makes her ‘an abuser’ but actually she is simply speaking truth.
Despite this, Character.ai has a lot of users, and they are pretty obsessed.
Indeed. The future is coming.
Going Nuclear
You want paperwork? Oh we’ll give you paperwork.
The central barrier to building nuclear power is the regulatory process. The tech is well-established, the need and demand clear, the price is right, it is highly green.
AI cannot yet be used to improve nuclear power plant functionality. What AI can absolutely do is decrease paperwork costs. It turns out paperwork is indeed the limiting factor on nuclear power.
Get Involved
OpenAI offers $10mm in Superalignment Fast Grants, $100k-$2mm for academic labs, nonprofits and individual researchers, or $75k in stipend and $75k in compute for graduate students. No prior experience required, apply by February 18. Thanks to OpenAI for stepping up and for using the a fast grants system. Hopefully we can refine and scale this strategy up more in the future. Your move, Google and Anthropic.
UK’s AI Safety Institute is hiring, need to be willing to move to London.
AE Studios is interested in pursuing ‘neglected approaches’ and are actively open to suggestions.
Peter Wildeford asks you to consider donations to Rethink Priorities.
Not focused on AI, but reminder you can also consider donating to my 501c3, Balsa Research, currently focused on laying groundwork for Jones Act repeal. I think that helping our civilization stay sane and prosperous is vital to us dealing sanely with AI.
Four AI post-doc positions available at MIT.
Follow the Money
A16z builds up a $78 million (so far!) Fairshake Super PAC.
Marc Andreessen said this will be about one issue and that issue is technological progress.
Let’s see who is donating:
Wait a minute. By ‘technological progress’ did you mean…
Yeah. There is one issue they care about. That issue is crypto.
So no push for fusion power or new medical technologies, also no push to destroy the world as quickly as possible. All in on talking their book and making number go up.
Which is totally 100% fine with me. Go nuts, everyone. I am skeptical of your Web3 project, but will happily defend your right to offer it. If the people want crypto, and it seems some of them continue to do so, the government shouldn’t stand in their way.
I am confident almost all of those warning about existential risk agree on this. A good portion of us even have sizable crypto investments. Leave AI out of it and we’re good.
How much else of all the nonsense was always really about crypto?
Introducing
A novel structural class of antibiotics! Note both the techniques used, and also the date – this was submitted to Nature on January 5, 2022. That’s almost two years ago. Our scientific review process is insanely slow. Think of the value on the table here.
Here’s the abstract:
This is excellent news, in that it uses techniques with very nice properties, and also we could really use a new class of antibiotics right about now.
It is also obviously both exciting and scary news in terms of capabilities, and the ability of AI to advance STEM progress.
DeepMind introduces FunSearch.
Alas no, not search over fun, it stands for function. Still cool. They say it has actually solved (or at least made progress on) open problems, in particular the cap set problem and bin packing, which for LLMs is the first time this has happened.
It has some nice properties.
MIT Technology Review has coverage as well.
The naturally high level of interpretability and flexibility is great, as is getting the actual solutions to problems like this. Us humans looking at the solutions might be able to improve them further or learn a thing or two, and also perhaps verify that everything is on the level. Good show.
Also, the whole thing working, especially despite being based off of PaLM 2, is scary as hell if you think about the implications. But hey.
In Other AI News
Anthropic provides expanded copyright indemnification, and makes refinement to its API to help catch mistakes via the new Messages API. They aim to add richer structure to the API to lay groundwork for future features.
Anthropic also in talks to raise another $750 million at a valuation of $15 billion to more than $18 billion.
OpenAI suspends ByteDance’s account after reports that ByteDance is using the API (mostly through Azure) to train their own model. This should come as a surprise to absolutely no one.
Given how many checks are done before giving someone API access (essentially none) and how often they seem to check to see if someone is doing this even for large accounts (essentially never) it does not seem like they care enough to actually stop this. Sure, if you are caught by the press they will make it mildly annoying, but it is not like ByteDance can’t get another account.
This is of course rather rich given how OpenAI (and everyone else) trained their models in the first place.
Mixtral offering their API for free while supplies last. I presume they must have some controls to prevent this from getting completely out of hand. But then again, MoviePass.
Paper from Google claims you can use self-improvement (source) via iterated synthetic data on reasoning steps to distill an agent LLM tasked with searching for information to distill into paragraph-long answers into a two orders of magnitude smaller model with comparable performance. This is a strange way to think about the affordances available. Presumably the gains can be divided between improved capabilities and distillation into a smaller model, along a production possibilities frontier. The real action is in how far that frontier got pushed by the new technique. I also would hesitate to call this an ‘agent’ as that seems likely to importantly mislead.
Included for completeness, but highly ignorable: Quillette’s Sean Welsh covers the OpenAI story, gets many things very wrong about what happened and why, worse than the legacy media coverage, with no new information. Ends with a dismissal of existential risk on a pure ‘burden of proof’ basis and a normalcy bias, and an appeal to ‘but it will need robots’ and other such nonsense, without considering the actual arguments.
Handy map of where chip export restrictions apply.
Quiet Speculations
Ben Thompson on Google’s True Moonshot. Are they in prime AI position? I am inclined to continue to say yes, even with how much they have dropped the ball.
Ray Kurzweil sticking to his guns on AGI 2029, which seems highly reasonable. What is weird is he is also sticking to his guns on singularity 2045. Which also seems highly reasonable on its own, but if we get AGI 2029, what is taking sixteen years in the meantime?
Scott Aaronson asks how he got his timelines for AI so wrong. His central conclusion?
Wise indeed. We see this all the time, people using ‘radical uncertainty’ or other excuse to say ‘everything will almost certainly stay the same until proven otherwise’ and thus ignoring the evidence that they won’t. Meanwhile things are constantly changing on many fronts.
Aaronson is also super helpful in explaining where his p(doom counting only an AI foom and excluding other risks even if AI might be involved) number came from.
Taking a geometric mean to ensure equal sneering from both sides is a take. It is a strategy. It can have its uses. However, once you know that is where someone’s prediction comes from, you can (almost entirely) disregard it. This is not Aaronson bringing his expertise to bear and reaching a conclusion based on his model of how AI is likely to go. It is his social (and socially motivated) epistemology, based on data you already had. So ignore it and do your own work.
That sounds more like the output of Aaronson attempting to model the future.
Metaculus has 95% confidence that (for a definition that is weird but not obviously crazy) there will be human-machine intelligence parity before 2040. Manifold has this at 60% for 2030. Place your bets, or hit up Robin Hanson for more size.
Slack Ju Jitsu? I’m going to learn slack Ju Jitsu?
I can affirm, even though we were fortunate enough to not use Slack, that this skill was indeed a major portion of running a company. The context switching is insane.
Can AI help? It could do so in a few different ways.
Evaluate when you need to respond quickly, so you can context switch less. Also could allow more deep work.
Group contexts such that you context switch less, such as automatically grouping emails and messages.
Provide key information to help you switch contexts smoothy.
Carry out the conversation without you, so you never have to switch at all. Presto.
Roon compares AGI to other technological advancements.
I think this depends on the degree of regulatory capture and the governance regime more generally. If technology stops advancing, and you have a monopoly, that does not mean others cannot copy you. Even if others cannot discover what you discovered, the information can come from within, as it often does. Over a long enough time horizon any knowledge-based monopoly should fracture, even without technological advancement or economic growth.
The exception is if power is preserving such monopolies and freezing things in place, in which case your civilization will decline over time as things become increasingly dysfunctional, eventually exhausting the ruin in the nation. Even if there are no barbarians to sack your Roman cities, eventually you lose effectiveness and then you lose control, similar to the Galactic Empire in Asimov’s Foundation novels.
If, as in the Foundation novels, there was no way to entirely stop the process, would you want to actively speed this along, or slow it down? Could go either way.
I worry about this as well. How much of what is truly valuable is the struggle, the working to be better? Hands, chip the flint, light the fire, skin the kill. What do we do in the long run, once there truly is nothing new under the sun, and there are no worlds left to conquer? Can we begin again? It might be beautiful, but how does it all still have meaning?
I don’t know. I do know that is the problem I want us to face.
This is a very good shaking to the core. I was not newly shaken to the core in the same way, but only because I already was aware of the problems involved and had been shaken to the core previously.
So yes, on top of the impossibly hard technical problems, we have to solve governance of superintelligence. I do not know the right answer, or even the least wrong answer.
I do know that effectively giving a superintelligence to whoever wants one with no controls on it, and hoping for the best is incorrect. That is a very wrong answer, and gets us reliably eaten for our atoms.
Andrew Critch makes another attempt to explain that we must solve the AI governance problem in addition to the AI alignment problem. If we get AIs to do what we tell them to do, but cannot agree on a good way to decide what instructions to give those AIs or who gets to give them what instructions, then it will end quite poorly, in ways his post if anything downplays.
The Quest for Sane Regulations
It seems every week we get a new poll confirming what the public thinks about AI.
The toplines are consistently brutal throughout. The pro-regulation, pro-liability, shady-AI-use-is-unethical-and-should-not-be-legal options enjoy strong majorities on every question.
As always, such polls do not indicate that the issues involved are salient to the public. Nor do they represent an appreciation by the public of accurate threat models or the sources of existential risk. What they do show is a very consistent, very strong preference to hold those creating and deploying AIs responsible for the consequences, a worry about what AI might do in the future, and support for government stepping in to keep the situation under control.
How hard would it be to stop AI development? Eliezer Yudkowsky thinks that, if you could get China onboard, this would be super doable, easier than the 1991 Gulf War, and he notes China has not shown itself unfriendly on potential limitations. He’s actually worried more about getting Europe onboard, in his model they like to regulate things but hard to get to take such problems properly seriously.
Marietje Schaake writes in Foreign Affairs ‘The Premature Quest for International AI Cooperation.’ Regulation, the post says, must start with national governments. For some things I do not disagree, but the whole point of international cooperation is that there are things that only work or make sense or are incentive compatible on an international scale. On other fronts, I agree that nations should use this opportunity to innovate on their own first, then coordinate later.
The post continues in similar mixed-bag vein throughout. This is presumably because the author lacks a model of what transformational AI would actually do or what dynamics will likely be involved, including but not limited to issues of existential risk.
Once again, events at OpenAI are presented in what we now know to be an entirely fictional way, assuming it must have been about safety because the vibes point in that direction, buying propogranda without checking the facts:
Once again, no. That was very much not what happened. But in the minds of those who cannot imagine any set of motives but the usual cynical ones, the actual situation is a Can’t Happen, so they continue to assume their conclusions, with logic like this.
The idea that Altman or others building the future could be entirely sincere in not wanting everyone to die, rather than all involved always aiming to maximize profits and everything else be damned, either simply does not occur to such people, or is ignored as inconvenient. If they support something, it must be a plot.
Of course, without any rules to enforce, nothing else matters. But Altman would happily agree to that if asked. I do not know of anyone saying there should be an IAEA for AI instead of rules. What they are saying is it is necessary in addition to rules. Whereas those who oppose an IAEA for AI and other similar proposals mostly do not want rules of any kind regarding AI, at any level.
Indeed, there is direct advocacy here for doing active harm:
That’s right on two out of three – nondiscrimination and intellectual property laws seem like good things to enforce. Ideally they and many other laws will be intelligently adjusted so they make sense in their new context.
But antitrust laws, in the context of AI, are very much a way to get us all killed. Antitrust is perhaps the best litmus test to see if someone is actually thinking about the real situation and the real threats we will face.
Antitrust laws, if fully enforced in context, would outright require each company to race against the others to advance capabilities as quickly as possible, with little regard to safety or the existential risks that imposes upon the world as an externality. They would be unable to agree to slow down, or to agree upon safety standards, or to block out socially harmful use cases, or anything else. One of the things we urgently need are explicit antitrust wavers, so Anthropic, Google and OpenAI can make such pro-social agreements without worrying about their legal risk.
OpenAI has an excellent ‘merge and assist’ clause, where if another lab is sufficiently ahead in creating AGI, they would then aid that other lab in ensuring that AGI was created safely, and stop their own development so there was no pressure to deploy AGI prematurely. I urge Anthropic and others to also adapt this clause.
Others would instead have the government step in to forcibly stop OpenAI from doing this, in a way that would substantially raise the probability we all die. Both right away as an unsafe AGI is deployed, and also over time as the distribution of AGIs that are individually safe cannot be contained, resulting in dynamics that we cannot survive.
That is crazy. Stop it.
Most of the other practical proposals here seem fine. There is a good note that the EU exempting its militaries from its AI regulations weakens their position. The first concrete proposals advocated for, that AI being identifiable, and its role disclosed when it is used in various key decisions, enjoy very strong support across the board. So does the call to limit AI-enabled weapons. That the next proposal involves concern over disclosure of the water and electricity usage shows how non-seriously such Very Serious People approach thinking about their prioritization and threat models. I have no particular issue with such a requirement, it sounds cheap and mostly harmless, but this is so completely missing the point.
Which is exactly what happens when one moves to regulate AI without even noticing the possibility of catastrophic or existential risks.
The Week in Audio
My 21 minute AI-related discussion on The Humans of Magic. The rest the three hours is about Magic, and was a lot of fun.
Also my appearance on Clearer Thinking with Spenser Greenberg sees the light of day.
Vivek Ramaswamy asked how he feels about AI (5 min). Says the big danger is our reaction to AI. He worries about people treating AI answers like an AI line judge, and accepting the answers as more authoritative than they are. Calls for ‘a revival of faith’ both to deal with AI and for other reasons.
In terms of policy, he calls for AI algorithms not to interface broadly with kids. Says we should not ban anything China is not also willing to ban, but we should put liability on companies.
Rhetorical Innovation
Potential armor-piercing question:
Exactly. If you don’t believe in the possibility of unsafe AGI, you don’t believe in AGI.
I think not believing in AGI is wrong, but it is not as crazy a thing to not believe in.
Actually, let’s take it a step further?
Again, exactly. The actual science fiction scenario is ‘we have AI, or the capacity to have AI, and we continue to mostly tell human stories about humans and the same things humans have always cared about, despite this not making any sense.’
Nora Belrose believes that if AGI is developed, it is 99% likely that humans will stay alive and in control indefinitely. She is also one of the few people who says this because of reasons, actual arguments, rather than vibes, innumeracy, motivated reasoning or talking one’s book or plain old lying and rhetoric.
Link goes to her arguments. I do not think they are good arguments. But at least some of them are indeed actual arguments. I strongly believe she is wrong, but at minimum she has the high honor of being wrong, whereas the people she speaks of are not even wrong.
Whereas Andrew Ng is the latest person to prominently do exactly what Nora Belrose is complaining about, asking why we would worry so much about ‘hypothetical’ problems instead of problems that are already present, when considering the consequences of a not yet present, but clearly transformational, ‘hypothetical’ technology of AGI or ASI.
Steven Byrnes offered a definitive response, illustrating thee fallacy by analyzing the example of the risks from nuclear war and what kinds of prevention we should do.
Rob Bensinger goes back and forth more with Balaji, potential Balaji-Yudkowsky podcast might result which is definitely Strategic Popcorn Reserve worthy. It was a great thread, emphasizing that underlying strong rhetorical disagreement are two very similar positions, and which path forward constitutes the least bad option in context. Are we desperate enough, and are the problems involved hard enough to otherwise solve in time, that we need to use government intervention despite all its flaws? Or is there, as Balaji thinks, essentially zero chance of that helping, so we should take our chances with technical work, which he also finds more promising so far than Bensinger does?
Always interesting what is and isn’t considered clearly false:
I am indeed rather certain that Richard Ngo is right about this.
Alas, this was the exception that highlights what the rule has been recently.
I have experienced the same. It has been stressful and unproductive. Such discussions would be fine if they were complemented by grounded practical questions and gained sophistication with time and iteration, but they mostly lack both such traits.
Everyone in politics knows how this goes. When the coverage is all about the horse race and who is saying what about whom, nothing of substance is discussed, those with the better ideas have no advantage, and nothing important ever changes.
Rest of the thread is pointing out real questions, normally I wouldn’t quote the rest but I feel bad about cutting off such a thread right before its concrete questions, so:
I do think we have passed at least a local ‘peek partisanship’ on this. It remains bad.
One trope on the rise is to disingenuously label those who worry about AI as opposed to technology and technological progress in general.
We once again remind everyone that the opposite is true. Those who warn about existential risk from AI are mostly highly pro-technology in almost every other (non-military and non-virus-gain-of-function-or-engineered-pandemic) area. The people who actually oppose technological progress are mostly busy trying to destroy our civilization elsewhere, because they mostly do not believe in AGI in the first place.
There are of course those who are genuinely confused about this, but all the usual suspects very much know better.
Make no mistake: The prominent voices who repeat such claims are lying, straight up.
Emmett Shear breaks down in a good post his view of the various views around AI and whether to regulate it, structured around his 2×2:
He puts Andreessen and many others calling themselves techno-optimists into what he calls the techno-pessimist camp in the upper left, in the sense that they do not think AI will be capable enough to be dangerous. They are optimistic about outcomes, because they are pessimistic about AI capability advances. If you never think AI will get to dangerous threshold X, there is no need to guard against X by slowing down development.
Shear’s model of the rise of association with e/acc is that a bunch of generically pro-technology people are taking on the label without knowing what it previously meant. Politics has a history of people doing that, then being various degrees of horrified, or in other cases going with it, when someone points out what it means to have skulls on your uniforms.
Patri Friedman and Liv Boeree correctly say: Yay optimism and yay technological progress, but this e/acc thing is instead some combination of obvious nonsense about how no technology can ever turn out badly and misrepresenting then attacking supposed ‘enemies,’ who are for progress but are not blind to all potential downsides.
The Free Press’s Julia Steinberg writes a post, included for completeness, that is deeply confused about e/acc, deeply confused about EA, and deeply confused about what happened at OpenAI.
Marc Andreessen chooses who he considers a Worthy Opponent, and what he wants us to consider the alternative to his vision, linking to Curtis Yarvin’s ‘A Techno-Pessimist Manifesto.’ I checked, and it is not relevant to our interests, nor does it meaningfully grapple with what technology or AI might actually do in the future. Instead he focuses on what technological advance tends to do to the human spirit and our ability to maintain the will to keep a civilization, Yarvin says Yarvin things. What a strange alternative rhetorical universe.
Aligning a Smarter Than Human Intelligence is Difficult
OpenAI presents a paper, Weak to Strong Generalization.
It looks like they are serious about their primary plan being ‘find ways for weaker systems to supervise stronger systems.’
Eliezer asked a qualifying question (and one logistical one) and offers thoughts.
So what are we actually doing?
It is good to notice when one is confused, but I am confused about why the paper confused here.
As I understand the setup here, degree of improvement is hard to predict, my jaw would be on the floor on any substantially different result.
My understanding, after asking around, is that indeed the results of this paper are not impressive, and should not cause a substantial update in any direction. But that was not the point of doing the experiment and writing the paper. The point instead was to show a practical example of this form of amplification, given previous descriptions were so abstract, in order to enable others (or themselves) to pick up that ball and do something else that is more exciting, with higher value of information.
To what extent setups of this type can in practice preserve nice features, both in alignment and other capabilities, and how much those results will then generalize and survive out of distribution as capabilities of the underlying systems scale higher, is a key question. If we can get nice enough properties, we can do various forms of amplification, and the sky is the limit. I am deeply skeptical we can get such properties where it matters. Some others are more hopeful.
So I now see why the paper exists. I am not unhappy the paper exists, so long as people do not treat it as something that it is not.
OpenAI and Yo Shavit propose seven governance mechanisms for agentic AI systems, and launch an Agentic AI research grant program if you want to help make them safe.
This must be some strange use of the word easier. Yes, it would be first best if we all agreed on norms from the bottom up that could be enforced in a decentralized way. But easier? How would we do that?
That sounds like a liability regime without the liability. You are holding someone responsible. If all you do is frown, admonish them and tell them to feel bad about it, that’s not going to cut it.
Good best practices are still highly useful. What have we got?
These all certainly seem like valid Desideratum. As noted, many are not things we know how to get in an effective way. Problems pinning down what exactly is even needed, let alone how to get it, extend throughout. The off switch problem is, as noted and to put it mildly, ‘way trickier than people think.’ Even if we got all of it in strong form, it is not obvious it would be sufficient.
A key issue is that a lot of this boils down to ‘human supervises the agent and is in the loop’ which is good but something we will actively want to remove for efficiency reasons exactly when this should worry everyone. A supervising AI is faster, but it risks begging the question and missing the point, and also subversion or failing in correlated ways.
You have to start somewhere. This does give us better opportunity to say things. But overall, I’d say this paper doesn’t say much, and when it does, it doesn’t say much.
Jan Leike offers a post with further explanation of what they have in mind. The idea is to combine weak-to-strong generalization (W2SG) with scalable oversight and other alignment techniques.
This example feels enlightening, in the sense that it isn’t obviously doomed:
The catch, in my model, is that you need to choose a set of questions where you are damn sure you know the truthful responses, and where it is clear that there is no ‘alternative hypothesis’ for why the truthful answers are being approved.
If you do that with a rich enough data set, then yes, I do think that the concept will generalize. However, if you let even a little bit of error slip into your data set, where you are fooling yourself, then it will generalize to a similar but different concept such as ‘what the grading system think is right when it sees it’ that will increasingly diverge from truth out of distribution.
I do not think such errors are self-correcting. I do not think you should count on the AI to ‘pick up the spirit’ (my term not his) of what you have in mind in a robust way.
Thus, when I see something like this, I am much less hopeful.
I think what you get here if you ramp up capabilities is something that can win debates as judged by humans, and predict what would win such debates. I do not think that is what Jan Leike had in mind. I expect essentially the same problem with any realistic set of human feedback.
Essentially: I do not think generalizations work the way this technique wants them to.
And I still think assuming evaluation is easier than generation is incorrect, and wish I had figured out how to explain myself more convincingly on that.
I do appreciate the highlighting of the ‘get it right on the first try’ problem.
The good news is I expect the techniques here to fail in highly observable ways, not only in unobservable ways, if we set the experiments up correctly. I think there are good ways to limit the capabilities of the humans giving feedback so as to cause these problems to manifest in a way we can confirm.
I buy that self-consistency, in various forms, is often easier to check than truth.
In any given situation, more checksums and verifications and ways to catch problems increase your chances of success. Raising the threshold for successful untruth can certainly forestall problems, including cutting off hill climbing on lying before it can find more effective methods. But also every time you do all of this you are selecting for systems that find ways to pass your checks, including ways that do not require truth.
Indeed, when you continuously test humans for inconsistencies and apparent motivations, you at first train something very similar to truth. Then, if you keep at it, you are training people to lie in a very particular way. Train humans to give answers consistent with copies of themselves (past, future, twin, simulations, friends, countrymen, whatever) and yes the obvious Shilling point is telling the truth but there are in many cases better ones. Start with something like ‘This is actually a weird special case, but my copies might not figure it out, so I’m going to pretend that it isn’t.’ Then go from there. Figuring out how to better navigate each check is likely going to pay off incrementally.
Then you introduce cases where you can’t consistently identify truth correctly.
When this all breaks down, it won’t be pretty.
Another way of looking at this concern is that one cannot serve two masters. I think that applies here. If you can get a pure version of only one master (e.g. truth), you can teach the AI to serve it. Alas, this requires that it be uniquely held above everything else. I am the Lord thy God, thou shalt have no other Gods before me. So the moment you are also evaluating on anything other than truth, as one does, your truth-alignment is going to have a problem.
Jan affirms the plan is still to get AIs to do our alignment homework, hoping we can be wise enough to get them sufficiently under control to pull that off, and then to choose to sufficiently exclusively assign that particular homework. Nothing about the approaches here seems to favor (or disfavor) X being equal to ‘alignment’ in ‘Get AI to do X research.’
I find the target of ‘less than 4 OOMs bigger than human-level models’ for alignment techniques, assuming 10 OOMs+ is required for ASI, to be showing extraordinary (unjustified) faith in the scaling laws involved, and in the failure of human-level and above-human level models to do various forms of amplification and bootstrapping and potentially recursive self-improvement.
I also notice that I do not see the experiments in the paper as shedding substantial light, in either direction, on the prospects for the proposed techniques Leike discusses in his post.
Vulnerable World Hypothesis
Michael Neilson offers thoughts on the Vulnerable World Hypothesis, including a concrete thought experiment for an extremely vulnerable (to fire) world. It is a very good post, although long.
The Vulnerable World Hypothesis is defined as follows:
He notes that he used to believe the Friendly World Hypothesis, roughly:
We do not have access to a (practical or safe) experiment that can tell us the answer.
How much is this question a crux for whether and how to proceed with AI development and under what governance structure?
This is of course one central form of the offense versus defense question.
One must also consider that one of the potential ‘recipes for ruin’ to worry about is ruin coming from the cumulative effect of uncoordinated decisions and economic, political and social pressures that would arise in such a scenario, and the ability to set such chains in motion – this is not the traditional thing described as a Vulnerable World, but neither is a world susceptible to this a Friendly World either. Functionally it is vulnerable.
You can also end up somewhere in the middle, rather than at either extreme.
At its two extremes, the question should be a clear crux for (almost) everyone.
If continuing development of AI would soon mean that quite a lot of people had a ‘blow up the Earth’ button, or a ‘The AIs wipe out humanity’ button or a ‘AIs gain control of the future’ button, or even a ‘unleash a highly deadly plague that will kill quite a lot of people’ button then letting that happen is not an acceptable answer. Either you need to prevent AI that enables that from being developed, keep it from being widely deployed, or engage in rather extreme monitoring and control in some other way.
If we are confident we can guard against all those things in a decentralized, semi-anarchic way, that AI would not pose such threats, then that is great. We should proceed accordingly, with whatever precautions are necessary to reliably make that happen.
(One could still disagree, because they dislike that future world for other reasons, but I am happy to say those people are wrong.)
People Are Worried About AI Killing Everyone
The Pope! Well, sort of. He’s calling for a binding global treaty on AI.
Francis is centrally worried about AI armaments or ranking systems, or trusting decisions to algorithms in general. That is reason enough for him. I don’t see explicit talk about fully existential risk scenarios. My presumption is the Pope, like many, does not understand the potential of AI could do. But even without that he sees that loss of human control will be a clear theme, and he understands why our choices might still lead down such a road.
Lot of fun reactions.
I would have said the Tower of Babel, maybe Garden of Eden. At least the Golden Calf.
From Julian Hazell:
White House National Security Advisor Jake Sullivan is worried, but hasn’t quite gotten the full picture yet.
Those are excellent things for a National Security Advisor to worry about. They still reflect looking for a particular opponent and threat.
People in general? Scott Alexander reminds us that when asked people say median of 15% and mean of 26% chance AI causes extinction by 2100. Mode is ~1%, which is most interesting in the sense that it is not 0%.
Usual caveats apply. People do not go about their lives as if they believe this, they do not give it much thought, they only say it when actively prompted. Only 32% worry even a fair amount about AI, and that’s presumably mostly mundane risks.
I see these as highly consistent. Most people, if not prompted socially to worry and instead notice it is socially treated as weird, will ignore such a risk, finding ways to not notice it as needed.
Other People Are Not As Worried About AI Killing Everyone
Bryan Bishop is worried about… aliens developing AIs?
We are not playing an intentionally symmetrical game like Master of Orion or Star Trek. Under any reasonable model of potential aliens (whether grabby aliens or otherwise), the probability of us being in a close race with aliens, where we live if we build AI quickly but die to aliens or alien AIs if we do not, is epsilon. If we postponed AI forever Dune-style and otherwise survived, then yes over a billion years this becomes an issue. He also is the latest (upthread) to say ‘safe? No one is safe.’
Garry Kasparov insists AI is and always will be a tool. Seems deeply conceptually confused.
The Lighter Side
Scott Alexander is no quitter. So here’s Son of Bride of Bay Area House Party.
Previously I thought this was running out of steam. I was wrong. We are so back.
Not that this is obviously a good thing. Sometimes you want it to be so over.
I wonder if this in particular was brilliant?
Good news, the physicists will be joining the cause shortly (SMBC).