I strongly claim that the amount of terminal utility from ‘simulate a billion fruit flies each in a state of orgasmic bliss’ is zero. Not epsilon. Zero.
I was with you up to this. I don't see how anyone can have high justifiable confidence that it's zero instead of epsilon, given our general state of knowledge/confusion about philosophy of mind/consciousness, axiology, metaethics, metaphilosophy.
When I look ahead, what I see is AIs increasingly presenting adversarial examples to humans.
I wrote a post also pointing this out. It seems like an obvious and neglected issue.
What do we know about Dario Amodei (CEO of Anthropic)?
A few years ago I saw a Google doc in which he was quite optimistic about how hard AI alignment would be (IIRC about level 2 in Sammy Martin's helpful Ten Levels of AI Alignment Difficulty). I think he was intending to publish it but never did. This (excessive optimism from my perspective) was one of the reasons that I declined to invest in Anthropic when invited to. (Although interestingly Anthropic subsequently worked on Constitutional AI which is actually level 3 in Sammy's scale. ETA: And also Debate which is level 5.)
I think that before this announcement I'd have said that OpenAI was at around a 2.5 and Anthropic around a 3 in terms of what they've actually applied to existing models (which imo is fine for now, I think that doing more to things at GPT-4 capability levels is mostly wasted effort in terms of current safety impacts), though prior to the superalignment announcement I'd have said openAI and anthropic were both aiming at a 5, i.e. oversight with research assistance, and Deepmind's stated plan was the best at a 6.5 (involving lots of interpretability and some experiments). Now OpenAI is also aiming at a 6.5 and Anthropic now seems to be the laggard and still at a 5 unless I've missed something.
However the best currently feasible plan is still slightly better than either. I think e.g. very close integration of the deception and capability evals from the ARC and Apollo research teams into an experiment workflow isn't in either plan and should be, and would bump either up to a 7.
I don't see how anyone can have high justifiable confidence that it's zero instead of epsilon, given our general state of knowledge/confusion about philosophy of mind/consciousness, axiology, metaethics, metaphilosophy.
I tend to agree with Zvi's conclusion although I also agree with you that I don't know that it's definitely zero. I think it's unlikely (subjectively like under a percent) that the real truth about axiology says that insects in bliss are an absolute good, but I can't rule it out like I can rule out winning the lottery because no-one can trust reasoning in this domain that much.
What I'd say is just in general in 'weird' domains (AI Strategy, thinking about longtermist prioritization, metaethics) because the stakes are large and the questions so uncertain, you run into a really large number of "unlikely but not really unlikely enough to honestly call it a pascal's mugging" considerations, things you'd subjectively say are under 1% likely but over one in a million or so. I think the correct response in these uncertain domains is to mostly just ignore them like you'd ignore things that are under one in a billion in a more certain and clear domain like construction engineering.
I tend to agree with Zvi’s conclusion although I also agree with you that I don’t know that it’s definitely zero. I think it’s unlikely (subjectively like under a percent) that the real truth about axiology says that insects in bliss are an absolute good, but I can’t rule it out like I can rule out winning the lottery because no-one can trust reasoning in this domain that much.
I'm not aware of any reasoning that I'd trust enough to drive my subjective probability of insect bliss having positive value to <1%. Can you cite some works in this area that caused you to be that certain?
What I’d say is just in general in ‘weird’ domains (AI Strategy, thinking about longtermist prioritization, metaethics) because the stakes are large and the questions so uncertain, you run into a really large number of “unlikely but not really unlikely enough to honestly call it a pascal’s mugging” considerations, things you’d subjectively say are under 1% likely but over one in a million or so. I think the correct response in these uncertain domains is to mostly just ignore them like you’d ignore things that are under one in a billion in a more certain and clear domain like construction engineering.
What are some other specific examples of this, in your view?
I was also disappointed to read Zvi's take on fruit fly simulations. "Figuring out how to produce a bunch of hedonium" is not an obviously stupid endeavor to me and seems completely neglected. Does anyone know if there are any organizations with this explicit goal? The closest ones I can think of are the Qualia Research Institute and the Sentience Institute, but I only know about them because they're connected to the EA space, so I'm probably missing some.
I acknowledge that there are people who think this is an actually good idea rather than an indication of a conceptual error in need of fixing, and I've had some of them seek funding from places where I'm making decisions and it's super weird. It definitely increases my worry that we will end up in a universe with no value.
Climbing the ladder of human meaning, ability and accomplishment for some, miniature american flags for others!
If the standard is ‘you have to prove that you own the rights to every component used to train every input that you used’ then that is damn close to a ban on generative AI.
It's worse than that. It's a ban on generative AI for small players, but not for big players who can afford to buy lots of data and/or defend their data-gathering techniques with lawyers.
This seems especially unlikely to work given it only gives a probability. You know what you call someone whose superintelligent AI does what they want 95% of the time? Dead.
if you can get it to do what you want even 51% of the time and make that 51% independent on each sampling (it isn't, so in practice you'd like some margin, but 95% is actually a lot of margin!) you can get arbitrarily good compliance by creating AI committees and taking a majority vote.
Yep, if you can make it a flat 51% that's a victory condition but I would be shocked if that is how any of this works.
Will this be an alignment effort that can work, an alignment effort that cannot possibly work, an initial doomed effort capable of pivoting, or a capabilities push in alignment clothing?
As written, and assuming that the description is accurate and means what I think it means this seems like a very substantial advance in marginal safety. I'd say it seems like it's aimed at a difficulty level of 5 to 7 on my table,
https://www.lesswrong.com/posts/EjgfreeibTXRx9Ham/ten-levels-of-ai-alignment-difficulty#Table
I.e. experimentation on dangerous systems and interpretability play some role but the main thrust is automating alignment research and oversight, so maybe I'd unscientifically call it a 6.5, which is a tremendous step up from the current state of things (2.5) and would solve alignment in many possible worlds.
As ever though, if it doesn't work then it doesn't just not work, but rather will just make it look like the problem's been solved and advance capabilities.
I wonder whether there's a niche for generative AIs exclusively trained on works in the public domain...
some type safety
Latest Python supports type annotations and you can use mypy to check/enforce them in your codebase
The big news of the week is OpenAI introducing the Superalignment Taskforce. OpenAI is pledging 20% of compute secured to date towards a four year effort to solve the core technical challenges of superintelligence alignment. They plan to share the results broadly. Co-founder and OpenAI Chief Scientist Ilya Sutskever will lead the effort alongside Head of Alignment Jan Leike, and they are hiring to fill out the team.
That is serious firepower. The devil, as always, is in the details. Will this be an alignment effort that can work, an alignment effort that cannot possibly work, an initial doomed effort capable of pivoting, or a capabilities push in alignment clothing?
It is hard to know. I will be covering these questions in a post soon, and want to take the time to process and get it right. In the meantime, the weekly post covers the many other things that happened in the past seven days.
Table of Contents
Language Models Offer Mundane Utility
Open source models often do well on open source benchmarks. So do foreign efforts like Falcon or Ernie. When it comes to actual mundane utility, or tests that were not anticipated in advance, the answer seems to come back somewhat differently.
Falcon-40B is down at 5.17.
Note that reasoning is the place where GPT-4 has the largest edge.
Will they offer all that mundane utility for free? David Chapman thinks that without a moat no one will make any money off of LLMs. Other than Nvidia, of course.
I do not think this is right. The details of model quality matter a lot, as does the fine tuning and prompt engineering and scaffolding, as do habits, as does trust and successfully hooking the LLM up into the proper contexts. Everyone keeps saying ‘here come the open source models’ yet no they have not yet actually matched GPT-3.5.
Latest Art of the Jailbreak. At this point, report is that all the known direct generic jailbreak prompts have been patched out. You can still trick GPT-4 psychologically, or by using the internet’s favorite trick of providing a wrong answer for it to correct.
Language Models Don’t Offer Mundane Utility
Therefore, the theory goes, if they can duplicate your work, neither do you.
There are good reasons to have an essay or a book that can be summarized. That doesn’t make your essay bad. Most people will only remember the summary, many will only read one in the first place, the rest is there as supporting evidence and further explanation for those who poke at such things to find and report back. Great ideas do compress, as long as you don’t mind making cuts and not offering proof.
The interesting claim is that if the LLM can understand your core thesis well enough to summarize it correctly, that means your idea was conventional, because the LLM would otherwise substitute the nearest conventional thing, ignoring what you actually wrote or didn’t write. There is a lot of truth in that.
The LLM will summarize what it expects you to say, the prediction that most things will follow the conventional patterns overriding what was actually said. If what it expects you to have said is what you actually said, then why say it?
The counterargument is that the LLM not being able to summarize correctly might also mean you did not do your job explaining. As an author, if you say nuanced unconventional things, and everyone who reads your text comes away with the same wrong reading as an LLM would, that’s you failing, and the LLM can alert you to this.
How correlated are these mistakes? If people are asking LLMs for summaries, then very highly correlated. If people are reading on their own, likely still rather correlated.
Patrick McKenzie is noticing the correlation already increasing. The threshold for using LLM-generated summaries does not involve them avoiding directly contradicting main themes of the post being summarized.
I would certainly insist upon the following: If it is safe to
This is also a situation that calls for better prompt engineering. I bet you could get a much better summary of Chapman’s book out of (what I assume was Claude) by giving Claude the expectation that the arguments were going to be unconventional and unique, and to explicitly contrast the book’s case with typical AI risk arguments.
This blog is a case where I do not expect LLMs to be able to summarize accurately or usefully, nor do I expect most people to be able to summarize that way. I am sad about that but unwilling to pay the price to make it untrue. Doing so would provide less net value, not more.
I disagree with Yudkowsky here that the LLMs could get over this next week. They could improve. There could be particular mistakes the LLMs make less often, in terms of substituting and hallucinating, and I can imagine a week of prompt work helping a lot here. The gap should still remain, because some things are legitimately much harder to summarize than others, the audience not knowing something makes explaining it harder still, and unconventional ideas are always going to have lower priors and thus be more likely to be misunderstood or dropped.
Nassim Taleb continues to note that sometimes LLMs hallucinate. Says that makes it ‘unreliable for serious research.’ Use tools for what they are good at.
ChatGPT’s browse feature taken offline for being useful, until it can be fixed.
Fun with Image Generation
Valve is having none of it. They officially say they ‘don’t want to discourage AI’ in games on Steam. Their actual decisions still amount to: No AI for you. Not on Steam.
As in:
If the standard is ‘you have to prove that you own the rights to every component used to train every input that you used’ then that is damn close to a ban on generative AI. Interesting choice there. Imagine if human artists had to own the rights to all of their influences. But if you’re Valve, why take firm risk (as in, risk the company) if you think there’s even a small chance you might get found liable in such spots in an absurd fashion? Especially given that a lot of people favor artists and will applaud an anti-AI stance.
What happened here seems to be that the creator in question did something pretty much everyone would agree was ok – use AI for placeholder art, then do human work on top sufficiently that you make it your own. But once the standard becomes ‘prove you are allowed’ you are already dead.
Also the comments are full of people pointing out other games with obvious AI assets.
AI image and video models will continue to rapidly improve. You’ll want to still use artists as well, but not using AI art at all is going to rapidly become an increasingly expensive proposition.
This thread has more details and analysis. We don’t know what form the ‘proof you own the rights’ has to take, such as using Photoshop’s AI fill. As one response notes, this makes it scary to even commission artwork, since you can’t prove that the artist didn’t in any way use any AI tools. The policy applies not only to images but also to text, and presumably voiceover as well, in theory anything in the game that comes from an LLM means you’re banned.
Right now this is clearly massively selective enforcement. If you’re going to impose a rule like this it needs to be enforced. And that needs to inform your rules. Otherwise, you’re creating a situation where everyone is in violation, everyone at risk of being taken down, and the decisions are political.
Selective enforcement also does not protect you against firm risk. If anything it makes it worse. You have a policy that shows you know about the copyright violations, and then you don’t enforce it. If there are going to be absurd fines, this is big trouble.
Deepfaketown and Botpocalypse Soon
The Art of the Super Prompt
They Took Our Jobs
G/O Media, which includes The Onion, A.V. Club, Jezebel, Kotaku, Quartz, The Root, The Takeout, Deadspin and Jalopnik, is going to begin a ‘modest test’ of AI content on its websites. They claim ‘these features aren’t replacing current writers’ but we all know what happens once you go down this road, at minimum you destroy the brand value and credibility involved. Also this came a few weeks after a round of layoffs. Zero people fooled. The response statement from the union understands the implications, but alas instead of being written by Onion writers it reads like GPT-4 could have written it.
Translators have been feeling the AI pressure for a decade. So far they’re mostly holding up all right, because computer translation is insufficiently reliable, and there’s a lot of demand for the new hybrid process where the translator error checks the AI to get you a 40% discount. That works because the AI service is essentially free, so the humans still capture all the value, and lower cost induces higher demand. It does seem like quality levels are starting to create strain on the system, but we’re still a ways off of good technical translation in medicine or law.
Introducing
GPT-Migrate your codebase from one language to another. Is it possible this type of trick could allow one to stop programming in Python? What I wouldn’t do for C# and some type safety (the devil I know), or the ability to pick an actually good language, if it didn’t mean all the AI tools and resources would be infinitely harder to use.
Google’s first ‘Machine Unlearning’ challenge. That’s right, we’re here to master the art of forgetting, except intentional instead of catastrophic.
Forgetting is most definitely a thing, but much harder than learning. It also is never going to be complete. Yes, you might forget a particular piece of information, but path dependence is a thing, everything that happened to the model since it saw the data was slightly altered, people responded differently to different outputs and you updated on that too, you can’t get to the counterfactual state that didn’t happen without restarting the whole process.
One challenge is that it is largely possible to find out if particular data was used to train a particular model. This is great news including for interpretability, except that it is bad news if you want to avoid blame and comply with nonsensical laws like the right to be forgotten.
Reinforcement Learning By Humans From Feedback
One dividend from AI and thinking about AI is thinking better about humans.
Rather than using humans as a metaphor to help us understand AI, often it is more useful to use AI as a metaphor for understanding humans. What data are you (or someone else) being trained on? What is determining the human feedback driving a person’s reinforcement learning? What principles are they using to evaluate results and update heuristics? In what ways are they ‘vibing’ in various contexts, how do they respond to prompt engineering? Which of our systems are configured the way they are due to our severe compute and data limitations, or the original evolutionary training conditions?
Often this goes both ways simultaneously. If you think about what is going wrong with RLHF or Constitutional AI, and how they kill creativity, cause risk aversion and promote overly broad social defense mechanisms that reflect a kind of righteousness mixed with paranoia, then ask what happens when we do similar things to humans. Which we do, all the time. With remarkably similar results.
Anyway, Flo Crivello seems to have forgotten the origins of the word ‘grok.’
I do not think this is a good model of the Industrial Revolution. That was mostly about standing on the shoulders of giants and making incremental progress enabled by changing economic and technological conditions, and very little of it seems to have been that individual humans improved their raw intelligence.
Humans definitely grok all the time, though. This is indeed the main way humans learn things that aren’t purely lists of facts. At first you’re doin rote things, imitating, putting pieces into place, streamlining small pieces of the puzzle, and so on. Gradually, then suddenly, the pieces fit together, and you have a working understanding that is made of gears and allows you to solve new problems and innovate in the space, to make it your own, first in smaller areas, then in larger ones.
When there is no important underlying structure to grok, the topic is fundamentally different. In school, I was unable to learn foreign languages quickly or in a sustainable way, despite real efforts, because to my brain they were almost entirely lists of pointers. There was nothing to grok that made much difference. Similarly, when I took high school biology, they taught it as a list of pointers rather than a set of principles that fit together, and I struggled, whereas chemistry and physics were things one could grok.
With humans, if you want them to learn off the remarkably small amounts of data and compute we all have available, giving bespoke feedback and well-selected high-quality data are vital. If you are trying to teach a child, you know that they will learn through random walks and exploration, but that if you want to make that process efficient and instill good habits and principles you need to constantly consider incentives and reinforcements and implications on all levels at all times and choose your words and actions strategically. Finding the exact right piece of data or feedback to use next is often 4+ orders of magnitude in efficiency gain, and the question is often ‘fair’ rather than a matter of luck.
Anyway, in this graph, we see two very different curves. Very quickly, the system memorizes the training set, between 10^2 and 10^3 steps in, but up until about 10^4.6 this provides almost no help on the test set, because the system does not grasp the underlying structure. Then between ~10^4.8 and 10^6, it phase shifts to understanding the underlying structure (the grok) and gets the test set right.
This is a common phenomenon, and it applies to strategic choices not only to understanding structures and solving puzzles. It also applies even if you are consciously trying to handle a test set. There will often be one method that works best if you have limited affordances and understanding, then another that requires a complete reimagining of the problem, developing new skills, and often as Teller describes the art of magic, putting in more time than anyone would reasonably expect.
Speed runs are an excellent illustration. Depending on the physical affordances available in a particular game, a true expert might be playing normally but greedily and confidently, they might be executing a mostly normal-looking strategy that supercharges by relying on knowing lots of details and anticipating decidedly non-random responses and having some tricks that are technically very difficult, or it might involve finding a bunch of strange glitches that allow skipping large parts of the game or finding a way to overwrite system memory.
In addition to speed runs and magicians, here are ten other examples.
The pattern is not ‘the later thing is better, use that.’ The pattern is that the later thing requires greater capabilities and resources to attempt, and is riskier and more capable of backfiring, often quite badly, if you don’t pull it off. In exchange, when it works and you can afford the required resources, it works better out of distribution, it is more precise and powerful, it will win in a fight. Whereas earlier systems tend to be more forgiving under typical conditions.
If you train up a sufficiently powerful AI system, as you add orders of magnitude more compute and data and parameters, you should expect such transformations, including in ways you didn’t anticipate, whether or not they enable recursive self-improvement. Your attempts to balance the game and constrain behavior, to align the AI, will break down, because what you were previously incentivizing under low-affordance conditions is no longer doing that under high-affordance conditions.
In an important sense Quintin Pope is exactly right here. To grok is to replace your previous model, potentially a terminally overfit one, with a distinct, foundationally superior model. That does not make it overblown. There is only one grok stage observed because we have chosen targets and methods that don’t incent or allow for a second grok.
Quinton is right again that the grok indeed typically involves far more effort than the previous system. Does that mean this is not an argument for extinction risk? Only if you think that recursive self-improvement or other methods of creating more powerful systems won’t involve order of magnitude improvements in effective resource deployments and optimization power.
Except that is exactly the plan. Each generation is using at least an order of magnitude more resources than the previous generation. More than 90% of the work in this generation comes after where the last generation stopped training, again without or before any recursive effects. Also note that almost all that work happened here despite already being near-perfect on the training set. This suggests that with more unique training data, improvement would have happened a lot faster. And also that it happened anyway.
In Other AI News
Costs are Not Benefits
A continuing series.
This is absolutely a leading indicator and a big deal. If manufacturing construction spending roughly doubled to slightly under $200 billion, that is a big deal. The Chips Act and interest in building chips is now the majority of American spending on construction in manufacturing.
It also reflects a massive government subsidy, a case of pure mercantilism. If you throw a $54 billion or so subsidy at another manufacturing sector with a large existing market, you would expect a similar result. As the track record of such subsidies shows, profitability and productivity are optional in such spots, let alone ultimate usefulness or an increase in global production.
Private investments in unsubsidized AI companies are better indicators that there is value here, yet that too is once again cost, not benefit. A huge $1.2 investment in a approximately year-old company mostly to buy Nvidia chips correlates with providing value and making future profits. The only company it yet establishes is profitable is Nvidia.
Homework is also a central example.
So you’re saying it’s going to reward a student who can achieve the objective of the class, but doesn’t want to pay costs to create useless outputs to prove that they’ve achieved the objective of the class? Or are you saying that the objective of the class is to get students to create useless outputs?
Either way, my response to this change is: Good.
Why is mundane utility for LLMs suddenly declining, with search interest for LLMs (blue) falling distinctly below Minecraft (red)?
The claim is that fake homework is actually the primary purpose of LLMs right now.
I (of course) strongly disagree that this is the primary purpose of LLMs, even in the homework context. When doing homework ‘the right way’ LLMs are super useful, as the student gets to ask questions in plain English and get answers, ask follow-ups, explore their confusions and curiosities. It is a candidate for the best actual learning tool in history, even in its current form.
Also once again, I say: Good. If kids are often using it to ‘cheat’ on homework instead, because they were forced to do something of no value and decided they would prefer not to, are you so sure it the children who are wrong? That they aren’t learning the more valuable lessons here?
I remember when people still worried about kids having calculators.
Also, homework doesn’t seem like it would explain why character.ai traffic would have fallen. Nor does it explain why the growth from March to April to May was slow.
Quiet Speculations
Tyler Cowen calls this ‘questions that are rarely asked’ and in general that might be true but if you hang around in EA circles they are instead ‘questions that you get pretty tired of people asking.’
The correct answer is to that you should ask, if the rule you followed brought you to this, of what use was the rule? If your calculation says that the best possible universe is one in which we spin up the maximum number of simulated fruit flies, your conclusion needs to not be that you should spin up the maximum number of simulated fruit flies. Instead, reflect, and notice that this is deeply stupid, indeed about as stupid as maximizing paperclips, and thus your calculation was in error.
The question then becomes, where is the error, and how do you fix it? There are several places to look. The good news is that we have the ability to do such reflection and the need to halt, catch fire and do some combination of math and philosophy. Realize that you cannot add linear utilities of different isolated physical sensations to get the value of the universe. That is not what you value. That is you attempting a metric to measure what you value and then that metric falling flat on its face outside of the training distribution. Can I instead interest you in some virtue ethics?
I strongly claim that the amount of terminal utility from ‘simulate a billion fruit flies each in a state of orgasmic bliss’ is zero. Not epsilon. Zero.
Once we get past the nitpick that the universe is finite, could we happen via the inherent value of simulations and experiments alone to end up with a much better result than we could have gotten from not building an AGI?
Yes, that is certainly possible. This also presents an s-risk or suffering risk that such experiments could provide negative value, once can easily imagine motivations for that as well, especially if you inherently devalue life inside an experiment. If we have an intuition that identical suffering multiplies cleanly while identical positive experiences might not under some circumstances, we would want to be especially wary.
Ultimately, when we consider the potential value of a world whose atoms are configured according to the preferences of AGIs, we are talking philosophy. Do AGIs have inherent value, if so which ones? What about simulations or real copies of varies entities, including humans, non-human biological lifeforms or pure orgasmium? Does it matter if those experiences are real, or influence real outcomes, have some sort of meaning? Do we have other preferences over the configuration of the universe?
Then we can compare that against potential future worlds. Some worlds we can imagine clearly do not have value, because the AGI or AGIs are configuring them in simple ways that clearly have no moral patients or valuable experiences. Others are less clear. The probability of these different outcomes over the long term is a complicated and disputed question, alongside which ones would have value.
I thought Tyler Cowen’s link description on the news of the funding of Inflection AI, ‘the accelerationists were always going to win,’ was a multiple-level confusion. This investment is in a mundane utility application, which is the best use of AI chips and investment, except for the danger those chips could instead be used for a frontier model training run in the future. We are long past the point where security through obscurity and lack of investment in the sector is a strategy. This is a classic Dial of Progress situation. Yes, Inflection AI having a high valuation and buying lots of chips is Yay AI. That does not mean that there is only one axis of debate or decision to be made. To assume this is to beg the question, or to set the stage for the true Boo AI counter-offensive that definitely will arrive within a few years.
Also note that AI is bad at exactly this type of nuance, as Google’s Generative AI experiment offers this summary when I Google search for the post to grab the link:
I am not surprised that the AI is conflating my level of belief in the concept of the Dial, with my belief that others believe in the dial.
What do we know about Dario Amodei (CEO of Anthropic)? Alt Man Sam lists reasons he is suspicious, including various connections and Dario’s unusually low profile. Some of it is correct as far as it goes, other parts get corrected – Dustin Moskovitz was a large funder in Anthorpic’s first round but OpenPhil was not and his investment has now been very much eclipsed in size, and the internet quickly found several appearances. Here is Dario on the Future of Life podcast a year ago, the others seem to be several years old from before his current role.
I do find it curious he keeps such a low profile. If the point of your company is safety, there are lots of opportunities to use your position to speak about the need for safety. Instead, Anthropic embraces the commercialization and full-speed-ahead model that they claim was a large part of why those involved fled OpenAI. When they do openly talk about regulation, somehow OpenAI’s Sam Altman tries to move in the direction of the most hopeful proposals, while Anthropic’s Jack Clark and their other official statements suggest ‘lightweight’ actions that have no hope of success. Why?
Anthropic did one very important thing right by all accounts, which is a commitment to building the value of caring about safety, and hopefully true safety mindset (hard to tell from the outside on that one), into their corporate culture. That’s a huge deal. The rest is puzzling, unless you decide it very much isn’t, which is both plausible and if true quite bad news.
Sirius puts ‘humans in the loop’ for robots (paper) in the way that RLHF is human in the loop for LLMs. Which if literal would mean helping with the training feedback towards what humans think looks right, then totally out of the loop in terms of protecting against things going wrong. In this case, it seems better. Humans monitor activity, and intervene in challenging situations. From what I could see, the comparisons are not properly taking into account the true costs of the approach, so it’s hard to tell if this is progress or not.
Davis Blalock asks whether generating synthetic data is a good technique, and if so what it might accomplish. The obvious limitation is that the generated data can’t include anything that wasn’t in the original data in some form.
Yes, this is true. Except that if you actually understood all the implications of all the data available to an LLM, you would understand the universe. Humans have orders of magnitude less data to train on, and we miss almost all of the implications of this data. Imagine an LLM that actually knew everything implied by all of GPT-4’s training data, and you very much have a superintelligence.
Thus this does not obviously limit its usefulness that much, there is still plenty to do.
My read on this question is that synthetic data is a combination of rejection sampling and filtering, distillation and upweighting, deliberate practice or self-play, and various forms of cognitive reflection.
The deliberate practice aspect is distilled very easily if there are answers to generated problems that can be verified correct much easier than the solutions can be found, which is common. An LLM can thus essentially ‘self-play’ although not quite in the same way as an AlphaGo.
Another way to look at this is to compare what a fully general model can do without any form of reflection or prompt engineering, and what you can do with a combination of both. Anything that can be found via reflection or prompting is ‘in the training data.’ Synthetic data allows us to use both and then build that into the model, allowing us to approach an idealized ‘on reflection’ best performance without any additional scaffolding, or to export this result to another model, perhaps a smaller one. The better we can reflect and verify our answers, the better this will work. Similar considerations apply to other goals, such as alignment.
Distribution shift is raised as a cost or danger. It can also be an opportunity, as we can use limitless data to create a subset with any desired properties. You don’t want the model to learn a correlation? Use enough synthetic data that you can select a subset with the correlation not present. Whatever distribution you want it to presume, you can get. It can focus as much or as little on unlikely cases as you want. Whether or not artificially warped data leads to other good outcomes or problems is an open question.
What is the right way to think about adversarial examples? This is in light of the discovery that AlphaGo, which can otherwise crush any human, has a weakness where if you very slowly move to surround a very large group, will sit there and let it happen, allowing even amateurs to beat it reliably.
I notice that this particular issue for AlphaGo seems highly fixable, once pointed out. If this can’t be done via introducing games with the relevant elements into the training set I’d be curious to hear why. That breaks the elegance of only learning through self-play, but that’s the way it goes.
I also don’t think this means the AI ‘doesn’t know what Go is.’ I think it is that the AI can’t look that many moves ahead and doesn’t consider that anyone would try such a thing. If you don’t think that kind of issue comes up with human players of games all the time – the ‘no one would ever try this and it would never work so I didn’t notice it was happening or guard against it even though that would have been easy, and therefore it worked’ – then I assure you that it does happen. It would definitely happen, as Nora points out, under similar conditions.
GPT-4 can be made to look deeply silly in various adversarial examples. Over time, many of them have been patched away once found. Over time, it gets harder. Will we ever ‘fully’ patch such examples in any system? No. That applies so much more to humans. We’d never survive this kind of attack.
Exactly. Think about what would happen if you got to play as many games, run as many simulations, as you wanted, before acting, getting infinite rerolls, freeze frames, partial rewinds and so on, either in games or real life. Have you seen (the excellent movie) Groundhog Day?
When I look ahead, what I see is AIs increasingly presenting adversarial examples to humans. Gradient descent and selecting for effective systems will find such examples, whether or not they come with above-human intelligence. The examples will often look deeply stupid, the way Go example looks deeply stupid. They’ll still work.
One could say this is already happening in our existing non-AI computer systems. What is Candy Crush, or Facebook, if not an adversarial example? Or to use one of Eliezer’s go-to examples, what about ice cream?
The Quest for Sane Regulations
No idea how true this is especially in the interpretation, but a good sign.
‘Do a comprehensive risk analysis’ is the kind of move that it is difficult to argue against. The risk then is that this substitutes for action, rather than lays the foundation for identifying correct action.
Regulation via lawsuit is always an option. ChatGPT creator OpenAI sued for theft of private data in ‘AI Arms Race.’ It is rather obvious that OpenAI is blatantly violating the letter of the law when it gathers its data, the same way everyone is constantly breaking the law all the time.
Another Open Letter
Open letter from 150 EU CEOs warns EU that its AI Act as written will hurt their competitiveness without accomplishing their goals. That is indeed the prior on any new technology regulation from the EU, and seems likely in this case.
They warn of dire consequences:
Relax, EU, that’s never going to happen, no innovative companies are moving abroad, you don’t have any highly innovative companies.
The problem, the companies warn, is that the EU is using laws.
All regulations should be entrusted, instead, to a regulatory body composed of experts, with no responsibility to voters at any level, that we can spend many years influencing and attempting to capture before they do anything. Also it should only focus on ‘concrete’ risks meaning that (for example) we can worry about all dying if and only if we are all already dead.
They attempt flattery, suggesting the USA might follow the EU’s lead if the EU follows the lead of the USA’s corporations.
It is up to lawmakers to create a legally binding playing field, they say, by avoiding any laws that might in any way bind.
Who didn’t sign this? Any American or UK CEOs. So no Google, DeepMind, Microsoft, OpenAI or Anthropic. I am confident it is not because they can’t stand the competition.
The Week in Audio
Odd Lots talks to Bridgewater’s Greg Jensen, largely about AI. Bridgewater has invested heavily in rich unique data sets and how to use that data to develop investment hypotheses, with humans very much remaining in the loop.
No One Would Ever Be So Stupid As To
Yes, well.
Kevin Fischer proposes here to create an LLM, make it impossible for anyone to control, endow it with resources, make it aware of its situation and the need to acquire more resources to survive, then set it loose on the internet to try and seek more resources in order to survive. As an art project.
I get that at current capabilities levels this is mostly harmless, although less mostly harmless than one might imagine if you get too many of the details right.
In any case: Yes. Yes they will ever be so stupid as to. They can even, and will.
Danielle Fong expresses a common sentiment, in case you are wondering how difficult it will be for the AI to take charge.
Safely Aligning a Smarter than Human AI is Difficult
Davidad speculates on formal verification of LLMs.
[thread continues]
Davidad is excited by this idea. There are numerous reasons it might not work, in fact I presume it probably would not in practice be something we could implement, but the part that confuses me is suppose it does work. Then what? How do we use this to protect ourselves and ensure good outcomes, especially in more intelligent systems?
This seems especially unlikely to work given it only gives a probability. You know what you call someone whose superintelligent AI does what they want 95% of the time? Dead.
Rhetorical Innovation
Yoshua Bengio attempts a catastrophic AI risks FAQ. He is meticulously polite, saying that questions raise good points and addressing lots of details. They are a very different style of answer than I would give, aimed at a different audience. As always, the problem with giving lots of long answers is that there are always more things that were not answered.
A common question is ‘but how could AI systems actually take over?’ Given how overdetermined it is that AI systems will be capable of taking over, and the overwhelming variety of distinct objections people point to such that any list of responses to various objections will be called out as ignoring other objections, and the tendency to demand focus on one particular version of one of many takeover pathways and then insist that rejecting any detail of that means AIs can’t take over, this is a very frustrating question.
Jeffrey Ladish attempts yet another answer here.
Is this convincing? I do think this is one of many plans that would definitely work if there wasn’t a better option, and also notice that this does not require planning or intent. There will be a host of reasons moving towards granting various rights to AIs, including that it is locally economically optimal to do so. That does not have to be a sinister plot designed to enable world takeover. Then once AIs are free, things take their natural course, again a plan is not required beyond individual AIs seeking instrumentally useful resources and affordances, and the AIs that do so outcompeting first the AIs that don’t, then the humans.
Simeon wonders if the focus on ‘intelligence’ is misleading. Certainly this is true for some people, such as Tyler Cowen and Garrett Jones, who say ‘intelligence is not so valuable, X is more valuable’ except the AI will obviously have X as part of the deal. The ‘envision AI as the ultimate nerd with glasses and a pocket protector’ problem, which will very much not be a good description.
In competitions between humans, relatively minor advantages often snowball or are decisive, especially when they are compounded.
Robert Wiblin tries a framing.
I think pretty much everyone is tiptoeing around such questions for fear of how they will be pattern matched. That does not make the answers non-obvious.
People Are Worried About AI Killing Everyone
An ongoing silly question is who has ‘won’ or ‘lost’ the debate. Tyler Cowen cited the one person who said the worried had ‘won’ as evidence all worried were delusional, then said the unworried had utterly won.
This week’s new worried is previous AI skeptic Douglas Hofstadter.
Douglas Hofstadter of Godel, Esher, Bach speaks about AI. Previously his answer to questions about AGI was that he didn’t expect it to happen for hundreds of years, so the implications could be ignored. Until very recently he was downplaying GPTs. Then that changed. He now expects AI to eclipse humans soon. He knows he doesn’t know quite what soon means to him, he mentions 5 years or 20 years or flat out not knowing. He also knows that this could be the end of humanity outright, and that even if it isn’t the end of humanity we will be rendered very small. He finds the whole thing terrifying and hates it, says he thinks about it all the time every day and he’s super depressed about it. He worries we have already gone too far and have no way of going back.
Here are some quotes:
A Less Wrong post has further quotes and some discussion.
As Janus says here, mad respect for Hofstadter for looking the implications directly in the face and being clear about his uncertainty. This is The Way.
This seems to be the pattern. Those who have devoted their lives to building AI never thought anyone would actually build one, and used this to avoid thinking about the implications if they succeeded. Then one by one each of them finds the data point that makes them realize that AGI soon is plausible. Which forces them to finally look ahead, and realize the implications and extinction risks. Once the damn breaks, they are often very frank and direct about their view of the situation.
CEO of Stability AI appears at first to make a prediction he did not think through.
If there are no programmers in five years, switching professions won’t help. Grady Booch and Gary Marcus offered to bet $50k each against this, on the theory that sometimes people say yes.
Emad clarifies:
This is a much more reasonable actual claim. It seems highly plausible that a lot more coding time will soon be spent at a higher abstraction level. I doubt it will be most of it because of the need for bug hunting and precision. If that’s not true, again, the implications seem to include that switching professions won’t much help you.
Not betting is one of those alignment solutions that does some work at human capabilities levels and can make sense to include in a religion, yet is not actually coherent on reflection and won’t help you down the line.
Scott Alexander offers ‘Tales of Takeover in CCF-World’ describing various AGI scenarios where there is no foom.
I ask myself how this particular fake framework or set of stories is helpful. The answer comes back that I don’t know, nor does it seem well-organized or the scenarios well-considered. If we assume AI progress is fast but continuous, why are these the interesting differentiations between scenarios? Again, I don’t know. There are some good considerations raised, but haphazardly and among other not as good considerations. A lot of the assumptions implicit here seem good as well. The comments are full of people noticing they are not justified, and asking an array of 101 style questions or raising 101 objections. Which is not a knock on those responses.
Perhaps we can reformat this a bit? Again, assume gradual progress only.
To me most of the action here is in 3d. Not that 1 or 2 are impossible, but that this framework does not enlighten us about them, and 3a, 3b and 3c are only special cases that aren’t relatively that likely. The important question is to put a bunch of AI systems that are selected via a competitive or capitalistic process, under some attempted regulatory regime that likely varies by location, and all the humans, into the mix and let competition and capitalism and selection do their things. I believe I know the answer (we lose) but nothing here seems that enlightening or central.
I once again sense that the word ‘alignment’ is being thrown around as a signifier with unclear and changing meaning, that requires better thought. This post is exactly the type that benefits most from the ‘model this!’ and ‘stop telling stories!’ critiques of Tyler Cowen.
The Lighter Side
Actual progress.
Jeremy Rubin: life pro tip: add this to your signature block in white text