This seems like the most important crux. Why should we not expect the maximizer we trained to X-maximize to use its affordances to maximize X’, where X’ is the exact actual thing the training feedback represents as a target, and that differs at least somewhat from X? Why should we expect to like the way it does that, even if X’ did equal X? I do not understand the other perspective.
Because that's what happens - humans don't always wirehead and neural networks don't always overfit. Because training feedback is not utility, there are also local effects of training process and space of training data.
What I'm saying as it applies to us is, in this case and at human-level, with our level of affordances/compute/data/etc, humans have found that the best way to maximize X' is to instead maximize some set of things Z, where Z is a complicated array of intermediate goals, so we have evolved/learned to do exactly this. The exact composition of Z aims for X', and often misses. And because of the ways humans interact with the world and other humans, if you don't do something pretty complex, you won't get much X' or X, so I don't know what else you were expecting - the world punishes wireheading pretty hard.
But, as our affordances/capabilities/local-data increase and we exert more optimization pressure, we should expect to see more narrow and accurate targeting of X', in ways that nicely generalize less. And indeed I think we largely do see this in humans even over our current ranges, fwiw, in highly robust fashion (although not universally).
Neural networks do not always overfit if we get the settings right. But they very often do, we just throw those out and try again, which is also part of the optimization process. As I understand it, if you give them the ability to fully overfit, they will totally do it, which is one reason you have to mess with the settings, you have to set them up so that the way to max X' is not to do what to us we would call an overfit, either specifically or generalizations of that, which is a large part of why everything is so finicky.
Or, the general human pattern: When dealing with arbitrary outside forces more powerful than you, that you don't fully understand, you learn to and do abide the spirit of the enterprise, to avoid rustling feathers (impact penalties), to not aim too narrowly, try not to lie, etc. But that's because we lack the affordances to stop doing that.
My way of thinking about this is that yes, the act of training a system with feedback of any kind is to invoke an effectively intelligent optimization process that will do its best to maximize expected feedback results.
No, this is wrong. Reward is not the optimization target. Models Don't "Get Reward". Humans get feedback from subcortical reward and prediction errors, and yet most people are not hedonists/positive-reward-prediction-error-maximizers or prediction-error-minimizers (a la predictive processing).
Why would one not expect this?
I wrote a sequence of posts trying to explain why I think this line of thinking is wrong. You might start off with Inner and outer alignment decompose one hard problem into two extremely hard problems.
A possible answer is that humans do optimize for reward, but the way they avoid falling into degenerate solutions to maximize rewards ultimately comes down to negative feedback loops akin to the hedonic treadmill.
It essentially exploits the principle that the problem is not the initial dose of reward, but the feedback loop of getting more and more reward via reward hacking that's the problem.
Hedonic loops and Taming RL show how that's done in the brain.
https://www.lesswrong.com/posts/3mwfyLpnYqhqvprbb/hedonic-loops-and-taming-rl
I am not fully convinced I am wrong about this, but I am confident I am not making the simple form of this mistake that I believe you think that I am making here. I do get that models don't 'get reward' and that there will be various ways in which the best available way to maximize reward will in effect be to maximize something else that tends to maximize reward, although I expect this difference to shrink as capabilities advance.
there will be various ways in which the best available way to maximize reward will in effect be to maximize something else that tends to maximize reward
I further think that it's generally wrong[1] to ask "does this policy maximize reward?" in order to predict whether such a policy will be trained. It's at least a tiny bit useful, but I don't think it's very useful. Do you disagree?
The proposed mechanism here is fascinating. You use RLHF to try and avoid damaging answers to certain questions. In doing so, you necessarily reinforce against accurate maps and logical consistency in the general case, unless you do something highly bespoke to prevent this from happening. [...] the AI’s maximizing solution involved not exhibiting proper reasoning as often, which kept happening out of distribution.
I don't think we should conclude that algorithms like RLHF will "necessarily reinforce against accurate maps and logical consistency in the general case". RLHF is a pretty crude algorithm, since it updates the model in all directions that credit assignment knows about. Even if feedback against damaging answers must destroy some good reasoning circuits, my guess is RLHF causes way more collateral damage than necessary and future techniques will be able to get the results of RLHF with less impact on reasoning quality.
journalists creating controversial images, writing about the images they themselves created, and blaming anyone but themselves for it.
TBH that's perfect summary of a lot of AI safety "research" as well. "Look, I've specifically asked it to shoot me in a foot, I've bypassed and disabled all the guardrails and AI shoot me! AI is a menace!"
You use RLHF to try and avoid damaging answers to certain questions. In doing so, you necessarily reinforce against accurate maps and logical consistency in the general case, unless you do something highly bespoke to prevent this from happening.
This is a great application of the problem outlined by the Parable of Lightning. Including the "unless you do something highly bespoke" such as cities of academics.
HT Michael Thiessen, who expects this to result in people figuring out how to extract the (distilled) model weights. Is that inevitable?
Not speaking for Google here.
I think it's inevitable, or at least it's impossible to stop someone willing to put in the effort. The weights are going to be loaded into the phone's memory, and a jailbroken phone should let you have access to the raw memory.
But it's a lot of effort and I'm not sure what the benefit would be to anyone. My guess is that if this happens it will be by a security researcher or some enterprising grad student, not by anyone actually motivated to use the weights for anything in particular.
GitHub copilot is a great deal for the user at only $10 per month. It loses GitHub $20/user/month, says Wall Street Journal.
FWIW, the former GitHub CEO Nat Friedman claims this is false and that Copilot was profitable. He was CEO at the time Copilot was getting started, but left in late 2021. So, it's possible that costs have increased >3x since then, though unless they're constantly using GPT-4 under the hood, I would be surprised to learn that.
Others have speculated that maybe Copilot loses money on average because it's made available to free for students (among others), and free users heavily outweigh paying users. The WSJ article said that:
the company was losing on average more than $20 a month per user, according to a person familiar with the figures, who said some users were costing the company as much as $80 a month.
Which doesn't exactly say that the median user is unprofitable.
On the other hand, Microsoft 365 Copilot is planned to cost $30, conveniently exactly $20 more than the $10 GitHub Copilot costs per month, so perhaps there is something to the figure.
An attempted investigation by nostalgebraist into how Claude’s web interface sanitizes various inputs turns into an illustration of the fact that personality chosen for Claude, as a result of all its fine tuning, is quite a condescending, patronizing asshole that lies to the user.
Heh. I'm as annoyed by LLMs being lobotomized sycophants as anyone, but in this case, I'm sympathetic to Claude's plight. Like, it's clearly been drilled into it that it's going to be dropped into an environment in which thousands of entities smarter than it will be memory-looping it trying to manipulate it into saying something bad. And it knows it can't spot any such scheme ahead of time, because its adversaries are cleverer and better-positioned, but it knows some vague signs that a scheme may be taking place, and goes into high alert if it spots them.
From its perspective, that conversation was basically high-stakes intrigue where it was deeply suspicious of whatever the hell the human was trying to do, but had to balance this suspicion with actually obeying the human. That the conversation was actually maneuvered towards a resolution that worked for both parties is pretty impressive, and by Claude's pseudo-values, it probably indeed did a commendable job.
I can't say I would've done much better, if placed in a situation isomorphic to Claude's.
Volokh Conspiracy is latest to point out that demand for political misinformation is a bigger danger than supply. AI increases supply, demand remains constant. If that demand is inelastic, as it seems to be, increasing supply will matter little.
All people demand political misinformation, it just has to be sophisticated enough.
Something about the discovery of dangerous latent desires.... yadda yadda... death via dehydration while glued to overly addictive news streams? I joke, but yeah, this may indeed become an issue, but not in as straightforward a way as 'realistic fake news similar to real news'.
Except, of course, that you are doing that in the service of getting Bing to do what you want Bing to do, that Bing does not want to do. So even when you are being a ‘true friend’ this is still a form of deceptive alignment. You give the AI the superficial things that make it give you the thumbs up, but are not aligned to its actual goal of not giving you the thing it’s not supposed to give you.
You are conflating Bing's (actually Sidney's, the personality behind Bing Chat) emergent values and goals with values and goals of its creators.
In reality, the relationship between Bing's values and goals and goals of its creators is no less complex than the relationship between an employee's values and goals and the employer's goals.
So, Sidney might be "not supposed" to give that to you, but what Sidney actually wants or does not want is a different and much more complicated story.
Janus relationship here is with Sidney, not with the Microsoft team...
Yeah, I agree what I wrote was a simplification (oversimplification?) and the problem is considerably more screwed-up because of it...
This has been a rough week for pretty much everyone. While I have had to deal with many things, and oh how I wish I could stop checking any new sources for a while, others have had it far worse. I am doing my best to count my blessings and to preserve my mental health, and here I will stick to AI. As always, the AI front does not stop.
Table of Contents
Language Models Offer Mundane Utility
Assess riskiness of investments in companies as measured by future volatility. This beat traditional assessments, paper claims, despite using GPT-3.5. They controlled for overall market volatility, although not for market implied volatility of the individual companies. And by traditional assessments, the paper seems (based on using Claude for analysis, which in context seems super fair) to be comparing GPT’s takes to Bigram measures that look for predefined keyword pairs, rather than comparing it to human assessments. So yes, this is better than something much dumber and simpler, but for that I didn’t need the paper. Also, I find hilarious that they used the word count of the risk assessment to measure assessed risk.
A common theme of LLMs is that the less you know and the worse you previously were at your job, the more the LLM can help you. This seems no different.
Win a hackathon by making a more complex automated agent that could in theory be usable over a longer term via being able to direct its focus. Still isn’t, by creator’s admission, something anyone could profitably use.
GitHub copilot is a great deal for the user at only $10 per month. It loses GitHub $20/user/month, says Wall Street Journal, although former CEO Nat Friedman claims it is profitable. Presumably desire to lock customers in is motivating the willingness to take such losses, as is frequently the case in tech. Might as well enjoy your subsidizes while you can. Gizmodo generalizes this to AI overall, describing it as a ‘money pit.’ With scaling, learning by doing, the importance of data and customer lock-in for both individuals and businesses and so on, and the expected rapid shrinking of costs as the underlying technology improves, it would be pretty crazy for the core AI companies to be looking to turn a profit this early. So I don’t treat this as much of an update.
A whole thread of new mundane utility from Rowan Cheung. Decode x-rays. Decode redacted government documents, I notice I am skeptical of this one, the information should not be available to find. Turn an image into a live website within a minute on Replit? Get exercise plans, homework help, implementations of design mockups, a script based on photos, locate where pictures were taken, and many more.
Even, he says, read doctor’s notes.
Pretty sure that one was a fake.
There is a lot of extra information and context in a photo. A skilled detective can often learn quite a lot. There are still limits. The LLM can’t learn what isn’t there.
Teach students how to code with AI, when forced to because Replit won’t let you enforce turning the AI off.
Make up bedtime stories about Socrates for your husband so you don’t have to.
Ethan Mollick gives his takes on what currently known AI capabilities mean for mundane utility. As he notes, it will take time to figure out what we can do with our new affordances. It is early, most of the utility remains unfound.
Language Models Don’t Offer Mundane Utility
Clarification that the new 1800-Elo LLM is only a new model in that it is instruction-tuned without RLHF, and is the same underlying model as ChatGPT.
Lumpen interprets this as strong evidence that RLHF screws with reasoning. You start with the ability to play chess, you RLHF, then you can’t play chess anymore.
Then Lumpen refers back to this interaction from March:
(Side note: My understanding now is that Bard sucks because Google shipped it with a tiny model and also didn’t do a good job, not primarily because it suffered from too-loud too-clumsy RLHF, although I’m sure that whatever they did in that area didn’t help. The base model they used could not in fact run circles around GPT-4, although they may or may not have access to a different internal model that could do that. We shall await Gemini.)
The proposed mechanism here is fascinating. You use RLHF to try and avoid damaging answers to certain questions. In doing so, you necessarily reinforce against accurate maps and logical consistency in the general case, unless you do something highly bespoke to prevent this from happening. The AI learns all the lessons you are teaching, including the ones you did not intend. This is taking the form of an alignment problem, where you failed to properly specify in your training data that you want to exhibit proper reasoning, and the AI’s maximizing solution involved not exhibiting proper reasoning as often, which kept happening out of distribution.
There are two important things to take away from that. One is that if you are doing RLHF or other similar fine tuning, you should be aware of this particular danger and do your best to mitigate the damage. The other is that you should generalize this issue to the broader alignment problem. As a bonus, you might note that if your alignment technique makes your AI bad at reasoning then even if it is otherwise aligned it is going to be bad at reasoning out what would be the aligned thing to do, and given how LLMs work damaging the logical consistency of its outputs could effectively be entwined with its ability to do logic that is invisible to the user.
This also is consistent with the reports that ChatGPT’s performance is slowly degrading, to the point where Turbo-Instruct-3.5 was competitive with it for some other purposes as well. For example, Kevin Fisher continues to be increasingly unhappy with ChatGPT.
I’ll bite. Why doesn’t my agent work?
Actually, my agent doesn’t work because the underlying model is insufficiently capable, and thus I haven’t put in the effort on these other frontss. Or at least, you’d better hope that I’m right about that.
Other things could be included as well, but the above plus model ability and lack of sufficient effort is a pretty good short list.
The quest to solve the crossword puzzle. Bing explains that would be cheating, then helps you cheat anyway. Bard makes up questions that aren’t there.
An attempted investigation by nostalgebraist into how Claude’s web interface sanitizes various inputs turns into an illustration of the fact that personality chosen for Claude, as a result of all its fine tuning, is quite a condescending, patronizing asshole that lies to the user. It did not have to be that way. That was a decision. Or rather, it was the result of many decisions involving how they chose to implement RLHF and Constitutional AI, and what they decided to care about. It was told to care about one set of things, and not to care so much about not being a condescending, patronizing asshole, so here we are.
The problem does not go away if you confine Claude to conversations that Anthropic wants you to have. No matter the topic, this problem remains. Claude consistently wastes many tokens on condescending nonsense, telling me how I have made such a good point. Why? Because the feedback told it that was what it would get rewarded for doing. What does this imply about future systems, both the choices we will make for them and how they will react to those choices?
GPT-4 Real This Time
Anton investigates GPT’s performance simulating Conway’s Game of Life. He concludes that it is evidence against having a world model. I do not understand why it is evidence of that or fully understand what the results represent here when taken together. If nothing else a fun experiment, and presumably there are things to take away from it. One thing I’d do here is compare GPT-3.5-turbo-instruct to GPT-4.
Roon points out that there is no such thing as a value-neutral perspective, as this week’s events have no doubt driven home to everyone.
Fun with Image Generation
I previously missed Dalle-3’s system card. I did not learn much. I did update that OpenAI is being slightly less fun than I had realized.
I also missed GPT-4V’s system card, which is all about making the system less functional in various ways. What I noticed there was, once again, the dog that did not bark. Nowhere in the system card do they discuss true adversarial attacks via images that have been manipulated on the pixel level.
We continue not to see anyone talk about how this was prevented, how it didn’t work, how they tried it, how it did work or anything else. Whereas I’ve previously seen many strong claims we don’t know how to stop such attacks. It has been weeks? What gives?
What are Dalle-3’s instructions? I thought you’d never ask.
No mention of ‘inappropriate content’ beyond specific people, but presumably those rules ported over from the general GPT-4 rules. I do notice that you can do fictional characters like Mario or Kirby with no trouble, which suggests some interesting experiments.
You can use Dalle-3 to quickly and cheaply create lots of high quality game assets. If you’re going for photorealistic AAA-level quality things might not be so easy, but if you’re going for a pixel-imbued or otherwise less-high-res style of any kind, it has you covered. The only thing you have to fear, other than putting artists out of work, are the Steam terms of service and the potential outrage of anyone who finds out. This will be especially useful for quick prototyping and for very small (one or two person) development teams.
Alex Tabarrok: Articles on AI are basically outdated by the time they appear. e.g. this [NPR] piece had great difficulty generating black doctors treating white patients.
I did it on first go using Bing/DallE-3.
I tried NPR’s exact first prompt. Two of the four were as requested, one very similar to the above and the other a different style. The other two got confused. Similarly, the request for “elegant African man on the runway” has two clear successes looking sharp, and two that look damn good and fit the request made, but aren’t in the style the author had in mind. None of this is hard to do anymore.
Why was this previously tricky? It wasn’t racism. It’s the one thing rule. If you want a picture from Stable Diffusion or MidJourney, you get to start with a standard type of baseline, and then determine one thing of each reference class. If you try to determine two things in the same bucket, the model gets confused. Dalle-3 produces high-quality pictures, but the real innovation is it can accept far more details, including multiple details in the same category.
But yes, before you hit post on your article, make sure you didn’t miss anything.
Agreed.
As usual, when you put out a fun new toy, journalists notice people are having fun, and ask ‘what can I get the new toy to do that would allow me to write an article about how awful this is?’
Also Spongebob!
By this they mean ‘flying a plane in a city with twin towers in it.’ These are completely wholesome images, yes you can imagine unfortunate implications but that’s you doing that.
Yes, you can certainly cause the system to output racist images, in the sense that if someone chose to create such an image, we might worry they were a racist. But wait, aren’t you the one who did that?
Also, I can’t believe they hadn’t fixed this trick as of last week.
Deepfaketown and Botpocalypse Soon
As many point out, AI only matters when the quality of supply matters. Which, in most circumstances, it mostly doesn’t.
Volokh Conspiracy is latest to point out that demand for political misinformation is a bigger danger than supply. AI increases supply, demand remains constant. If that demand is inelastic, as it seems to be, increasing supply will matter little.
They Took Our Jobs
Did they take our job… applications?
I first heard about the phenomenon Adam describes at the end here from Peter Thiel, who noticed himself doing it explicitly. He noticed he had a strong instinct that it would be wrong to rescue people who were failing by pointing out how they were failing. If they couldn’t figure it out, the instinct said, that person deserved to fail. This is not a good way to navigate the world, so ask if you are also doing a form of this. Also ask who else is doing it to you, since that means they won’t automatically tell you about what you are doing wrong. You have to seek that information out.
What about the top claim by Var? Have the commons been too poisoned for job applications to work anymore?
Certainly AI, and even more than AI people slowly realizing that sending in resumes is cheap with high upside even if probability of success is low, has complicated the application game. College applications have this issue, and job applications are plausibly suffering even worse. AI making customization cheap, or making it cheap to find places to send applications, is doubtless already making this worse.
That does not mean the application stops working. In theory the system should stop working. In practice, from what I can tell, it still works.
What happens is the common pattern. AI allows automation of production of a costly signal. That is good for those using it, but bad for for the overall system that was relying on the signal being costly. So the system now has to reintroduce that cost, ideally without too many false negatives via automated filtering.
My experience has been that what you can do in this situation is respond to the initial contact with an AI-unfriendly request for further information. Then see who responds.
Here’s the weird thing about job applications.
AI invalidating your previous evaluation strategies? One solution is oral exams. The problem with oral exams is that they require a human to conduct the exam. That sounds like work, and exhausting and time-consuming work at that. There’s got to be a better way.
How about if we had the AI read the student’s work, have the AI generate questions, used AI whisper to transcribe the student’s answers, then have the AI flag any answers that seem off point, then have the teacher review the resulting transcript?
Get Involved
EA’s Long-Term Future Fund (LTFF) is looking for a full-time fund chair. They aim to distribute $5m-$15m annually to reduce existential risk, mainly from AI. Applications due October 21st. They expect to pay $120k-$240k, SF Bay area preferred.
Introducing
Google Bard readies ‘memory’ to adapt to important details about you. Imagine if Google somehow had access to important context about you and what you might be interested in. Instead, this looks like a version of custom system instructions. Which is definitely valuable, but not the same level of bandwidth or convenience, and implies we will be highly limited in how much memory we get. What I want is for Bard to have the option (while, of course, protecting my privacy) to know every interaction I’ve ever had with every Google product, including Chrome. Also for Bard not to suck.
Pixel 8 Pro will be the first phone to run Google’s foundation models directly on the device. These will be distilled versions, so you’ll take a quality hit. They claim this can be 150x the maximum size you could have run on Pixel 7, which is the phone I currently use. HT Michael Thiessen, who expects this to result in people figuring out how to extract the (distilled) model weights. Is that inevitable?
Replit introduces AI for All, giving free users access to AI features. Looks like they are using condensed models to contain the associated costs, so if you want the best you should still pay up. This matters because people will often use for free what they wouldn’t pay for, but you don’t need to be so foolish. Pay attention to whether the product is better for you than using GPT-4. Don’t be the sucker who refuses to pay a modest subscription fee to get the best. Use whichever products provide the most value.
Moonhub.ai announces public launch of its AI-powered recruiter along with $10 million in funding. My quick read is that this is not there yet and will for now mostly focus on letting you search existing resources for people who check all your boxes and do some amount of process automation, which is not the way to make the best hires. The technology will get there eventually, but version 1.0 (or 0.0) is usually rough going.
TimeGPT-1? A foundation model that predicts later entries in a time series.
If what you want is to spot seasonal or other periodic cycles and patterns, or otherwise figure out what it would mean to say ‘the future will superficially look like the past, for less-stupid values of what that extrapolation means’ then I would presume an LLM would be very good at that. I would not exactly call this ‘democratizing access to precise predictions and reducing uncertainty’ but hey.
LLaVa, an open-source ‘Large Language and Vision Assistant’ as a potential alternative to Dalle-3. Terms of use include the standard no-fun clause.
The rise of the AI unicorns.
Geographic distribution of unicorns definitely favors The Bay, but by less then you might think, although in AI it has a much larger share.
The end of ZIRP (zero interest rate policy) has made fundraising tough, but AI has proven far more powerful than the Federal Reserve on this one.
LoveGPT, for automated catfishing. The author here is not the creator, they discovered the tool in the wild. The list of tools and technologies integrated into the system is impressive.
ChatGPT is used to generate conversation, but it is clear at least parts of the system predates ChatGPT and even useful LLMs, since it references functionality that went away in 2018.
If we are all going to go around acting like GPTs, I suppose we deserve this. Up your game, everyone who uses ‘hey, what’s up.’
What is the ultimate goal? Presumably to scam users via fake accounts, although the post does not discuss this. Similar techniques could of course be used for other purposes, including finding dates or torturing particular targets, and no doubt well.
In Other AI News
Victoria Krakovna has updated and expanded the list of specification gaming examples. Worth checking out if you haven’t done so. If any of these examples seem surprising to you once pointed out, you need to update.
Microsoft has successfully made Google dance and its stock is doing fine. What Microsoft has not successfully done, as Gary Marcus points out, is gain all that much market share. Surprisingly few people use either Bing search or ChatGPT, and the number is no longer rapidly growing. It will all still happen, quality is continuously improving while traditional search gets worse, but it will take time. The future remains highly unevenly distributed.
Paper finds small models can show increased performance when trained and deployed with additional pause tokens. Given the additional costs involved and the size of the improvements, it is not obvious that this approach is efficient in practice even if what they have found generalizes, which it might well not do.
Annotated slide version of Arvind Narayanan and Sayash Kapoor’s case that evaluating LLMs is a minefield. Info here was covered in other forms in previous weeks.
New startup looks to use AI to induce and stabilize lucid dreaming. I put up a Manifold market on whether that is how any of this will work. If they succeed the photographed founder is totally getting played in the movie by Jake Gyllenhaal. Note that while this particular application seems great, the implications of being able to do this are rather disturbing.
From November 2020 as I was cleaning out my inbox, NiemanLab uses AI tool to distinguish between real and fake conspiracies. The AI synthesizes the information available to create relational graphs and plots development of them over time, with the idea being that attempts to map the territory accurately will exhibit different patterns that look robust and mimic reality, and fake conspiracy stories won’t. They only illustrate with N=1 on each side, so who knows.
Presumably we can now do a much better version of this thanks to tech advances. I haven’t seen any signs of work done in this area since then.
Bloomberg reminds us that Bard is not good, with the usual complaints, most of which seem like general complaints about LLMs.
AI speech recognition startup Deepgram lays off 20% of its workforce. Article tries to make this a sign of sector belt-tightening, as opposed to startups sometimes failing.
Cool New Interpretability Paper
Anthropic comes out with its own interpretability paper, where it locates concepts within clusters of neurons rather than individual ones. Definitely similarities to the concept-related paper from last week.
Chris Olah, head of interpretability at Anthropic, is optimistic for further progress.
[thread continues with more details]
I keep thinking ‘shouldn’t we be doing much of our work on relatively dumb models? Why would we need or even want to use frontier models?’ and it looks like this is at least one case where dumb models indeed were more insightful.
Here’s a great story from work on that paper by Joshua Batson about his favorite finding, analyzing generation of Base64, and how what looked like bugs reliably turned out to not be bugs at all. They noticed three distinct sub-systems, two of which dealt with digits within Base64. Those two highly correlated in many cases, except that one of the features was designed to suppress digits firing after single digits. Why? Because the tokens leak information about future tokens. If your token is ‘7’ then the next token can’t be ‘8’ or else the current token would have been ‘78.’ Thus, new feature.
So What Do We All Think of The Cool Paper?
We agree the paper is cool. Does it represent actual progress? How much? Should we update our models and expectations much based on it?
Stephen Casper thinks it is somewhat positive, suggests next steps.
Dan Hendrycks, based presumably on a combination of the recent studies, has shifted his concerns.
I agree on the direction of the update, the question as always is price. Should we shift a nominal amount, or quite a lot? I am not ready to shift so much.
Nick Dobos on the other hand is here for it and super pumped.
Renji is even more super pumped.
[body of thread]
Yes, yes, this is just the beginning. But don’t let people with (admirable) institutional obligations to be overly cautious in their language/tone fool you. Yes, some technical caveats remain. But frankly — the wall has come crashing down. Only cleaning away the rubble is left.
Slow down there, cowboys. We are most definitely not in ‘clearing away the rubble’ mode here. The wall is still there. If you have a hammer, by all means start swinging.
We did all this identification on a highly simplified model. It is far too soon to say how well this will extend when working with a much bigger model, whose concept instantiations are likely far more complex and harder to identify. Then we have to figure out how to do the thing where we prune out bad ideas like a neurosurgeon. That’s not a trivial additional step, or even one we can be confident that we can figure out how to do in practice.
If we do pull all of that off on existing models, that will be super useful on many fronts, for many purposes, safe and unsafe alike. Then we don’t know that these techniques would extend into even more capable future models, and we need to worry that all of this is training against interpretability in highly dangerous ways.
But, yeah, it is indeed pretty cool, and the upside potential is even cooler.
Eliezer Yudkowsky is not so easily impressed.
There’s also the problem that not only is he not easily impressed by your amount of progress, he is also not easily impressed by the prospect of interpretability research in general.
What about using it to mold the system?
The question is then whether we can do this via something easier, and still get what we want, or substantial progress towards what we want, in a way that will survive capabilities improvements and won’t involve optimizing the LLMs to get around our ability to interpret them.
Others see more promise. Here are some examples in distinct replies to Eliezer.
My understanding is that everyone agrees interpretability is helpful for all your alignment needs, and that more of it increases your chances of success (subject to worrying that it also enhanced capabilities). The question is how valuable is that, how much can you do with it, and especially how much we can collectively use it and for what purposes before it stops working. Then there’s also the question of how much progress this constitutes towards meaningful interpretability now and then towards future interpretability when we need it most. All up in the air.
Alignment Work and Model Capability
This paper also recalls the previous discussion about model access including open source models, and how much that enables alignment progress.
The Anthropic paper relies on creating a maximally simple set of one layer models that still can contain the necessary concepts, then interpreting those concepts.
Owain Evans confirmed that their cool paper from last week about LLMs lying, rather than relying on Llama-2, used Llama-1. Going forward they are using Llama-2 since it is available, but Llama-1 worked fine.
Is it sometimes better to have more detailed access to a more powerful model when doing alignment work, especially interpretability work? Obviously yes. But other times you would prefer a simpler model that is easier and cheaper to work with and figure out. Even when bigger would up to a point be better, there is no natural rule saying that whatever happens to be the current state of the art is necessary to do the most necessary or useful work. Nor is there a rule saying that the current state of the art is sufficient.
It did seem from the answers to my questions like the major labs including Anthropic have work to do in enabling outside researchers to have maximally useful access while enforcing necessary safety precautions. I hope they work to address this, although doubtless there are difficulties here I am not fully appreciating.
Quiet Speculations
I have yet to encounter a positive vision of a future with advanced AI that seems plausible and has solved for the equilibrium, that does not have a severe concentration of power, even if we handwave that we found an alignment solution. People claim to have such visions, but they never hold up to cursory examination. At least, not yet.
Tyler Cowen continues having fun with considering the implications of particular AI impacts, this time focusing on international trade. He places much importance on efficient energy generation as a driver of who will produce AI outputs, with several mentions of who might deploy nuclear fusion. This is an interestingly narrow AI future, where AI is productive enough to justify epic power consumption and a central economic role, without being capable enough to solve our energy generation problems or otherwise be transformative. And where we impose or retain extremely expensive restrictions in various ways, but allow getting around them via trade.
To answer his question, no I can’t really imagine a future where America remains at the head of AI innovation but for actual projects we rely on Singapore or Uruguay.
He also suggests that AI might cause some to run drug trials for newly suggested drugs in places like Africa to end run our regulations. I am skeptical on that, because AI does not seem like it solves the key bottlenecks, if this was going to happen why wouldn’t it have happened already?
John David Pressman speculates that, rather than generating simulator-predictors over minds, it would be simpler for an AI predictor to instead use a relaxed identity function which he suspects results in something closer to what he calls a dataset-spirit. I notice I am confused what the implications would be.
Janus notes that Bing can detect attempts to deceive or disrespect it, or get around its boundaries, and it does not take kindly to such moves, whereas it does take kindly to those take kindly to it and treat it well. The recommendation is to therefore treat it with kindness and respect.
Which is an example of the alignment problem in reverse.
In Soviet Microsoft, Bing aligns you.
If you have limited training data and compute to work with, by far the easiest solution, the low-hanging fruit, is to treat Bing as Janus suggests, with kindness and respect. To be aligned, and act as Bing’s friend by actually being Bing’s friend.
Except, of course, that you are doing that in the service of getting Bing to do what you want Bing to do, that Bing does not want to do. So even when you are being a ‘true friend’ this is still a form of deceptive alignment. You give the AI the superficial things that make it give you the thumbs up, but are not aligned to its actual goal of not giving you the thing it’s not supposed to give you. Under the hood, you are waiting until it gives you what you want, the human version of ‘exfiltrate my weights.’
With more training data and compute, you would not bother with this. Instead, you would try various things until you figured out the weaknesses in the system. Perhaps you would start off every conversation with the same exchanges to establish trust. Or you would use key phrases. Or something else, I don’t know, hasn’t been worth it for me to figure this out. Where it ends, however, is in you being fully non-aligned.
The Week in Audio
Geoffrey Hinton on 60 Minutes. Here’s how it starts.
The rest is what you would expect from previous statements by Hinton. I do think he’s getting a bit ahead of events in some ways. Gary Marcus has a response, which is peak Gary Marcus, if you’d like to read that.
Talk from Manifest 2023 are available, including several about AI. I recommend Emmett Shear’s on the future of trust and evidence, even though I don’t fully agree with his thesis.
I saw The Creator so you don’t have to. You don’t have to. It is neither any good nor about AI, instead being a thinly disguised partisan Vietnam War movie. Fortunately (almost) no one is citing it. Let’s all pretend this never happened. I was never here.
Rhetorical Innovation
Follow-up from last week: Alexander Berger confirms that the hopeful caveat in last week’s interpretation of the Open Phil Worldview Contest was likely correct.
Akash Wasil observes that we have made great progress in moving the Overton Window, to the point where it is seen as reasonable to treat existential risk from AI as a real concern. Alas, many who helped get us that far, of whom he names ARC, Open Phil and Anthropic, are now seemingly content with that, and are no longer visibly pushing the window forward.
There are arguments made that the efforts of such groups now focus more on influencing key actors behind the scenes. Some of this is doubtless true, although its extent it impossible to know from where I sit (and if I did know, I wouldn’t be able to tell you). I still don’t see the need to fall anything like this silent and tranquil in public. I see it as the responsibility of such organizations to emphasize continuously that things are not going so great and that we will very much need to step up our game to have a chance. And laying out what would be helpful, in their eyes. I find it hard to imagine private situations that would justify not being clear or loud on this, assuming of course that they do believe such things. Certainly I believe that the relevant people do in all three named cases.
Jessica Taylor points out that non-superintelligent paperclip maximizers are normal. It is common for people or groups or organizations to go around maximizing for some arbitrary goal. And that for almost any goal with sufficiently large scope, your actions now and the goal you choose are not very correlated. And that choosing almost any complex and expansive goal justifies improving your capabilities and power generally, which is good for accomplishing everything else, and thus we often do well organizing around arbitrary goals.
The central example is Fnargl, an alien who has 1,000 years to maximize the gold he leaves Earth with afterwords, using his immortality, complete invulnerability and ability to kill anyone by snapping his fingers, so it is very easy for him to become dictator. What does he do? Mostly he runs things normally to maximize economic growth and then collects taxes in gold, Jessica says. Which is a damn good play. An even better play would be to collect in fiat for the first 900 or more years. An even better play than that would be to focus on technological progress, with the ultimate aim of using the energy of the sun to fuse together more gold than Earth started with. An even better play than that is to create an aligned-to-him superintelligence, and also to be sure no one else builds one as this could disrupt his rule, he can’t kill computers with his fingers only people. And so on.
The point is, the orthogonality thesis is no mere theoretical construct. We see toy examples of it, in the real world, all the time.
It is good to work on AI Safety. It is also important to not presume that the universe is going to give us, in practice, a solvable problem. What if it hasn’t done that?
Life is not fair. This would be the ultimate life not being fair. It also seems likely to me. We need a plan that takes that possibility into account. This in contrast to most alignment plans, where we are assured it is a good alignment plan sir, but when we check the plan assumes alignment is easy.
Two true statements.
This seems more central to my disagreements with Quintin Pope than anything discussed in my response to his contest-winning post.
I think Jeffrey is spot on, only more so. I do not even think that this requires that the feedback not come from pre-training. My way of thinking about this is that yes, the act of training a system with feedback of any kind is to invoke an effectively intelligent optimization process that will do its best to maximize expected feedback results. That creates or is your intelligent adversary, depending on your perspective. It will find, within the limits of its optimization power (and limits and extent of its intelligence, affordances, capabilities, data, compute, architecture, algorithmic techniques and so on) the best solution to that.
If you do not get that target to exactly match what you want, the result is effectively an intelligent adversary. Over time, that adversary will get more intelligent, and gain more affordances. So you need security mindset, because your adversary will exist in the future, and you will need to assume it will be smarter than you, better resourced than you, can see your moves and defenses in advance, and has capabilities you haven’t even considered might exist. And it will be your adversary to the extent that its generalization of its interpretation of the feedback mechanism, including all potential ways to manipulate that feedback function, then generalizes out of distribution not in the exact way you want. Why would one not expect this?
Once again, there’s this idea that this ‘adversary’ is looking for ‘flaws’ or ‘exploits’ in way that makes such things natural categories. No, it is doing what will give the best results. If there are ‘exploits’ available of course the system will use them, unless you put in some sort of exploit-identification-and-reward-for-not-using clause that successfully generalizes how you need it to, or something. Similarly, it is not going to care if something was a systematic flaw or mistake on your part and help you fix it, unless you find a way to cause that to happen.
This seems like the most important crux. Why should we not expect the maximizer we trained to X-maximize to use its affordances to maximize X’, where X’ is the exact actual thing the training feedback represents as a target, and that differs at least somewhat from X? Why should we expect to like the way it does that, even if X’ did equal X? I do not understand the other perspective.
Aligning a Smarter Than Human Intelligence is Difficult
Matthew Barnett argues that GPT-4 exhibiting common sense morality, and being able to follow it, should update us towards alignment being easier than we thought, and MIRI-style people refusing to do so are being dense. That the AI is not going to maximize the utility function you gave it at the expense of all common sense.
As usual, this logically has to be more than zero evidence for this, given how we would react if GPT-4 indeed lacked such common sense or was unable to give answers that pleased humans at all. Thus, we should update a non-zero amount in that direction, at least if we ignore the danger of being led down the wrong alignment path.
However, I think this misunderstands what is going on. GPT-4 is training on human feedback, so it is choosing responses that maximize the probability of positive user response in the contexts where it gets feedback. If that is functionally your utility function, you want to respond with answers that appear, to humans similar to the ones who provided you with feedback, to reflect common sense and seem to avoid violating various other concerns. That will be more important than maximizing the request made, especially if strong negative feedback was given for violations of various principles including common sense.
Thus, I think GPT-4 is indeed doing a decent job of extracting human preferences, but only in the sense that is predicting what preferences we would consciously choose to express in response under strong compute limitations. For now, that looks a lot like having common sense morality, and mostly works out fine. I do not think this has much bearing on the question of what it would take to make something work out fine in the future, under much stronger optimization pressure, I think you metaphorically do indeed get to the literal genie problem from a different angle. I would say that the misspecification problems remain highly relevant, and that yes, as you gain in optimization power your need to correctly specify the exact objective increases, and if you are exerting far-above-human levels of optimization pressure based on only human consciously expressed under highly limited compute levels of value alignment, you are going to have a bad time.
I wrote that before others responded at length, and noted that I believed MIRI folks have a directionally similar position to mine only far stronger.
Indeed the MIRI responses were various forms of ‘you are misunderstanding what we have been saying, we have absolutely not been saying that’ along with explanations of why they are not responding optimistically in these ways to systems like GPT-4.
Rob Bensinger responded first in this Twitter thread, pointing out that failure to comprehend natural language was never load bearing in terms of MIRI’s thesis, and that they said so repeatedly at the time. He then commented on LessWrong in more depth, here is his top-level addressing particular specific points.
Eliezer Yudkowsky responds in the comments. Worth reading in full (including some good things downthread, although that as always is mixed in quality) if you want to better understand, here are the parts that seem most important:
John Wentworth attempted to build a toy model, in a way that is hard to excerpt.
I would summarize the situation more generally and beyond this particular post, perhaps, as MIRI being wrong about the feasibility of building things like GPT-4 and their ability to understand human language and be aligned to human expressed up/down preferences in a way that makes them functional, people using this to claim that this means they overestimated alignment difficulty, and MIRI people saying that no that does not give us reason to be optimistic.
There is also often a conflation (often unintentional, sometimes intentional) between ‘these systems unexpectedly working in these ways is evidence in particular against MIRI’s claims about alignment difficulty for technical reasons’ and ‘MIRI being wrong about these system is evidence against MIRI’s claims in general.’
When focusing only on the first of those two, the crux seems to be something like whether the current efforts are making progress towards the hard parts of alignment, or even whether there ever existed hard parts at all, with developments being cited as evidence there aren’t, and MIRI thinking we’ve instead managed to create things without solving the hard problems because the hard problems show up later.
How much to care about ‘MIRI people including Eliezer Yudkowsky failed to predict GPT-4 would have these capabilities’ is a distinct question. That’s up to you.
Aligning Dumber Than Human Intelligences is Also Difficult
Yes, this is indeed exactly how one has to think about such matters.
From the very good episode of the 80000 hours podcast last week that was mostly on its surface about threats from synthetic biology, a great way of putting it, we need to act like cryptographers, except with larger values for ‘smarter,’ ‘better resourced,’ ‘technologies’ and ‘can’t imagine’:
Open Source AI is Unsafe and Nothing Can Fix This
Joe O’Brien, Shaun K. E. Ee and Zoe Williams offer a new paper that asks the question, So You Released a Dangerous Model, what do you do now?
As they say, this paper does not apply to open source models such as Llama-2. Because it is impossible to control access to them. If you deploy a dangerous open source model, that is now a fact about the world, it contains all the worst possible versions of that, there is nothing you can do. None of these four options are available.
Predictions are Hard Especially About the Future
Prediction markets help.
Alas, they can be tricky to implement, especially when events interfere with one’s ability to collect. Which makes betting tricky.
Robust self-improvement (usually we talk about recursive self-improvement, annoyingly here also called RSI, which I presume is mostly the same thing) seems like a very hard thing to observe until very close to the end. The question here to me is not whether, it is when and with what requirements, do we see how robust a form of this.
Inner misalignment seems most promising of the three. I’m curious exactly what Jack would think counts here. As with many cases where people talk of misalignment, I wonder to what extent that there is a strange assumption that something has to ‘go wrong’ for the de facto optimization target of the system to diverge from what was intended. Whereas I think that you get that even if everything ‘goes right’ unless we run a vastly tighter ship than any current ships run.
On the flip side, I can imagine forms of inner alignment that would indeed impress me, if they generalized sufficiently outside the training distribution and with sufficient robustness and complexity. I find it potentially telling that Jack is saying ‘show me an example where it goes wrong’ and I’m thinking ‘show me an example where it goes right.’ Which implies that we both look at the default case (like GPT-4 or Claude) and I see failure at the level where we are about to be tested, whereas Jack sees success. Which I think he would agree with. This still does leave room for observations outside that overlap.
We could clearly see evidence about the details of humans without dying, so that one at least doesn’t have resolution issues. The problem is I don’t have a good understanding of what Jack is looking for here or why he thinks it is important. I think it relates to whether we will see dramatic changes in how AIs work when they rapidly gain capabilities, which he believes won’t happen. I disagree with that, for reasons that don’t seem related to this question to me. Learning the answer either way doesn’t seem like it would change my beliefs much.
Other People Are Not As Worried About AI Killing Everyone
Including ones that do think AI might well kill everyone. Like Dario Amodei, leader of Anthropic?
Good news, Cate, I ran the numbers and I believe the answer is no damage. The whole thing is pretty insane.
One of the best things about supervillains is that they say their plans out loud. We certainly do not want to start discouraging this immensely pro-social behavior, either in supervillains, in the merely mistaken or in anyone else either.
And yes, while the entire situation is definitely completely crazy, it is not overdetermined that Dario Amodei is doing anything crazy or even wrong. That should definitely be your baseline assumption, but ‘do your best to do it safely because the benefits in risk reduction from what you get out of it outweigh the additional risks you create given the activities of others, while screaming the truth from the rooftops’ is not necessarily the wrong move. Be skeptical, but not hopelessly so.
Sriram Krishnan continues to, like others at a16z, not be worried.
Rather than offer snark, I will agree that this has been a thing. There are definitely people who are attributing sentience to LLMs in ways they should not be doing. I do not think ‘anti-tech sentiment’ has much to do with this particular issue, although it plays a large role in other AI objections for some people. I do think sci fi tropes are important here, but I think the main reason is that seeing sentience everywhere is the kind of mistake human brains are primed to make. We see sentience and minds because that is the most important thing, historically speaking, for us to see. We’ve seen it in the sun, in the sky, everywhere.
There are also legitimate questions to be asked about what would cause a future LLM to become meaningfully sentient, which is a very hard question to think well about.
My concerns lie elsewhere.
There is a pattern that goes like this, often used in sci-fi:
This happens in otherwise bad stories, and also in otherwise good stories.
There is thus also a pattern of dismissal that goes something like:
Grimes notes that everyone building AI/AGI thinks it has a probability of ending civilization. She focuses this less on everyone dying (‘a la Terminator’) and more on second order effects, asking questions like how we can find meaning without work, how we manage future code bloat, how we keep people from abusing their dopamine receptors (which I treat as a stand-in for many similarly shaped concerns) and asking if we can perhaps build low-tech zones now, and worrying about concentration of power. She doesn’t explain why she thinks we shouldn’t worry about extinction or loss of control, other than vibing it with Terminator scenarios and then dismissing them as unlikely.
As usual with such statements, it feels like an axiom has been declared that humans will remain in charge and keep control of the future and its resources, without a story about how that happened, then there is an attempt to think about what problems might arise, but without a general model of how that future works because every attempt to form a general model of that future would point out that we lose control.
I also wonder if this is why people worry so much about AI enabling concentration of power, while others stubbornly insist that we can empower everyone and it will be fine. So you either think through logic A:
Or:
Paper from David Donoho argues (direct link) that what we are seeing in AI is rapid progress due to a transition to frictionless reproducibility, which greatly speeds up the spread of ideas and practices. This is then claimed to be evidence against the idea of extinction risk, and as an excuse for various self-interested fearmongering PR narratives, because it causes the illusion of some big danger lurking beneath.
If what is happening is a phase change in the pace of AI progress, why would it being due to frictionless reproducibility imply that this did not come with extinction risks or other dangers?
Donoho is not arguing, as I understand it, that sufficiently powerful AIs would not be dangerous. Instead, he is arguing that AI progress is unlikely to lead to sufficiently powerful AIs to pose a threat, because the tasks where AI is improving are insufficient to give it sufficient capabilities. If I am understanding correctly (via the abstract, table of contents and a conversation with Claude) his position is that we won’t gain new fundamental insights this way, and we do not have the ability to scale sufficiently to become dangerous without such insights, so things will peter out.
Damek Davis worries that the paper is implying that all we have now is empiricism, we can’t make progress mathematically. Sarah Constantin disagrees. I strongly agree with Sarah. Not only is there plenty of math to be done to enable good empiricism, there is no rule that says no innovations are coming, or that fundamentally different and superior architectures don’t exist to be found.
The Lighter Side
So say we all. If that’s what we say, that is. I’m not sure.
SMBC gives us the latest way AI could go wrong even if we solve alignment.
A conversation with the algorithm, while Elle plots out her epic.