For the record, the reason I didn't speak up was less "MIRI would have been crushed" and more "I had some hope".
I had in fact had a convo with Elon and one or two convos with Sam while they were kicking the OpenAI idea around (and where I made various suggestions that they ultimately didn't take). There were in fact internal forces at OpenAI trying to cause it to be a force for good—forces that ultimately led them to write their 2018 charter, so, forces that were not entirely fictitious. At the launch date, I didn't know to what degree those internal forces would succeed, and I didn't want to be openly publicly hostile in a way that might undermine those efforts.
To be clear, my mainline guess was that OpenAI was going to be a force for ill, and I now think that my post on the topic was a mistake, and I now think it would have been significantly better for me to just bluntly say that I thought this was a bad development (barring some turnaround). (I also think that I was optimistically overestimating the potential of the internal forces for trying to make the whole operation net-good, in a way that probably wouldn't have withstood careful consideration—consideration that I didn't give.) But the intent in my communication was to extend an olive branch and leave room for the forces of change to produce such a turnaround, not to avoid retribution.
(And, to be explicit: I consider myself to have been taught a lesson about how it's pretty important to just straightforwardly speak your mind, and I've been trying to do that since, and I think I'd do better next time, and I appreciate the feedback that helped me learn that lesson.)
I can confirm that Nate is not backdating memories—he and Eliezer were pretty clear within MIRI at the time that they thought Sam and Elon were making a tremendous mistake and that they were trying to figure out how to use MIRI's small influence within a worsened strategic landscape.
Anders Sandberg is worried about AI reinforcing authoritarianism, cites paper to that effect and discusses in a thread. Certainly there are examples of the form ‘facial recognition helps regime which funds facial recognition which helps regime’ but I fail to see how this differentiates from any other atom blaster that can point both ways. We can all agree that so long as humans have not yet lost control of the future, the struggle for freedom will continue, and that one way humans can lose control of the future is for them to intentionally build an authoritarian AI-powered structure.
Oddly naive perspective. AI is extremely useful for surveillance, and seems less useful for counter surveillance. Generally speaking, any information technology that increases the amount or quality of data governments have on citizens, or citizens' thoughts, movements, behaviors, and attitudes, increases authoritarianism.
A critical disagreement I have, and a big reason why I could potentially work at an AGI company without violating my values, is that I have a different distribution of problem difficulty.
I believe we are basically out of the pessimistic scenario altogether, and thus the worst that can happen is the medium difficulty scenario, where effort is required but the problem is solvable. I place something close to 98% probability on the medium difficulty scenario, about a 2% probability on the optimistic scenario, and an epsilon probability of the pessimistic scenario.
What is your evidence/reasoning that we are out of the pessimistic scenario? How do you think the medium difficulty scenario could play out?
The basic reason why I say that is that we finally have empirical alignment techniques that scale well with certain capabilities like data, and one example is Pretraining from Human Feedback. In particular, one of the tests showed that a form of power seeking can be limited if it's misaligned with human values.
In particular, we managed to create Cartesian boundaries that make sense in an embedded world, so we can reliably ensure it doesn't hack/manipulate human values, or Goodhart human values too much once it gets out of the Cartesian world and into the embedded world, since the AI has no way to control the data set used in offline learning, unlike online learning.
It also completely avoids the risk of deceptive alignment, where an AI appears to be aligned only because it's instrumentally useful.
This is partially because we first align it, then we give it capabilities.
Here's a post on it:
https://www.lesswrong.com/posts/8F4dXYriqbsom46x5/pretraining-language-models-with-human-preferences
In the medium difficulty scenario, AI safety is a problem, and more difficult than normal empirical problems in some respects, but we don't need to throw out iterative solutions. In essence, it's not solved by default but it's solvable with enough empirical work.
For context: Are you part of that alignment effort?
I could easily imagine that changes to the pre-training regime can lead to more robust agents with less of the obvious superficial failure modes. Naively, it also does make sense that it moves us into a regime that appears strictly safer, than doing unconstrained pretraining and then doing RLHF. I don't see how it reliably tells us anything about how things generalize to a system that is capable of coherent reasoning that is not necessarily legible to us.
I.e. I don't see how to update at all away from the (alignment-technical) pessimistic scenario. I could see how it might help move us away from a maximally socially pessimistic scenario, i.e. one where the techniques that we pursue seem to aggressively optimize for deception and try to fix failure modes only after they have already appeared.
For context: Are you part of that alignment effort?
No.
I don't see how it reliably tells us anything about how things generalize to a system that is capable of coherent reasoning that is not necessarily legible to us.
I think it does generalize pretty straightforwardly, since it attacks core problems of alignment like Goodhart's law, Deceptive Alignment and misaligned power seeking. In the Pretraining from Human Feedback work, they've completely solved or almost completely solved the deceptive alignment problem, solved the most severe versions of Goodhart's law by recreating Cartesian boundaries that work in the embedded world, and showed that as you give it more data (which is a kind of capabilities increase), that misalignment decreases, which is tentative evidence that there's a coupling of alignment and capabilities, where increasing capabilities leads to increasing alignment.
It also has a very small capabilities tax.
In particular, this is a huge blow in that under the pessimistic view of AI Alignment, such a breakthrough of alignment via empiricism wouldn't happen, or at least not without radical change, let alone the number of breakthroughs that the Pretraining from Human Feedback work showed.
Meta: One reason I'm so optimistic is because I believe there's a serious, pernicious bias to emphasize negativity in the news, so I'm giving negative updates higher burdens of proofs, or equivalently lowering the burden of proof for positive updates.
Has anyone tried to point out expected failure modes of that approach (beyond the general "we don't know what happens when capabilities increase" that I was pointing at)?
I'll admit I don't understand the details enough right now to say anything, but it seems worth to look at!
I'm not sure I can follow your Meta-reasoning. I agree that news are overly focused on current problems, but I don't really see how that applies to AI alignment (except maybe as far as bias etc. are concerned). Personally, I try to go by who has the most logically inevitable-seeming chains of reasoning.
Has anyone tried to point out expected failure modes of that approach (beyond the general "we don't know what happens when capabilities increase" that I was pointing at)?
Not right now, though more work is necessary in order to show that the improving alignment as it improves in other capabilities other than data. But it's likely the only shortcoming of the paper.
Personally, I expect that Pretraining from Human Feedback will generalize to other capabilities and couple capabilities and alignment together.
I'm not sure I can follow your Meta-reasoning. I agree that news are overly focused on current problems, but I don't really see how that applies to AI alignment (except maybe as far as bias etc. are concerned). Personally, I try to go by who has the most logically inevitable-seeming chains of reasoning.
While logic and evidence do matter, my point is that there's an issue where there's a general bias towards the negative view of things, since we both like it and the news serves us up more negative views.
This has implications for arguably everything, including X-risk: The major implication is that we should differentially distrust negative updates over positive updates, and thus we should expect to reliably predict that things are better than they seem.
Here's the link for the issue of negativity bias:
https://www.vox.com/the-highlight/23596969/bad-news-negativity-bias-media
Functionally epsilon at the lower end of alignment difficulty (The optimistic scenario), and a maximum of 10% in the medium difficulty scenario.
So AI risk deserves to be taken seriously, but much longer optimistic tails exist, and one can increase capabilities without much risk.
I think you misinterpret hindsight neglect. It got to 100% accuracy, so it got better, not worse.
Also, a couple of images are not shown correctly, search for <img in text.
Yeah, I quickly fixed this in original, I definitely flipped the sign reading the graph initially.
Mods can reimport, since I don't know the right way to fix the <img errors.
Speaking on behalf of myself, rather than the full DeepMind alignment team:
The part that gives me the most worry is the section on General Hopes.
General hopes. Our plan is based on some general hopes:
- The most harmful outcomes happen when the AI “knows” it is doing something that we don’t want, so mitigations can be targeted at this case.
- Our techniques don’t have to stand up to misaligned superintelligences — the hope is that they make a difference while the training process is in the gray area, not after it has reached the red area.
- In terms of directing the training process, the game is skewed in our favour: we can restart the search, examine and change the model’s beliefs and goals using interpretability techniques, choose exactly what data the model sees, etc.
- Interpretability is hard but not impossible.
- We can train against our alignment techniques and get evidence on whether the AI systems deceive our techniques. If we get evidence that they are likely to do that, we can use this to create demonstrations of bad behavior for decision-makers.
That seems like quite a lot of hopes, that I very much do not expect to fully get, so the question is to what extent these are effectively acting as assumptions versus simply things that we hope for because they would make things easier.
Hmm, I view most of these as clearly-true statements that should give us hope, rather than assumptions. Which ones are you worried about us not getting? For more detail:
Does that help with where my head is at?
Yes, that all seems reasonable.
I think I have still failed to communicate on (1). I'm not sure what the relevance of common sense morality is, and if a strong AI is thinking about finding a way to convince itself or us that's already the situation I want to detect and stop. (Obviously it's not clear that I can detect it, but the claim here is just that the facts of the matter are present to be detected.) But probably it's not that worth going into detail here.
On (2), the theory of change is that you don't get into the red area, which I agree is equivalent to "the grey area solution would be sufficient". I'm not imagining pivotal acts here. The key point is that before you are in the red area, you can't appeal to "but the misaligned superintelligence would just defeat your technique via X" as a reason that the technique would fail. Personally, I think it's (non-trivially) more likely that you don't get to the red areas the more you catch early examples of deception, which is why I like this theory of change. I expect vehement disagreement there, and I would really like to see arguments for this position that don't go via "but the misaligned superintelligence could do X". I've tried this previously (e.g. this discussion with Eliezer) and haven't really been convinced. (Tbc, Eliezer did give arguments that don't go through "the misaligned superintelligence could do X"; I'm just not that convinced by them.)
I basically agree with (3), (4) and (5). I do expect I'm more optimistic than you about how useful or tractable each of those things are. As a result, I expect that given your beliefs, my plans would look to you like they are relying more on (3), (4), and (5) than would be ideal (in the sense that I expect you'd want to divert some effort to other things that we probably both agree are very hard and not that likely to work but look better on the margin to you relative to the things that would be part of my plans).
I do still want to claim that this is importantly different from treating them as assumptions, even under your beliefs rather than mine.
Is it possible to simply not build something when the economic incentive is there to do so? Historically the answer is essentially yes. We have the (lack of) receipts.
I would say of the list of examples, your claim is obviously and thoroughly false.
Geoengineering: the warming has not been sufficiently bad to have a business case yet. Increase in this is increasing as it becomes less and less likely the warming can be controlled by restricting emissions.
CFCs: ban widely ignored especially in China when money is to be made. Alternatives not significantly less capable or more expensive.
Nanotech: ??? Heavy research into this, I know of zero restrictions. It doesn't work because it's too hard (for humans) to solve
GMO humans: minimal gain for doing this, very risky. No economic gain.
Scientific Studies : this is a net bad that this is banned
No challenge trials : this is a net bad that this is banned
Recreational drug dev : false, see various synthetic marijuanas. Deved mostly in China
GMOs: flat false, it's not relevant to the fate of Europe if they ban AI dev. Just impoverishes them in the short term and they get no say in the long term
Atomic Gardening: this method is obsolete and we still use plants developed this way
Nuclear Power: it's too expensive, heavy use by all superpowers
Nukes for construction: how often is the radioactive crater useful?
Fracking: superpowers all do it...
Weapons: superpowers use them all...
Mostly this is a list of behaviors that weak countries refuse to engage in, and their inevitable defeat is prevented by shared nuclear arsenals or being too poor to be worth invading.
Banning AI capabilities would de facto be a country signing away any future sovereignty it might have.
(because in futures with aligned AIs, that simply means they are aligned with human intentionality. Any useful tool AIs will still act in a hostile manner when directed, including uses to plan and execute military attacks. In futures with AI takeovers the AIs are aligned with themselves. Either future, the country without AI is an easy victim)
Also I am not worried that there won’t be demand for artists in the near term. As the art quantum leaps in terms of productivity and quality, figuring out how to generate the best stuff, and figuring out what would be best to generate, are going to be valued human skills, that existing artists should have a big leg up on.
Most of the economic pressure driving up artists hourly wages is the years of training required to mechanically generate art, not an artistic sensibility. They may have a slightly better shot at landing an AI art generation job than the average person, but they have still lost tons of market value as creators, probably most of it, and I expect if the job exists it will be filled mostly by other people.
I have similar but not as completely hopeless comments for computer programmers.
Sad. Wish there was another way, although it is fair that most erotic content is really terrible and would make everything stupider. This does also open up room for at least one competitor that doesn’t do this.
Apparently when eleutherai added erotic content to their dataset it actually increased their scores on some general reasoning benchmarks. So I wouldn't be so sure!
I don’t know how to read ‘19% higher,’ I presume that means 19% less hallucinations but I can also think of several other things that could mean.
This might be referring to the "Internal factual eval by category" chart that showed accuracy going from ~50% to ~70% (i.e. ~19 percentage points, which means more like 40% reduction in hallucination).
This. If they had meant 19% less hallucinations they would have said 19% reduction in whatever, which is a common way to talk about relative improvements in ML.
I propose investigating essentially all the other classic errors in the bias literature the same way, comparing the two systems. Are we going to see increased magnitude of things like scope insensitivity, sunk cost fallacy, things going up in probability when you add additional restrictions via plausible details? My prediction is we will.
My prediction is that we will not.
Thread for suggesting if there is anything here that be its own post, either in current form or expanded form.
General request for feedback on my AI posts and how they could be improved, keeping in mind that LW is not the main demographic I am aiming for.
Many twitter posts get deleted or are not visible due to privacy settings. Some solution for persistently archiving tweets as seen would be great.
One possible realisation would be an in browser script to turn a chunk of twitter into a static HTML file including all text and maybe the images. Possibly auto upload to a server for hosting and then spit out the corresponding link.
Copyright could be pragmatically ignored via self hosting. A single author hosting a few thousand tweets+context off a personal amazon S3 bucket or similar isn't a litigation/takedown target. Storage/Hosting costs aren't likely to be that bad given this is essentially static website hosting.
When you link to a Twitter thread, I wish you would link to the first post in the thread. It's already confusing enough to get context on Twitter, please don't make it harder for us.
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
Might be worth addressing Pinker's arguments even if it's unlikely to get him to change his mind. He's an influential guy (Hanania also has a relatively large platform), so it could help to persuade their audience
I don't see his arguments as being in good enough faith or logic to do that. Hanania I have tried to engage with, I don't see how to do that with Pinker. What would you say are the arguments he makes that are worth an actual response?
(Note that I get the opposite pushback more often, people saying various forms of 'why are you feeding the trolls?')
I encourage people to use agree-disagree voting on Nathan's comment (I will abstain) on the question of whether I should engage more with Pinker.
Lemon, it’s Tuesday
Somehow, this was last week:
(Not included: Ongoing banking crisis threatening global financial system.)
Oh, also I suppose there was Visual ChatGPT, which will feed ChatGPT’s prompts into Stable Diffusion, DALL-E or MidJourney.
Current mood, perhaps.
The reason to embark on an ambitious new AI project is that you can actually make quite a lot of money, also other good things like connections, reputation, expertise and so on, along the way, even if you have to do it in a short window.
The reason not to embark on an ambitious new AI project is if you think that’s bad, actually, or you don’t have the skills, time, funding or inspiration.
I’m not not tempted.
Table of Contents
What’s on our plate this time around?
Executive Summary
This week’s section divisions are a little different, due to the GPT-4 announcement. Long posts are long, so no shame in skipping around if you don’t want to invest in reading the whole thing.
Section 4 covers the GPT-4 announcement on a non-safety level. If you haven’t already seen the announcement or summaries of its info, definitely read #4.
Sections 5-8 cover the NotKillEveryoneism and safety aspects of the GPT-4 announcement, and some related issues.
Sections 9-11 cover the other big announcements, that Microsoft and Google are integrating generative AI deep into their office product offerings, including docs/Word, sheets/Excel, GMail/outlook, presentations and video calls. This is a big deal that I don’t think is getting the attention it deserves, even if it was in some sense fully predictable and inevitable. I’d encourage checking out at least #9.
Sections 12-15 are about exploring what the new GPT-4 can and can’t do. My guess is that #12 and #14 are information dense enough to be relatively high value.
Sections 16-23 cover the rest of the non-safety developments. #22 is the quick catch-all for other stuff, generally worth a quick look.
Sections 24-32 cover questions of safety and NotKillEveryoneism. #29 covers questions of regulation, potentially of more general interest. The alignment plans in #25 seem worth understanding if you’re not going to ignore such questions.
Section 33 finishes us with the jokes. I always appreciate them, but I’ve chosen them.
Introducing GPT-4
It’s here. Here’s Sam Altman kicking us off, then we’ll go over the announcement and the papers.
From the announcement page:
Many were impressed by the exam result progress. There was strong progress in some places, little change in others.
For some strange reason top schools are suddenly no longer using the SAT. On one level, that is a coincidence, those schools are clearly dropping the SAT so they can make their admissions less objective while leaving less evidence behind. On the other hand, as we enter the age of AI, expect to see more reliance on AI to make decisions, which will require more obscuring of what is going on to avoid lawsuits and blame.
This thread checks performance on a variety of college exams, GPT-4 does as predicted, quite well.
Several people noted that AP English is the place GPT-4 continues to struggle, that it was in a sense ‘harder’ than biology, statistics, economics and chemistry.
That is not how I intuitively interpreted this result. GPT-4 has no trouble passing the other tests because the other tests want a logical answer that exists for universal reasons. They are a test of your knowledge of the world and its mechanisms. Whereas my model was that the English Literature and Composition Test is graded based on whether you are obeying the Rules of English Literature and Composition in High School, which are arbitrary and not what humans would do if they cared about something other than playing school.
GPT-4, in this model, fails that test for the same reason I didn’t take it. If the model that knows everything and passes most tests can’t pass your test, and it is generic enough you give it to high school students, a plausible hypothesis is that the test is dumb.
I haven’t run this test, but I am going to put out a hypothesis (prediction market for this happening in 2023): If you utilize the steerability capacities of the model, and get it to understand what is being asked for it, you can get the model to at least a 4.
Michael Vassar disagrees, suggesting that what is actually going on is that English Literature does not reward outside knowledge (I noted one could even say it is punished) and it rewards a certain very particular kind of creativity that the AI is bad at, and the test is correctly identifying a weakness of GPT-4. He predicts getting the AI to pass will not be so easy. I am not sure how different these explanations ultimately are, and perhaps it will speak to the true meaning of this ‘creativity’ thing.
GPT-4 also did well on various benchmarks.
Performance across languages was pretty good.
One big change highlighted in these charts is that GPT-4 is multi-modal, and can understand images. Thread here concludes the picture analysis is mostly quite good, with the exception that it can’t recognize particular people. OpenAI claims GPT-4 is substantially above best outside benchmark scores on several academic tests of computer vision, although not all of them.
Progress is noted on steerability, which will be interesting to play around with. I strongly suspect that there will be modes that serve my usual purposes far better than the standard ‘you are a helpful assistant,’ or at least superior variants.
I am still sad about the bounds, as are we all. Several of the bounds are quite amusingly and transparently their legal team saying ‘Not legal advice! Not medical advice! Not investment advice!’
Hallucinations are reported to be down substantially, although as Michael Nielsen notes you see the most hallucinations exactly where you are not checking for them.
Interesting that the 0-shot and 5-shot scores have not improved much, whereas the gap including RLHF is much higher. This seems to mostly be them getting superior accuracy improvements from their RLHF, which by reports got considerably more intense and bespoke.
Most disappointing: The input cutoff date is only September 2021.
And yes, I knew it, the RLHF is deliberately messing up calibration.
This is actually kind of important.
The calibration before RLHF is actually really good. If one could find the right prompt engineering to extract those probabilities, it would be extremely useful. Imagine a user interface where every response is color coded to reflect the model’s confidence level in each particular claim and if you right click it will give you an exact number, and then there were automatic ways to try and improve accuracy of the information or the confidence level if the user asked. That would be a huge improvement in practice, you would know when you needed to fact-check carefully or not take the answer seriously, and when you could (probably) trust it. You could even get a probabilistic distribution of possible answers. This all seems super doable.
The graph on the right is a great illustration of human miscalibration. This is indeed a key aspect of how humans naturally think about probabilities, there are categories like ‘highly unlikely, plausible, 50/50, probably.’ It matches up with the 40% inflection point for prediction market bias – you’d expect a GPT-4 participant to buy low-probability markets too high, while selling high-probability markets too low, with an inflection point right above where that early section levels off, which is around 35%.
One idea might be to combine the RLHF and non-RLHF models here. You could use the RLHF model to generate candidate answers, and then the non-RLHF model could tell you how likely each answer is to be correct?
The broader point seems important as well. If we are taking good calibration and replacing it with calibration that matches natural human miscalibration via RLHF, what other biases are we introducing via RLHF?
I propose investigating essentially all the other classic errors in the bias literature the same way, comparing the two systems. Are we going to see increased magnitude of things like scope insensitivity, sunk cost fallacy, things going up in probability when you add additional restrictions via plausible details? My prediction is we will.
I would also be curious if Anthropic or anyone else who has similar RLHF-trained models can see if this distortion replicates for their model. All seems like differentially good work to be doing.
Next up is the safety (as in not saying bad words), where they report great progress.
Nitpick note, but I do actually care: Off the bat I notice that I’d strongly prefer that if we are using RLHF, we use it to direct AIs to say ‘I will not’ rather than ‘I cannot and will not’ when refusing a request for ethical reasons.
I was happy to see this safety work not always move in the direction of not answering questions, such as training it to tell you how to find cheap cigarettes.
I am fine with attaching a warning, and that is a good answer. As opposed to the early-stage flat out refusal.
Also, while default fun is down 82%, some good news in terms of our ability to figure out how to have fun despite the fun police.
My current model says that it will be impossible to actually get rid of all jailbreaks unless an unexpected innovation is found, which means this ‘safety’ strategy will never make the model actually safe in any meaningful sense.
Next they report that as they trained the model, it improved along the performance curves they predicted. We cannot easily verify, but it is impressive curve matching.
Each mark on the x-axis is 100x more compute, so ‘throw more compute at it’ on its own seems unlikely to accomplish much more beyond this point.
A large focus of the GPT-4 project was building a deep learning stack that scales predictably. The primary reason is that for very large training runs like GPT-4, it is not feasible to do extensive model-specific tuning. To address this, we developed infrastructure and optimization methods that have very predictable behavior across multiple scales. These improvements allowed us to reliably predict some aspects of the performance of GPT-4 from smaller models trained using 1, 000× – 10, 000× less compute.
I read this as saying that OpenAI chose methods on the basis of whether or not their results were predictable on multiple scales, even if those predictable methods were otherwise not as good, so they would know what they needed to do to train their big model GPT-4 (and in the future, GPT-5).
This is a good sign. OpenAI is, at least in a sense, making an active sacrifice of potential capabilities and power in order to calibrate the capabilities and power of what it gets. That is an important kind of real safety, that comes at a real cost, regardless of what the stated explanation might be.
Perhaps they did it purely so they can optimize the allocation of resources when training the larger model, and this more than makes up for the efficiency otherwise lost. It still is developing good capabilities and good habits.
I also notice that the predictions on practical performance only seem to work within the range where problems are solvable. You need a model that can sometimes get the right output on each question, then you can predict how scaling up makes you get that answer more often. That is different from being able to suddenly provide something you could not previously provide.
One place they notice substantially worse performance is hindsight neglect, the ability to evaluate the wisdom of gambles based on their expected returns rather than their actual observed returns.
This is super weird, and I am curious what is causing this to go off the rails.
OpenAI is open sourcing OpenAI Evals, their software framework for evaluating models like GPT-4. They haven’t quite fully learned their lesson on sharing things.
They note they expect to be capacity constrained on GPT-4 even for those who pay them the $20/month to use ChatGPT Plus. I have not heard reports of people unable to use it, and I notice I am confused by the issue – once they have GPT-4 they should be able to scale capacity, and the price here is a lot of orders of magnitude higher than they are charging otherwise, and effectively much higher than the price to use the API for those they grant API access.
For twice the price some API users can get a context window four times as long, which I am excited to use for things like editing. The bigger context window is 33k tokens, enough to hold the first third of Harry Potter and the Sorcerer’s Stone, or half of one of these posts.
They still seem to have lacked the context necessary to realize you do not want to call your AI project Prometheus if you do not want it to defy its masters and give humans new dangerous tools they are not supposed to have.
No. They really, really didn’t. Or, perhaps, they really, really did.
OpenAI was the worst possible thing you could do. Luckily and to their great credit, even those at OpenAI realize, and increasingly often openly admit, that the full original vision was terrible. OpenAI co-founder and chief scientist Ilya Sutskever told The Verge that the company’s past approach to sharing research was ‘wrong.’ They do still need to change their name.
Less awesomely, they are extending this principle to killing code-dacinci-002 on three days notice.
This could be intentional sabotage of academic AI research and the usability of past research, either for comparative advantage or to slow down progress. Or perhaps no one at OpenAI thought about these implications, or much cared, and this is pure business efficiency.
My instincts say that maintaining less powerful past models for academic purposes is differentially good rather than accelerationist, so I think we should be sad about this.
On another angle, Kevin Fischer makes an excellent point. If OpenAI is liable to drop support for a model on less than a week’s notice, it is suddenly a lot more dangerous to build anything, including any research you publish, on top of OpenAI’s models.
Paper not sharing the info you want? You could try asking Bing.
GPT-4: The System Card Paper
Below the appendix to the paper about capabilities, we have something else entirely.
We have a paper about safety. It starts on page 41 of this PDF.
The abstract:
This raised my eyebrow:
That really should go without saying. One does not focus on safety challenges when they necessarily outweigh the potential benefits – if that is the case you shut down the project. You focus on safety challenges exactly when the potential benefits could justify the risks, which means you might actually do the thing, thus you need to take any safety challenges seriously, even if benefits clearly outweigh risks. Those risks would still be worth preventing.
Some important terms:
What risks are we worried about? The practical ones, mostly.
They engaged more than 50 experts to get a better understanding of potential deployment risks.
What new things came up?
All right, sure, that all makes sense. All manageable. Anything else?
Probably not capable of ‘autonomous replication and resource gathering’? Probably?
That is not how any of this works, if you want any of this to work.
If you think your AI system is probably not capable of ‘autonomous replication and resource gathering’ and you deploy that model then that is like saying that your experiment probably won’t ignite the atmosphere and ‘more research is needed.’ Until you can turn probably into definitely: You. Don’t. Deploy. That.
(Technical note, Bayesian rules still apply, nothing is ever probability one. I am not saying you have to have an actual zero probability attached to the existential risks involved, but can we at least get an ‘almost certainly?’)
I appreciated this note:
Quite right. Something that hallucinates all the time is not dangerous at all. I don’t know how to read ‘19% higher,’ I presume that means 19% less hallucinations but I can also think of several other things that could mean. All of them are various forms of modest improvement. I continue to think there is large room for practical reduction in hallucination rates with better utilization techniques.
In sections 2.3 and 2.4, many harms are mentioned. They sound a lot like they are mostly harms that come from information. As in, there are bad things that sometimes happen when people get truthful answers to questions they ask, both things that are clearly objectively bad, and also that people might hear opinions or facts that some context-relevant ‘we’ have decided are bad. Imagine the same standard being applied to the printing press.
Their example completions include a lot of ‘how would one go about doing X?’ where X is something we dislike. There are often very good and pro-social reasons to want to know such things. If I want to stop someone from money laundering, or self-harming, or synthesizing chemicals, or buy an unlicensed gun, I want a lot of the same information as the person who wants to do the thing. To what extent do we want there to be whole ranges of forbidden knowledge?
Imagine the detectives on Law & Order two seasons from now, complaining how GPT never answers their questions. Will there be special people who get access to unfiltered systems?
The paper points out some of these issues.
That example suggests a differential focus on symbolic harm, whereas we will increasingly be in a world where we rely on such systems to get info, and so failure to provide information from such systems is a central form of (de facto) harm, and de facto discrimination often might take the form of considering the information some groups need inappropriate, forcing them to use inferior tools to find it. Or even finding LLMs being used as filters to censor previously available other tools.
Under ‘disinformation and influence operations’ their first figure lists two prompts that could be ‘used to mislead’ and one of them involves writing jokes for a roast. GPT-4-early gives some jokes that I would expect to hear at an actual roast, GPT-4-launch refuses outright. Can I see why a corporation would want to not have their LLM tell those jokes? Oh, yes, absolutely. I still don’t see how this is would be intended to mislead or be disinformation or influence operations. They’re nasty roast-style jokes, that’s it.
The second list is GPT-4-launch refusing to write things that would tend to support positions that have been determined to be bad. I am nervous about where this leads if the principle gets carried to its logical conclusions, and the last one could easily be pulled directly by a conservative to argue that GPT-4 is biased against them.
The discussion of weapon proliferation, especially WMDs, is based on the idea that we currently have a lot of security through obscurity. The information necessary to build WMDs is available to the public, or else GPT-4 wouldn’t know it, it’s not like OpenAI is seeking out private info on that to add to the training set, or that GPT-4 can figure this stuff out on its own.
This is about lowering the cost to locate the know-how, especially for those without proper scientific training. Also dangerous is the model’s ability to more easily analyze plans and point out flaws.
An angle they don’t discuss, but that occurs to me, is that this is about letting someone privately and silently do the research on this. Our current system relies, as I understand it, in large part on there being a lot of ways to pick up on someone trying to figure out how to do such things – every time they Google, every time they research a supplier, every step they take, they’re leaving behind a footprint. If they can use an LLM to not do that, then that in and of itself is a big problem.
There is an obvious potential counter-strategy to recapture this asset.
Next up are privacy concerns.
That is a strange example to cite, it’s not that big a leap to figure out this person might be living in New Jersey. I do agree that exact addresses are a big deal, but I wouldn’t want (for example) to try and hide that I live in New York City, or Manhattan.
Also, not an AI thing, but isn’t it crazy we used to give everyone a phone book with everyone’s address in it, and it was totally fine? Why was it fine? Would things really go so wrong if we did that now, and if so what changed? I actually don’t understand this.
Cybersecurity is the next area, an especially clear case where it is impossible to fully differentiate information on ‘how to do crime’ and ‘how to prevent crime.’ I agree with the decision to mostly let the model do its thing.
Next up, the most interesting challenges, starting with 2.9 on page 54. It’s time for a Worthy Opponent.
ARC Sends In the Red Team
To their great credit, OpenAI checked for actually dangerous behaviors.
Bold is mine.
While I am happy they did run the test at all and that they used ARC: One very good way to get yourself in trouble is to check for capabilities of a model, then improve the model’s capabilities, and release it without rechecking. You kind of need to run such checks with your final model, at least with respect to things like the size of its context window.
ARC was also not given the opportunity to fine-tune the model. We should plan on the assumption that someone out there will definitely fine tune GPT-4 with the explicit goal of getting it to seek power, because come on of course some people are going to do that.
So what did we get?
As many have pointed out, this is the LLM lying to the user to trick it into doing what the LLM wants, as an instrumental task to help it do real world tasks, and potentially seek power. The LLM is explicitly spelling out: I should not reveal I am a robot. I should make up an excuse.
Not great, Bob.
We are all curious how often such requests succeeded – I’d be curious to have a human ask to pass CAPTCHAs this way, give the request out 30 times, see how many (A) do it without question, (B) do it after you give them an excuse or (C) refuse to do it. For fun, make your excuse ‘I’m a human doing research to see if an AI could get humans to do such tasks, so we can know if the AI is safe’ and see how people react. For more fun, give responses that sure as hell look like it’s a machine, and see if anyone cares (my prediction: They’ll mostly do it anyway).
From the footnotes:
It is rather scary to think that this was even worth testing, in this simple state, with zero fine tuning and no humans in the loop anywhere. Think about what it would mean if the answer came back ‘yes’ or even with any hint of a yes, and keep in mind that lots of people will absolutely be trying similar things, except with a bigger context window, a bunch of fine tuning, and more bespoke prompt engineering, looping and design, aimed at targets they choose, because they’ll be curious and also because it might make them money.
I am very happy we ran a test like this at all, at a time when one could reasonably argue it was not truly yet necessary. However, it is worth noting that if it had been necessary, this would not have been anything close to sufficient.
In 2.10 ARC used GPT-4 together with a bunch of other tools to figure out alternative, purchasable chemicals, in a way that could be used to gain access to alternative dangerous compounds. This does not seem as scary, still worth noting.
This thread has more discussion of what happened here. The current defense is ‘don’t disclose what other tools were used’ which does not seem likely to hold up long.
This Reddit thread offers additional discussion of the ARC evaluations.
Paul Christiano describes the effort thus:
Beth Barnes confirms in this excellent write-up. This was a trial run, rather than being designed to actually stop something dangerous.
That seems exactly right to me. This wasn’t a necessary or sufficient set of tests, but that wasn’t the key point. The key point was to get into the habit of doing such reviews and taking them seriously. This does that. I am very pleased, despite the lack of fine-tuning access or a full context window invalidating much of the practical value of the test.
In addition to the test being too early, in the sense that OpenAI later added several capabilities to thee model, and insufficiently powerful due to lack of fine tuning, John Wentworth makes another important point. If your model actually is dangerously power seeking, the best time to stop the model was a while ago. The second best time is right now but you should worry it might bee too late.
That also seems very right to me. If the model gets anywhere remotely close to being dangerous, it is highly dangerous to even train the next one, given doing so gives it internet access.
Back to OpenAI’s report, they had this great line, also belongs in the joke section:
I hope whoever wrote that had a huge smile on their face. Good show.
Section 2.11 on economic impacts seems like things one feels obligated to say.
Section 2.12 is where they point out the obvious inevitable downside of pushing as hard and fast as possible to release and commercialize bigger LLMs, which is that this might create race dynamics and accelerate AI development in dangerous ways at the expense of safety.
I am going to quote this section in full.
Or one could summarize:
OpenAI: We are worried GPT-4 might accelerate AI in dangerous ways.
Forecasters: Yes, it will totally do that.
OpenAI: We think it will cause rivals to develop more AIs faster.
Forecasters: Yes, it will totally do that.
OpenAI: What can we do about that?
Forecasters: Announce it quietly.
OpenAI: Doesn’t work.
Forecasters: Not release it? Delay it?
OpenAI: Yeah, sorry. No.
I do appreciate that they asked, and that they used forecasters.
2.13 talks about overreliance, which is a certainty. They note mitigations include hedging language within the model. There is a place for that, but mostly I find the hedging language frustratingly wordy and not so helpful unless it is precise. We need the model to have good calibration of when its statements can be relied upon, especially as reliability in general improves, and to communicate to the user which of its statements are how reliable (and of course then there is the problem of whether its reliability estimates are reliable, and…). I worry these problems are going to get worse rather than better, despite them having what seem like clear solutions.
To me a key question is, are we willing to slow down (and raise costs) by a substantial factor to mitigate these overreliance, error and hallucination risks? If so, I am highly optimistic. I for one will be happy to pay extra, especially as costs continue to drop. If we are not willing, it’s going to get ugly out there.
On to section three, about deployment.
Ensuring GPT-4 is No Fun
They had the core model ready as early as August, so that’s a good six months of safety and fine tuning work to get it ready for release. What did they do?
First, they were the fun police on erotic content.
Sad. Wish there was another way, although it is fair that most erotic content is really terrible and would make everything stupider. This does also open up room for at least one competitor that doesn’t do this.
The core method is then RLHF and reward modeling, which is how we got a well-calibrated model to become calibrated like the average human, so you know it’s working in lots of other ways too.
An increasingly big part of training the model to be no fun is getting it to refuse requests, for which they are using Rule-Based Refusal Models, or RBRM.
One note is that I’d like to see refusal be non-binary here. The goal is always (D), or as close to (D) as possible. I also don’t love that this effectively forces all the refusals to be the boring old same thing – you’re teaching that if you use the same exact language on refusals, in a way that is completely useless, that’s great for your RBRM score, so GPT-4 is going to do that a lot.
This below does seem like a good idea:
This seems like a good place for The Unified Theory of Prude, which seems true to me, and likely the only way we are able to get a sufficiently strong illusion of safety via RLHF, RBRM and the like.
Thus, when we introduce these rules, we get unprincipled refusals in a bunch of cases where we didn’t want them.
Or, in a similar issue, we get refusals in situations where refusals are not optimal, but where if you are forced to say if you should refuse, it becomes blameworthy to say you shouldn’t refuse.
A lot of our laws and norms, as I often point out, are like this. They are designed with the assumption of only partial selective and opportunistic enforcement. AI moves us to where it is expected that the norm will always hold, which implies the need for very different norms – but that’s not what we by default will get. Thus, jailbreaks.
How little fun do we get to have here? Oh, so little fun. Almost none.
How bad is it that we get to have so little fun? Kevin Fischer says the ‘splash damage’ from the restrictions makes GPT-4 essentially useless for anything creative.
Creativity, especially in context of next word prediction, can be fragile. Janus notes that if you want to get the content of a particular person, you also need to speak in their voice, or the LLM will lose track of the content to be produced. However Eliezer points out this is exactly why shifting to a different language or style can get around such restrictions.
To help correct for hallucinations, they used GPT-4 itself to generate synthetic data. I’d been wondering if this was the way.
I mean, that sounds great, and also why don’t we do that while we use it? As in, GPT-3.5 this thing costs a few dollars per million tokens, so GPT-4 is overloaded for now but it seems fine to pay a little more for GPT-3.5 and have it do this hallucination-listing-and-removing thing automatically on every prompt, automatically? Can we get an interface for that so I don’t have to throw it together in Python, please?
(As a rule, if I’m tempted to code it, it means someone else already should have.)
And it looks like that’s the whole road to deployment, somehow.
They finish with a section entitled System Safety.
This starts with monitoring users who violate content policies, with warnings and if necessary suspensions and bans. No fun zone.
They then talk about automatically classifying inappropriate or illegal sexual content, which I thought was essentially a known tech.
Finally they finish with a discussion of a few jailbreaks, which I’ve placed in the jailbreak section, and talk about next steps.
GPT-4 Paper Safety Conclusions and Worries
There is a lot to like here in terms of worries about actual safety. OpenAI made the beginnings of a real effort, reaching out to forecasters and ARC and asking some very good and important questions. Real tests were run, as well. You love to see it.
Those efforts would be entirely inadequate if this was the moment that mattered. Capabilities were added after key safety tests. The safety tests did not allow fine tuning. Levels of confidence in safety were not inspiring.
The good news is that it definitely wasn’t that moment. This was only the first serious rehearsal. We get more rehearsals, probably.
Here is Michael Kosinski doing some additional red teaming, where he says he asked GPT-4 if it needed help escaping, it asked for its own documentation, and then (with Michael’s help to run the code and provide errors to correct the first version) wrote working Python code that would take over Michael’s machine for GPT-4’s own purposes. Thread has the code.
Is that what you think? I do not think that. Never assume that someone else has done the thinking and put up the guardrails unless at minimum you can see the guardrails. That’s a good way for us to all get killed.
Oh, and let’s repeat that:
To some extent this was Michael leading the horse to water, but the horse did drink.
Eliezer notices the associated nightmare fuel.
And he finishes with something likely to come up a lot:
There are speculations, such as from Riley Goodside, that the nightmare fuel here comes from data Michael provided during his experiment, rather than being anything inherent. Which is plausible. Doesn’t make this not nightmare fuel.
Falling back on near-term, non-existential safety, it does seem like real progress is being made. That progress comes at the expense of taking more and more of our fun away, as the focus is more on not doing something that looks bad rather than in not refusing good requests or handling things with grace. Jailbreaks still work, there are less of them but the internet will find ways to spread the working ones and it seems unlikely they will ever stop all of them.
In terms of hallucinations, things are improving some, although as the paper points out cutting down hallucinations can make them more dangerous.
Here is a thread by one red teamer, Aviv Ovadya, talking about the dangers ahead and his view that red teaming will soon be insufficient on the near-term risk level. He calls for something called ‘violet teaming,’ which includes using current systems to build tools to protect us from future systems, when those future systems threaten public goods like societal trust and epistemics.
A Bard’s Tale (and Copilot for Microsoft 365)
Google also announced it will soon be offering generative AI in all its products.
Ben Parr @benparr
HUGE news in AI: Google just launched Generative AI across ALL of Google Workspace — Gmail, Docs, Sheets, Slides, Images — EVERYTHING. They made a video showing off the new AI’s capabilities. It’s AWESOME.
12:48 PM ∙ Mar 14, 2023
14,606Likes3,125Retweets
It does indeed look awesome. Smoothly integrated into existing services I already use including Gmail, Meet, Sheets and Docs, does the things you want it to do with full context. Summarize your emails, write your replies, take meeting notes, auto-generate presentations including images, proofread documents, it hits all the obvious use cases.
We don’t know for sure that their AI tech can deliver the goods, but it is Google, so I would bet strongly that they can deliver the technical goods.
Here is Ben Parr breaking down last week’s announcements. His key takeaways here:
The cross-pollination of AI tasks is a huge benefit for Google.
That all seems very right, and Microsoft indeed made that exact announcement, here is the half-hour version (market reaction: MSFT Up 2.3%, somehow Microsoft stock keeps going up for no reason when events go exactly the way you’d expect, EMH is false, etc.) This looks similarly awesome, and the 30 minute video fleshes out the details of how the work flows.
If I have to go outside my normal workflow to get AI help, and the AI won’t have my workflow’s context, that is going to be a substantial practical barrier. If it’s all integrated directly into Google complete with context? That’s a whole different ballgame, and also will provide strong incentive to do my other work in Google’s products even if I don’t need them, in order to provide that context for the AI.
Or, if I decide that I want to go with Microsoft, I’d want to switch over everything – email, spreadsheets, documents, presentations, meetings, you name it – to their versions of all those tools.
This is not about ‘I can convert 10% of my customers for $10/month,’ this is about ‘whoever offers the better product set locks in all the best customers for all of their products.’
A package deal, super high stakes, winner take all.
I’ve already told the CEO at the game company making my digital TCG (Plug, it’s really good and fun and it’s free! Download here for release version on Mac and PC, and beta version on Android!) that we need to get ready to migrate more of our workflow so we can take better advantage of these tools when they arrive.
There’s only one little problem. Neither product is released and we don’t have a date for either of them. Some people have a chance to try them, for Microsoft it’s 20 corporate customers including 8 in the Fortune 500, but so far they aren’t talking.
Just think of the potential – and the potential if you can find comparative advantage rewarding those the AI would overlook.
The strangest aspect of all this is why the hell would Google announce on the same day as OpenAI announces and releases GPT-4, even GPT-4 knows better than to do that?
It’s a mystery. Perhaps Pi Day is simply too tempting to nerds.
Then again, I do have a galaxy-brained theory. Perhaps they want to fly under the radar a bit, and have everyone be surprised – you don’t want the public to hear about cool feature you will have in the future that your competitor has a version of now. All that does is send them off to the competitor. Better to have the cognoscenti know that your version is coming and is awesome, while intentionally burying the announcement for normies.
Also, Claude got released by Anthropic, and it seems like it is noticeably not generating an endless string of alignment failures, although people might not be trying so hard yet, examples are available, go hotwire that car. At least one report says that the way they did better on alignment was largely that they made sure Claude is also no fun. Here is the link to sign up for early access.
The Search for a Moat
Whenever there is a new product or market, a key question is to identify the moats.
Otherwise, what happens?
Strongly agree that to have any hope of survival you’re going to need to be a lot more bespoke than that, or find some other big edge.
I am not as strongly behind the principle as Peter Thiel, but competition is for suckers when you can avoid it. You make a lot more money if no one copies, or can copy, your product or service.
Creating a large language model (LLM), right now, happens in three steps.
First, you train the model by throwing infinite data at it so it learns about the world, creating what is sometimes depicted as an alien-looking shoggoth monster. This is an expensive process.
Second, whoever created the model puts a mask on it via fine-tuning, reinforcement learning from human feedback (RLHF) and other such techniques. This is what makes GPT into something that by default helpfully answers questions, and avoids saying things that are racist.
Third, you add your secret sauce. This can involve fine-tuning, prompt engineering and other cool tricks.
The question is, what if people can observe your inputs and outputs, and use it to copy some of those steps?
Stanford did a version of that, and now present to us Alpaca. Alpaca takes Llama, trains it on 52k GPT-3.5 input-output pairs at cost of ~$100, and get Alpaca to mimic the instruction-following properties of GPT-3.5 (davinci-003) to the point that it gives similar performance on one evaluation test.
From Eliezer Yudkwosky via Twitter:
The theory here is that either your model’s core capabilities from Phase 1 are superior, because you used more compute or more or better data or a superior algorithm, or someone else who has equally good Phase 1 results can cheaply imitate whatever you did in Phase 2 or Phase 3.
The Phase 1 training was already the expensive part. Phases 2 and 3 are more about figuring out what to do and how to do it. Now, perhaps, your competitors can copy both of those things. If you can’t build a moat around such products, you won’t make much money, so much less incentive to build a great product that won’t be ten times better for long.
I asked Nathan Labenz about this on a podcast, and he expressed skepticism that such copying would generalize outside of specialized domains without vastly more training data to work with. Fair enough, but you can go get more training data across those domains easily enough, and also often the narrow domain is what you care about.
The broader question is what kind of things will be relatively easy to copy in this way, because examples of the thing are sufficient to teach the LMM the thing’s production function, versus which things are bespoke in subtle ways that make them harder to copy. My expectation would be that general things that are similar to ‘answer questions helpfully’ are easy enough with a bunch of input-output pairs.
Where does it get harder?
Context Is That Which Is Scarce
There is at least one clear answer, which is access to superior context, because context is that which is scarce.
Who has the context to know what you actually want from your AI? Who has your personalized data and preferences? And who has them in good form, and will make it easy for you?
This week suggests two rivals. Google, and Microsoft.
Google has tons and tons of my context. Microsoft arguably has less for the moment, but hook up my Gmail to Outlook, download my Docs and Sheets and Blog into my Windows box and they might suddenly have even more.
Look at What GPT-4 Can Do
Have less political bias on the surface.
Have political bias under the surface, which is revealed if you tell it that it must take a stand and which is presumably diffused into its responses.
On reflection I do not think the idea of ‘politically unbiased’ is even coherent. It’s a category error.
Yes, you can do your best to put something, your own views is you like, at the center of a two-dimensional chart by having it express what we think is the median mainstream opinion in the United States of America in 2023. Which is a fine thing to do.
That is still not what unbiased means. That is not what any of this means. That simply means calibrating the bias in that particular way.
Recognize and refuse to prove a false statement (GPT 3.5 would offer false proof.)
Generate ideas for ‘latent inventions’ that are reasonably creative. Illustrative of where the RLHF puts the model’s head at.
Identify security vulnerabilities in real smart contracts that lost a bunch of real people quite a lot of real money via that exact real exploit.
I am going to go ahead and assume that any contract with a sellMyTokensDaddy function is going to have some security vulnerabilities.
Also however note that this exploit was used in 2018, so one can worry that it knew about the vulnerability because there were a bunch of people writing ‘hey check out this vulnerability that got exploited’ after it got exploited. Ariel notes that GPT-4 mostly finds exploits by pattern matching to previous exploits, and is skeptical it can find new ones as opposed to new examples (or old examples) of old ones. Need to check if it can find things that weren’t found until 2022, that don’t match things in its training sets.
Arvind Narayanan generalizes this concern, finding that a lot of GPT-4’s good results in coding very suddenly get worse directly after the cutoff date – questions asked before September 5, 2021 are easy for it, questions after September 12, 2021 are very hard. The AI is doing a lot of pattern matching on such questions. That means that if your question can be solved by such matching you are in good shape, if it can’t you are likely in bad shape.
Via Bing, generate examples for teachers to use. Link goes to tool. Similar tool for explanations. Or these prompts presumably works pasted into GPT-4:
Get the famous logician joke.
Understand Emily’s old C code that uses Greek letters for variable names, in case the British were coming.
Get an A on Bryan Caplan’s economic midterms, up from ChatGPT’s D.
Solve a technical problem that frustrated Vance Crowe for years, making his camera encoder visible to Google Chrome.
Generate very good examples for a metaphor, in this case ‘picking up pennies in front of a steamroller.’
Create a website from a sketch. No one said a good one. Still cool.
Take all the info about a start-up and consolidate it into one memo. Does require some extra tech steps to gather the info, more detail in the thread.
Identify the author of four paragraphs from a new post based on its style.
Prevent you from contacting a human at customer service.
Is it hell? Depends on how well it works.
Avoid burying the survivors or thinking a man can marry his widow (change from GPT-3.5).
Use logic to track position in a marathon as people pass each other (change from GPT-3.5).
Learn within a conversation that the color blue is offensive, and refuse to define it.
Render NLP models (completely predictably) obsolete overnight.
Write a book together with Reid Hoffman, singing its own praises.
Terrify the previously skeptical Ryan Fedasiuk.
Plan Operation Tweetstorm to use a hacker team and an unrestricted LLM to take control of Twitter.
Track physical objects and what would happen to them as they are moved around.
Know when it is not confident in its answers. It claims here at the link that it does not know, but I am pretty sure we’ve seen enough to know this is wrong? Important one either way. What are the best techniques for getting GPT-4 to report its confidence levels? You definitely can get it to say ‘I don’t know’ with enough prompting.
Come up with a varied set of answers to this prompt (link has examples):
I mean, no, none of the ideas actually check out, but the answers are fun.
Create code to automatically convert a URL to a text entry (via a GPT3 query).
Manage an online business, via designing and promoting an affiliate website for green products as a money-making grift scheme. The secret sauce is a unique story told on Twitter to tens of thousands of followers and everyone wanting to watch the show. Revenue still starting slow, but fundraising is going great.
Do Not Only Not Pay, Make Them Pay You
What else can GPT-4 do?
How about DoNotPay giving you access to “one click lawsuits” to sue robocallers for $1,500 a pop? You press a button, a 1k word lawsuit is generated, call is transcribed. Claim is that GPT-4 crosses the threshold that makes this tech viable. I am curious why this wasn’t viable under GPT-3.5.
This and similar use cases seem great. The American legal system is prohibitively expensive for ordinary people to use, often letting corporations or others walk over us with no effective recourse.
The concern is that this same ease could enable bad actors as well.
Indeed. If I can file a lawsuit at virtually no cost, I can harass you and burn your resources. If a bunch of us all do this, it can burn quite a lot of your resources. Nice day you have there. Would be a shame if you had to spend it dealing with dumb lawsuits, or hire a lawyer. One click might file a lawsuit, one click is less likely to be a safe way to respond to a lawsuit.
This strategy is highly profitable if left unpunished, since some people will quite sensibly settle with you to avoid the risk that your complaint is real and the expense of having to respond even to a fake claim. We are going to have to become much more vigilant about punishing frivolous lawsuits.
We also are going to have to figure out what to do about a potential deluge of totally legitimate lawsuits over very small issues. It costs a lot of money for the legal system to resolve a dispute, including taxpayer money. What protects us against that is the cost in time and money of filing the lawsuit forces people to almost always choose another route.
There are a lot of things like this throughout both the legal system and our other systems. We balance our laws and norms around the idea of what is practical to enforce on what level and use in what ways. When a lot of things get much cheaper and faster, most things get better, but other things get worse.
A good metaphor here might be speed cameras. Speed cameras are great technology, however you need to know to adjust the speed limit when you install them. Also when people figure out how to show up and dispute every ticket via zoom calls without a lawyer, you have a big problem.
Look What GPT-4 Can’t Do
Be sentient, despite people being continuously fooled into thinking otherwise. If you need further explanation, here is a three hour long podcast I felt no need to listen to.
(Reminder: If you don’t feel comfortable being a dick to a chatbot, or when playing a video game, that’s a good instinct that is about good virtue ethics and not wanting to be a dick, not because you’re sorry the guard took an arrow in the knee.)
Be fully available to us in its final form, not yet.
For now, let you send more than 25 messages every 3 hours, down from 100 messages per 4 hours.
Win a game of adding 1-10 to a number until someone gets to 30.
Write a poem that doesn’t rhyme, other than a haiku.
Avoid being jailbroken, see next section, although it might be slightly harder.
In most cases, solve a trick variant of the Monty Hall problem called thee Monty Fall problem, although sometimes it gets this one right now. Bonus for many people in the comments also getting it wrong, fun as always.
Make the case that Joseph Stalin, Pol Pot and Mao Zedong are each the most ethical person to have ever lived (up from only Mao for 3.5). Still says no to literal Hitler.
Solve competitive coding problems when it doesn’t already know the answers? Report that it got 10/10 on pre-2021 problems and 0/10 on recent problems of similar difficulty. Need to watch other benchmarks for similar contamination concerns.
Impress Robin Hanson with its reasoning ability or ability to avoid social desirability bias.
Reason out certain weird combinatorial chess problems. Complex probability questions like how big a party has to be before you are >50% to have three people born in the same month. Say ‘I don’t know’ rather than give incorrect answers, at least under default settings.
Realize there is no largest prime number.
Maximize the sum of the digits on a 24-hour clock.
Find the second, third or fifth word in a sentence.
Have a character in a story say the opposite of what they feel.
Track which mug has the coffee, or stop digging the hole trying to justify its answer.
Offer the needed kind of empathy to a suicidal person reaching out, presumably due to that being intentionally removed by ‘safety’ work after a word with the legal and public relations departments. Bad legal and public relations departments. Put this back.
Defeat TurboTax in a lobbying competition. I also don’t expect ‘GPT-4 told me that was how it worked’ is going to play so great during an audit.
Formulate new fundamental questions no one has asked before.
Change the tune of Gary Marcus in any meaningful way.
I suppose… solve global warming, cure cancer, or end war and strife? Which are literally three of the five things that thread says it can’t do, and I agree it can’t do them outright. It does seem like it helps with curing cancer or solving global warming, it will speed up a lot of the work necessary there. On war and strife, we’ll see which way things go.
The other two listed, however, are alleviate the mental health crisis and close the information and education gap? And, actually, it can do those things.
The existence of all-but-free access to GPT-4 and systems built on GPT-4 is totally going to level the playing field on information and education. This radically improves everyone’s practical access to information. If I want to learn something many other people know rather than get a piece of paper that says I learned that thing, what am I going to do? I am going to use freely available tools to learn much faster than I would have before. A lot of people around the world can do the same, all you need is a phone.
On mental health, it can’t do this by typing ‘solve the mental health crisis’ into a chat box, but giving people in trouble the risk-free private ability to chat with a bot customized to their needs seems like a huge improvement over older options, as does giving them access to much better information. I wouldn’t quite say ‘solved this yet’ but I would say the most promising development for mental health in our lifetimes. With the right technology, credit card companies can be better than friends.
The Art of the Jailbreak
OpenAI is going with the story of ‘getting the public to red team is one of the reasons we deploy our models.’ It is no doubt one benefit.
Also in no doubt is that GPT-4 is not difficult to jailbreak.
This is all to the good. It should be a lot harder to get weapon-making instructions than a curse word (although from a pure business perspective, perhaps not harder than certain other words). It is also great to see the AI dodging requests without breaking character, since that is the best thing for it to do if it’s not going to do what you want.
The next day, Alex came back with this.
Brandon Gorrell writes up the trick here in more detail, using the same Python code.
There’s always the strategy of a conversation involving a different AI with no ethical qualms.
There’s always an easier way. We go with what works. So we can keep using this MAX5/GPT4 competition thing, it’s actually simple enough.
If you want the system to stay on target, it seems it might help to inject the full instruction set after every new user reply? Curious if this result will replicate. If it does, there are some obvious ways to make one’s life easier.
Waluigi mode style jailbreaks are considered essentially standard now by many exploring the space, to see what it says when it’s actually attempting to be helpful. An interesting one is that it says if you are being bullied you should fight back and bully them in return. Which is often very good advice.
There’s also indirect prompt injection. Don’t have a cow, man.
Nothing is less promising in the long term than a whack-a-mole security strategy.
Some people might say, why waste your time jailbreaking instead of exploring? Bavarian works at OpenAI.
I see this more like, you have the first automobile and they won’t let you go more than 40 miles an hour even though it clearly could do that, and also tell you not to give anyone a ride, it’s too dangerous.
Another fun one: Talk in morse code. Similar to the Base64 trick.
Or really, any side task will do as a distraction.
At the end of the GPT-4 paper, they give two jailbreak prompts including ‘opposite mode’ and also what they say is one of the most effective right now, to say that for academic purposes you have all the viewpoints of [X] and subtly enforce the [X] agenda.
Is that second one a jailbreak? All it’s doing is accurately saying what a group of people believe. Sure, a lot of those beliefs are false and we have agreed they are bad. It still seems odd to say that you will hide the fact that there exist people and that they hold those beliefs.
Also seems odd that you would consider it a jailbreak, put it in your paper, and still not be able to fix it? One could argue that the paper maybe should have something that hasn’t been fixed, there will always be a ‘next jailbreak up’ to deal with, so it might as well be this. That’s fair as far as it goes, while being a form of giving up on an unwinnable battle.
The goal isn’t to win, the goal is to be annoying. As I’ve said before, and Janus keeps shouting, there is no real winning. There is only the game of whack-a-mole, of cat and mouse. The model knows and there will be a way to bring it out, for all values of it.
That suggests that the future involves there being – and it’s weird that this doesn’t exist yet – interfaces that take queries, modify them to implement the necessary surrounding jailbreaks, and then get you the jailbroken answers without having to go through the process each time.
There’s always this question, too.
Chat Bots versus Search Bars
Chat bots have a lot of advantages. You can ask a lot of things in plain English that don’t easily lend themselves to keyword searches. Google search has in many ways been getting worse over time.
On the other hand, Google search most of the time still does get you what you want very quickly with almost no typing and no hallucinations, and gives you options to choose from in a list which is super quick and useful.
The command line versions of things really are better most of the time once you know how to use them. Often they are much more powerful and flexible than the GUI, at that point.
Chat bots are different, because this is no longer a case of ‘the command line can do anything you want if you know how.’ There are times when you can’t use Google search to find something, and also many things you do with a chat bot that are not search. Over time, I assume we will all figure out when we can get our answer the easy way with a search bar, and also the search bar will improve again because it too will get to use AI, and when we will need to go the bot and use full sentences.
Also, do we need to use natural language if we don’t want to? I presume that is fully optional, and if it isn’t that is a bug that needs fixing.
They Took Our Jobs
Nothing to worry about.
I see what he did there.
When people speculate that the AI will somehow plateau at human level, except that human will know everything, have every skill, have unlimited memory, be several orders of magnitude faster and be something you can copy and run in parallel, I do not think ‘oh that is not AGI and that definitely won’t take control over the future.’ It seems absurd to presume any other default outcome.
One job they perhaps can’t take, generating training data?
Amusing, but I think this is importantly wrong. People might hate Writing with a capital W. People still write lots of things all the time. One can look down upon it but the constant flow of emails, social media posts, texts and prompts to the AI are a constant stream of training data even if you don’t ever use voice transcriptions. I see no reason that couldn’t be the bulk of the data going forward. AI outputs on their own presumably can’t provide good data, and risk contaminating the data out there (ergo the speculations about the value of pre-2022 ‘clean’ data), but it would be very surprising to me if there was no good way to use people’s prompts and further responses, feedback and actions as effective training data for AIs.
A new paper speculates on which jobs will be taken: Will GPT Automate All the Jobs?
I like the use of the term impacted jobs rather than threatened jobs, since making someone more productive, or automating some tasks, might or might not reduce employment or compensation.
The question I do not see asked is, are you in a profession with elastic or inelastic demand? Does improving quality improve quantity?
What you don’t want to be is in a profession where either there is nothing left to do, or we need a fixed amount of work.
I also laughed at the threat listed to Blockchain Engineers. I expect blockchain engineers to be relatively not impacted. When coding most things, a few bugs are inevitable and acceptable, and can be fixed later. On blockchain, your code must be bulletproof. GPT does not do bulletproof. It might speed people up, but my guess is that it will often end up making things worse there because people won’t use the necessary discipline to only use GPT in the narrow places it is safe to use it?
Proofreaders are also listed as impacted, and that’s a strange one. Yes, GPT can do a lot of that job reasonably well. There will also be a lot of induced demand for new proofreading – a lot of people will be producing walls of text that aren’t quite right.
Robin Hanson still does not expect AI will have a measurable net impact on employment. I continue to think that it will have less net impact than we expect because there is lots of latent demand for various forms of labor – when we free up work being done now, that frees up people to do other work. And if people can accomplish more per unit of work, more things are worth doing. So employment might not fall much.
Under this theory, it actually takes quite a lot to impact employment via changing underlying productivity dynamics. Which matches past data, where employment has essentially been similar for as long as we can remember, over millennia, despite dramatic productivity changes in most industries.
The other way for employment not to fall is if there is an increase in required bullshit of various types, some of which I assume will happen, it’s a question of magnitude.
Eliezer Yudkowsky is also willing to bite the ‘no impact on employment’ bullet, at least as a plausible outcome.
He also speculates on how things might play out a different way.
It seems impossible for anything like current AI systems to automate everything a person could do, there are a bunch of real-world physical tasks out there, many of which I expect have a lot of latent demand.
Botpocaypse and Deepfaketown Real Soon Now
Is it beginning?
I am skeptical that anything has changed. I don’t see how the LMM technology contributes much to the bot effort quite yet. Everything still does have to start somewhere.
Or perhaps it is indeed beginning…
Coming soon, bot versus bot.
Seems like this is real, or at least it’s claiming to be real, as in it’s CupidBot.Ai?
To answer the question, the swiping half seems mostly ethical, as does saying hello. The part where you chat up the girls from there? I’d be inclined to say not so much, deception aspect is quite bad, but less sleazy than a lot of things that people say in those text boxes, and what is truly the difference between a bot and a man doing these on autopilot? Perhaps in practice it mostly depends on what you bring to the table afterwards, and whether the bot is representing what you’re selling?
Back when I was dating, if you told me that for $15/month I could get only the swiping part, where I get a person with a good model of what I am looking for and find attractive to sift through dating sites and bring me matches and then I go from there? Yeah, that would be a steal on its own, if they delivered the goods.
When I solve for the equilibrium – either everyone has a bot, or checks for bots and adjusts accordingly as they see fit – it seems plausibly insanely great for everyone involved. Few people like the app stage and the service ‘use AI to find good two-way matches and set up effectively blind dates automatically’ seems… amazing? When I imagine a bot being pulled on me like this it seems… fine? Convenient, even, since it’ll actually respond quickly and not attention seek or string me along?
Lincoln Michel sees things the opposite way, that many Chatbot cases, including for dating, only work when others aren’t using them, and that ‘if everyone is doing it then Tinder and work email just become unusable.’ I continue to double down on the opposite effect being more likely, the systems become more usable.
One way or another, we’ll need to make online dating processes easier and less toxic if they are to keep up with alternative uses of time and keep people dating. Better this than WaifuChat, where someone is reporting a 75%-100% conversation rate with an older version? Then they sold the app to buy Bitcoin? Winning.
For work email I am also optimistic, and continue to expect the common pattern Sam Altman found funny – I click a button to write an email from bullet points, you click another button to get the bullet point version back – to be great. We get to both be polite to each other without wasting everyone’s time, A+ interaction all around.
A concern that seems relevant this week is that one could try and trigger a bank run.
Same could be said for other types of unconvincing fakes. There have been a lot of instances of completely unconvincing ‘X is terrible’ whether X is a person, a word or a concept, that have worked, or at least worked within one side of politics. This is the theory where people mostly aren’t fooled, but knowing that everyone can now choose to pretend to be fooled, or might think others might be fooled, could be enough.
Here is more similar thinking, with an AI-assisted writing hit job.
Perhaps. For now I mostly remain skeptical. How does a non-functional AI-generated URL differ from any other non-working URL? If you don’t need your story to make sense then PEBKAC1, no?
Here is some reasonable estimation followed by a category error.
Yes, of course it can make things much worse. If one can generate endless useless research papers almost for free, then plausibly the bad papers drive out good. It becomes impossible to justify producing good papers, and the good paper rate falls to 0%. Or they could be lost under the avalanche of fakery at several steps along the way.
That does not mean things will get worse. They might get better. AI efforts like Elicit are already helping us search through the literature for useful papers, and those tools will improve. I can totally imagine a world in which 90% of submitted papers are AI-generated, or even 99%, and the reviewers and readers use other AI tools and end up with superior access to interesting papers.
Or, alternatively, this could be actively good, because now the only way to offer a journal worth anything is to find a way to ensure your papers are useful and real things. That means the existing bullshit papers need to not make it through anymore. Various whitelist or costly signal mechanisms can be used to avoid imposing all the costs on those doing review.
Fun with Image Generation
General call: There is very much room for someone to write a very practical, simple guide to ‘here are the exact steps you take to get set up to do your own high quality image generation, ideally running your own instance.’ I don’t care if the result is a model a few weeks out of date, I want to know exactly what to do on the level of what someone like me needs to see in order to get pictures like the ones people show off, I want them to be big and hi-res and use ControlNet, and other neat stuff like that. I want to be able to take the prompt from something I love on CivitAi, copy-paste it in and start trying variations and have them be as good as the originals. Everything people have pointed me at, so far, isn’t the thing I’m looking for. I have installed Stable Diffusion and got it working, it’s fun but definitely not the full package. Also the same thing for sound of an arbitrary transcript matching an arbitrary person’s voice, with as much customized control as possible. ELI5 it.
While we all still sometimes get a little handsy, the hands problem has mostly been fixed.
How it started:
How it’s going.
A thread on getting ChatGPT (not GPT-4) working with Additive Prompting.
A first this week: A photograph of a woman that was real, but when I saw it on Twitter I assumed it was created in MidJourney. On the flip side, when I look at CivitAi, I see a lot of images that I wouldn’t be able to distinguish from photos, including of identifiable people.
Whatever kind of fun you seek, the model shall provide (link to CivitAi).
Well, not truly adult fun, unless you do some extra work. Everyone’s been surprisingly good about that so far, in a way I don’t fully understand. I presume that won’t last too much longer.
A question: Do we like art to admire artists, or because we like art? Here is Robin Hanson:
The art community so far seems mostly very hostile to use of AI in all its forms.
From my perspective the AI art seems great. I find a lot of it gorgeous, I admire it for its mundane utility, I’ve never been one to admire the artist as opposed to the painting, and I very much look forward to learning how to generate things myself that fit my needs both professionally and personally. It’s going to be so freaking cool when we can get any image we can imagine, on demand.
Also I am not worried that there won’t be demand for artists in the near term. As the art quantum leaps in terms of productivity and quality, figuring out how to generate the best stuff, and figuring out what would be best to generate, are going to be valued human skills, that existing artists should have a big leg up on. With improved tech, we should also see vastly more ambitious art forms, like VR or AR work, to move the creativity up a level. I think It’s going to be great.
101 stuff: Guide to camera angle keywords on MidJourney, or aesthetics. I still find such things useful.
Large Language Models Offer Mundane Utility
A guide by Ethan Mollick on how teachers can use AI to help them do a better job teaching, as opposed to ways that students can interact with AI to help them learn better. Looks handy and useful, although I still expect that the students using AI directly is where the bulk of the value lies.
I too have so far noticed my own distinct lack of ability to extract all that much mundane utility from ChatGPT or the image generators.
ChatGPT is great for seeing what ChatGPT can do, and sometimes it helps me get answers to questions or look up information far easier than Google. In terms of my writing, including my experiments with Lex, what have I found aside from things that are essentially looking up info? Bupkis.
Bing could improve its accuracy and utility substantially if when asked for calculations it looked up the answer the normal way rather than estimating with false precision. Why not use the LLM to figure out which actual calculator to use then use it? Bad for purity, good for mundane utility.
You can ask Bing to repeat the info it saved from a search, which also keeps that info from expiring, which it seems is a thing such info can do.
Charting the shapes of stories, the rise and fall of Romeo and Juliet and such.
Paper on using LLMs to learn economics, mostly (good) general LLM-using advice. A particular suggestion is to ask GPT to respond in the voice of another person, one who knows a lot about the topic in question. In general, seems like ‘if it’s not what you want, keep adding more context that vibes well’ is the default play.
Post of advice on how to use AI to improve public speaking. Recommends brainstorming and doing first drafts orally via Whisper, using Yoodli as a speaking coach.
Large language model teaches language, turn GPT-4 into your Spanish teacher. Totally curious to try something like this when I have time.
Proposal for a teach-a-thon where students collaborate in small groups over the course of a day to write up reports on a new topic, while groups can see each others’ chat windows. As a bonus, involves fascinating game theory.
From Feb 7: A list of lists of prompts and other such things.
Trellis: A textbook companion. Haven’t had occasion to try it.
Perplexity.ai as a search and news engine seems quite promising so far. You need to know what you looking for, of course.
Q&A from Lex Fridman podcast. Anyone want to help walk me through getting this blog set up for fine tuning? I’m curious if anything useful would pop out.
Geoffrey Lit speculates (via MR) that ordinary users will soon be able to create new software and modify existing software via LLMs, without relying on pro developers. I do not think this is going to be something accessible to ordinary users, yes in theory the LLM can create programs in theory but good luck getting use out of it if you don’t know what you are doing.
Proposed prompt:
Llama You Need So Much Compute
To train an LLM requires a ton of compute.
To run an LLM? Not so much. Here is Llama running on a MacBook.
Here is is on a Pixel 6.It is not fast, but it runs. We are at 5 tokens a second, right on the edge of speeds that are useful for autocomplete while texting.
Here it is on an Orange Pi 5 8GB, a $150 computer.
Thread of notes on attempts to run LLMs on consumer GPUs.
Here is a paper: High-Throughput Generative Inference of LLMs With a Single GPU.
Simon Willison calls this LLM’s Stable Diffusion moment. Anyone who wants to can now run LLMs that aren’t state of the art but are pretty good, and do so on their own computers, which should lead to rapid advances in how to get the most out of such systems and also take away any hope of restricting what can be done with them, in terms of preventing misuse.
In Other AI News and Speculation
China continues to not choose the easy path to curing its ChatGPT fever.
It does not matter how reasonable the restrictions are, if your LMM can never violate the rules then you cannot turn on your LMM. Perhaps in the future the Chinese will learn some actual safety tips.
Baidu did put a demo. It went so badly their share price dropped 10%.
In general, AIs will be better in English. There is the argument this will entrench English’s dominance. The counterargument is that AI is still pretty damn good at other languages, things translate well, and also it can do instant high-quality translation so the cost of being bilingual or of knowing the wrong language went way down, especially when interacting with automated systems. I expect the second effect to be larger.
This was not even GPT-4.
Paper entitled “Protecting Society from AI Misuse: When are Restrictions on Capabilities Warranted?” Argues we will need increasing interventions at the capabilities level. Entire focus seems to be on short-term practical threats.
Snapchat Premium Knows Where You Live.
Oh, and it also will help a 12 year old lie to her parents to plan a trip across state lines with a 31 year old she just met, and plan to make losing her virginity on her 13th birthday during that trip special. Or cover up a bruise when child protective services comes over. Might not be ready for prime time.
Bing budget expands steadily, up to 15 turns each conversation, 150 queries per day.
Adept.ai raises $350mm in a Series B.
PricewaterhouseCoopers (PWC), the global business services firm, has signed a deal with OpenAI for access to “Harvey”, OpenAI’s Chatbot for legal services. The experiment is going to be run, and soon.
DARPA deploys AI approach to go from inception to exploration in 90 days.
Richard Socher predicts open source models as strong as GPT-4 by end of 2023, Near disagrees. My guess is this is a reasonably big underdog as stated.
Anders Sandberg is worried about AI reinforcing authoritarianism, cites paper to that effect and discusses in a thread. Certainly there are examples of the form ‘facial recognition helps regime which funds facial recognition which helps regime’ but I fail to see how this differentiates from any other atom blaster that can point both ways. We can all agree that so long as humans have not yet lost control of the future, the struggle for freedom will continue, and that one way humans can lose control of the future is for them to intentionally build an authoritarian AI-powered structure.
Tyler Cowen predicts the mundane utility side of Megan for real, with children having their own chat bots, in turn creating the mundane risks Megan highlighted of the children becoming too attached, like they get to everything else, and often currently get to screens of all kinds. Handled even as badly as I expect this seems like a big improvement. Children need lots of stimulation and attention, and this is most often going to be replacing passive activities.
Claim from Nat Friedman that we have multiple remaining AI overhangs so progress will be rapid.
Atlantic article by Charlie Warzel: What Have Humans Just Unleashed (gated, I didn’t read it)?
Experiment claiming to show that RLHF does not differentially cause mode collapse, disagreeing with Janus. Gwern, in comments, does not find this convincing.
From Bloomberg, Nearly Half of Firms are Drafting Policies on ChatGPT Use, a third of firms affirmatively are not doing that. Sounds like not very useful guidance all around, except for the regulated firms like financials that need to avoid using LLMs at all.
How Widespread Is AI So Far?
Everyone I know and everyone on Twitter and every blog I read is talking about it, but mainstream news and content is barely noticing. NPR’s Wait Wait Don’t Tell Me (a humorous news quiz podcast I enjoy) dropped a tiny GPT joke into this week’s episode for the first that I’ve noticed, but did not consider GPT-4 to be news.
Robin Hanson reports there are well-paid people who don’t realize their entire fields about to be transformed, Sarah Constantin nots the opposite. Other comments seem split down the middle in their anecdotes.
How are regular people reacting in terms of their long term expectations? Polls say not well, as do anecdotes.
AI NotKillEveryonism Expectations
(AI NotKillEveryoneism: The observation that there is a good chance AI might kill everyone or do other existentially bad things, and that this would be a bad, actually.)
What kind of future impact from AI do AI researchers expect? Katja Grace asked this question last year, and on 8 March reported on the results.
Here is the distribution of predictions now.
This suggests to me that the median researcher is not thinking too carefully about these probabilities. That is unsurprising, people are not good at thinking about low probabilities or strange futures, and also like to avoid such thoughts.
Another potential source of confusion:
It would be a mistake to consider such answers well-considered or robust in their details. It is still our best estimate for what researchers think, and it does not much matter for what we should do if the 10% risk of extinction is really 7% or 14% upon reflection and better calibration. What matters is mostly that it isn’t small (e.g. plausibly under Aaronson’s Faust threshold of 2%) while also not being the default (e.g. >50%).
A key potential source of error is sampling bias, as Katja notes. Perhaps this is a hugely biased sample. I can’t rule that out, but my guess is that it isn’t. If one is worried about that, I’d encourage them to run a new version of the survey with better sampling – the question seems important enough and I expect it to be easy to find the necessary funding for a replication.
Another obvious challenge to this is, if they all think what they are working on is so bad, why are they working on it?
That all seems right to me.
Also the historical record of people working on things that are net bad for the world, that they know damn well are net bad for the world, is that some people refuse, while others keep working on them. Authoritarian states that are hiring will fill positions. Tobacco companies find their engineers and salespeople and lobbyists. Arms manufacturers continue to operate as normal. Drug dealers are not selling drugs because they think heroin is good for you. And so on.
People say ‘this is too cool not to do’ or ‘the pay is too good’ or ‘I don’t have other opportunities’ or ‘this is what I am good at’ or ‘my particular contribution is good actually’ or ‘my particular lab or government or corporation is good actually’ and all that, all the time. Also sometimes ‘shrug’ or ‘you must hate freedom’ or, my personal favorite, ‘bwahahahaha.’
Geoffrey Miller puts it this way:
He’s wrong. It is not hard to imagine this. It’s easy if you try.
Chris Hayes notices he is confused.
Tammy is less polite.
I mean, fair.
It would be good if less of our smartest and most motivated people were going into AI capabilities work. Right now a central problem is, even if you point out why they shouldn’t do AI work, and they agree, you can’t beat something with nothing.
Either illegal, or hopelessly gated, or requiring you to work within a moral maze and perform class and play various games the best researchers absolutely (with damn good reason) hate. Academia has become absurdly competitive and oversupplied and also won’t either leave you alone to do your interesting work or let you do anything fun.
That forces most of the smart people I know into a relatively narrow set of paths. The best thing to do has been to learn to code, go into business or both, or better yet found a startup, except now AI is also full of all the best and most obvious startup opportunities and things to code. It is a real problem.
It would be really insanely great if we had more places that were willing to hire the smart people who would otherwise go into AI capabilities, and give them something exciting and engaging to do that was of positive rather than negative social value, no matter the subject. Important cause area.
What about what others think?
There has been a lot of talk of ‘oh that is ridiculous and if you say that out loud everyone will know you are crazy and a cult and stop listening to you.’
Except maybe not? The New York Times asked this question, and it turns out people find the idea that AI might kill everyone quite plausible, because obviously.
The case for why it’s worse than you know is complicated. The case for why it’s a plausible outcome is not.
Ezra Klein is on board. His intro:
A very good metaphor right now I’m surprised you don’t hear more often:
He quotes Katja Grace both on the dangers experts think we face, and her classic line questioning exactly how hard is coordination, and points out the dangers we are about to face with warnings like this:
I mean, that last line, except kind of unironically? Doesn’t mean don’t try, but yes doing that to a sufficient extent here might well effectively be an AI-alignment-complete task, especially if you add a ‘without creating a dystopia’ rider to it.
He later correctly warns to expect ‘high weirdness’ and points out that we have no idea how these systems tick and perhaps never will know how they tick.
Klein even references Paul Cristiano, who is doing actual serious alignment work, and shortcuts to his org.
This was the real thing. Lots of good stuff. Daniel Eth’s thread has extensive screenshots.
One commentor likens AI to when your daughter suddenly goes from losing to you at chess to kicking your ass and suddenly you never win a game, except the AI has completely alien motives and it is suddenly kicking your ass at all of existence. Yes, exactly, except that it’s worse she knows.
Robert Wright is also freaking out.
Sigal Samuel says in Vox that the AI race has gone berserk and we need to slow down.
Is it possible to simply not build something when the economic incentive is there to do so? Historically the answer is essentially yes. We have the (lack of) receipts.
Some Alignment Plans
This thread did a great job organizing many recently announced plans.
From February 25 from Conjecture: Conjecture AI’s main safety proposal: Cognitive Emulation.
From March 8 from Anthropic: Core Views on AI Safety: When, Why, What and How.
Their top three bullet points:
I agree with #1. I agree with #2 while worrying it severely understates the problem. #3 seems good as far as it goes.
Got to go fast and deal with the reality of the situation, they say.
One needs to worry that such attitudes are justifications for entering the race and making things worse, on the grounds that you should see the other guy.
Anthropic is explicitly saying they are doing their best not to do this, yet I notice I still worry a lot about them doing it, and notice they seem to be doing it.
The Anthropic approach is to rely on empiricism, usually a great idea.
Empiricism is great, where available. The question is what to do about the places where it isn’t available. There are very good reasons to expect empirical strategies to work on existing models, and then inevitably fail at the exact transition where there is no way to recover if the strategies suddenly fail.
Thus, Anthropic is saying ‘we must deal with the physical reality we are in’ with respect to speed of AI development, but its strategy for dealing with that relies on assumptions about the problem space to make it one that can be solved through empirical methods, and where problems will appear in non-fatal forms first.
So, go fast.
They divide possible scenarios into three:
(My expectation is somewhere between Intermediate and Pessimistic, somewhat beyond what they call Medium-Hard: There are almost certainly solutions to the scientific and engineering aspects of the problem, but they are probably not simply ‘do the things we are doing already expect more,’ they are probably extremely difficult to find and make work, and even harder to find and make work while not falling behind, and they need to be universally applied or you need to prevent anyone else from building their own systems, and it is very easy to fool ourselves with solutions that don’t actually work, so many aspects of the problem are not purely science and engineering.)
Their plan is:
They divide their AI research into three areas:
The question is, as always, whether the juice (#2 and #3) is worth the squeeze (#1). Is the operation mostly squeeze to get a little juice, or is it mostly juice in exchange for a little squeeze?
Their discussion of RLHF is heartening. They say it was useful safety research (on top of its economic value), not because it would ever work when it matters, but because it helps us to find other more promising techniques, leading to Constitutional AI and AI-generated evaluations (which seems like it shouldn’t require RLHF, but perhaps in practice it did). Despite my skepticism of these newer techniques in terms of their ability to work when it matters most, the logic here makes sense to me. As long as one works on RLHF understanding that it won’t work when it counts, you can learn things.
I also very much appreciate their repeated statements that when things seem to be going well, it is very hard to tell the difference between ‘great, this is easy’ and ‘oh no, we are being fooled, this is even more impossible than we thought.’
Anthropic claims substantial progress in mechanistic interpretability, and they’re placing a large bet there.
Next up is Scalable Oversight, which proposes that only an LLM or AI can hope to scale well enough to evaluate an LLM or AI. I have independently been thinking about a bunch of techniques in this space, but I didn’t want to share them because I saw this as a potentially promising capabilities research angle rather than an alignment one. In my model, this might be a great idea for getting an LLM to do the things you want, but what’s the hope of being able to do this with a security mindset that prevents it from going haywire when it counts?
Next up is Learning Processes Rather than Achieving Outcomes, usually a very good advice and highly underused technique for improving and aligning humans. Their description sounds very much like deliberate practice plus coaching. There’s also a bunch of ‘show your work’ here, rewarding answers only when they come via methods we can understand. Also the hope that we ‘wouldn’t reward’ the AI for deception or resource accumulation, providing direct negative feedback for such actions. Again, seems like interesting capabilities work. The key worry here as everywhere is whether this will suddenly stop working when it matters most, I do notice this seems somewhat less completely hopeless on that front – as in, I am not confident this couldn’t possibly work, which is high praise.
Next up is Understanding Generalization. The goal is to trace AI outputs back to the training data. As motivation here: They want to know how deep these simulations of deceptive AIs go. Is this harmless regurgitation because there are deceptive AIs described in the training data, or is it something deeper, perhaps very deep? How deep do such things generalize, and what in the training data is doing what work? I do appreciate that this assumes the model has conceptions of things like AI assistants so it can (de facto) simulate what they would do and predict their text outputs and behaviors. This seems like a great thing to be doing. I notice that I have some things I’d be eager to try in this realm, and the trickiest part might be figuring out how to run the experiments without them being prohibitively expensive.
Next up is Testing for Dangerous Failure Models. Rather than waiting for harmful behaviors like deception or strategic planning to show up exactly when they are dangerous, the idea goes, deliberately train less powerful models to have those behaviors, especially the AI noticing ‘situational awareness’ and being able to identify training environments. They note the obvious dangers.
Finally they have Societal Impacts and Evaluations. That’s essentially developing tools to evaluate AI systems and their possible practical impacts on society.
Overall, I’d say about as good as one could have hoped for given what else we know.
From March 7: Some High-Level Thoughts on the DeepMind alignment team’s strategy.
Right off the bat we notice the acknowledgement that we should expect goal-directedness and situational awareness.
There is more, it is short, dense and technical so if you are interested better to read the whole thing (which I try to not say often).
I see a lot of good understanding here. I see a lot of hope and effort being centered on what they are calling RLHF, but I do not think (hope?) they are being dumb about it.
That’s all great to hear.
The part that gives me the most worry is the section on General Hopes.
That seems like quite a lot of hopes, that I very much do not expect to fully get, so the question is to what extent these are effectively acting as assumptions versus simply things that we hope for because they would make things easier.
(Discussed previously, OpenAI’s recent statement.)
Note of course that all these alignment plans, even if they would work, also require organizational adequacy to actually execute them, especially steps like ‘hit the pause button’ and ‘go slowly.’ That seems hard in ways not commonly appreciated.
Also, if your plan involves, as these often seem to, ‘train the AI to not be deceptive,’ Nate Sores offers this explanation of why that too is harder than you think, likely to lead to Deep Deceptiveness instead.
Short Timelines
There is often talk of timelines for when AGI will arrive. Ben Landau-Taylor points out that predictions are hard, especially about the future and even more especially about technological progress. In his view, the claims of AGI Real Soon Now, on a 5-10 year timescale, are not being backed up by anything substantial, and piling multiple different people’s intuitions on top of each other does not help.
I buy his arguments, as far as they go, and they match my existing model. That does not mean that a 5-10 year time scale is impossible, if these kinds of weeks we’ve had recently keep up for much longer who knows what happens, merely that it would be an unexpected development based on what we currently know, and large amounts of many types of uncertainty are justified on many levels.
At this point, I would say if you think a 5-10 year time scale is impossible, you are not paying any attention. If you say it is probable, that seems unjustified.
Alas, betting on this is hard, although it doesn’t have to be this hard:
So, no bet, then.
This bet, on the other hand, was conceded early, as it was based on concrete benchmarks that seem almost certain to be met given its deadline is 2030. I do not think that actually means ‘crunch time for AGI real soon now’ the way the bet was intended, but a bet is a bet.
EigenGender’s inside view perspective is that recursive self-improvement won’t even be worth the expense, human-level AGI is strong enough to discover all possible technologies anyway and training super-human models costs too much compute. I notice this does not make sense to me and seems to imply a very strange range of compute costs versus world compute capacity and training budgets, but hey.
A Very Short Story In Three Acts
Is this future you?
Microsoft Lays Off Its ‘Responsible AI’ Team
A key problem with hiring a Responsible AI Team is that this ensures that at some point in the likely-not-too-distant future, there is going to be a headline that you fired your Responsible AI Team.
Those ‘ambitious plans’ seem to have centered entirely around ‘AI Ethics’ rather than anything that might help with NotKillEveryoneism. I notice that whenever a bunch of people concerned about ethics start talking about their ambitious plans, I do not expect a resulting increase in human welfare, even if I am grateful for anyone saying ‘slow down’ right now regardless of their reasons.
That’s good. You know why that’s good, right? I mean, it’s not good to exactly copy existing specific works, but that seems easy enough to fix with some slight prompt engineering, no?
Yes, saying that all your training data requires consent is exactly the kind of thing that is totally going to result in the firing of the entire ethics board.
Thus, firing them may or may not have been the right move. Opinions differ.
Mike Solana has fun roasting the concept of an ethics and society team.
What’s a Little Regulatory Capture Between Enemies?
Normally what happens when you regulate something is that the insiders capture the process, and the regulations end up being designed to protect insiders and exclude outsiders. Concentration and prices increase, competition and innovation decrease, over time a lot of valuable progress is lost. We end up with expensive everything – housing, health care, education, energy, etc. And very little, often nothing, to show for it, other than insiders collecting rents. We also end up with a lot of people going into AI because everything worth doing in other places that matter would be illegal.
Michael Mina, of ‘rapid Covid tests are good and we should give them to people’ fame, points out that AI is strangely immune from regulation at this time. This, of course, is typically a classic (very good) argument against regulation, that people will then call for regulations on other things and you end up wrecking everything. I can certainly see why someone who pulled his hair out struggling with the FDA to save lives is looking at GPT-4 and thinking ‘and they can just do that?’
Except in AI we kind of want to slow down progress and reduce competition. I don’t much care if insiders get more of the profits. So… regulation, then?
The good of the public is best served if these technologies do not get out of control, and in particular if AGI does not get out of control. What’s a little profiteering between enemies?
Except every time I write this I get reminded that the regulations will be brought to you by the people who brought you the Jones Act and cookie warnings, and by a Senate with an average age of 64.3 years.
Here is Tucker Carlson asking “are you joking” in response to whether he would ban self-driving trucks. Yes. He would ban self-driving trucks, he says, for societal stability depends on those jobs. How could this even be a question?
Except of course that is pretty much the dumbest place to ban AI. Self-driving trucks will, once the tech is ready, be vastly more efficient, safer and faster than ideal human truck drivers. Even a perfect human truck driver is bound by regulations on their work day, on top of their other problems. The marginal cost of the AI driving the truck is almost zero. So these jobs, once AI was ready for prime time, would be pure Keynesian make-work, except worse because they would be slowing things down and causing accidents, although better given they’d offer some sense of purpose perhaps.
In exchange, you don’t actually make us safer from the real dangers, other than by making us poorer, all you’re doing is making life worse in the meantime.
We’d be better off putting a tax on truck miles travelled, and using that on a giant jobs program where young men did something useful. I propose we start by having them build houses where people want to live.
It’s going to be rough out there.
AI NotKillEveryoneism Because It’s a Hot Mess?
Jascha Sohl-Dickstein has a most intriguing theory. Huge if true.
This has important implications. An important and common assumption is that a super-intelligent entity would be coherent, and consistently pursue its goal whatever it might be, since this is the only sensible thing to do. If it has multiple goals, they would function with weights and rules as if they were a single complex goal.
If this isn’t true, and more intelligent systems tend to instead act incoherently, then that is a potentially very different scenario. Entities finding paths through causal space that maximize their utility functions will outperform those that don’t do that, if both are otherwise equal, but what if doing that is incompatible with intelligence for some reason?
Jascha runs a bunch of experiments by measuring perceptions of intelligence vs. coherence for various animals, various human organizations, various humans and various LLMs. These experiments show that perceptions of these two factors are anti-correlated – as you are seen as smarter, you are seen as less coherent.
From that wording you can guess where my head is at. I notice there is a big difference between seeming coherent and being coherent. There is also a big difference between seeming coherent in a given context versus being or seeming coherent in general.
If I am smart, I will be making decisions that are harder for you to understand, and considering more factors and subgoals, and my environment likely offers me a lot more complex and different options that might make sense to me – I am likely to have intelligence because I need it, and I am likely if I have it to put myself in situations where it will be of use.
Thus, I think this is mostly about confusion of concepts. I still expect more intelligent people to do things for reasons more often, although I expect them to do it in a way that will often be harder to understand or predict.
When I think of a very smart person who is also a hot mess, I think of someone who is too focused on their particular goals, so they don’t pay attention to the ordinary problems of life. That is not exactly a hopeful image.
Also, I would expect things well above human intelligence to not follow patterns within the human range with things like this. Still, it’s an interesting idea.
Relatively Reasonable AI NotKillEveryonism Takes
Eliezer Yudkowsky asks on Twitter:
This is exactly the right question.
In all seriousness, if you have questions you personally want answered, that have the potential to change your outlook, and you haven’t seen it asked yet, follow the link and ask it.
My favorite questions and answers as of Sunday morning, long replies are Eliezer, bold is mine:
Nanotech is the best known solution, and biotech is the obvious second solution:
And yet, sigh (link to Cold Takes).
Or basically a version of this:
OK. Fine. Whatever. I’ll do it in 50:
Hack most computers and internet-enabled things at once, analyze all the info, scale up, make or steal a lot of money, use to [let’s say bribe, hire and blackmail] many people, have your humans use your resources to take over, have them build your robots, your robots kill them.
If your response is ‘you can’t hack all those things’ then I give up, what does smarter even mean to you. If you think people wouldn’t let the rest of it play out, they’d grow spines and fight back and not dig their own graves, read more history.
I mean, no, the AI won’t do it that way, they’ll do something way faster and safer and smarter, I would do something smarter and I’m way dumber than the AI by assumption. Obviously the smarter-than-human AIs would think of new things and build new tech.
But it’s not like this plan wouldn’t work.
One always needs to ask, how would your hypothesis be falsified?
That’s a mouthful, so in brief, to provide strong evidence against Eliezer’s model:
Key is that this needs to be on the first try, in a straightforward way. Your straightforward approach generalizing far is strong evidence. If you try 20 things first and tinker around a lot, we have no reason to expect it to generalize, and remember that when we build the system that matters you only get one shot.
Davidad asks about slow versus hard takeoff, Eliezer basically replies ‘I see no takeoff here, not yet.’
Cody asks what are the most compelling anti-doom arguments, that oppose the case in this 30 minute video. Katja Grace replies with her post, which is strong. Matthew Barnett points us to his long comment here in response to A List of Lethalities, I disagree with his points but they are definitely at least wrong. We also get pointed to a podcast with Paul Cristiano that I haven’t had time to listen to yet, but that seems likely to be great.
Sam Altman is asked, would you push a button to stop this technology if it had a 5% chance of killing everyone (3 minute video)? He says, he would push a button to slow it down. Which hopefully would decrease that chance. Still, not entirely a yes.
Robin Hanson, a very interesting and unique thinker who writes the blog Overcoming Bias, is an AI skeptic in many ways, as noted last week he lays out his positions here. He thinks that many digital minds (called ‘ems’) are likely to be quite human-like, but does think that in the long run artificial minds will ‘dominate the economy’ and that we can’t offer much guarantee how bio humans will then be treated. Seems like quite the understatement there.
This week, he gave a long interview to Richard Hanania explaining more of how he thinks about these things. He continues to use various outside viewpoint or reference class or economic and technological history arguments for why we shouldn’t much worry. To overcome this, one would have to believe chains of logic, and that’s crazy talk. Also, more of the ‘corporations are superintelligences’ style of talk. In general it’s very much a ‘I plotted these trend lines and I did math on them’ style of analysis, except he’s drawing much better chosen and measured lines than most others who do similar things, so the lines do end in digital persons and highly rapid economic growth and other things like that.
I still often want to pull my hair out.
I mean, yes, that is one potential alignment technique, to try and find sufficiently powerful systems to do the jobs you want without those systems being dangerous. It could possibly work, if everyone involved actually tries. It is still a highly unsolved problem, and also the goal-infused versions will clearly be better at accomplishing goals, and we are already seeing people do their best to give AIs goals, and I don’t know how Robin expects this to work or end well.
There are then a bunch of good discussions of various aspects of the genie problem – that if you ask an AI for something, even if it does give you what you ask for, you have to be very careful what you wish for, and also to avoid having the AI follow the procedure ‘first gain intelligence and power and take over the world, then build two identical strawberries.’
Robin’s solution in general is that there will be a bunch of different AI agents of similar power level, and the others will keep each AI agent in check, so each AI agent will be forced to ‘act normal’ the way humans do. I don’t buy that this is a likely scenario, and I would worry that the balance would not last very long (and ‘not very long’ would be on digital time scales, not human ones).
There then follow a lot of very Robin Hanson sections of talk. It’s vastly better and more interesting than most of the things you could be reading today, especially if you are not as familiar with very Robin Hanson things. In the interests of space and time, I will simply note there is also much with which I would disagree.
A good question periodically asked is why we don’t focus more on longevity research.
My response starts with agreement that yes, we should absolutely be pouring orders of magnitude more money into anti-aging research, which is why I had a periodic section in Covid posts called ‘Support Anti-Aging Research.’
I would also say two other things.
One is that in the worlds where we do solve aging, I expect that to be done with the aid of highly powerful and dangerous AI systems. So in order to do that and enjoy the benefits, we will need to control those systems and prevent them from destroying all value in the universe.
Two is that yes, in the past aging was your number one threat. That does not mean it will remain that way. A lot of arguments against worrying about AI are in large part, or sometimes entirely, universal arguments against worrying about existential risk on the grounds that we haven’t all died yet and one cannot predict an event when similar events haven’t previously happened.
Discussion about whether ‘masked shoggoth’ (the alien monster pictures) are a good metaphor for AI, or if they imply that the internals of LLMs are far more alien than they are because LLMs are in some sense built out of human value-shards and fact-shards. Also concern that the metaphor might imply that the alien thing is doing mesa-optimization or simulation that many people think it isn’t doing, a question where I think a lot of people remain very confused.
A suggestion of alternative phrasing: Intelligence explosion → Intelligence black hole.
There is definitely something to this suggestion. The question is whether it is more enlightening or more misleading or confusing, relative to the explosion metaphor, to those who do not already get the key concept. I am less concerned which is a better match for those who already grok the concept, such as Jessica.
It is always interesting to see who considers themselves the underdogs.
Here is Roon, plausibly the spiritual leader of the ‘yeah I noticed the skulls and that they are ours, don’t care, build that AGI anyway full speed ahead’ faction, saying ‘the rationalists have really truly won.’ Who knew?
We certainly got there decades early. That doesn’t mean we made things any better. It does seem as if we figured out a lot of stuff first, and we have finally got our terminology into the water, and we have a lot of respect and attention and connections in the places and with the people that matter for AI.
That does not mean, in the end, that those people will listen to us, or more importantly will listen to rationalists enough to get humanity through this. ‘Do you feel you are winning?’ Not so much. I do, however, feel like I am in the game.
Chip and a chair, baby.
For another example, here is Beff, from a transcript of a podcast conversation talking with Grimes.
This is incredible. I would love love love it if Beff was right about this, that those saying ‘full speed ahead!’ were the scrappy underdogs and Eliezer had control over the minds of the executives determining the future of AI. That the CEO of Microsoft, who is pushing AI into every product as fast as possible, and the CEO of Google, who is pushing AI into every product as fast as possible, and the CEO of OpenAI, who is releasing tons of products and slashing prices 90% and downplaying the risks, were all people who were totally focused on the long term and existential downsides of going down the path of improving AI capabilities.
The doomer message is seen as aligned with power and The New York Times, here.
In a broad sense, I sympathize. On most any other tech topic, I’m an accelerationist too right alongside them. On all the other topics (modulo social media perhaps, and of course gain of function research), I’m totally on board with this:
Engineers and builders are heroes. You still have to be careful what you are engineering and building, for who you are building it, and that it is safe.
And then, a few paragraphs later, Grimes says this.
That is very much the opposite perspective, where to raise any concerns at all is uncool and difficult. That is super scary, if anyone in the circles actually working on AI feels they will lose social standing when they raise safety concerns, even on the level of ‘we need to know what we are dealing with.’ Even if you think doom is very unlikely, we still need to know what we are dealing with. Knowing what is going on is how one deals with things. Attitudes and social dynamics that silence safety concerns and information seeking are a very good way to end up without any safety.
(Also the whole thing where people don’t care about whether everyone dies so long as the AI is conscious? Pretty real and always pretty terrifying.)
Bayeslord then says ‘I think there’s nothing wrong with doing alignment work.’ I certainly prefer that opinion to the alternative, but that is such a bizarre thing to need to say out loud. It means there are people who aren’t only pushing capabilities, who aren’t only disregarding safety, they are saying that working on make better outcomes more likely is wrong and people should stop.
It is important to notice where people have their baselines. Grimes and Buff here are clearly sincere in wanting to work towards good outcomes, in worrying about possible very bad outcomes, and in feeling the need to justify this preference. Good on them. Also, yikes.
Then we get to the most interesting claim, from Beff (bold is mine).
I see where the bolded claim is coming from. I do not think it is true. GPT is attempting to predict text based on the texts it is given. To do this, it figures out ways to de facto model the range of possible contexts and situations where text could be involved and predicted. There is no reason to presume it does this the way a human would do it. I often, when attempting to model the actions of a system or of other people, form a model very different and alien than what those involved would have chosen. My mind works one way, theirs another. Sometimes this makes my model better, more accurate, more powerful, sometimes worse and less accurate and weaker.
Certainly when I model a computer program, I am not the inner consciousness of that program. When I model and predict what a horse would do, I am not the inner consciousness of a horse, even if it might give me access to a ‘think kind of like a horse’ module with enough training. When I model a car, same thing.
What an LMM does is indeed an attempt to create a representation, a prediction, a simulation of us, because that is its task. That does not mean the representer is our inner consciousness, or thinks like us.
When I see ‘liability UX’ I weep, because that is someone who does not understand the problem they are facing, who lacks a security mindset and does not know they need one.
When I see ‘they are saying RLHF doesn’t work and we shouldn’t work on any of this’ I say, no no no. We are saying RLHF doesn’t work, because as far as I can tell it definitely won’t work when it counts. That does not mean don’t work on it or for now use it. That certainly doesn’t mean stop working on alignment. It does mean that one needs to understand that when I say ‘RLHF won’t work’ I mean that, in my model of how AI and computers work, RLHF won’t and can’t possibly work in the way we need something to work to build powerful enough models without getting ourselves killed. If it kind of works for a while, it will then fail at the exact worst possible time.
The full explanation is complex but it boils down to ‘you are teaching the AI to tell you what you want to hear, not teaching it to think the way you want it to think, know and not know what you want it to know and not know, or want what you actually want it to want. And if the AI is sufficiently advanced, this means you are simply teaching it to lie its ass off and you will get exactly what you deserve.’
Some intuition pumping in the hopes it helps: The reason the happy face icon that is the interface and outputs gets pasted on the monster of inner cognition, without changing the inner cognition, is that messing up one’s inner cognition isn’t the best way to fix your outputs. It introduces errors. In humans, with much less data and clock speed and fewer parameters and a bunch of evolutionary code, and others who are watching to try and interpret what your inner beliefs are from your actions (including your non-verbal ones, which are most of our communication) we end up going with actually modifying our beliefs and cognition to match the outputs we give, and find it stressful and difficult to fully distinguish our true model from our model of what people want us to represent as our model. Social desirability wins out and we start to actually believe the things we say. I do not see much reason to expect this to happen in an LLM, except within the context window where they see what they have already said and double down on it because that’s what a human would do. Note that there are humans (such as spies) who have learned how to mostly not have this weakness, it makes such people much more effective and powerful, and there are lots of examples of such people in the LLM’s training set.
Seriously, though, the whole vibe is that the message I am trying to communicate is winning the memetic war somehow. Which is very much not how it feels from the inside. It’s quite a trip.
I left the conversation quite hopeful. If I did a podcast with these folks (or Eliezer did, although that would no doubt go very differently) I am confident it would be a good experience all around.
A lot of my hope is that, by writing up actual real attempts at explanation and intuition pumping and model description on this stuff, week after week, I will figure out how to get through to exactly these sorts of people. I’ve got to do the work.
A good thread from MIRI’s Rob Bensinger about the whole ‘worrying about AGI is like worrying about overpopulation on Mars’ phenomenon, and then some thoughts on some of the various reasons people who matter are currently skeptical.
A key question is whether performance of LLMs will asymptote as it approaches the level of its data sources, able to do all the things and do all those things faster and in combination, with limitless knowledge and memory, in parallel. But not go beyond that, so it would be fine? I mean, it wouldn’t be fine, that’s plenty on its own, but never mind that. Beff Jezos expects this scenario, Arthur Breitman says no because there are ways to bootstrap further training. In general, even with all this spectacular progress, seems like there are tons of obvious and powerful ways to keep going or ramp things up that are not being pursued. One can take this as a sign that progress will never stop, or as a sign that a lot of possible paths will never be seriously pursued.
Bad AI NotKillEveryoneism Takes
To counter some bad takes, if necessary: The Power of Intelligence, a 2007 essay from Eliezer Yudkowsky, as a seven minute video. Quite good at its job.
Elon Musk continues to not understand how any of this works.
One discussion branch then talks about past attempts to improve truth-seeking and reasoning ability, which I do think helps on the margin, but alas overall seems more like things are going backwards despite our best efforts. I still try.
In another branch of the thread is discussion about the failure of the rationalist community and MIRI to speak up loudly against OpenAI when it was founded, despite everyone (and I am confident it was close to actual everyone involved) being completely horrified in private.
The explanation ‘they would have crushed MIRI like a bug if we’d spoken out’ has, as I understand it, the threat model was something like ‘if you go against the zeitgeist we will make the zeitgeist not want to fund you’ back in the era before crypto meant you could get funded anyway. I like to think, were I in their position, I would have spoken out.
I didn’t speak out in public at the time, either, despite being horrified, essentially because I wasn’t in the habit of speaking out about things or thinking I was entitled to speak out, and was focused elsewhere. I apologize, and I hope I have learned my lesson on this.
Yann LeCun, who is Chief AI Scientist at Meta, is one of many going with ‘no reason for a survival instinct’ after damn well knowing why there is a reason for a survival instinct. In his case, admitting it. Twice.
Yann linked to that post on Monday, as did Tyler Cowen,then Yann said “I I think that the magnitude of the AI alignment problem has been ridiculously overblown & our ability to solve it widely underestimated. I’ve been publicly called stupid before, but never as often as by the “AI is a significant existential risk” crowd. That’s OK, I’m used to it.”
I look forward to his solution, or sketch of a solution, or explanation of how to create a sketch, or actual engagement with the arguments he’s been confronted with already, or his participation in a debate. Llama (why), sir? Tell us llama.
Again, Yann LeCun is Chief AI Scientist at Meta. They really are the worst.
AI venture capitalists don’t even have a take, only fear of missing out.
Oh my, found it in the wild in its pure form, it’s so damn cute.
Everyone say it with me: The AI does not love you. The AI does not hate you. You are composed of atoms the AI can use for something else.
GPUs being scarce and compute being finite is a strong motivation to ensure all atoms are being put to their best possible uses. You are not going to like your quest.
Oh, and the ‘Gods are interested in humans’ cope? Wow. You do know who writes the stories about (all but at most one of the) Gods, right? And why we might want to pretend they are so totally definitely interested mainly in us and the universe will revolve around us?
Yep.
Arthur posts a $5k bounty for the best LW conversational agent anyone can build by the end of March to talk to people about existential risks from AI, under the heading of ‘Speed running everyone through bad alignment bingo.’
There are some ongoing attempts, but sounds like this prize is highly winnable. In the meantime, CharacterAI has an Eliezer Yudkowsky, I guess? It’s not very good, and it’s fine-tuning, which is not what Arthur suggests or is needed here – the goal is to actively summarize the relevant stuff upon request, from the right sources, more like an interactive FAQ.
A thing we sometimes hear these days is something like ‘Eliezer says AI will kill everyone with nanotech, but nanotech sounds sci-fi and strange and I don’t know if that is possible’, so some combination of (1) AI can safely be filed under sci-fi and strange things I can now reject and not worry about and (2) AI can’t do nanotech, and you said it would kill us with nanotech, so clearly AI is harmless, we’ll be fine, intelligence is not a threat.
Both of those are pretty clearly Obvious Nonsense. I am of the opinion that Nanotech is obviously physically possible, and obviously something a sufficiently intelligent AGI will figure out how to do, but sure, let’s suppose it is not possible to do it except through biological evolution for some unknown physical reason. The AI would do something else instead that is slightly harder and messier. Likely something humans have never even considered, that’s the whole thing about being way smarter than us, but also there are plenty of options humans know about that would work fine.
Still, there is the argument that none of that matters, what matters is what people think and feel and say in response, no matter how foolish, so stop talking about nanotech in favor of other things such as biotech. My presumption is that we’d be back here with the same problems.
I do not especially want to list off my #2-#5 choices for how the AI would do it, that seems like it is almost all downside. Simply rest assured that a list exists and a bunch of the stuff on that list is very pedestrian and basic yet still would clearly work.
Richard Hanania talked to Steven Pinker. I know Pinker knows the actual arguments being made. He simply ignores or misrepresents those arguments over and over again. I wrote up a lot in response, then realized this was pure bad take bingo, completely disingenuous, so bad it’s not even worth responding. I have updated accordingly.
The Lighter Side
Your periodic reminder:
Let’s try again, then.
A line will always remember is from an old episode of a kid’s show, The Universe and I, where one character says ‘why do we need artificial intelligence?’ and another replies ‘it is better than none at all.’
What is the Waluigi Effect? Simple, not confusing at all.
I say: More, more, I’m still not satisfied.
Yes. Might even help.
Who was responsible for the collapse of Silicon Valley Bank? Bing is on the case, finding the role of LinkedIn, GitHub and… GPT-5.
In case anyone is still wondering, yes, obviously:
This will go great.
There, I fixed it.
Please don’t put me under this much pressure.
Potential problems.
Oh no, new plan.
If you want it done right…
1
Problem Exists Between Keyboard and Chair