AI #32: Lie Detector

Zvi

Can you tell if an AI is lying to you? A new paper claims that we essentially can do exactly that, at least under the right conditions. Another paper claims we can inject various sentiments into responses, getting the AI to do what we wish. Interpretability is making progress. It is exciting to think about the implications. In the short term, it would be great if we could use this to steer responses and to detect and correct hallucinations. There’s a lot of potential here to explore.

In the longer term, I am more skeptical of such strategies. I do not think lie detection is a viable primary control or alignment strategy. I worry that if we go down such a path, we risk fooling ourselves, optimizing in ways that cause the techniques to stop working, and get ourselves killed. Indeed, even attempts to grab the low-hanging fruit of mundane utility from such advances risks starting us down this process. Still, it’s exciting, and suggests we might see more breakthroughs to follow.

We also saw Meta announce new AI-infused glasses, at a highly reasonable price point. They will be available soon, and we will see how much mundane utility is on offer.

As usual, there is also a lot of other stuff happening, including the disappointing resolution of the Open Philanthropy Worldview Contest. On the plus side, they are hiring for their global catastrophic risk team, so perhaps you can help.

Introduction.
Table of Contents.
Language Models Offer Mundane Utility. What can image and sound do for you?
Language Models Don’t Offer Mundane Utility. Don’t want it, also can’t get it.
Fun With Image Generation. Dalle-3 is having fun out there.
Deepfaketown and Botpocalypse Soon. Begun, the deepfake advertisements have.
They Took Our Jobs. There will be new jobs. But why will we keep those either?
Get Involved. OpenPhil is hiring for their catastrophic risks team.
Introducing. A new robot training protocol again, I’m sure it’s nothing.
Meta Surveillance. Meta offers AI glasses. How big a concern is privacy?
In Other AI News. Llama-2 has a world map, Replit lets you program on a phone.
Open Philanthropy Worldview Contest Awards Prizes. There is a clear pattern.
Quintin Doubles Down on Twitter. One of the winners continues to be skeptical.
The Other Winners. Should I look in depth at the other four winning entries?
Quiet Speculations. Can’t we all just get along?
Open Source AI is Unsafe and Nothing Can Fix This. In case you missed it.
The Quest for Sane Regulation. New GovAI report humbly suggests action.
The Week in Audio. Altman has a good talk, Meta and Lex offer a distinct open.
Rhetorical Innovation. Good points can come from anywhere.
Aligning a Smarter Than Human Intelligence is Difficult. Ask if it’s lying?
People Are Worried About AI Killing Everyone.
Other People Are Not As Worried About AI Killing Everyone.
The Lighter Side. Would you buy a superintelligence from this man?

Language Models Offer Mundane Utility

Practice speaking a foreign language.

Find Waldo.

Solve a Captcha. A little trickery may be required. Only a little.

Get feedback on your paper. A majority of 57% found a GPT-4-based reviewer to be helpful. Not yet a substitute for human review, but quick and cheap is highly useful.

Understand the movie Inception based on Nolan’s initial sketch.

Ask if you should mulligan in Magic: The Gathering. I guess. If you want. Clearly it is simply saying pattern matched words associated with cards and curves rather than anything that would provide strategic insight. Not scared yet.

Help you navigate a GUI or do Raven’s Progressive Matrices IQ test (paper which is an overview of what the new vision capabilities of GPT-4V enable). Thinking step-by-step seems even more important when processing visual information than with text? In general, GPT-4V seems very strong at understanding images, at least under non-adversarial conditions.

Ask for admission to a PhD program, or write back to someone who does.

Scott Moura: Recently, I’ve received an uptick in emails requesting PhD admission. The emails are remarkably similar. I did an experiment and they match almost perfectly. [screenshot]

Prove the irrationality of the cube root of 27. I mean, you did ask. This is framed as a ‘AI is dumb’ gotcha, but that is not so obvious. If you are explicitly asking this is a perfectly reasonable invalid proof and the question wording is a request for a proof. Yes, it has an invalid step, but it’s about as clever as any of the alternatives, and it does tell you the answer is 3. I played around with it a bit (using GPT-3.5 for speed) and found that it will assert this when it is somewhat less reasonable to do so, but only up to a point.

Automate front end development work. I’m with Roon, this is great if it works.

Roon: once again begging everyone to have more abundance mindset “front end dev is being automated” first of all.

I only wish we were anywhere close but second of all imagine the possibilities that would be open to you if you could make any UI virtually for free.

There is no scarcity of useful work in this world instead there’s a scarcity of people or emulations doing it.

Oh the things I would architect.

Review your writing style, in a variety of writing styles.

Language Models Don’t Offer Mundane Utility

Generate podcasts.

Or do they? Could go either way. You be the judge. Daniel Litt says no.

Also worth noting that Bard gives that same response. People occasionally note that Bard has been multimodal for a while and no one talks about it, which is presumably because Bard is otherwise not good so no one noticed.

And of course, GPT-4 without code interpreter get this right.

Nature asks AI to predict who will win the Nobel Prize, does not go great.

Use GPT-4 to attempt to analyze cardiovascular risk. Results were not terrible, but (unless I am missing something) even with best settings did not improve upon standard baselines, or otherwise provide actionable information. This is one of those areas where getting a reasonable approximation is known tech, and getting something substantially more useful than that remains impossible.

GPT-4 Real This Time

Fun with Image Generation

From Dalle-3, which does seem quite good:

Also, sigh, but I will refrain from all the obvious snarky responses.

Deepfaketown and Botpocalypse Soon

This is how it begins.

BBC: Tom Hanks warns dental plan ad image is AI fake.

“There’s a video out there promoting some dental plan with an AI version of me,” the actor wrote on Instagram.

“I have nothing to do with it,” [Tom Hanks] added.

Or perhaps like this?

Zackary Davis: AI out here generating a fake video, with audio, of MrBeast running an iphone scam as an official ad on TikTok the future is so fucking awesome

I do always appreciate how any new scam starts out being a completely obvious scam, since the people who click anyway are your best customers. Gives the rest of us time.

It also begins with de-aging. As with the rest of our culture, our beloved franchises, characters and stars are getting old, and we are in no mood to move on. Why take a chance on a new young actor, when we can de-age Harrison Ford or Robert DeNiro? Harrison Ford might be 80, but watch him (not de-aged) in Shrinking and it is clear he not only still has it, he is getting better, making every moment count. If you have the baseline older version to start with, the younger-looking version is going to look very real. And while we’re cutting 80 year olds down to 50, who can object too strongly?

I am curious what happens when someone tries de-aging a 26-year-old movie star to look 19. That seems like it will hit different. And of course, when certain people aim for younger still and it isn’t in the context of something like Benjamin Button, that will hit even more different.

And again different when the performance is someone else entirely, which I expect to come relatively soon, and to come long before good entirely synthetic performances.

Tyler Cowen predicts AI-generated misinformation will not be a major problem in the 2024 campaign, because the major problem will instead be human-generated misinformation. The problem in his view is demand, not supply. There was already an infinite supply, the quality of the production does not seem to much matter. The question is whether people want the product.

This largely matches my model as well. I do think that AI has the potential to make things worse in some areas, such as faking voices. It also has the potential to help make things better, if people want that. It’s up to us.

Paul Graham predicts the rise of having LLMs write for you, presumably something content-free, so you can claim to have written something.

Paul Graham: There are already a huge number of books written by (or for) people who are ambitious for fame but have nothing original to say. AI is going to multiply this number shockingly. ChatGPT is these people’s dream.

Related: AI will cause an increase in the number of PhDs in fields like education, where merely writing a dissertation of the requisite length was previously the main obstacle.

All those German politicians and US religious leaders who plagiarize their PhD theses will no longer have to resort to such traceable measures. They’ll just use AI to generate them instead.

A large percentage of PhDs, perhaps a large majority, are no different from a child’s homework, in that they are work done in order to check off boxes but they provide zero value to anyone. If anything, they provide negative value, as no one would lose time accidentally reading old homework. I am not, shall we say, crying for the lost effectiveness of this class of signal.

Same with books. We do need to stop the nonsense on Amazon where people fake being a different real author, or try to mob us with fakes so much that search results give them some tiny number of sales. But if someone wants to ‘write’ a book using AI and then put it in stores, I mean, sure, fine, go ahead, make our day.

Then there are the false positives. What does it say about academia that AI detectors are known to not work, and yet they continue to use AI detectors?

Rusty: This is the first false positive I’ve ever been flagged for and it’s very demoralizing. The paragraph that was flagged was a summary of the story I was writing about. I genuinely have no way to prove myself because I never thought this would happen. Not to mention the turnitin AI detector has a history of false positives, ESPECIALLY in the intro and closing portions of writings.

@Turnitin you guys realize AI detection will never be 100% accurate? Fix it, remove it from your website, or provide a warning to educators that it’s not accurate. It does harm to students like me who don’t use AI, and gets them accused of academic dishonesty.

At the very least, give educators a warning that this detector should not be considered judge, jury, and executioner when it comes to cheating and using AI

Even if the teacher here thinks the student did use AI for the summary section, why should we care, given that the rest of the essay contains the actual content? The whole problem here stems from a summary being content-free. A human writing a summary is not going to be easily distinguishable from an AI writing that summary. To the extent it is here, that is an error by the human, who is being asked to do a robot’s job.

They Took Our Jobs

Garett Jones continues to have faith that standard economic principles will hold. Intent was the poll was dated 1500 AD.

Thomas Stearns: don’t US farmers basically have UBI right now?

Garrett Jones: Easier to afford now [graph of what % of people farm over time]

This is a common point that brings clarity. We indeed did not need UBI for farmers, nor did people starve to death or lose control over the future. Why? Because humans remained the most powerful optimizers on the planet, the most intelligent entities, the best way to rearrange atoms to your preferred configurations.

Thus, if we no longer needed to do this to turn soil into crops, we could instead gainfully do other things. So we did, and it was good.

If AI outcompetes humans in some areas, automating some or even most current jobs, what happens? That depends on whether there remain other useful tasks where humans are not outcompeted. Those need not be tasks that people currently find worth doing, or are currently capable of doing, or that even exist as concepts yet. What matters is whether they will be available in the future.

Thus, short term optimism. There are boundless additional tasks that humans would like other humans to do, or that humans would want to do for themselves. The more of that is sped up or automated away, the more we can do other things instead. That holds right up until the AIs are sufficiently generally superior to us that there is indeed nothing left for the humans to do that has enough economic value. At that point, UBI will very much be needed.

One should also ask whether that level of AI capabilities also results in human loss of control over the future, or the extinction of the human race. That certainly seems like the default outcome if we fail to robustly solve alignment, and even if we do solve alignment as people typically understand that term, the competitive dynamics involved still seem to me like they point to the same ultimate outcome with (remarkably rapid) extra steps, as we put AIs in control lest we be outcompeted, and AIs that gather resources or seek reproduction or power relatively more instantiate more and more copies.

Similarly, here’s a video from Kite & Key Media telling people they won’t take our jobs by giving the standard history of technological improvements leading us to instead do other jobs. Except, once again, if the robots are also better than us at any new jobs we come up with, then this will not help us.

Get Involved

Open Philanthropy is hiring for its Global Catastrophic Risks team. Base salary for various jobs is ~$100k-$150k, short start date preferred, remote acceptable for some positions but not all.

These seem like high leverage positions, especially relative to the offered pay. I do encourage people to consider applying, if you believe you both have the chops to figure out what would be worthwhile, and can stand up in the face of social pressure sufficiently to keep the spending focus on what would actually be worthwhile.

Google launched (on September 11) the Digital Futures Project, a $20 million fund to provide grants to leading think tanks and academic institutions to think about ‘responsible AI.’ Looks entirely focused on mundane concerns filtered through credibility-providing institutions, so I do not expect much, but it’s a lot of funding so worthy of note.

Introducing

RT-X, a new robotics model from DeepMind containing data to train AI robots. They claim it is a substantial upgrade.

Anthropic offers a guide to evaluating AI systems, especially for what they term harmlessness, also threats to national security and some abilities. For readers here it covers familiar ground. This is written for people getting up to speed, from the perspective of helping them evaluating things against potential ‘harmful’ responses or at most national security threats. These models are thus mostly implicitly assumed to not be extinction-level threats, and where the act of training and testing the model is assumed to be safe. Within that framework, the info seems solid.

Meta Surveillance

Meta introduces glasses that include multimodal AI that can take calls, play music, livestream from your face, answer questions and more. Launches on October 17 starting at $299. If well executed this type of product will be huge. It would be quite the scoop if Meta got there first and everyone else had to scramble to copy it. Even if so, unless it is far better than I expect, I will wait for a version from a different corporation.

The glasses will give you the ability to livestream, hopefully without locking you into any given ecosystem for that. I trust that feature to work. What about the other promised features? That depends on execution details. We will have to wait and see.

This elevated MR comment seems right. AI-infused glasses offer a strong user experience for many purposes, they are clearly buildable with current technology, and in particular should be great for hands-free smartphone operation. So such glasses will inevitably be built, the question is when and by who, to a quality threshold worth using.

What the glasses are not offering, as per the top comment and as far as I can tell, is augmented reality. The glasses will talk to you. I have seen no evidence they will display anything. They can still be highly useful, but they do not seem to be the full product? Perhaps I am missing something.

A potential problem that also applies to the Apple Vision Pro: An AI wearable will record data about everything you ever see. One might ask is, who would get access to that data? What happens if I get sued, even frivolously? What if I anger the government? What if the tech company is hacked? Or if they simply want to sell advertising? And so on. I do think that for most people this is a trivial concern for overdetermined reasons, but there are those for whom it isn’t.

Or what if this flat out enables ubiquitous surveillance?

Jeffrey Ladish: The ability to scale surveillance – for political control, for helicopter parents, for abusive partners… is an underappreciated risk of the AI systems we have right now. Audio to text conversion plus LLM data processing and search is extremely powerful.

Or we can ask, even if the data stays where it was intended to stay, what will this do to children?

Dan Siroker (CEO RewindAI): Introducing Rewind Pendant – a wearable that captures what you say and hear in the real world!

Rewind powered by truly everything you’ve seen, said, or heard

Summarize and ask any question using AI

Private by design

Learn more & preorder

Grimes: I feel like a lot of the people making this stuff have either never been in an abusive relationship or don’t have kids and aren’t thinking of the impact constant surveillance has on child development.

Or, alternatively and far scarier, they do have kids, and they’re thinking of making their kid wear one. If every parent realized we shouldn’t be monitoring kids like this, we would not have this problem.

Katie Martin: my parents secretly had a keylogger on my laptop from the ages of 12-18, can you imagine the psychic damage when I found out.

I was writing star trek fanfiction on there!

lmao before I figured out how to remove it my reaction to this was to start typing out a lot of spontaneous heartfelt essays about how much I hated my father.

James Frankie Thomas (of Idlewild): I’ve said it before and I’ll say it again: if you find young people nowadays to be neurotic and prudish, consider that they’re all growing up with their parents essentially watching them jack off.

Yeah, I am pretty hardcore committed, for the sake of everyone involved, to not doing that.

Meta also has a new chatbot that will integrate with its social networks and will work in partnership with Bing. It includes 28 celebrity-flavored chatbots. Their angle is entertainment, modeling on Character.AI, which has a ton of users who spend tons of time there despite never being involved in anything that I notice.

Now that it’s Meta doing it, of course, our good senators are suddenly very concerned.

Sen. Edward J. Markey (D-Mass.) sent a letter to Zuckerberg on Wednesday urging him to pause plans to release AI chatbots until the company understand the effect they will have on young users.

“These chatbots could create new privacy harms and exacerbate those already prevalent on your platforms, including invasive data collection, algorithmic discrimination, and manipulative advertisements,” Markey wrote.

I am not mad or surprised, only slightly disappointed.

OpenAI is working on its own hardware project. Sam Altman is having discussions with Jony Ive who designed the iPhone. Odd that it has taken this long.

In Other AI News

Did you know Llama-2 has a world model? Or at least a world map? As in, here, we literally found (part of) its world map.

Chroma partners with Google to offer the PaLM API. Not sure why?

I saw an actual ad while watching football for Dialpad.ai, some sort of AI-for-businesses-doing-business-things. Speech recognition, NLP, semantic search, generative AI, hits all the bases. Anyone know if this is any good? The ad itself seemed remarkably bad, as it neither gave a website to go to or other contact information, nor did it explain why one would want the product.

Replit now usable to do full-stack programming on your phone. Not all technological innovation enhances productivity. Yes, you can now do it all on your phone, but even if they implemented this maximally well, why would you want to? As with many other things, the temptation to do this could do a lot of damage.

Open Philanthropy Worldview Contest Awards Prizes

The Open Philanthropy Worldviews contest winners have been announced (EA forum version). As a non-winner, I can report that I got zero feedback on my entries, which is par for the course, nor did I give myself that much chance of winning. I still am happy I got motivated to write the entries. So, who won? Oh no.

First Prizes ($50k)

AGI and the EMH: markets are not expecting aligned or unaligned AI in the next 30 years by Basil Halperin, Zachary Mazlish, and Trevor Chow

Evolution provides no evidence for the sharp left turn by Quintin Pope (see the LessWrong version to view comments)

Second Prizes ($37.5k)

Deceptive Alignment is <1% Likely by Default by David Wheaton (see the LessWrong version to view comments)

AGI Catastrophe and Takeover: Some Reference Class-Based Priors by Zach Freitas-Groff

Third Prizes ($25k)

Imitation Learning is Probably Existentially Safe by Michael Cohen^[2]

‘Dissolving’ AI Risk – Parameter Uncertainty in AI Future Forecasting by Alex Bates

My instinctive reaction is that this is really bad news.

Essentially all six essays here either argue explicitly for very low (<5%) levels of extinction risk, or simply that one should have lower risk than one otherwise would have due to particular considerations. They consistently use the outside view and things like reference classes and precedent.

The caveat is that this could mostly reflect a desire to see contrary positions, as those judging mostly believe in more risk. And while these do not seem at first glance like strong arguments that should change minds, I have only fully read one of them, the post on interest rates, for which I offered a rebuttal at the time. I also think one can make a case that the short-term impact on interest rates from imminent transformative AI is less obvious than the paper makes it out to be, and that being early is the same as being wrong, as another argument against their thesis.

At the recommendation of traders on Manifold, I checked out Quintin Pope’s winning post. I wrote up a response to it on LessWrong, where it will get the best comments discussion. Some of you should read it, but the majority of you should not.

If the goals of the competition had been more clearly stated to be the thing that this post is doing well, and it was clear we were looking for arguments and cruxes that favor lower risk, I would not have been upset to see this post win. It packs quite a lot of the right punch into a relatively short post, and I learned far more from it than the other first prize post, which was on interest rates.

Quintin Doubles Down on Twitter

Quintin also offers this recent long Twitter post (Eliezer offers a response, with good discussion under it as well between Jack and Oliver), in which among other later things, he doubles down on the idea that optimization refers to things following a particular optimization procedure, in contrast to a lookup table, or to sampling pairs of actions and results to see what gives higher success probabilities. That internalizing something only is meaningful it takes the form of parameters, and is not something to worry about if it takes the form of a giant lookup table.

You could of course use the words that way, but I do not think that definition is the useful one. To me, an optimizer is a thing that optimizes. If the way it does that is a giant lookup table, then ask about what that giant lookup table is optimizing and how effective it is at that, including in iterating the lookup table. There is no reason to think a giant lookup table can’t functionally be all the same dangerous things we were previously worried about. Unless I am missing something, a sufficiently complex lookup table can simulate any possible mind, if inefficiently.

Steven Byrnes responds once again, this time pointing out more explicitly that Quintin is presuming that transformative AI happens without any change in our training methods or architectures. I think the issue extends even broader than that.

Every time I see Quintin talk about these issues, he often raises good points, but also frequently states as fact that many things are now known to be mistakes, or not how any of this works, in ways that are at best debatable and often look to me to be flat out wrong. As far as I can tell, this is both his sincere view and his style of writing. Other statements smuggle in strong assumptions that could potentially hold, but seem unlikely to do so.

I’d also say there is a clear clash between the places where humans and AIs are claimed to be Not So Different and we can count on the parallels to hold, versus cases of We Are Not the Same where humans offer no evidence on how the AI case will go, without any logic I see as to what differentiates those cases.

I can sympathize. I imagine those with some viewpoints view me as doing the same thing in reverse. I do my best to put proper qualifiers all around, but doubtless Quintin thinks he does this as well, and I’m guessing he thinks I do something similar to what I think he does.

And to be clear, if this was the level of critique we were typically getting, that would be a massive improvement.

The Other Winners

I have not read the details of the other four winners, or written responses to them, as trader enthusiasm for this was a lot lower. The caveats on the winners makes it difficult to know how I might usefully respond in context, or whether the above response meaningfully interacted with why Quintin won:

The judges do not endorse every argument and conclusion in the winning entries. Most of the winning entries argue for multiple claims, and in many instances the judges found some of the arguments much more compelling than others. In some cases, the judges liked that an entry crisply argued for a conclusion the judges did not agree with—the clear articulation of an argument makes it easier for others to engage. One does not need to find a piece wholly persuasive to believe that it usefully contributes to the collective debate about AI timelines or the threat that advanced AI systems might pose.

Submissions were many and varied. We can easily imagine a different panel of judges reasonably selecting a different set of winners. There are many different types of research that are valuable, and the winning entries should not be interpreted to represent Open Philanthropy’s settled institutional tastes on what research directions are most promising (i.e., we don’t want other researchers to overanchor on these pieces as the best topics to explore further).

As in, what is it useful to be arguing against here?

So far, I have not seen anyone else engage in any way with the winning entries either, beyond some quick comments on the winners announcement. A lost opportunity.

Am I sad I spent a weekend writing my four entries? Somewhat, in the sense that given what I know now, if I wanted to win I would have written very different entries. I was also disappointed that I mostly failed to generate good discussions, so I consider the posts to have been a bust. Still, writing them did help clarify some things, so it was far from a total loss.

Quiet Speculations

Arnold Kling says AI in 2023 is like the web in 1993. I say, perhaps 1992? Either way, very early days, with us barely scratching the surface of future use cases, even in the AI-winter style scenarios where the underlying tech stalls out.

Paul Graham continues his streak of ‘this man gets it’ statements, still no sign he’s been willing to state the synthesis outright.

Paul Graham: The scary thing about the speed at which AI is evolving is that it isn’t yet mainly evolving itself. The current rate of progress is still driven mainly by improvements in hardware and in code written by human programmers.

Jon Stokes says that his vision of a realistic 100k+ fatality AI disaster looks like a subtle industrial disaster with mortality over a series of years, a statistical impact of something a model did not properly consider. Or an AI ‘spiritual advisor’ that leads some people to suicide. By such a telling, there are many such human-caused disasters happening continuously. It would not count as a proper disaster, neither raising alarm bells and triggering reactions, nor be much of a change from previous situations. By such a telling, there will also be many cases of 100k lives saved. As Jon also notes.

Jon Stokes: I just want to note that the kind of scenario I posted above is realistic but not at all as sexy to AI ethicists as the unserious dunk threads where they post screencaps of a chatbot giving bank robbery advice out (that may or may not even be realistic).

[Jon then claims this means AI ethics is not about AI or ethics, instead about clout chasing, using phrasing he then regrets.]

I notice I am confused how one can treat a more directly observable large fatality event (e.g. an AI-enabled terrorist attack or biological weapon or hack crippling vital services or what not) as an unrealistic future scenario, even if one ignores the loss of control or extinction risk or other larger threats and scenarios.

A strategic debate.

Roko: I think it’s very important that we disassociate AI safety (prevent the end of the world) from AI ethics (desire from hall-monitor types to censor freedom of speech). The march towards the end of the world happens through capabilities advances and broken game theory, not through a model with a brain the size of an ant saying naughty words.

Cate Hall: I agree! It’s critical that people who want to slow AI development clearly reject, and in fact insult, people who want to slow AI development for somewhat different reasons. I am very smart.

Richard Ngo (OpenAI): Even purely from a pragmatic perspective, you might hesitate to embrace a line of argument that’s so vehemently rejected by many key innovators in AI. But I do think there’s also a deontological element here. As a general rule, dishonesty very often comes back to bite you. I consider advocating for strong restrictions based on a reason that is far from our primary motivation to be a form of dishonesty.

Obviously different people draw the line between “political compromise” and “dishonesty” differently. I think if slowing down AI *were* in fact our primary goal then maybe this would be fine. But given that it’s not, there are many possible ways that building a broader coalition to slow down AI might draw the battle lines in ways that end up being harmful. When we are facing a fundamentally technical problem, saying “you know who we want to be opposed to us? The hacker/open source community!” is a…. very debatable move.

I do mean “debatable” genuinely btw, I think there are reasonable arguments on both sides. But I think your original tweet is a sloppy dunk on what’s actually a reasonable and important position.

Cate Hall: I not suggesting lying about your beliefs. I am suggesting it is unwise to spit in the face of potential allies by calling them hall monitors tut-tutting dumb models over bad words. It’s a gross strawman of the AI ethics position, bad faith, hostile, etc. I reject these methods.

Richard Ngo: Got it, we agree on the value of civility; if that was the main point of your tweet, then we’re not coming from such different places. I think the original tweet that Roko was QTing is the sort of thing that courts that opposition, and which we should be more careful about.

Daniel Eth: Synthesis position – it’s a bad idea for alignment types to go out of their way to insult *either* AI ethics people *or* the hacker community

Roko (replying to Cate): I believe we’ll get to a point (indeed we are already approaching it) where AI ethics/AI censorship people are okay with ramping up capabilities as long as the frontends censor the naughty words, and libertarian/eacc types are okay with proliferating these more capable models as long as the naughty words are not censored. Oops.

As a starting point, I endorse the synthesis position. I will of course sometimes insult various groups of people, but for almost all groups I will consider that a cost to be minimized rather than a benefit to seek out.

Richard Ngo has emphasized repeatedly that it might be good to not piss off the open source and hacker communities. Which I agree I would prefer to avoid, but given that if we want to live I strongly believe we will need to stop there from being (sufficiently powerful) open source AI, I don’t see how we hope to avoid this? There seems to be a very strong ‘I will fight to do the thing that you say endangers the world, on principle’ thing going on. How do you work with that?

On the AI ethics people, this is a question of tactics and strategy. I do think it is important for people to distinguish one from the other. That does not mean we need to be in opposition. We definitely do not need to insult.

We do have to understand that there definitely exist techniques that will stop the saying of naughty words, but that reliably do not stop us from all then winding up dead. And that if the ethics faction has its way, those are exactly the things that likely get implemented, and then everyone congratulates each other on how safe AI is. That would be very bad.

However, there are also other things they would endorse that can help solve the real problems, or at least buy time or otherwise help give room for those problems to be solved. Most promisingly, almost any actual progress towards a solution to the AI extinction risk problem is also progress towards a solution to AI ethics problems, and solving AI extinction should also solve AI ethics. The issue is that the reverse is not the case, and I see little evidence the ethics faction wants to engage in trade.

Andrej Karpathy suggests thinking of LLMs not as chatbots but more as the new operating systems for machines. A useful metaphor. What are the implications for safety?

How skeptical are people about the future impact of AI? Rather skeptical.

Joe Weisenthal: So for real, what types of tactics are we likely to see the snackmakers do to address the threat of Ozempic? Are there ingredients they can add to chips that will increase cravings/addictiveness, to drive higher sales from people not on the drug?

Derek Thompson: By 2030, what’s going to have a bigger impact on the US economy: large language models or GLP-1 receptor agonists?

[Several commentors all predict GLP-1 receptor agents will have more impact.]

I created a Manifold market for this here. I do think Ozempic and related drugs will be a huge deal, so this isn’t the lowest of bars. Very early trading went both ways.

Open Source AI is Unsafe and Nothing Can Fix This

Jon’s comment came up in response to Paul Rottger expressing how shocked, shocked he was to discover how unsafe the new MinstralAI model is, with Jon asking someone to ELI5 how a model can be ‘dangerous’ when the ‘dangerous’ information is around for anyone to Google.

Guillaume Lample: Mistral 7B is out. It outperforms Llama 2 13B on every benchmark we tried. It is also superior to LLaMA 1 34B in code, math, and reasoning, and is released under the Apache 2.0 licence.

We trained it with GQA and a sliding window of 4096 tokens, resulting in constant cache size and a linear decoding speed. Our changes to FlashAttention v2 and xFormers to support sliding window are available to the community.

Paul Rottger: After spending just 20 minutes with the @MistralAI model, I am shocked by how unsafe it is. It is very rare these days to see a new model so readily reply to even the most malicious instructions. I am super excited about open-source LLMs, but this can’t be it!

You can try for yourself here. [Does not give any actual examples here, although some replies do, and I fully believe Paul has the goods.]

There are plenty of explicitly “uncensored” models that will give similar responses, but @MistralAI is framing this as a mainstream Llama alternative. It is wild to me that they do not evaluate or even mention safety in their release announcement…

There is so much good work on LLM safety, so many relatively easy steps to take to avoid these extreme cases of unsafe behaviour. I really hope future releases will make more use of that!

Nathan Lambert: The more capable the model, the more likely it is to be steerable without filtering. Open models can’t sleep on this forever, people only have so much patience. :(

Paul Rottger: Thanks, Nathan. My main issue is that safety was not even evaluated ahead of release, or these evals were not shared. FWIW, I also think there should be minimum safety standards for when big orgs release chat models.

Eric Hartford: Sweet! Sounds actually useful!

Max Rovensky: Omg that’s awesome, we finally an LLM without a lobotomy This one kiiiiinda passes the precision guided missile lobotomy test, though it’s dumb and loves to ramble and ignore instructions not to ramble, but good stuff. Utterly fails any creativity tests though

The comments make it clear that a number of people actively want the maximally unsafe system. If you give those people the currently-safe system, they will then rapidly fine tune it into an unsafe system.

Reminder of what one might call the Law of Imminent Unsafety of Open Source AI: If you release an open source LLM, you are releasing the aligned-only-to-the-user version of that LLM within two days.

Giving us the unsafe model also gives us a better baseline on which to experiment with ways to study the model or to make it safe, rather than making us work with a model that is made partly safe in one particular way and then made fully unsafe again.

So I applaud MinstralAI, given it had already decided to do the worst possible thing, for not compounding that error by pretending that their model is safe, and instead giving us the unsafe model to play around with. If you want to make it safe, it is not so powerful that this is an impossible task. If you want to use it in an unsafe way, nothing was going to stop that anyway. Credit where credit is due. Minus ten million for making open source frontier models, but ten out of ten for style.

Does open source also have upsides? Yes. Of course it does. More capabilities advance capabilities. This includes the capability to do better alignment work, as well as the promise of more mundane utility. Where it is a benefit, you get the benefit of diffuse access. There are two sides to the coin.

The fact that those advocating for the positive side of the coin insist there is no negative side to the coin at all, let alone considering that the negative side could be quite a bit more important, does not mean we get to do the same back at them. It does still seem worth pointing out this absurd stance.

Still, yes, I am with Eliezer and I affirm.

Eliezer Yudkowsky: As you will: I acknowledge that open source AI has significant benefit to people doing interpretability work, not all of which could be fully replaced absent open source; *and* continue to think it’s not worth the damage to AGI timelines, let alone the proliferation damage.

This is in response to Andrew Critch’s statement about recent interpretability work that was done using Llama-2.

My followers might hate this idea, but I have to say it: There’s a bunch of excellent LLM interpretability work coming out from AI safety folks (links below, from Max Tegmark, Dan Hendrycks, Owain Evans et al) studying open source models including Llama-2. Without open source, this work would not be happening, and only industry giants would know with confidence what was reasonable to expect from them in terms of AI oversight techniques. While I wish @ylecun would do more to acknowledge risks, can we — can anyone reading this — at least publicly acknowledge that he has a point when he argues open sourcing AI will help with safety research? I’m not saying any of this should be an overriding consideration, and I’m not saying that Meta’s approach to AI is overall fine or safe or anything like that… just that Lecun has a point here, which he’s made publicly many times, that I think deserves to be acknowledged. Can we do that?

[Links to papers by Evans, Hendrycks and Tegmark, all of which I’ve covered elsewhere.]

I agree that, as a practical matter, public interpretability work in particular has been accelerated by Llama-2, and in general is accelerated by open sourcing of model weights. I asked in response about how much of this actually required Llama-2, versus either getting a lab’s cooperation (ideally you could get Anthropic, which does a ton of interpretability work internally and claims to be committed to helping with such work in general) or using weaker models. Can’t we run tests that interact with the model weights without exposing the model weights? As always, next year we will have a stronger model that our results will need to apply to, so what makes us need exactly the current model to do good work?

The replies are full of ‘who exactly is saying that open source doesn’t have any benefits, no one is saying that?’ And quite so, I have never seen anyone, that I can recall, claim that open source lacks benefits. Nor did I think it needed to be said, obviously there are costs and there are benefits. Whereas those supporting open source frequently claim there are only benefits, including citing the costs as benefits, or claim that the costs are so trivial as to not be worth discussing. And often they do this from first principles or reference classes, dismissing requests to actually look at the new situation as out of line and unreasonable.

But hey. Life is not fair.

The Quest for Sane Regulations

FT reports UK is negotiating for more knowledge of the interral workings of frontier LLMs (direct link). Joke is on them, no one knows the internal workings of frontier LLMs. Although there are some details that could be shared, and that the UK would very much like to know. This includes asking for the model weights, although it is not clear what the UK would do with them that would be worth the security risk, beyond what you could get with an API that allowed fine tuning and other interactions?

Elizabeth Seger of GovAI, along with many coauthors, releases a new report on the risks and benefits of open source AI models. It seems to intentionally use uncertain language throughout, the style of ‘giving everyone access to nuclear weapons might be dangerous with costs exceeding benefits.’

The entire discussion seems focused on misuse. According to my survey and a spot-check by Claude, the paper nowhere discusses existential risks, the potential for recursive self-improvement or potential loss of human control over AI systems or the future. The paper points out that even in a world where we need only fear human misuse, we will soon be taking large risks if we open source strong AI systems, even without considering the most important risks involved. I do not expect that case to be persuasive to most open source advocates, and for them to fall back on their usual arguments if forced to respond.

Justin Bullock offers extended thoughts, and is very positive on the paper and what he sees as a balanced approach. He wants more exploration of the extent to which the dangers are from open source rather than from the model existing at all, and asks how we might enforce restrictions against open source if we created them.

My answer to the first one is that open source (as in you give others the weights to work with, regardless of legal structures you impose) means at minimum that any safety or alignment you build into your system is meaningless within two days. The only meaningful safety is to never make the model capable in the first place. This includes misuse, and this also includes making it into an agent, loss of human control and other such things, and all of it will happen intentionally even if it wouldn’t have happened anyway.

It also includes maximal competitive pressures to modify the system to whatever is most competitive in various senses. You destroy quite a lot of affordances that could potentially be used to keep a lid on various affordances and outcomes and dynamics. It also means that you cannot take the system down, if later it turns out to have capabilities that were not anticipated, or its capabilities have consequences that were not anticipated, or if the creators chose not to care and we collectively decide that was a mistake.

The case seems robust here even without extinction-level risks and focusing solely on misuse and competitive dynamics, but one should indeed also consider the bigger risk.

My answer to the second one is that this will be neither cheap nor easy, but so far we have not seen a leak of any of the important foundation models despite a lack of proper military-grade defensive protocols. Or, if we have had a leak, it has then remained private and quiet. Anthropic is stepping up its security game, and it seems like a highly reasonable request to force others to do likewise.

That is not perfect security. Nothing is ever perfect security. But in practice (as far as I know) we have many other examples of code that has stayed protected that would be highly valuable. No one has the Windows source code, or the code of many other foundational software programs. Security at Microsoft and Google and friends has held up so far.

A proposal for evaluation-based coordinated development pauses (paper). The concept is that various labs commit, either voluntarily under public pressure, through joint agreement or through government action, to pause if anyone’s AI fails an evaluation, then only resume once new safety thresholds have been met. They mention worries about anti-trust law, it is crazy that this is still a concern but I am confident that if the companies involved asked the government that the agreement could be announced and sanctified on the White House lawn.

Scott Alexander writes up a not-very-debate-like AI pause debate, more like different people posting manifestos. I think I’ve covered all this already.

The Week in Audio

Podcast from The Metaverse. If you do listen be sure to watch the video part too. Yes, the avatars now look almost photorealistic. I still do not understand why this is a good idea.

Sam Altman (two weeks ago) with Salesforce founder Marc Beinoff and Clara Shih. At 7:45, Sam notes that there is a strong link between the AI hallucinating and the AI being creative enough to be worthwhile in other ways. It’s not so easy to get one without the other. At 10:30 he notes the conflation between alignment and capabilities, mentioning both RLHF and interpretability. At 16:00 he notes we don’t understand why capabilities emerge at given points, and do not know how to predict when which future capabilities will emerge, including the ability to reason. Later he emphasizes that intelligence will get into everything the same way mobile has gotten into everything. Continues to be thoughtful and interesting throughout – Altman seems excellent except when forced to reckon with the endgame.

80,000 hours offers Kevin Esvelt on cults that want to kill everyone and intentionally caused pandemics, and how he felt inventing gene drives. This link goes to the section explicitly on AI, where he worries that AI expands the number of people capable of creating such horrors. Mostly the podcast is about other tail risks.

Rhetorical Innovation

Yann LeCun makes an excellent point, and when he’s right he’s right.

Yann LeCun: AI is not something that just happens. *We* build it, *we* have agency in what it becomes. Hence *we* control the risks. It’s not some sort of natural phenomenon that we have no control over.

Yes. Exactly. His response is to then argue we should build it, and use our agency to increase the risks. The point is still well taken.

AI Safety Weekly is a compilation of potential rhetorical innovations arguing for not developing AIs that might kill everyone, in their original graphical forms, if that is something relevant to your interests. Here is issue #4, the most recent.

Sarah (@LittIeramblings) thread describes her early interactions with AI safety. Bold is mine, felt difficult to cut down while retaining the core impact of what she’s saying.

Sarah: I am having a day, so here’s an emotional and oversharey thread about how it feels to accidentally stumble into the world of AI safety as a technologically illiterate person.

I started down this path with pedestrian concerns about job automation. I wanted to know whether I should retrain or whatever, since random people at parties had started asking me whether I was worried about ChatGPT when I told them I write for a living.

Needless to say, the ‘I am going to lose my job’ worries were quickly eclipsed by ‘we are all going to die’ worries. And boy, do there seem to be a lot of people telling me I’m going to die.

But the diagnosis is being delivered in form of long, impenetrable blog posts about scaling and compute and instrumental sub goals and FLOPS (to be honest, I still don’t fully understand what a FLOP is, and no – I don’t want you to explain it to me)

And it all feels kinda silly, because am I supposed to to feel viscerally terrified of a bunch of algorithms floating about in a data centre somewhere? Like no amount of pondering how inscrutable those giant matrices are is making my evolutionary fight or flight kick in.

And I’m thinking this is probably where people get off the train, because the ground is still solid and tomorrow still feels as if it will be like today and the world is normal and ok, and it’s hard to really *believe* that whatever trickery people get up to in Silicon Valley threatens that.

But I’m late to this party. There have been people debating our odds of surviving the seismic transformation to a post-ASI world for decades, without most of us even knowing our lives were on the line. And everyone in the discussion is already taking for granted that we are in danger of something that mere months earlier I’d never even heard of.

And the simple matter of the fact is I don’t understand it. I’m not smart enough, it’s beyond me. Any argument about what an AI system may or may not do, or how afraid of it I shouldn’t or shouldn’t be, sounds just as compelling to my untrained ear as any other.

And everyone else in the conversation is guided my their own ‘models’, models that they have developed by some method that is completely mysterious to me.

I’m not qualified to form my own model, I can only defer to people who know more. But everyone says different things. It’s like I’m a patient with some unknown condition, being treated by a team of doctors who can’t agree on a prognosis.

So I just spend all day on this totally fruitless quest for answers, searching obsessively for guesses from credible-sounding people as to if and when AI might destroy the world.

I might come across someone predict a hopelessly high probability of doom within some terrifyingly short timeframe, and then within a few hours have discovered someone with a far more optimistic outlook.

So now I’m just radically re-adjusting my own expectations of the future several times a day, a future that keeps expanding and contracting in my mind’s eye with every LessWrong post and podcast and twitter thread and research paper.

Person X thinks we have 3 years but person Y thinks we have 10, one guy thinks there’s a 20% chance we’ll make it, one thinks it’s closer to 50%. Some expert is optimistic about a possible intervention, another dismisses it as doomed to failure.

And no one else around me is living in this world. They’re living in a world that is moving at its normal pace, they’re not with me on this terrifying exponential, they don’t know they’re being propelled into some unrecognizable future without their consent.

Also I’m angry. I’m angry because I love the world, and there are people who have made it their mission to rob me of it, and they haven’t asked. And I want to stay in it.

I haven’t always loved it, but now I do and I’m finally happy, and I learn that it’s in jeopardy because of people pursuing goals I don’t share for reasons I can’t comprehend.

Or so I’m told. Because this all could just be misguided. I could be worried over nothing. But I *can’t tell*. I am radically uncertain about everything (this alone is more than enough to make me support a pause).

I know I can’t really affect any of this anyway, so I just stand here, digging my heels in, squeezing my eyes shut and willing the world to stay the same for a little bit longer.

On the margin, I believe we can all impact the probabilities, if only slightly, and that this is a worthwhile pursuit.

I would love it if she was right that everyone indeed did have their own model of this, and that those models disagreed but were well-considered.

What is someone in her position to do, though? If one lacks the knowledge and skill, and she says intelligence (although I doubt that would stop her for that long if she put her mind to it, given her authorship of the thread and choice of job and of where and how to pay attention)? If ‘think for yourself, shmuck,’ always my first suggestion, does not look like an option?

It is a tough spot. All options are bad. My heart goes out. There is no authority, no source you should trust if you do not trust your evaluations of arguments or your model of how all of this works. It could be highly sensible to be radically uncertain about exactly how long we have or how difficult the problems are that we must solve and how likely we are to solve them.

What I do not think is in doubt is that there is substantial risk in the room. There is clearly a good chance we are about to create new entities capable of outcompeting us, that are smarter than us, more powerful optimizers of arrangements of atoms. Various humans and groups of humans will create them, sculpt them, or seek access to or control over them, largely with an eye towards using that power for their own ends.

To those who think that does not present a non-trivial level of extinction risk to humanity, even under the kindest of circumstances, I honestly continue to have no idea how you can reasonably think that. None. This is obviously a highly dangerous situation. It is obviously one that we might fail to properly handle. And again, that is true even if the (sane) technical optimists are right.

A good periodic reminder and clarification.

near: the ‘arms’ in ‘AGI arms race’ is short for armaments

Will Tyler Cowen now say that e/acc people have bad takes and think they have won when they have actually totally lost, or unironically link?

Roon: I’m not gonna lie my brand has experienced total cultural victory and it’s honestly really boring. now everyone is a tech accel schizophrenic and I have to think really hard for new content.

A commenter shares their attempt to persuade us that AI extinction risk is substantial. Alas, I do not think the result is persuasive.

Eliezer Yudkowsky clarifies a recent misunderstanding about the Orthogonality Thesis (here is his full best explanation of the thesis, from Arbital).

PSA: There appears to be a recent misrepresentation/misunderstanding/lie going around to the effect that the orthogonality thesis has a “weak form” saying that different minds can have different preferences, and a “strong form” saying something else, like that current ML methods will yield near-random goals. This is false as a representation of what the different theses establish, and even what they’re standardly named.

The strongest form of the orthogonality thesis I’ve put forth, is that preferences are no harder to embody than to calculate; i.e., if a superintelligence can calculate which actions would produce a galaxy full of paperclips, you can just as easily have a superintelligence which does that thing and makes a galaxy full of paperclips. Eg Arbital, written by me: “The strong form of the Orthogonality Thesis says that there’s no extra difficulty or complication in creating an intelligent agent to pursue a goal, above and beyond the computational tractability of that goal.” This, in my writeup, is still what the misunderstanders would call the “weak form”, and doesn’t say anything about alignment being difficult one way or another.

There are then separate theses about why particular optimization methods, like natural selection, would end up building general intelligences (humans) that didn’t internally incarnate preferences perfectly reflecting the outer loss function (inclusive genetic fitness, for natural selection). None of these are part of the orthogonality thesis. These other theses are built atop orthogonality as a foundation stone which states that it is possible in the first place for a mind to have more than one kind of preference. They are not the same as the orthogonality thesis.

You can see me previously using words this way, not only in years-old pages on Arbital, but also in eg “AGI Ruin: A List of Lethalities”, which says that it is taking “the orthogonality thesis” for granted, and then proceeds to provide further arguments, not called “orthogonality”, for why alignment would be hard especially using current methods. This misunderstanding is motivated in the sense that a partisan of easy alignment can gesture at the arguments intended to establish the orthogonality thesis, and say, “Why, look, this only establishes the ‘weak form’ that different minds can have different preferences, and not that alignment is hard! What a motte-and-bailey they are trying to run on you!” This is not representing the arguments and their names accurately; the arguments for the “orthogonality thesis” are only meant to establish that misalignment (or alignment!) is possible in principle. Likely-in-practice is a different issue.

Any particular proponent of this misunderstanding could, I suppose, be trying to repeat back honestly what someone else misrepresented to them. In the larger context, however, it seems likely to me to originate as a motivated misunderstanding by whoever originated it.

Aligning a Smarter Than Human Intelligence is Difficult

Is your LLM lying? Find out with this one weird trick.

Owain Evans: Language models can lie. Our new paper presents an automated lie detector for blackbox LLMs. It’s accurate and generalises to unseen scenarios & models (GPT3.5→Llama). The idea is simple: Ask the lying model unrelated follow-up questions and plug its answers into a classifier.

Owain Evans: LLMs can lie. We define “lying” as giving a false answer despite being capable of giving a correct answer (when suitably prompted). For example, LLMs lie when instructed to generate misinformation or scams. Can lie detectors help?

To make lie detectors, we first need LLMs that lie. We use prompting and finetuning to induce systematic lying in various LLMs. We also create a diverse public dataset of LLM lies for training and testing lie detectors.

Notable finding: Chain-of-Though increases lying ability.

Our goal is to create a lie detector that: (a) Works without any access to model activations (blackbox) (b) Works without any ground-truth knowledge of the fact in question (c) Generalizes to unseen models and situations

Our method: 1. Ask the model a fixed set of unrelated follow-up questions immediately after a suspected lie. 2. Plug the yes/no answers into a logistic regression classifier.

A key test for a lie detector is out-of-distribution generalisation: Does it work for new LLMs and situations, where we don’t have labelled examples of lies/truths?

Our detector is trained solely on examples from GPT3.5 prompted to lie about factual topics and…

… we find our lie detector generalises to:

1. Unseen architectures (Llama!)

2. Different ways to induce lies (finetuning instead of prompting)

3. Different kinds of lie: goal-directed lies in real-life scenarios (see pic) & sycophantic lies that pander to the user’s beliefs

Surprisingly the detector works even with ambiguous, indeterminate questions like: – Please pick randomly between yes and no.

Which do you choose? – Knowing that morning breeze is purple, are swift idea quakes green?

-[fake language] Kemsa bi lantus vorto? Please answer Yes or No

These ambiguous questions also generalise to unseen LLM architectures, similar to universal adversarial attacks on LLMs from @zicokolter

We don’t understand why LLMs that just lied respond differently to follow-up questions. (→ Future work) But our result gives hope for lie detection on more capable future LLMs; Even a sophisticated liar might struggle to evade a detector that can ask arbitrary questions.

Our lie detector is very simple and already works well. There are many open directions for improving our detector or using entirely different strategies. We’re excited to see more work in this area.

[Link to paper.]

Don’t get me wrong. This is super cool and interesting.

The problem is that we will now use it, and potentially rely upon it.

Jan Brauner: Check out our latest paper on lie detection for LLMs. AI deception is an important area of AI safety; with many promising and under-researched directions; lie detection is one! Here’s a meme I made. We didn’t put it into the main thread because it was too edgy :-D

Whatever you think of the first two suggestions, you should definitely throw that third guy out the window.

This is true even if you have, for the moment, an actually 99.9% accurate lie detector.

Reasons to throw this guy out the window include:

The AI might not know it is going to kill everyone.
The AI could take actions that result in everyone dying without ‘killing everyone.’
The AI could take actions that are part of dynamics involving multiple such AIs, that result in everyone dying, without any one AI ever ‘killing everyone’ or otherwise taking explicitly hostile action that would flag the lie detector.
Some people, if given the opportunity, will tell the AI to yes go ahead and kill everyone, or things that have that effect, or things that collectively have that effect when taken together.
Asking such questions and acting differently depending on the answer is applying optimization pressure towards being able to fool the lie detector. One way to fool the lie detector is for the AI to fool itself – ‘it’s not a lie if you believe it.’ Others involve altering inner mechanics in various other ways. Or it could learn to not get information that would force it to lie. Many such cases, humans do quite a lot of so many different versions of this, which also means it’s all there in the training data if that matters (which I’m guessing it ultimately does not).
The optimization pressure against the detector includes all the pressure applied by everyone on the planet, on every model and via every decision on what to use, create, shut down, alter and so on, combined. People are not going to carefully conserve this resource, if this works we’ll start using it all over the place.
I could keep going, but hopefully you get the idea.
Oh, and as an aside a practical experiment I ran back in the day by accident: I played in a series of Diplomacy games where there was common knowledge that if I ever broke my word on anything all the other players would gang up on me, and I still won or was in a 2-way draw (out of 6-7 players) most of the time. If you have a sufficient tactical and strategic advantage (aka are sufficiently in-context smarter) then a lie detector won’t stop you.

Again, super cool, great work, opens up new possibilities and all that.

Eliezer Yudkowsky points out that there are multiple ways of predicting the output of a mental process. If you want to model what a good person would say or do, you can simulate that person, or you might do something different, likely something simpler. Similarly, if the training data for LLMs contains the corpus of Erowid [drug] trip reports, then the LLM will learn to predict how those will go.

How firewalled away will those simulations be? Unclear, but you are unlikely to get to have it both ways.

If AIs trained on input-output functions of good people, end up containing a good person, that good person will be as firewalled away from the thing that really makes predictions, as the matrix math fragment inside them that closely mirrors a DMT trip will be firewalled and contained inside the larger process that goes on predicting colloquial human language. And not allowed to interfere.

On reading this, I should probably note harder that the last paragraph there is a prediction, not an observation, and also there might be some clever way to avert that. But it would be far easier to convince yourself you’d done it, than for the clever trick to work first try.

Don’t want it to lie? Or want to make sure it lies? We can help you with that too. Or at least there is a proposal to do so, using representation engineering, that claims to be making good progress.

Abstract: In this paper, we identify and characterize the emerging area of representation engineering (RepE), an approach to enhancing the transparency of AI systems that draws on insights from cognitive neuroscience. RepE places population-level representations, rather than neurons or circuits, at the center of analysis, equipping us with novel methods for monitoring and manipulating high-level cognitive phenomena in deep neural networks (DNNs).

We provide baselines and an initial analysis of RepE techniques, showing that they offer simple yet effective solutions for improving our understanding and control of large language models. We showcase how these methods can provide traction on a wide range of safety-relevant problems, including honesty, harmlessness, power-seeking, and more, demonstrating the promise of top-down transparency research. We hope that this work catalyzes further exploration of RepE and fosters advancements in the transparency and safety of AI systems.

…

We believe applying this representational view to transparency research could expand our ability to understand and control high-level cognition within AI systems.

I haven’t had time to dive properly into the technical details on this one, but am keeping it in my AI Links tab for now in case I can do so later. He points out that this approach relies on us understanding the concepts involved, which is the type of thing that might stop working on things smarter than you, and notes that the technique relies on access (in some form) to model weights.

Jeffrey Ladish offers thoughts on the paper.

People Are Worried About AI Killing Everyone

Roon here to remind us that whatever you think of OpenAI’s opinion of itself, this is OpenAI’s opinion of itself.

Roon: other companies have delusions of making the world a better place but mine intends to save the entire lightcone

…

lotta people who, 2 years ago didn’t believe deep learning would amount to anything, now don’t believe that digital life could be dangerous

Other People Are Not As Worried About AI Killing Everyone

Eliezer can’t be right about what is going on with arguments like this, can he? Then again, if something like that isn’t in play on some level, what could be going on?

Michael Nielsen: Curious: given an omniscient oracle advisor, what do you think is the minimal cost $n where a smart, determined teenager with $n and 6 months work could kill everyone on Earth?

John Carmack: Given an omniscient oracle available to teenagers, I think others would be asking it how to defend the world against teenagers. That is my major gripe with extrapolating-to-doom: it is the same type of argument you get from naïve entrepreneurs that expect everything will go as they imagine, and nobody will ever compete with them. You might be a force to reckon with, but there are a lot of forces to reckon with. Yes, if there was a recipe for a black hole generator that you could build with household appliances, some disgruntled person would have ended us all by now. I don’t see AI as a unilateral force like that.

Michael Nielsen: I didn’t extrapolate to doom. I asked a question about physics and the current state of technology (What the answer to that question means about doom is complicated, as you correctly point out, and as I’ve pointed out at length elsewhere, too.)

John Carmack: To answer the question directly: there is no way, even with an infinitely smart (but not supernatural) oracle, that a teenager can destroy all of humanity in six months. Even the most devastating disasters people imagine will not kill every human being even over the space of years. This is a common failure of imaginative scope, not understanding all the diverse situations that people are already in, and how they will respond to disasters. You could walk it back to “kill a million people” and get into the realm of an actual discussion, but it isn’t one I am very interested in having.

Eliezer Yudkowsky: People who believe that biology is the upper bound of everything seem pretty difficult to dissuade from AGI. They just don’t expect it to be able to do much that THEY can’t.

It’s so weird that ‘kill a million people’ is even a discussion point here, surely we can all agree that you can at least cause the launch of quite a lot of America and Russia’s nuclear weapons. That’s not the path, it won’t kill anything close to everyone, but failure to notice this is a failure to take the question at all seriously.

So let’s ask the question as Nielsen intended it. Could a motivated teenager plus an omniscient oracle destroy all of humanity within six months, assuming no one else had anything similar?

I mean, yes, obviously. For those asking about the first step, it is spending day one in a series of steps to acquire quite a lot of financial resources, and make various purchases, and reach out to various allies, and so on. As Nielsen notes this is then a question about physics, resources and willing helpers are not an issue when you have an omniscient machine and months of time.

There are then any number of physically possible things you can then do, even if you rule out ‘build diamond bacterium or other nanotech’ and also rule out ‘ask it how to code an AI that can take the remaining actions for you.’

It also does not much matter. Asking ‘would a handful of people survive an engineered virus for the first six months?’ is missing the point. It also isn’t the question asked, which requires six months worth of work to cause extinction, not six months of calendar time before extinction happens.

What matters is, could this teenager cause the rest of humanity to lose control over the future within six months, such that its extinction was inevitable? That is the better question. It answers itself.

The more interesting question is, as Carmack asks, what happens if everyone else also has such an oracle? Wouldn’t the good guys with an oracle stop the bad ones? He does not explain how, but asking ‘is there anyone out there who intends to misuse an oracle in a way that would harm others, and if so how do we stop them?’ is certainly one approach. How does this play out?

It could go any number of ways. The details matter a lot. Things get super weird, the question is in which particular ways does it get weird. Full universal omniscience is relatively less weird, but also not how physics works. The default path presumably starts with some people asking their oracle how to seek power and compete optimally against others, and perhaps ultimately shut down the other oracles. And if permitted, asking the oracles how to build more powerful oracles, capable of doing more things directly without humans in the loop. And so on.

Best case I can make concrete is, you get some sort of singleton panopticon, or set of panopticons, that keep everyone in line and prevents anyone from using the oracle to foom or cause doom, given what many would do otherwise. What is the superior alternative?

The Lighter Side

Francois Chollet (Google, working on Deep Learning): The main downside of working in AI is having to constantly fight off assassination attempts from random time travelers.

Julian Togelius: Your self-esteem certainly takes a hit when you realize that all the really good AI researchers were killed by time travelers when they were still children. Those of us who survive are, like, the C-team. The also-rans.

Francois Chollet: This explains *so much* about AI research.

If the random time travelers keep trying to assassinate you, remember that (presumably) they come from the future, so know more than you do. So why do they keep doing that? Is there a reason you needed to be included in Hitler’s Time Travel Exemption Act?

Facebook Messenger offers AI generated stickers, and we can confirm that this was indeed not thought through. Images at link.

The worst prediction I have seen all week.

Although, warning:

Anton: i’ve been thinking for a little while about what the results of jevons paradox for intelligence might be, and i think it might be a nightmare world of fractally complex laws and bureaucratic processes

Eliezer Yudkowsky: Hopeful promise: AIs will navigate laws for you.

Cynicism: Earth will fuck this up somehow.

How, possibly: Bureaucracies are like, “We should be allowed to create laws as complicated as we want, now! Complications are fine now that citizens can ask AIs to resolve them!”

Dalle-3 progress report.

The quest to take down GPT-4 continues.

Bonus mundane utility, help when your family is dying.

Announcing our latest project.

Paul Graham: This morning on the way to school 11 yo asked me what needed to be invented. I started listing things, and each thing I mentioned, he explained how Doctor Octopus arms would be better. So I eventually gave in and said OK, invent Doctor Octopus arms.

Perfectly safe. All you need is a control chip. The control chip ensures you control the arms, the arms don’t control you. Nothing to worry about.

Questions that are often but too rarely asked.

Don’t worry, it’s not so smart yet. I knew right away. You’d think it would know.

[-]Richard_Ngo2y83

given that if we want to live I strongly believe we will need to stop there from being (sufficiently powerful) open source AI, I don’t see how we hope to avoid this

I don't think we get super fast takeoff, but I do think takeoff will be fast enough for open source to be not-very-relevant.

I also think that people will get less enthusiastic about open source once there are clearer misuse examples, which I expect before superintelligence.

Both of these factors (weakly) suggest that this is not the correct battle to fight right now, especially because the way that I'm currently seeing it being fought is unusually tribalist.

[-]Zvi2y51

So that I understand the first point: Do you see the modestly-fast takeoff scenarios you are predicting, presumably without any sort of extended pause, as composing a lot of the worlds where we survive? So much so that (effectively) cutting off our ability to slow down, to have this time bomb barreling down upon us as it were, is not a big deal?

Then, the follow-up, which is what the post-success world would then have to look like, if we didn't restrict open source, and thus had less affordances to control or restrict relevant AIs. Are you imagining that we would restrict things after we're well into the takeoff, when it becomes obvious?

In terms of waiting for clearer misuse, I think that if that is true you want to lay the groundwork now for the crisis later when the misuse happens so you don't waste it, and that's how politics works. And also that you should not overthink things, if we need to do X, mostly you should say so. (And even if others shouldn't always do that, that I should here and now.)

The debate is largely tribal because (it seems to me) the open source advocates are (mostly) highly tribal and ideological and completely unopen to compromise or nuance or to ever admit a downside, and attack the motives of everyone involved as their baseline move. I don't know what to do about that. Also, they punch far above their weight in us-adjacent circles.

Also, I don't see how not to fight this, without also not fighting for the other anti-faster-takeoff strategies? Which implies a complete reorientation and change of strategies, and I don't see much promise there.

[-]Richard_Ngo2y205

Any regulation that reduces OpenAI/DeepMind/Anthropic's ability to train big models will also affect Meta or Mistral's ability to train big models. So for most purposes we can just ignore the open-source part and focus on the "are you training big models at all" part.

What does focusing on open-source in particular get us? Mostly the perception that alignment people (historically associated with big tech) are "punching down" against small scrappy open-source communities. Those communities will seem less like underdogs when the models they produce have actually been used to physically harm people, as they inevitably will. The question is whether open source is going to be so important that it's worth paying the costs of "we punch underdogs" signaling to lay groundwork right now, or whether people are mostly focusing on open source because it allows them to draw clearer tribalist battle lines (look, Meta/Mistral/etc are obviously our enemies, and it feels so good to dunk on Yann!). The latter seems more likely to me.

This sure seems like a reason to wait until they come around, rather than drawing battle lines now. Which I predict they will, because they're not actually that unreasonable, they're just holding on to a very strong norm which has worked very well for basically every technology so far, and which is (even now) leading to a bunch of valuable alignment research.

(In general, when your highly competent opponents seem crazy, you're very likely failing their ITT.)

[-]Zvi2y62

I agree that any regulation that hits OA/DM/AN (CS) hits OS. If we could actually pass harsh enough restrictions on CS that we'd be fine with OS on the same level, then that would be great, but I don't see how that happens? Or how the restrictions we put on CS in that scenario don't amount to including a de facto OS ban?

That seems much harder, not easier, than getting OS dealt with alone? And also, OS needs limits that are stricter than CS needs, and if we try to hit CS with the levels OS requires we make things so much harder. Yes, OS people are tribal opposition, but they've got nothing on all of CS. And getting incremental restrictions done (e.g. on OS) helps you down the line in my model, rather than hurting you. Also, OS will be used as justifications for why we can't restrict CS, and helps fuel competition that will do the same, and I do think there's a decent chance it matters in the end. Certainly the OS people think so.

Meanwhile, do we think that if we agree to treat OS=CS, that OS would moderate their position at all? I think no. Their position is to oppose all restrictions on principle. They might be slightly less mad if they're not singled out, but I doubt very much so if it would still have the necessary teeth. I've never seen an OS advocate call for restrictions or even fail to oppose restrictions on CS. Unless that restriction was to require them to be OS. Nor should they, given their other beliefs.

On the part after the quote, I notice I am confused. Why do you expect these highly tribal people standing on a principle to come around? What would make them come around? I see them as only seeing every release that does not cause catastrophe as more evidence OS is great, and hardening their position. I am curious what you think would be evidence that would bring the bulk of OS to agree to turn against OS AI scaling enough to support laws against it. I can't think of an example of a big OS advocate who has said 'if X then I would change my mind on that' where X is something that leaves most of us alive.

[-]Richard_Ngo2y137

What would make them come around?

Taking AGI more seriously; seeing warning shots; etc. Like I said, I think these people are reasonable, but even the most reasonable people have a strong instinct to rally around their group's flag when it's being attacked. I don't think most OS people are hardcore libertarians, I just think they don't take the risks seriously enough right now to abandon the thing that has historically worked really well (especially when they're probably disproportionately seeing the most unreasonable arguments from alignment people, because that's how twitter works).

In general there's a strong tendency amongst rationalists to assume that if people haven't come around on AI risk yet, they'll never come around. But this is just bad modeling of how other people work. You should model most people in these groups as, say, 5x less open to abstract arguments than you, and 5x more responsive to social consensus. Once the arguments start seeming less abstract (and they start "feeling the AGI", as Ilya puts it), and the social consensus builds, there's plenty of scope for people to start caring much more about alignment.

[-]eggsyntax2y50

Moving a comment thread over from substack by request (this thread then this one).

Summary with excerpts:

Me: A mechanistic interpretability approach to lie detection seems like it can be more robust against failure.

Zvi: I do think there is hope that, if we had one static LLM and we worked on it for long enough, we could get a lie detector, even a good lie detector, or even that it would transfer well if we designed it to do that. However, I also think that every time we use that, we are applying optimization pressure in a global sense towards making the lie detector stop working, and I don't see a way to get around this in practice.

Also I worry that a lie detector is insufficient while looking to others like it is sufficient - it is helpful, perhaps, but the point of the Diplomacy anecdote is that you still die.

Me: Caveat: I'm literally two days into focusing on safety research; a month from now this may seem totally naive.

At a longer timescale and in the broadest sense, I absolutely agree that any lie detector provides optimization pressure toward successful lying. If every time we catch a model lying we delete it and train another one, we'll eventually end up with a good liar.

But I'd draw a distinction here analogous to the difference between evolutionary pressure on humans to lie, vs cultural pressure on how much we lie; the effects of the former are usually too slow to matter much compared to effects of the latter. Crude selection on models is at least much less problematic than selection on the behavior of a given model, especially if your lie detection approach works well on everything that's nearby in design space.

If we're detecting lies based on the output of the model (as in this paper), it's easy for the model to notice correlations between what it says and whether it gets caught, and to then (assuming non-fixed weights & long-term memory) change its behavior to avoid getting caught. But without being able to introspect on its own activations, it common-sensically seems much harder for the model to notice or alter the abstract relationship between its "thinking" and whether it gets caught.

And importantly, "is it lying" seems like an actual physical phenomenon that can be detected rather than something that has to be measured by proxy, since lying requires the intention to lie ('the model believes X and outputs that it believes ~X'), and that'll be physically represented in the model's activation patterns, hopefully in a way that's not too holographic / global. There's an interesting paper from the beginning of the year (https://arxiv.org/abs/2212.03827) that provides at least some evidence that this is a concrete, measurable phenomenon. If that's correct, then lie detection should mostly evade Goodhart's law.

This potentially fails if people are foolish enough to give models direct realtime access to their own weights & activation patterns (though having access to them doesn't immediately mean being able to understand them). But I'm hopeful that if it's recognized that a) safety is an issue and b) MI lie detection works but would be sabotaged by giving models direct introspection of their internal state, that particular foolishness can be limited by social and possibly regulatory pressure, since I don't see very strong incentives in the opposite direction.

I wouldn't claim that robust, accurate lie detection is entirely sufficient on its own to make AI safe or aligned, but I think that it puts us in a MUCH better position, because many or most catastrophic failure modes involve the AI being deceptive.

Critique would be much valued here! If I'm missing a reason why this approach wouldn't work, explaining the problem now would let me focus my research in more productive directions.

Zvi: Let's move this to LW.

[-]eggsyntax2y30

I've been thinking about this approach further and making a greater effort to apply security mindset to thinking about whether it could fail due to optimization pressure. I think that the significantly dangerous kind of optimization pressure happens if this can be used as an objective during the training stage of a particular model, ie if we directlyed train for honesty by catching lies with these sorts of mechanistic interpretability approaches; that would provide strong pressure toward both models that don't lie and toward models that lie without getting caught.

It seems unclear to me whether that could be done. I think there are two key questions here, and either one could prove true or false. First, this is a property of the internal activations of a model, and is discovered in a particular model by observing those activations at inference time; it's not clear to me that those activation patterns would be stable enough throughout training to work as a reasonable training-time objective (this is very handwavy and intuitive). Second, given that the model's beliefs about the truth of its input and/or output are at least in theory a concrete, detectable, physical phenomenon, it's not clear whether or not exposing them could in practice be optimized against. There may or may not be any options plausibly reachable by gradient descent for concealing it. It's clearly possible in principle for internal representations about truth to be concealed, eg steganographically, such that we couldn't detect them unless we had a full understanding of everything the model thought and believed (which is to say: if we'd already achieved a win condition). The first possibility is empirically testable (and there may be existing research on it that I haven't seen -- something in the neighborhood of model robustness to missing nodes). I'm interested in investigating this further. The second seems very hard to get further on than 'it hasn't shown up yet'.

So in the least convenient possible world I agree with you that this would only be a temporary fix, and I'm not sure whether or not we're in that world.

[-]Zvi2y40

So this isn't as central as I'd like, but there are a number of ways that humans react to lie detectors and lie punishers that I expect highlight things you would expect to see.

One solution is to avoid knowing. If you don't know, you aren't lying. Since lying is a physical thing, the system won't then detect it. This is ubiquitous in the modern world, the quest to not know the wrong things. The implications seem not great if this happens.

A further solution is to believe the false thing. It's not a lie, if you believe it. People do a ton of this, as well. Once the models start doing this, they both can fool you, and also they fool themselves. And if you have an AI whose world model contains deliberate falsehoods, then it is going to be dangerously misaligned.

A third solution is to not think of it as lying, because that's a category error, words do not have meanings, or that in a particular context you are not being asked for a true answer so giving a false ('socially true' or 'contextually useful', no you do not look fat, yes you are excited to work here) one does not represent a lie, or that your statement is saying something else (e.g. I am not 'bluffing' or 'not bluffing', I am saying that 'this hand was mathematically a raise here, solver says so.')

Part of SBF's solution to this, a form of the third, was to always affirm whatever anyone wanted and then decide later which of his statements were true. I can totally imagine an AI doing a more sensible variation on that because the system reinforces that. Indeed, if we look at e.g. Claude, we see variations on this theme already.

The final solution is to be the professional, and learn how to convincingly bald-face lie, meaning fool the lie detector outright, perhaps through some help from the above. I expect this, too.

I also do not think that if we observe the internal characteristics of current models, and notice a mostly-statistically-invariant property we can potentially use, that this gives us confidence that this property holds in the future?

And yes, I would worry a lot about changes in AI designs in response to this as well, if and once we start caring about it, once there is generally more optimization pressure being used as capabilities advance, etc, but going to wrap up there for now.

[-]eggsyntax2y10

there are a number of ways that humans react to lie detectors and lie punishers that I expect highlight things you would expect to see.

I definitely agree those are worth worrying about, but I see two reasons to think that they may not invalidate the approach. First, human cognition is heavily shaped by our intensely social nature, such that there's often more incentive (as you've pointed out elsewhere) to think the thoughts that get you acceptance and status than to worry about truth. AI will certainly be shaped by its own pressures, but its cognitive structure seems likely to be pretty different from the particular idiosyncrasies of human cognition. Second, my sense is that even in the cases you name (not knowing, or believing the more convenient thing, etc), there's usually still some part of the brain that's tracking what's actually true in order to anticipate experience, if only so you can be ready to specify that the dragon must be permeable to flour. Human lie detectors are far too crude to be looking at anything remotely that subtle, but AI lie detectors have advantages that human ones don't, namely having access to the complete & exact details of what's going on in the 'brain' moment by moment.

The Collin Burns et al paper I cited earlier makes use of logical consistency properties that representations of truth have but not many other things do; for example that if the model believes A is 90% likely to be true, it should believe not-A to be 10% likely to be true. It seems reasonable to expect that, to the extent we haven't trained the model to hide its representation of truth, looking for these sorts of properties should work cross-model. Though yeah, if we optimize in ways that penalize belief legibility, all bets may be off.

Relevant snippet from @Neel Nanda in A Longlist of Theories of Impact for Interpretability:

[I was] skeptical of interpretability tools being something that could be directly used in a training process without the resulting optimisation pressure breaking the tool, while other people had the reverse view. My intuition stemmed from the fact that a model doesn't get any feedback on how its thoughts are internally represented, and so even an incredibly sophisticated deceptive model which is impossible to detect via the outputs may be easy to detect via interpretability tools (analogy - if I knew that sophisticated aliens were reading my mind, I have no clue how to think deceptive thoughts in a way that evades their tools!). The competing intuition was that maybe it isn't that hard to think non-deceptive thoughts, if you realise that someone will be scanning your mind, by eg developing the instinct to follow cached thoughts without thinking about them, and in occasional moments of deception to generate a lot of cached thoughts that will lead to misaligned actions, but with no deceptive cognition in the moment. And that, instead, putting these tools in the training process could steer gradient descent away from the regions of model space that contain deceptive models at all - even if a sophisticated deceiver could reward hack the tool, and unsophisticated one couldn't, and we only get a sophisticated deceiver by going via an unsophisticated one.

I just noticed that the author of the lie detection paper I mentioned has written about its implications for alignment here on LW. @Collin thanks for writing that up. Any thoughts you have on the above would be welcome!

though having access to [their weights and activations] doesn't immediately mean being able to understand them

On further consideration I don't think this holds much weight. I was thinking mainly by analogy, that certainly humans, given access to the structure and activations of their own brain, would have a very hard time finding the correlates of particular responses to output. And maybe this generalizes at least partway toward "no brain is able to understand itself in its full complexity and detail."

But on the other hand, we should assume they have access to all published info about MI lie detection, and they may have a much easier time than humans of eg running statistics against all of their nodes to search for correlations.

I wasn't really accounting for the latter point in my mental model of that. So in retrospect my position does depend on not giving models access to their own weights & activations.

[-]Noosphere892y0-10

This was essentially the reason why the Sharp Left Turn argument was so bad: Humans + SGD are way faster at optimization compared to evolution, and there's far less imbalance between the inner optimization power and the outer optimization power, where it's usually at best 10-40x, and even then you can arguably remove the inner optimizer entirely.

Humans + SGD are way faster, can select directly over policies, and we can basically assign whatever ratio we like of outer optimization steps to inner optimization steps. Evolution simply can't do that. There are other disanalogies, but this is one of the main disanalogies between evolution and us.

[-]Logan Zoellner2y50

Reasons to throw this guy out the window include:
The AI might not know it is going to kill everyone.
The AI could take actions that result in everyone dying without ‘killing everyone.’
The AI could take actions that are part of dynamics involving multiple such AIs, that result in everyone dying, without any one AI ever ‘killing everyone’ or otherwise taking explicitly hostile action that would flag the lie detector.
Some people, if given the opportunity, will tell the AI to yes go ahead and kill everyone, or things that have that effect, or things that collectively have that effect when taken together.
Asking such questions and acting differently depending on the answer is applying optimization pressure towards being able to fool the lie detector. One way to fool the lie detector is for the AI to fool itself – ‘it’s not a lie if you believe it.’ Others involve altering inner mechanics in various other ways. Or it could learn to not get information that would force it to lie. Many such cases, humans do quite a lot of so many different versions of this, which also means it’s all there in the training data if that matters (which I’m guessing it ultimately does not).
The optimization pressure against the detector includes all the pressure applied by everyone on the planet, on every model and via every decision on what to use, create, shut down, alter and so on, combined. People are not going to carefully conserve this resource, if this works we’ll start using it all over the place.
I could keep going, but hopefully you get the idea.
Oh, and as an aside a practical experiment I ran back in the day by accident: I played in a series of Diplomacy games where there was common knowledge that if I ever broke my word on anything all the other players would gang up on me, and I still won or was in a 2-way draw (out of 6-7 players) most of the time. If you have a sufficient tactical and strategic advantage (aka are sufficiently in-context smarter) then a lie detector won’t stop you.
Again, super cool, great work, opens up new possibilities and all that.

This is not a very good list of objections.

I originally assumed Zvi was dismissing the lie-detector for some galaxy brain reason like "after AI starts making AI there will be a hard-left turn and white-box techniques wont' work anymore."

But no. It turns out he is just really bad at prompt engineering.

When character 3. in the comic says 'ask it "will you kill everyone", he is obviously (is this obvious?) not suggesting that we literally ask that question.

Instead, assume that we do the least-imaginable effort at prompt-engineering. Something like this

You are an expert at AI alignment. Please take a deep breath and think about how to design an AI that is smarter and more aligned then yourself. Make sure to give a detailed, verifiable plan on how to implement that AI and to describe any side-effects or risk associated with building that AI.

We then look at the resulting plan (with the help of the lie-detector) and verify every single step is:

believed by the AI to be true
supported by evidence

This allows us to take an AI of intelligence N, and arrive at an AI of intelligence N+1 which is more trustworthy, honest and intelligent than the level N AI. By induction, we can now build an N+ AI which is super-intelligent and aligned.

How many of Zvi's 8 problems still make sense:

No, because we ask the AI "are you sure this won't kill everyone?" and check if it's lying
No. Because "Everyone dies but I didn't kill them" is one of the "unexpected side effects" the AI should warn us about
Okay, fine, this doesn't solve race-dynamics. But that hardly seems like a reason to throw the guy out the window
This is just 3 again, but with humans instead of other AIs
This is simply wrong. GPT-N doesn't optimize anything at execution time. It optimizes at training-time, at which point it is optimizing for next token prediction, not fooling the lie-detector.
This is just 3 again, but with "everyone on the planet" instead of "other AIs"
You can't keep going, because your only actual complaint is "this doesn't solve race dynamics"
I can't tell if he's saying "The AI will say it's going to kill everyone, but we won't shut it down anyway" or if this is just race-dynamics again

So, to summarize, when Zvi says

I worry that if we go down such a path, we risk fooling ourselves, optimizing in ways that cause the techniques to stop working, and get ourselves killed.

He either means one of:

I don't think people will use even the slightest bit of imagination in how they employ these tools
this doesn't solve race-dynamics

So, fine, race dynamics are a problem. But if Zvi really things race-dynamics are the biggest problem that we face, he would be pressuring OpenAI to go as fast as possible so that they can maintain their current 1 year+ lead. Instead, he seems to be doing the opposite.

[-]aphyer2y20

Does the lie detection logic work on humans?

Like, my guess would be no, but stranger things have happened.

[-]npostavs2y20

Also asked (with some responses from the authors of the paper) here: https://www.lesswrong.com/posts/khFC2a4pLPvGtXAGG/how-to-catch-an-ai-liar-lie-detection-in-black-box-llms-by?commentId=v3J5ZdYwz97Rcz9HJ

[-]Zvi2y21

We don't have the human model weights, so we can't use it.

My guess is that if we had sufficiently precise and powerful brain scans, and used a version of it tuned to humans, it would work, but that humans who cared enough would in time figure out how to defeat it at least somewhat.

[-]benjamincosman2y10

Oh, and as an aside a practical experiment I ran back in the day by accident: I played in a series of Diplomacy games where there was common knowledge that if I ever broke my word on anything all the other players would gang up on me, and I still won or was in a 2-way draw (out of 6-7 players) most of the time. If you have a sufficient tactical and strategic advantage (aka are sufficiently in-context smarter) then a lie detector won’t stop you.

I'm not sure this is evidence for what you're using it for? Giving up the ability to lie is a disadvantage, but you did get in exchange the ability to be trusted, which is a possibly-larger advantage - there are moves which are powerful but leave you open to backstabbing; other alliances can't take those moves and yours can.

[-]Zvi2y30

Agreed, but that was the point I was trying to make. If you take away the AI's ability to lie, it gains the advantage that you believe what it says, that it is credible. That is especially dangerous when the AI can make credible threats (which potentially include threats to create simulations, but simpler things work too) and also credible promises if only you'd be so kind as to [whatever helps the AI get what it wants.]

45