Regular people are often not so impressed? Reactions are so weird.
I don't think this is about the difficulties of communication (the effect persists even if you do a vivid real-time demonstration!) or about the "normal people" deliberately going in with the mindset of not being impressed. I think it's just flatly because they don't have a reference frame for grokking why this is supposed to be impressive.
It's particularly clear in the grandfather example. The man probably has no idea how any of the current technology works, it's all black-box magic to him. Why should "the box is able to generate a picture ex nihilo given a natural-language description of it" parse as significantly more impressive to him than "the box is able to find any picture on the internet given its annotation"? Given a gears-level model of things, it's obvious to us; lacking such a model, the impressiveness is just an informed attribute.
Similar is in effect with non-tech-savvy younger people. They don't think about this stuff much.
Similar is in effect with you, when it comes to some field you're not familiar with at all. If there was some amazing breakthrough in e. g. seismography (or linen manufacture, or galaxy-evolution modeling), and you were presented with two papers — one outlining it and one that's just a completely ordinary paper in the field — how would you tell them apart? (Besides judging the excitement with which the text is written and such.)
Definitely good to keep this in mind, but to me some of this stuff seems obviously super impressive even if you do not know the technical details. Generating complex rich pictures on demand that mostly match requested details not being impressive doesn't parse for me.
It parses for me; pretty sure a lot of people just don't see why that is impressive, and I can model their mental state. (As the XKCD alt-text notes, last century a lot of specialists thought such problems were tractable for a small group of people working for a few months. The specialists have grown wiser since then, but who's to say such understanding percolated to everyone else?)
But I suppose there is another component to this: whether you find something viscerally impressive depends on whether you think it's actually cool or useful. We here have dispositions such as "technology is cool/powerful/dangerous!", so we're very impressed and excited to see technological breakthroughs. A lot of people don't; they don't immediately see the implications, don't care that a major problem was solved, so it just looks like, say, nerds being excited by irrelevant nerd stuff.
By analogy, again, imagine that you were informed that some theorem in an obscure mathematical field far from any practical applications was solved after decades of work. Even if you grok why it was so difficult, would you be excited by these news? (Or maybe "we've finally found this obscure species of moss long thought extinct!" or "we've improved the technique for filtering clay water by 1%!".)
(Note: I am sick and tired of the cognitive trick that the alternative to ‘let anyone who wants to risk killing everyone on the planet do so’ is called authoritarian, and the assumption it would imply some sort of oppressive or dictatorial state. It wouldn’t.)
I agree up until you say "It wouldn't." The severity of governmental regulations varies with the difficulty of alignment, the difficulty of creating AGI and some other factors. In a Yudkowskian world, you'd have to have sever control over the compute supply, and distribution, chain, as well as on AI research, in order to be pretty confident we'd avert x-risks.
That said, there are relatively light touches which would improve our log odds by quite a bit under a broad range of views. And according to my model, there are interventions that would reduce P(doom) by like, a third, maybe even a half e.g. anualy-decreasing compute threshold, tracking chips at the hardware level, restrictions on what random biochemicals you can order synthesized without any checks etc. These are fairly restrictive, but not that onerous.
EDIT: I just realized that your statement is compatible with what I said.
A hackathon at MIT confirms that the cost of converting Llama-2 to ‘Spicy-Llama-2’ is about $200, after which it will happily walk you through synthesizing the 1918 flu virus. Without access to model weights, this is at least far more difficult and expensive. Nikki Teran has a thread summarizing.
I continue to be confused about this if they managed to undo all of the safeguards with $200 worth of compute. Does that mean Meta spent $200 in compute to install their safeguards? Perhaps naively, I thought it would take as much compute to undo some training as it would to do it in the first place. Unless, of course, they had access to the base-model. In which case, it would be trivial to undo the fine-tuning.
Yep, as the edit says I don't think we disagree on the first point - there are versions that are oppressive, but also versions that are not that still have large positive effects.
On the second point, I believe this is because it is much harder to introduce safeguards than to remove them, because removing them is a highly blunt target, whereas good safeguards have to be detailed to avoid false positives (which Llama-2 did not do a good job avoiding, but they did try). This is the key asymmetry here, the amount Meta (or anyone else) spends tuning does not help here.
Wow, what a week. We had the Executive Order, which I read here so you don’t have to and then I have a tabulation of the reactions of others.
Simultaneously there was the UK AI Summit.
There was also robust related discussion around Responsible Scaling Policies, and the various filings companies did in advance of the Summit.
I touched on Anthropic’s RSP in particular in previous weeks, but I did not do a sufficiently close analysis and many others have offered more detailed thoughts as well, and the context has evolved.
So I am noting that I am not covering those important questions in the weekly roundup, and they will be covered by one or more later distinct posts. I also potentially owe an after action report from EA Global Boston, if I can find the time.
This post is instead about everything else.
Table of Contents
While top sections of this post are highlighted in bold, if you read one thing this week, I would read On the Executive Order.
Language Models Offer Mundane Utility
Which models offer the best mundane utility?
The article emphasizes that GPT-4 is ‘slow and expensive’ opening up room for competition. It is amazing how quickly we are spoiled, but indeed it is slow and expensive compared to the competition.
For most purposes, I assert there is no comparison. The marginal cost of using GPT-4 for human interactions is very close to zero. If a human is reading the words, do not pinch pennies. Speed can still be a question, so in some cases you would want to use GPT-3.5 or Claude Instant.
Things get more interesting when humans will not see the words. If you are simulating lots of characters in an open world, or doing a study, or otherwise going industrial in size, then the cost can add up fast. At that point, it makes sense that a non-commercial model would be enough cheaper to get an edge in some spots. As the post notes, it makes sense then to get less ‘monogamous’ on model use, the same way I use a mix of GPT-4 and Claude-2 and occasionally Bard or Perplexity.
Rowan Cheung recommends using the ChatGPT plug-ins VoxScript and Whimsical Diagrams. VoxScript is his choice for web browsing, Whimsical Diagrams displays concepts via graphs. He then suggests this instruction: “Explain [topic] extensively. Simplify the concepts and visuals to make complex topics easier to understand and engaging. Then turn it into a mind map.”
Get “Hallucination-Free” answers to legal questions, via LexisNexis.
College students adopting new AI technologies faster than professors, surprising no one. How much you use them and how much you get out of them matters, so the gap is bigger than the pure usage statistics. And students mostly have no intention of stopping, even if use is technically banned:
Language Models Don’t Offer Mundane Utility
Regular people are often not so impressed? Reactions are so weird.
It is proven that some people try GPT-4 and DALLE-3 and do not come away impressed. They look for the flaws, find something to criticize, rather than trying to understand what they are looking at. If you want to not be impressed by something, if you are fully set in your ways, then it can take a lot to shake you out of it. I get that. I still don’t fully get the indifference reaction, especially to image models. How is this not completely crazy that we can get these images on demand? The moon is in the wrong place so who cares?
It will tell you what it thinks you will react well to, not what you want to hear.
They tried to hack the ChatGPT API. What they got back was missing an h.
GPT-4 Real This Time
Sometimes a thing happens that surprises you, but only because its failure to have happened already was also surprising and forced you to update? Thus, The Unexpected Featuring.
This is big. You can now combine vision with image generation with web browsing with reading PDFs. A lot of new use cases will open up as we explore the space. Rowan first points to uploading an image and asking for variations on it, which seems exciting. Another would be asking for images based on browsing or a PDF, or asking to browse the web to verify information from a PDF. Even if these things could have been done manually before, streamlining the process is in practice a big game.
I don’t have access to it yet, so I won’t know how big until I have time to play with it. I will probably start with image manipulation because that sounds like fun, or with PDF stuff for mundane utility to see if it can replace Claude-2 for that.
That is what is called a highly correlated portfolio. A company that gets wiped out when their product gets incorporated into a Microsoft product (or Google, or strictly speaking here OpenAI) is a classic failure mode indeed. An investor who gets wiped out needs a better understanding of diversification.
Also the Nvidia call options aren’t doing that great recently? I am guessing this is the market being dumb but something about it staying crazy longer than you can stay solvent.
As for the companies, I do not think it is crazy to quickly build out a feature and prey that you can make that into something valuable before it gets incorporated, but it is quite the risk, and it happened here.
Fun with Image Generation
Claim that saying things like Tank Man from Tiananmen Square was too sensitive a topic, but keep insisting and DALLE-3 will often give you what you want, and another claim in response that asking for a meme or scene or similar is an easier workaround. The generalizations do not seem very general.
Judge dismisses most of the claims in Andersen v. Stability AI, on standard existing copyright law principles. Plaintiffs are told to come back with specific works whose copyrights were violated.
Best Picture
Perhaps it was I who asked the wrong questions and did not appreciate how any of this works?
There was little discussion of Dead Reckoning among the worried, with my mentions being the most prominent I can recall and also not that prominent.
I have now put up a distinct post on LessWrong that includes only my thoughts on Mission Impossible: Dead Reckoning, without the other films.
As a refresher, here was my spoiler-free review:
We now must modify the paragraph about whether to see this movie. Given its new historical importance, combined with its action scenes being pretty good, if you have not yet seen it you should now probably see this movie. And of course it now deserves a much higher rating.
For more motivation, here’s a scene from the movie?
I mean, that’s great. Further thoughts at the LW post version.
It is presumably too late, but it seems like an excellent time to make extra effort to get into the writers’ room for Part 2 to ensure that perhaps some of this could be how some of this works and we can help send the best possible message while still making a good popcorn flick. I especially want to see the key turn out to be completely useless.
Deepfaketown and Botpocalypse Soon
This seems tricky to do but clearly doable, provided you are content with undergraduate-level results and not fazed by errors and omissions of all sorts. I too did not expect it quite this soon but such things are coming.
Mortiz makes a common mistake here. In a costly signaling game, or when such signals are used as gates, the instinctive play is to reduce the cost of the signal. That is good for the individual in isolation. Too much of it destroys the purpose of the costly signal, the equilibrium fails, and what replaces it could be far worse.
Would I use such a tool on occasion for various purposes? If quality is good enough then yes, and mostly not for cold emails. What will humans do in the future in this cold email situation? Presumably step up their game, show deeper understanding and more humanness to get around this and other AI substitutes, until such time as humans lose the ability to win that battle. Or alternatively we will have to find some other way to costly signal. Two obvious ones are money, which is always an option, or some form of hard-to-fake reputation that we can point to.
A third response is to react to such emails in a way that only profits those who we want to engage with, a principle which generalizes. I do not currently have much of a barrier to an AI or person using AI getting my attention, but what does it profit them?
I also have little barrier to humans getting my attention, including in ways it would indeed profit them, and yet almost no humans use them. Most other interesting or valuable people also have this. Yet almost no one reaches out. I do not expect this to change (much or for very long) now that I’ve noticed this out loud.
Qiaochu Yuan reports that people on Stack Exchange keep trying to spam its math questions with terrible and highly obviously wrong GPT-generated answers, which get removed continuously. Once again, demand for low quality fakes dominates, except here it is unclear where the demand is coming from. Who would do this? Why this place, in particular?
GPT4V says a check for $25 is actually for $100,000. Adversarial attacks, they work.
New York Times keeps attempting to slay irony.
I mean, bullshit? The information environment is awful for reasons that have nothing to do with AI and often have a lot more to do with the New York Times. The fakes and false information, both deep and otherwise, are mostly low quality, but various sources – again, this includes you, New York Times – are falling for or choosing to propagate them anyway, and that is making people very reasonably not trust anything, and none of that has that much to do with AI, not yet. Target demand, not supply.
They Took Our Jobs
AIPI releases another ‘look at all these jobs at risk’ studies. Robin Hanson once again offers to bet against such predictions.
It is easy to conflate ‘jobs at risk’ with ‘jobs that will go away’ with ‘jobs that will change and maybe people move around.’ Even if AI becomes technologically capable of automating 20% of all jobs, that will not mean 20% of jobs get automated, nor that all those people will ‘need retraining’ even if AI did do that.
I also call upon all those making such studies to make actual predictions backed by actual dates. What does it mean to be vulnerable to automation? Does that mean this year? In five years? In twenty? How many do they expect to actually get automated? In short, I have no idea how to operationalize this report, and all that detailed work goes mostly to waste.
Fake news? USA staff writers accuse the paper of using AI-generated fake newspaper articles to intimidate their union workers in the wake of a walkout. The company denies it.
Get Involved
The UK Model Taskforce is hiring. Excellent choice if you are a good match.
Not AI, but Our World In Data is hiring a communications and outreach manger.
Open Philanthropy extends its application deadline to November 27 for jobs on its catastrophic risk team.
OpenAI Frontier Risk and Preparedness Team
OpenAI announces the Frontier Risk and Preparedness Team. Here’s their announcement:
They are hiring for National Security Threat Researcher and Research Engineer.
Great stuff. As they say, this complements and extends existing risk mitigation work. It is not a substitute for other forms of safety work. It is in no way a complete solution to anything. It is a way to be informed about risk, and find incremental mitigations along the way, while another department works to solve alignment.
From what I can tell, this announcement is unadulterated good news. Thank you.
Introducing
USAISI, to be established by Commerce to lead the US government’s efforts on AI safety and trust, particularly for evaluating the most advanced AI models, as detailed in the executive order.
TimeGPT, a foundation model for time-series forecasting. Might have a narrow good use case. I am skeptical beyond that.
GOAT: Who is the Greatest Economist of all Time and Why Does it Matter? which is an AI-infused new book by Tyler Cowen. You can converse with the book instead of or alongside reading it. Manifold users predict it would probably be a good use of my time to check it out, so I probably will once I have the spare cycles. For now I’ve loaded it into Kindle, although I do plan to also use the AI feature as it feels interesting.
New version of AlphaFold that expands it to other biometric classes.
Radical Ventures presents a framework for VC to evaluate the risks posed by potential AI investments. I am curious to hear from VCs if they found this at all enlightening, from my perspective it does not add value but the baseline matters.
Phind (direct), which claims to be better than GPT-4 at coding and 5x faster. I’ve added it to AI tools, I’ll try it when I’m coding next. Always be cautious with extrapolating from benchmarks.
In Other AI News
From China with many authors, a new paper: AI Alignment: A Comprehensive Survey.
I don’t remember seeing the backward versus forward taxonomy before, and find it promising. I do not currently have the time to look at the paper in detail, but have saved it for potential later reading, and would be curious to get others’ takes on its contents.
UK Government publishes extensive accessible paper on the potential future capabilities, benefits and risks from AI. As government documents like this go it is remarkably good, especially in how plainly it speaks. The existential risk portion focuses on loss of human control, via us giving it control and via AIs working to get it. As always, one can quibble, especially about what is missing, and if you are reading this post you already know everything the paper is here to say, but good job.
Davidad issues mission statement at ARIA: Mathematics and modelling are the keys we need to safely unlock transformative AI.
Huge if true! Absolutely worth trying. I agree that the approach seems underexplored. I remain skeptical that such a thing is possible.
Zan Tafakri offers another criticism of the Techno-Optimist Manifesto, pointing out that the thing to be optimistic about is human knowledge rather than technology per se, and links to many others, which I hope closes the book on that.
AI healthcare startup Olive AI, once valued at $4 billion, has sold itself off for parts.
There are 40 ways to steal your AI model weights, from RAND (working paper).
Quiet Speculations
Eliezer Yudkowsky short story slash prediction of a possible future. Like most other visions of how there are AIs around and the people remain alive it does not fully work on reflection and it is very much a fully doomed world, but it is an illustrative dystopia nonetheless.
How fast will compute capabilities improve? Rather fast.
Suhail would say this means we must fight hard against any kind of cap on compute. I would say this means we urgently need one. So strong disagreement and also common ground.
John Pressman predicts 80% chance we will be clearly on track to solve alignment within the year and most parties will agree on that, but predicts few will change their positions as a result. I do not know how to operationalize this into a bet, unfortunately, because I do not expect to agree on resolution criteria?
Jon Stokes already does not really know what is happening, the world is too confusing outside of his bubble of expertise, and speaks of widespread epistemic crisis, and worries that AI will not only make our epistemic crisis worse but also disintegrate our communities and disenfranchise creators.
Is there an epistemic crisis? In some ways I think very much so, our discourse standards and epistemic standards have decayed greatly, in ways that are unrelated to AI. However we also have vastly superior access to information, and forget the many ways in which the past was impoverished on such fronts the same way it was impoverished in material goods. Vital things have been lost that must be recaptured, without applying the Lost Golden Age treatment where it does not apply.
His main argument is that AI instead has the power to help make this and also the position of creators better rather than worse. He offers a pitch that his company symbolic.ai can not only help but be the kids that prove it wrong by showing what can be done when the task is in the right hands. I do not think it works that way, offering even an excellent product cannot prove this because what matters is what happens with things in the typical hand rather than when things are io the right hands, but I do hope the project works out.
Andrew Critch proposes a taxonomy of ways human extinction could happen. He has more detail, I attempt to streamline here.
Type 1 failure: No one particular group is primarily responsible.
Type 2 failure: Extinction caused by a particular group. They did not expect a major impact on society.
Type 3 failure: Extinction caused by a particular group. They did expect a major impact on society, but did not expect to pose substantial risk or harm.
Type 4 failure: Extinction caused by a particular group. They knew it would cause harm and did not care enough to stop.
Type 5 failure: Extinction caused by a particular group. They were a non-state actor intentionally attempting to cause harm.
Type 6 failure: Extinction caused by a particular group. They were a state actor intentionally attempting to cause harm.
Critch has (30%/10%/15%/10%/10%/10%) on each of these scenarios, for a total doom percentage of 85%.
The lines blur a lot between scenarios. If you build it and we all die because you knew someone else was about to also build it and they would have also gotten everyone killed, I think technically that is Type 1 here, but that feels weird? If group A would have gotten us all killed, but in response group B does something else that gets us killed faster, either instead of or in addition to? If one group creates something that could then be unleashed by a variety of actors, and then one of them does?
In particular, Type 1 does not seem to differentiate well between importantly distinctive scenarios. And I don’t typically find it helpful to focus on who was to blame in a proximate cause sense, rather than asking how we could prevent it. I do however think that any such taxonomy will always have similar issues, and this illustrates one of several ways in which people narrow down their worrisome scenarios, before finding a way to dismiss the one scenario they choose to focus on.
The Quest for Sane Regulations
Tyler Cowen (Bloomberg) stands on one foot and tells Congress that it should not regulate AI in any way, instead it should be accelerationist via tactics such as high-skilled immigration and permitting reform. Any regulation of any kind that might get in the way, he says, would be premature. He says only once the technologies are mature and we ‘see if we have kept our lead over China’ should we consider regulation. No mention of existential risks or the other serious downsides, or any of the other considerations in play, no arguments offered. I would respond, but there’s nothing to respond to?
California is actively considering regulating on its own, notwithstanding Biden’s efforts, which is almost never good news. The proposal in question is AB-331, which I believe is an anti-algorithmic-discrimination bill, enforced via lawsuits.
Now this is the kind of regulatory burden I can get behind an objection to.
I complained about the definition of AI in the executive order, but here AI literally means any machine-based system that can make recommendations or decisions, so yes this literally means any machine system at all. An excel spreadsheet counts.
Here consequential decision is anything that has any material consequence on a wide variety of things, any deployer is anyone who uses such a tool to make any consequential decision, so basically anyone who does anything ever.
And what must they do in order to use any computer tool to make almost any decision that impacts another person in some way? Note that it is not the developer of the tool that must do this, it is the deployer:
So yes, as I read this, if in California under this law you want to use almost any tool including a spreadsheet to help you make choices that matter, you – yes, you – will first need to file all of these things for each tool, and have a governance program, and notify everyone impacted, and so on.
If actually enforced as written this would be quite the epic pain in the ass, in exchange for very little in the way of benefits. It is an absurdly bad bill.
Of course, laws in California are often more of a suggestion, and I presume they would not actually go after you in fully idiotic fashion here. But you never know.
The Week in Audio
Shane Legg talks to Dwarkesh Patel. Recommended for those thinking about alignment. Patel continues to impress as an interviewer, and especially impress on AI alignment. Shane Legg is friendly, is attempting to be helpful and is worried about AI killing everyone, but the solutions he proposes attempting (starting at ~19:00) seem even more doomed than usual?
I would yes-and Eliezer’s response, in that even if the AI was internally motivated to act ethically as reflected by the content of such tests that level of ethics, or the level of ordinary human ethics, does not get us where we need to go, I do not see this path working far enough up the capabilities chain even if it succeeds.
I’d also echo Eliezer that this does not mean that Legg does not have better answers and better ideas, and it certainly does not mean Google or DeepMind does not have better answers and ideas, in addition to this one. They are huge, they can contain multitudes, and the good ideas are at best explainable in the 4-hour-Dwarkesh-interview format rather than the 45-minute one here.
You know who else Dwarkesh Patel also interviews? Paul Christiano, for three hours. Highly self-recommending for those who want to go deep. I have been too busy to listen, and will do so when I can pay proper attention.
Interview with Nvidia CEO Jensuen Huang from a bit back. In this clip he says:
Clip of Demis Hassabis (1:30), saying it is good there is disagreement about AI and that we must proceed with cautious optimism in a responsible way. The question framing here is bizarre, describing ‘near term’ as being concerned with right now and the contrast being with the next wave of models coming out in the following year. The next year seems rather near term to me. There are arguments against ‘longtermism’ when it means distant galaxies but can we please have a non-hyperbolic discount rate?
EconEd features several past talks.
Rhetorical Innovation
Ben Thompson confirms that the CAIS letter was a big deal.
Most of the post is a very negative reaction to the executive order, of the all-regulation-is-bad, all-AI-safety-concerns-are-motivated-by-regulatory-capture, no-one-can-know-the-future-so-we-need-no-precautions variety. I added further details to the reaction post here out of respect for the source.
Zack Davis offers Alignment Implications of LLM Successes: a Debate in One Act.
It is not that simple, also often it kind of is.
Alternatively, Julian Hazell presents it as AI changing world versus not.
I see this as two distinct questions. If you do not think AI is definitely capable of transforming the world any time soon, then the correct view is to see it as being like any other technology. It is a tool people can use, and mostly we should let people build it and use it and profit from it. If you do think AI is likely to soon transform the world, that AI is more than a tool, then we get to Musk’s question of whether or not you are a fan of the humans, along with questions about whether the humans are in danger.
Also, Elon’s not done.
James Phillips in Spectator argues why AI must be regulated. Main article is gated, his summary on Twitter suggests this is a solid coverage of traditional ground.
Daniel Faggella updates on what he believes other people believe.
(Note: I am sick and tired of the cognitive trick that the alternative to ‘let anyone who wants to risk killing everyone on the planet do so’ is called authoritarian, and the assumption it would imply some sort of oppressive or dictatorial state. It wouldn’t.)
The problem with C2, and the idea of a human-oriented future with uncontrolled ASIs, is not that it does not sound cool or even potentially glorious, it is that it is incoherent and does not make any sense. C2→C3. I appreciate the people who stand up and say C3 – they have their own section heading and catchphrase and everything – because they are facing down reality and expressing a preference. I disagree with that preference, and expect most of you do as well, and we can go from there. Whereas those who claim C2 is a thing seem to me to either be lying, fooling themselves, thinking quite poorly or most commonly not really thinking about how any of this works at all.
Periodic reminder department, ideally one last time?
Open Source AI is Unsafe and Nothing Can Fix This
A hackathon at MIT confirms that the cost of converting Llama-2 to ‘Spicy-Llama-2’ is about $200, after which it will happily walk you through synthesizing the 1918 flu virus. Without access to model weights, this is at least far more difficult and expensive. Nikki Teran has a thread summarizing.
Once again, this is not a solvable problem, except by not open sourcing model weights.
1a3orn objects:
Yes, I see what else we could even do. So, OK, sure, I am all for be effective altruist, ban gain of function research, no papers on the internet show how to cause a pandemic, party where there is cake.
Then we can double check to see if it still only costs $200 to create Spicy-Llama-2.
Until then, the papers exist, they are available, anyone else could have done the same thing, I actually don’t see what else could we even do? Unless the plan is, be Meta, release open source LLMs, malicious actors fine-tune it for $200 and feed it some papers, they now know how to create pandemics?
And yes, if you were a sufficient domain expert who could parse such papers you could with more effort figure out all the same things. No one is saying Llama-2 is producing outputs here that expert humans could not produce. We are saying that this changes the expertise threshold from ‘there are select people who can do this with effort’ to ‘there are a lot of people who can do this faster and easier.’
In practice, this matters. Terrorist groups and apocalyptic cults have a practical history of seeking biological weapons and the ability to cause pandemics, and failing due to exactly this lack of technical expertise. The cults in Japan were discussed on a recent 80,000 hours podcast. To be concrete: So far, Hamas has not to our knowledge used biological weapons or attempted to cause a pandemic. Was that because they would not do such a thing? Or was that lack of practical capability?
Aligning a Smarter Than Human Intelligence is Difficult
What would be worse than the AI telling people what they want to hear? Doing this while the AI pretends not to be doing it.
I don’t know that this particular example worries me. Telling someone they are a liberal isn’t telling them what they want to hear so much as stating a true implication of their statement based on their dislike of Tricky Dick.
The principle still stands. If there are thoughts the model wants to think with CoT without us noticing, it could gain the ability to do that.
Traditional alignment problems, including traditional human alignment problems.
If you crack down on visible X, but X gets better results, you likely get hidden X.
I would generalize the implications here. If you use RLHF or other similarly flawed feedback, the AI will adapt to the exact things that trigger the feedback. You will get situations like this, where that which you don’t want to see is hidden, and that which you want to see is presented whether or not it is then used or meaningful in any way.
If the users that provide the feedback that counts want to be told they are a liberal, the AI will learn to tell them, whether or not that helps it reason. If those users punish such declarations, the AI learns to not give them. Except the fully general version of that. And over time, it will do this with a much better understanding of what is actually determining our decisions than we ourselves have, and an utter indifference to whether this corresponds to anything we would or should endorse.
Roon speculates.
It is a strategy one could pursue, to forgo inner alignment. Instead of aligning the core LLM, tell the LLM to play an aligned character, and have it aligned to the playing of characters on request, and ensure no one ever asks an AI to play a different character. For multiple different technical reasons I despair of this actually working when it matters in way that results in us not being dead, even if you ‘pulled it off,’ and would expect most alignment researchers to agree.
People Are Worried About AI Killing Everyone
I am considering a new polling segment called ‘Well When You Put It That Way.’
I do find it disappointing, but unsurprising, that working with non-allies is not yet more popular.
Please Speak Directly Into This Microphone
The most honest reaction to the Executive Order, or any move at all to do anything about the fact that AI might not be an unadulterated good or even might, ya know, kill everyone, is that you/we are trying to build AGI as fast as possible, how dare the government want you not to do that, this order might interfere with you building AGI as fast as possible.
You could use this moment to take the mask off completely, or to be fair to representative member of this group Aravind Srinivas you could be like him and never put one on in the first place.
Yes! He admit it. The whole point of (these parts of) the executive order is that you fools are rushing to build AGI as fast as possible. As much as we admire your current product lines, we would like you to not do that, sir.
This seems to reflect the extent to which this attitude cares about the fate of humanity:
Or this, where he seems to be taking a bold ‘stop being a little bitch about the costs of war and get back to killing people’ stance as a metaphor for being against calls for safety and alignment?
In addition to my human extinction is bad stance, I am going to also take a bold war is bad stance. I’d say fight me, but I’d rather you didn’t.
And in case you were wondering if his aim is recursive self-improvement and fast takeoff?
The Lighter Side
New cause area, give positive reinforcement when Roon writes bangers.
It’s hard to find good data.
ChatGPT attempts to ensure we think about The Roman Empire:
Finally, common ground: Cake! Delicious cake.
If you did not anticipate this, you need to up your alignment game.