LESSWRONG
LW

All of Writer's Comments + Replies

RA x ControlAI video: What if AI just keeps getting smarter?

That's fair, we wrote that part before DeepSeek became a "top lab" and we failed to notice there was an adjustment to make

RA x ControlAI video: What if AI just keeps getting smarter?

Writer8d82

It's true that a video ending with a general "what to do" section instead of a call-to-action to ControlAI would have been more likely to stand the test of time (it wouldn't be tied to the reputation of one specific organization or to how good a specific action seemed at one moment in time). But... did you write this because you have reservations about ControlAI in particular, or would you have written it about any other company?

Also, I want to make sure I understand what you mean by "betraying people's trust." Is it something like, "If in the future ControlAI does something bad, then, from the POV of our viewers, that means that they can't trust what they watch on the channel anymore?"

7habryka8d

I have reservations about ControlAI in-particular, but also endorse this as a general policy. I think there are organizations that themselves would be more likely to be robustly trustworthy and would be more fine to link to, though I think it's actually very hard and rare, and I would still avoid it in-general (the same way LW has a general policy of not frontpaging advertisements or job postings for specific organizations, independent of the organization)[1]. Yeah, something like that. I don't think "does something bad" is really the category, more something like "will end up engaging with other media by ControlAI which will end up doing things like riling them up about deepfakes in a bad faith manner, i.e. not actually thinking deepfakes are worth banning but the banning of deepfake being helpful for slowing down AI progress without being transparent about that, and then they will have been taken advantage of, and then this will make a lot of coordination around AI x-risk stuff harder". 1. ^ We made an exception with our big fundraising post because Lightcone disappearing does seem of general interest to everyone on the site, but it made me sad and I wish we could have avoided it

Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

Writer3mo20

But the "unconstrained text responses" part is still about asking the model for its preferences even if the answers are unconstrained.

That just shows that the results of different ways of eliciting its values remain sorta consistent with each other, although I agree it constitutes stronger evidence.

Perhaps a more complete test would be to analyze whether its day to day responses to users are somehow consistent with its stated preferences and analyzing its actions in settings in which it can use tools to produce outcomes in very open-ended scenarios that contain stuff that could make the model act on its values.

Writer's Shortform

Writer3mo50

Thanks! I already don't feel as impressed by the paper as I was while writing the shortform and I feel a little embarrassed for not thinking through things a little bit more before posting my reactions, although at least now there's some discussion under the linkpost so I don't entirely regret my comment if it prompted people to give their takes. I still feel to have updated in a non-negligible way from the paper though, so maybe I'm still not as pessimistic about it as other people. I'd definitely be interested in your thoughts if you find discourse is still lacking in a week or two.

Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

Writer3mo60

I'd guess an important caveat might be that stated preferences being coherent doesn't immediately imply that behavior in other situations will be consistent with those preferences. Still, this should be an update towards agentic AI systems in the near future being goal-directed in the spooky consequentialist sense.

Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

Writer3mo50

Why?

Rauno Arike3mo206

There's an X thread showing that the ordering of answer options is, in several cases, a stronger determinant of the model's answer than its preferences. While this doesn't invalidate the paper's results—they control for this by varying the ordering of the answer results and aggregating the results—, this strikes me as evidence in favor of the "you are not measuring what you think you are measuring" argument, showing that the preferences are relatively weak at best and completely dominated by confounding heuristics at worst.

6Gurkenglas3mo

I had that vibe from the abstract, but I can try to guess at a specific hypothesis that also explains their data: Instead of a model developing preferences as it grows up, it models an Assistant character's preferences from the start, but their elicitation techniques work better on larger models; for small models they produce lots of noise.

Writer's Shortform

Writer3mo*26-14

Surprised that there's no linkpost about Dan H's new paper on Utility Engineering. It looks super important, unless I'm missing something. LLMs are now utility maximisers? For real? We should talk about it: https://x.com/DanHendrycks/status/1889344074098057439

I feel weird about doing a link post since I mostly post updates about Rational Animations, but if no one does it, I'm going to make one eventually.

Also, please tell me if you think this isn't as important as it looks to me somehow.

EDIT: Ah! Here it is! https://www.lesswrong.com/posts/SFsifzfZotd3NLJa... (read more)

habryka3mo*4628

FWIW, my sense is that it's a bad paper. I expect other people will come out with critiques in the next few days that will expand on that, but I will write something if no one has done it in a week or two. I think the paper notices some interesting weak correlations, but man, it really doesn't feel like the way you would go about answering the central question it is trying to answer and I keep having the feeling of it very much having been written to produce the thing that on the most shallow read will produce the most surface-level similar object in order to persuade and be socially viral, and not to inform.

What are the good rationality films?

Answer by WriterNov 21, 202442

The two Gurren Lagann movies cover all the events in the series, and based on my recollection, they should be better animated. Still based on what I remember, the first should have a pretty central take on scientific discovery. The second should be more about ambition and progress, but both probably have at least a bit of both. It's not by chance that some e/accs have profile pictures inspired by that anime. I feel like people here might disagree with part of the message, but I think it does say something about issues we care about here pretty forcefully. (Also, it was cited somewhere in HP: MoR, but for humor.)

dr_s6mo122

I think the core message of optimism is a positive one, but of course IRL we have to deal with a world whose physical laws do not in fact seem to bend endlessly under sufficient application of MANLY WARRIOR SPIRIT, and thus that forces us to be occasionally Rossiu even when we'd want to be Simon. Memeing ourselves into believing otherwise doesn't really make it true.

Predicting the future with the power of the Internet (and pissing off Rob Miles)

Writer7mo30

Update:

Robin Hanson AI X-Risk Debate — Highlights and Analysis

Writer10mo40

I think it would be very interesting to see you and @TurnTrout debate with the same depth, preparation, and clarity that you brought to the debate with Robin Hanson.

Edit: Also, tentatively, @Rohin Shah because I find this point he's written about quite cruxy.

2Liron10mo

I'm happy to have that kind of debate. My position is "goal-directedness is an attractor state that is incredibly dangerous and uncontrollable if it's somewhat beyond human-level in the near future". The form of those arguments seems to be like "technically it doesn't have to be". But realistically it will be lol. Not sure how much more there will be to say.

Writer's Shortform

Writer11mo*2011

For me, perhaps the biggest takeaway from Aschenbrenner's manifesto is that even if we solve alignment, we still have an incredibly thorny coordination problem between the US and China, in which each is massively incentivized to race ahead and develop military power using superintelligence, putting them both and the rest of the world at immense risk. And I wonder if, after seeing this in advance, we can sit down and solve this coordination problem in ways that lead to a better outcome with a higher chance than the "race ahead" strategy and don't risk encou... (read more)

3tmeanen11mo

Plausibly one technology that arrives soon after superintelligence is powerful surveillance technology that makes enforcing commitments significantly easier than it historically has been. Leaving aside the potential for this to be misused for authoritarian government, advocating for this to be developed before powerful technologies of mass destruction may be a strategy.

"What if we could redesign society from scratch? The promise of charter cities." [Rational Animations video]

Writer1y20

Noting that additional authors still don't carry over when the post is a cross-post, unfortunately.

What exactly did that great AI future involve again?

Writer1y20

I'd guess so, but with AGI we'd go much much faster. Same for everything you've mentioned in the post.

What exactly did that great AI future involve again?

Answer by WriterJan 28, 202470

Turn everyone hot

If we can do that due to AGI, almost surely we can solve aging, which would be truly great.

1lemonhope1y

We'll solve it either way right?

Palworld development blog post

Writer1y130

Looking for someone in Japan who had experience with guns in games, he looked on twitter and found someone posting gun reloading animations

Having interacted with animation studios and being generally pretty embedded in this world, I know that many studios are doing similar things, such as Twitter callouts if they need some contractors fast for some projects. Even established anime studios do this. I know at least two people who got to work on Japanese anime thanks to Twitter interactions.

I hired animators through Twitter myself, using a similar proce... (read more)

gwern1y*147

It is a notorious practice.

As far as Palworld goes and the genius of its founders & character designer go, I would personally hold off on the encomiums until the dust has settled some more - both to see if it's anything but a gimmick (can 'unlicensed Pokemon but survival' really hold players long-term?) and also if they even survive (there's a reason 'unlicensed Pokemon but X' is not a large niche). While Nintendo has not yet formally sued them, the official Nintendo statement was not exactly friendly either... And their ex-head-lawyer seems to expect ... (read more)

The True Story of How GPT-2 Became Maximally Lewd

Writer1y31

Thank you! And welcome to LessWrong :)

3pavoras1y

Thanks! I am not "really" new though. Have just been reading for the last few months but I suppose this actually was my first comment 😅 Also, if you already know LW, the channel screams "Hey, this is LW stuff!". But I don't think I would have made the connection if I encountered the channel first. It does a hell of a job getting the "average" nerd informed and interested in these topics though, which is really nice. Not sure how the average population fares. I suppose the videos have way too much content to appeal to a mainstream audience... Which is good, in a way. Somebody has to do the "informing the public"-chore at some point though, but it might as well be someone else ^^

The True Story of How GPT-2 Became Maximally Lewd

Writer1y70

The comments under this video seem okayish to me, but maybe it's because I'm calibrated on worse stuff under past videos, which isn't necessarily very good news to you.

The worst I'm seeing is people grinding their own different axes, which isn't necessarily indicative of misunderstanding.

But there are also regular commenters who are leaving pretty good comments:

The other comments I see range from amused and kinda joking about the topic to decent points overall. These are the top three in terms of popularity at the moment:

Writer's Shortform

Writer1y*40

Stories of AI takeover often involve some form of hacking. This seems like a pretty good reason for using (maybe relatively narrow) AI to improve software security worldwide. Luckily, the private sector should cover it in good measure for financial interests.

I also wonder if the balance of offense vs. defense favors defense here. Usually, recognizing is easier than generating, and this could apply to malicious software. We may have excellent AI antiviruses devoted to the recognizing part, while the AI attackers would have to do the generating part.&n... (read more)

5quetzal_rainbow1y

Hacking is usually not about writing malicious software, it's about finding vulnerabilities. You can avoid vulnerabilities entirely by provably safe software, but you still need safe hardware, which is tricky, and provably safe software is hell in development. It would be nice if AI companies used provably safe sandboxing, but it would require enormous coordination effort. And I feel really uneasy about training AI on finding vulnerabilities.

Simulators

Writer1y42

Also I don't think that LLMs have "hidden internal intelligence"

I don't think Simulators claims or implies that LLMs have "hidden internal intelligence" or "an inner homunculus reasoning about what to simulate", though. Where are you getting it from? This conclusion makes me think you're referring to this post by Eliezer and not Simulators.

Writer's Shortform

Writer1y40

Yoshua Bengio is looking for postdocs for alignment work:

I am looking for postdocs, research engineers and research scientists who would like to join me in one form or another in figuring out AI alignment with probabilistic safety guarantees, along the lines of the research program described in my keynote (https://www.alignment-workshop.com/nola-2023) at the New Orleans December 2023 Alignment Workshop.
I am also specifically looking for a postdoc with a strong mathematical background (ideally an actual math or math+physics or math+CS degree) to take a lead

... (read more)

Writer1y73Review for 2022 Review

i think about this story from time to time. it speaks to my soul.

it is cool that straight-up utopian fiction can have this effect on me.
it yanks me in a state of longing. it's as if i lost this world a long time ago, and i'm desperately trying to regain it.

i truly wish everything will be ok :,)

thank you for this, tamsin.

Writer's Shortform

Writer1y30

Here's a new RA short about AI Safety: https://www.youtube.com/shorts/4LlGJd2OhdQ

This topic might be less relevant given today's AI industry and the fast advancements in robotics. But I also see shorts as a way to cover topics that I still think constitute fairly important context, but, for some reason, it wouldn't be the most efficient use of resources to cover in long forms.

Searching for Search

Writer1y30Review for 2022 Review

The way I understood it, this post is thinking aloud while embarking on the scientific quest of searching for search algorithms in neural networks. It's a way to prepare the ground for doing the actual experiments.

Imagine a researcher embarking on the quest of "searching for search". I highlight in cursive the parts present in the post (if they are present at least a little):

- At some point, the researcher reads Risks From Learned Optimization.
- They complain: "OK, Hubinger, fine, but you haven't told me what search is anyway"
- They read or get invol... (read more)

The Dial of Progress

Writer1y70

This recent Tweet by Sam Altman lends some more credence to this post's take:

Writer's Shortform

Writer1y72

RA has started producing shorts. Here's the first one using original animation and script: https://www.youtube.com/shorts/4xS3yykCIHU

The LW short-form feed seems like a good place for posting some of them.

Inner and outer alignment decompose one hard problem into two extremely hard problems

Writer1y295Review for 2022 Review

In this post, I appreciated two ideas in particular:

Loss as chisel
Shard Theory

"Loss as chisel" is a reminder of how loss truly does its job, and its implications on what AI systems may actually end up learning. I can't really argue with it and it doesn't sound new to my ear, but it just seems important to keep in mind. Alone, it justifies trying to break out of the inner/outer alignment frame. When I start reasoning in its terms, I more easily appreciate how successful alignment could realistically involve AIs that are neither outer nor inner aligned. In p... (read more)

Writer's Shortform

Writer1y20

Maybe obvious sci-fi idea: generative AI, but it generates human minds

Writer's Shortform

Writer1y62

Was Bing responding in Tibetan to some emojis already discussed on LW? I can't find a previous discussion about it here. I would have expected people to find this phenomenon after the SolidGoldMagikarp post, unless it's a new failure mode for some reason.

Predicting the future with the power of the Internet (and pissing off Rob Miles)

Writer1y40

Toon Boom Harmony for animation and After Effects for compositing and post-production

Predicting the future with the power of the Internet (and pissing off Rob Miles)

Writer1y30

the users you get are silly

Do you expect them to make bad bets? If so, I disagree and I think you might be too confident given the evidence you have. We can check this belief against reality by looking at the total profit earned by my referred users here. If their profit goes up over time, they are making good bets; otherwise, they are making bad bets. At the moment, they are at +13063 mana.

3Writer7mo

Update:

2Gurkenglas1y

Ah, but what is the average trader's profit?

The King and the Golem

Writer2y20

They might not do that if they have different end goals though. Some version of this strategy doesn't seem so hopeless to me.

At 87, Pearl is still able to change his mind

Writer2y218

Hofstadter too!

Announcing MIRI’s new CEO and leadership team

Writer2y40

For Rational Animations, there's no problem if you do that, and I generally don't see drawbacks.

Perhaps one thing to be aware of is that some of the articles we'll animate will be slightly adapted. Sorting Pebbles and The Power of Intelligence are exactly like the original. The Parable of The Dagger has deletions of words such as "replied" or "said" and adds a short additional scene at the end. The text of The Hidden Complexity of Wishes has been changed in some places, with Eliezer's approval, mainly because the original has some references to other articles. In any case, when there are such changes, I write it in the LW post accompanying the videos.

Announcing MIRI’s new CEO and leadership team

Writer2y51

If you just had to pick one, go for The Goddess of Everything Else.

Here's a short list of my favorites.

In terms of animation:

- The Goddess of Everything Else
- The Hidden Complexity of Wishes
- The Power of Intelligence

In terms of explainer:

- Humanity was born way ahead of its time. The reason is grabby aliens. [written by me]
- Everything might change forever this century (or we’ll go extinct). [mostly written by Matthew Barnett]

Also, I've sent the Discord invite.

Announcing MIRI’s new CEO and leadership team

Writer2y196

I (Gretta) will be leading the communications team at MIRI, working with Rob Bensinger, Colm Ó Riain, Nate, Eliezer, and other staff to create succinct, effective ways of explaining the extreme risks posed by smarter-than-human AI and what we think should be done about this problem.

I just sent an invite to Eliezer to Rational Animations' private Discord server so that he can dump some thoughts on Rational Animations' writers. It's something we decided to do when we met at Manifest. The idea is that we could distill his infodumps into something succin... (read more)

7Said Achmiz2y

Question for you (the Rational Animations guys) but also for everyone else: should I link to these videos from readthesequences.com? Specifically, I’m thinking about linking to individual videos from individual essays, e.g. to https://www.youtube.com/watch?v=q9Figerh89g from https://www.readthesequences.com/The-Power-Of-Intelligence. Good idea? Bad idea? What do folks here think?

7Gretta Duleba2y

Thanks, much appreciated! Your work is on my (long) list to check out. Is there a specific video you're especially proud of that would be a great starting point? Feel free to send me a discord server invitation at gretta@intelligence.org.

Evaluating the historical value misspecification argument

Writer2y21

I don't speak for Matthew, but I'd like to respond to some points. My reading of his post is the same as yours, but I don't fully agree with what you wrote as a response.

If you find something that looks to you like a solution to outer alignment / value specification, but it doesn't help make an AI care about human values, then you're probably mistaken about what actual problem the term 'value specification' is pointing at.

[...]

It was always possible to attempt to solve the value specification problem by just pointing at a human. The fact that we can now al

... (read more)

Evaluating the historical value misspecification argument

Writer2y32

Keeping all this in mind, the actual crux of the post to me seems:

I claim that GPT-4 is already pretty good at extracting preferences from human data. If you talk to GPT-4 and ask it ethical questions, it will generally give you reasonable answers. It will also generally follow your intended directions, rather than what you literally said. Together, I think these facts indicate that GPT-4 is probably on a path towards an adequate solution to the value identification problem, where "adequate" means "about as good as humans". And to be clear, I don't mean th

Writer2y50

I agree that MIRI's initial replies don't seem to address your points and seem to be straw-manning you. But there is one point they've made, which appears in some comments, that seems central to me. I could translate it in this way to more explicitly tie it to your post:

"Even if GPT-N can answer questions about whether outcomes are bad or good, thereby providing "a value function", that value function is still a proxy for human values since what the system is doing is still just relaying answers that would make humans give thumbs up or thumbs down."

To me, ... (read more)

3Writer2y

Keeping all this in mind, the actual crux of the post to me seems: About it, MIRI-in-my-head would say: "No. RLHF or similarly inadequate training techniques mean that GPT-N's answers would build a bad proxy value function". And Matthew-in-my-head would say: "But in practice, when I interrogate GPT-4 its answers are fine, and they will improve further as LLMs get better. So I don't see why future systems couldn't be used to construct a good value function, actually".

Evaluating the historical value misspecification argument

Writer2y12

Eliezer, are you using the correct LW account? There's only a single comment under this one.

3TekhneMakre2y

(It's almost certainly actually Eliezer, given this tweet: https://twitter.com/ESYudkowsky/status/1710036394977235282)

Evaluating the historical value misspecification argument

Writer2y10

Would it be fair to summarize this post as:

1. It's easier to construct the shape of human values than MIRI thought. An almost good enough version of that shape is within RLHFed GPT-4, in its predictive model of text. (I use "shape" since it's Eliezer's terminology under this post.)

2. It still seems hard to get that shape into some AI's values, which is something MIRI has always said.

Therefore, the update for MIRI should be on point 1: constructing that shape is not as hard as they thought.

7Matthew Barnett2y

That sounds roughly accurate, but I'd frame it more as "It now seems easier to specify a function that reflects the human value function with high fidelity than what MIRI appears to have thought." I'm worried about the ambiguity of "construct the shape of human values" since I'm making a point about value specification. This claim is consistent with what I wrote, but I didn't actually argue it. I'm uncertain about whether inner alignment is difficult and I currently think we lack strong evidence about its difficulty. Overall though I think you understood the basic points of the post.

The Parable of the Dagger - The Animation

Writer2y40

Up until recently, with a big spreadsheet and guesses about these metrics:
- Expected impact
- Expected popularity
- Ease of adaptation (for external material)

The next few videos will still be chosen in this way, but we're drafting some documents to be more deliberate. In particular, we now have a list of topics to prioritize within AI Safety, especially because sometimes they build on each other.

The Parable of the Dagger - The Animation

Writer2y20

Thank you for the heads-up about the Patreon page; I've corrected it!

Given that the logic puzzle is not the point of the story (i.e., you could understand the gist of what the story is trying to say without understanding the first logic puzzle), I've decided not to use more space to explain it. I think the video (just like the original article) should be watched one time all at once and then another time, but pausing multiple times and thinking about the logic.

What are the best non-LW places to read on alignment progress?

Writer2y10

This is probably not the most efficient way for keeping up with new stuff, but aisafety.info is shaping up to be a good repository of alignment concepts.

Should Rational Animations invite viewers to read content on LessWrong?

Writer2y10

It will be a while before we run an experiment, and when I'd like to start one, I'll make another post and consult with you again.

When/if we do one, it'll probably look like what @the gears to ascension proposed in their comment here: a pretty technical video that will likely get a smaller number of views than usual and filters for the kind of people we want on LessWrong. How I would advertise it could resemble the description of the market on Manifold linked in the post, but I'm going to run the details to you first.

LessWrong currently has about 2,0

... (read more)

Writer's Shortform

Writer2y40

Rational Animations has a subreddit: https://www.reddit.com/r/RationalAnimations/

I hadn't advertised it until now because I had to find someone to help moderate it.

I want people here to be among the first to join since I expect having LessWrong users early on would help foster a good epistemic culture.

AI X-risk is a possible solution to the Fermi Paradox

Writer2y50

The answer must be "yes", since it's mentioned in the post

2the gears to ascension2y

whoops.

Writer's Shortform

Writer2y30

I was thinking about publishing the post to hear what users and mods think on the EA Forum too, since some videos would link to EA Forum posts, while others to LW posts.

I agree that moderation is less strict on the EA Forum and that users would have a more welcoming experience. On the other hand, the more stringent moderation on LessWrong makes me more optimistic about LessWrong being able to withstand a large influx of new users without degrading the culture. Recent changes by moderators, such as the rejected content section, make me more optim... (read more)

3Chris_Leong2y

If you mention Less Wrong, you might want to think carefully about how to properly set expectations.

Writer's Shortform

Writer2y40

I'm evaluating how much I should invite people from the channel to LessWrong, so I've made a market to gauge how many people would create a LessWrong account given some very aggressive publicity, so I can get a per-video upper bound. I'm not taking any unilateral action on things like that, and I'll make a LessWrong post to hear the opinions of users and mods here after I get more traders on this market.

3Chris_Leong2y

I guess one thing to think about is that Less Wrong is somewhat stricter on moderation than EA, so I wonder if inviting people to the EA forum would be a more welcoming experience?

[New LW Feature] "Debates"

Writer2y90

"April fool! It was not an April fool!"

Writer's Shortform

Writer2y10

Here's a perhaps dangerous plan to save the world:

1. Have a very powerful LLM, or a more general AI in the simulators class. Make sure that we don't go extinct during its training (eg., some agentic simulacrum takes over during training somehow. I'm not sure if this is possible, but I figured I'd mention it anyway).

2. Find a way to systematically remove the associated waluigis in the superpostion caused by prompting a generic LLM (or simulator) to simulate a benevolent, aligned, and agentic character.

3. Elicit this agentic benevolent simulacrum in the supe... (read more)