All of Writer's Comments + Replies

Writer70

That's fair, we wrote that part before DeepSeek became a "top lab" and we failed to notice there was an adjustment to make

Writer82

It's true that a video ending with a general "what to do" section instead of a call-to-action to ControlAI would have been more likely to stand the test of time (it wouldn't be tied to the reputation of one specific organization or to how good a specific action seemed at one moment in time). But... did you write this because you have reservations about ControlAI in particular, or would you have written it about any other company?

Also, I want to make sure I understand what you mean by "betraying people's trust." Is it something like, "If in the future ControlAI does something bad, then, from the POV of our viewers, that means that they can't trust what they watch on the channel anymore?"

7habryka
I have reservations about ControlAI in-particular, but also endorse this as a general policy. I think there are organizations that themselves would be more likely to be robustly trustworthy and would be more fine to link to, though I think it's actually very hard and rare, and I would still avoid it in-general (the same way LW has a general policy of not frontpaging advertisements or job postings for specific organizations, independent of the organization)[1].  Yeah, something like that. I don't think "does something bad" is really the category, more something like "will end up engaging with other media by ControlAI which will end up doing things like riling them up about deepfakes in a bad faith manner, i.e. not actually thinking deepfakes are worth banning but the banning of deepfake being helpful for slowing down AI progress without being transparent about that, and then they will have been taken advantage of, and then this will make a lot of coordination around AI x-risk stuff harder". 1. ^ We made an exception with our big fundraising post because Lightcone disappearing does seem of general interest to everyone on the site, but it made me sad and I wish we could have avoided it
Writer20

But the "unconstrained text responses" part is still about asking the model for its preferences even if the answers are unconstrained.

That just shows that the results of different ways of eliciting its values remain sorta consistent with each other, although I agree it constitutes stronger evidence.

Perhaps a more complete test would be to analyze whether its day to day responses to users are somehow consistent with its stated preferences and analyzing its actions in settings in which it can use tools to produce outcomes in very open-ended scenarios that contain stuff that could make the model act on its values.

Writer50

Thanks! I already don't feel as impressed by the paper as I was while writing the shortform and I feel a little embarrassed for not thinking through things a little bit more before posting my reactions, although at least now there's some discussion under the linkpost so I don't entirely regret my comment if it prompted people to give their takes. I still feel to have updated in a non-negligible way from the paper though, so maybe I'm still not as pessimistic about it as other people. I'd definitely be interested in your thoughts if you find discourse is still lacking in a week or two.

Writer60

I'd guess an important caveat might be that stated preferences being coherent doesn't immediately imply that behavior in other situations will be consistent with those preferences. Still, this should be an update towards agentic AI systems in the near future being goal-directed in the spooky consequentialist sense.

There's an X thread showing that the ordering of answer options is, in several cases, a stronger determinant of the model's answer than its preferences. While this doesn't invalidate the paper's results—they control for this by varying the ordering of the answer results and aggregating the results—, this strikes me as evidence in favor of the "you are not measuring what you think you are measuring" argument, showing that the preferences are relatively weak at best and completely dominated by confounding heuristics at worst.

6Gurkenglas
I had that vibe from the abstract, but I can try to guess at a specific hypothesis that also explains their data: Instead of a model developing preferences as it grows up, it models an Assistant character's preferences from the start, but their elicitation techniques work better on larger models; for small models they produce lots of noise.
Writer*26-14

Surprised that there's no linkpost about Dan H's new paper on Utility Engineering. It looks super important, unless I'm missing something. LLMs are now utility maximisers? For real? We should talk about it: https://x.com/DanHendrycks/status/1889344074098057439

I feel weird about doing a link post since I mostly post updates about Rational Animations, but if no one does it, I'm going to make one eventually.

Also, please tell me if you think this isn't as important as it looks to me somehow.

EDIT: Ah! Here it is! https://www.lesswrong.com/posts/SFsifzfZotd3NLJa... (read more)

habryka*4628

FWIW, my sense is that it's a bad paper. I expect other people will come out with critiques in the next few days that will expand on that, but I will write something if no one has done it in a week or two. I think the paper notices some interesting weak correlations, but man, it really doesn't feel like the way you would go about answering the central question it is trying to answer and I keep having the feeling of it very much having been written to produce the thing that on the most shallow read will produce the most surface-level similar object in order to persuade and be socially viral, and not to inform.

Answer by Writer42

The two Gurren Lagann movies cover all the events in the series, and based on my recollection, they should be better animated. Still based on what I remember, the first should have a pretty central take on scientific discovery. The second should be more about ambition and progress, but both probably have at least a bit of both. It's not by chance that some e/accs have profile pictures inspired by that anime. I feel like people here might disagree with part of the message, but I think it does say something about issues we care about here pretty forcefully. (Also, it was cited somewhere in HP: MoR, but for humor.)

dr_s122

I think the core message of optimism is a positive one, but of course IRL we have to deal with a world whose physical laws do not in fact seem to bend endlessly under sufficient application of MANLY WARRIOR SPIRIT, and thus that forces us to be occasionally Rossiu even when we'd want to be Simon. Memeing ourselves into believing otherwise doesn't really make it true.

Writer40

I think it would be very interesting to see you and @TurnTrout debate with the same depth, preparation, and clarity that you brought to the debate with Robin Hanson.

Edit: Also, tentatively, @Rohin Shah because I find this point he's written about quite cruxy.

2Liron
I'm happy to have that kind of debate. My position is "goal-directedness is an attractor state that is incredibly dangerous and uncontrollable if it's somewhat beyond human-level in the near future". The form of those arguments seems to be like "technically it doesn't have to be". But realistically it will be lol. Not sure how much more there will be to say.
Writer*2011

For me, perhaps the biggest takeaway from Aschenbrenner's manifesto is that even if we solve alignment, we still have an incredibly thorny coordination problem between the US and China, in which each is massively incentivized to race ahead and develop military power using superintelligence, putting them both and the rest of the world at immense risk. And I wonder if, after seeing this in advance, we can sit down and solve this coordination problem in ways that lead to a better outcome with a higher chance than the "race ahead" strategy and don't risk encou... (read more)

3tmeanen
Plausibly one technology that arrives soon after superintelligence is powerful surveillance technology that makes enforcing commitments significantly easier than it historically has been. Leaving aside the potential for this to be misused for authoritarian government, advocating for this to be developed before powerful technologies of mass destruction may be a strategy.  
Writer20

Noting that additional authors still don't carry over when the post is a cross-post, unfortunately.

Writer20

I'd guess so, but with AGI we'd go much much faster. Same for everything you've mentioned in the post.

Answer by Writer70

Turn everyone hot

If we can do that due to AGI, almost surely we can solve aging, which would be truly great. 

1lemonhope
We'll solve it either way right?
Writer130

Looking for someone in Japan who had experience with guns in games, he looked on twitter and found someone posting gun reloading animations

Having interacted with animation studios and being generally pretty embedded in this world, I know that many studios are doing similar things, such as Twitter callouts if they need some contractors fast for some projects. Even established anime studios do this. I know at least two people who got to work on Japanese anime thanks to Twitter interactions. 

I hired animators through Twitter myself, using a similar proce... (read more)

gwern*147

It is a notorious practice.

As far as Palworld goes and the genius of its founders & character designer go, I would personally hold off on the encomiums until the dust has settled some more - both to see if it's anything but a gimmick (can 'unlicensed Pokemon but survival' really hold players long-term?) and also if they even survive (there's a reason 'unlicensed Pokemon but X' is not a large niche). While Nintendo has not yet formally sued them, the official Nintendo statement was not exactly friendly either... And their ex-head-lawyer seems to expect ... (read more)

Writer31

Thank you! And welcome to LessWrong :)

3pavoras
Thanks! I am not "really" new though. Have just been reading for the last few months but I suppose this actually was my first comment 😅 Also, if you already know LW, the channel screams "Hey, this is LW stuff!". But I don't think I would have made the connection if I encountered the channel first. It does a hell of a job getting the "average" nerd informed and interested in these topics though, which is really nice. Not sure how the average population fares. I suppose the videos have way too much content to appeal to a mainstream audience... Which is good, in a way. Somebody has to do the "informing the public"-chore at some point though, but it might as well be someone else ^^
Writer70

The comments under this video seem okayish to me, but maybe it's because I'm calibrated on worse stuff under past videos, which isn't necessarily very good news to you.

The worst I'm seeing is people grinding their own different axes, which isn't necessarily indicative of misunderstanding.

But there are also regular commenters who are leaving pretty good comments:

The other comments I see range from amused and kinda joking about the topic to decent points overall. These are the top three in terms of popularity at the moment:

Writer*40

Stories of AI takeover often involve some form of hacking. This seems like a pretty good reason for using (maybe relatively narrow) AI to improve software security worldwide. Luckily, the private sector should cover it in good measure for financial interests. 

I also wonder if the balance of offense vs. defense favors defense here. Usually, recognizing is easier than generating, and this could apply to malicious software. We may have excellent AI antiviruses devoted to the recognizing part, while the AI attackers would have to do the generating part.&n... (read more)

5quetzal_rainbow
Hacking is usually not about writing malicious software, it's about finding vulnerabilities. You can avoid vulnerabilities entirely by provably safe software, but you still need safe hardware, which is tricky, and provably safe software is hell in development. It would be nice if AI companies used provably safe sandboxing, but it would require enormous coordination effort. And I feel really uneasy about training AI on finding vulnerabilities.
Writer42

Also I don't think that LLMs have "hidden internal intelligence"

I don't think Simulators claims or implies that LLMs have "hidden internal intelligence" or "an inner homunculus reasoning about what to simulate", though. Where are you getting it from? This conclusion makes me think you're referring to this post by Eliezer and not Simulators.

Writer40

Yoshua Bengio is looking for postdocs for alignment work:

I am looking for postdocs, research engineers and research scientists who would like to join me in one form or another in figuring out AI alignment with probabilistic safety guarantees, along the lines of the research program described in my keynote (https://www.alignment-workshop.com/nola-2023) at the New Orleans December 2023 Alignment Workshop.

I am also specifically looking for a postdoc with a strong mathematical background (ideally an actual math or math+physics or math+CS degree) to take a lead

... (read more)

i think about this story from time to time. it speaks to my soul.

  • it is cool that straight-up utopian fiction can have this effect on me.
  • it yanks me in a state of longing. it's as if i lost this world a long time ago, and i'm desperately trying to regain it.

i truly wish everything will be ok :,)

thank you for this, tamsin.

Writer30

Here's a new RA short about AI Safety: https://www.youtube.com/shorts/4LlGJd2OhdQ

This topic might be less relevant given today's AI industry and the fast advancements in robotics. But I also see shorts as a way to cover topics that I still think constitute fairly important context, but, for some reason, it wouldn't be the most efficient use of resources to cover in long forms.

The way I understood it, this post is thinking aloud while embarking on the scientific quest of searching for search algorithms in neural networks. It's a way to prepare the ground for doing the actual experiments. 

Imagine a researcher embarking on the quest of "searching for search". I highlight in cursive the parts present in the post (if they are present at least a little):

- At some point, the researcher reads Risks From Learned Optimization.
- They complain: "OK, Hubinger, fine, but you haven't told me what search is anyway"
- They read or get invol... (read more)

Writer70

This recent Tweet by Sam Altman lends some more credence to this post's take: 

Writer72

RA has started producing shorts. Here's the first one using original animation and script: https://www.youtube.com/shorts/4xS3yykCIHU

The LW short-form feed seems like a good place for posting some of them.

In this post, I appreciated two ideas in particular:

  1. Loss as chisel
  2. Shard Theory

"Loss as chisel" is a reminder of how loss truly does its job, and its implications on what AI systems may actually end up learning. I can't really argue with it and it doesn't sound new to my ear, but it just seems important to keep in mind. Alone, it justifies trying to break out of the inner/outer alignment frame. When I start reasoning in its terms, I more easily appreciate how successful alignment could realistically involve AIs that are neither outer nor inner aligned. In p... (read more)

Writer20

Maybe obvious sci-fi idea: generative AI, but it generates human minds

Writer62

Was Bing responding in Tibetan to some emojis already discussed on LW? I can't find a previous discussion about it here. I would have expected people to find this phenomenon after the SolidGoldMagikarp post, unless it's a new failure mode for some reason.

Writer40

Toon Boom Harmony for animation and After Effects for compositing and post-production

Writer30

the users you get are silly

Do you expect them to make bad bets? If so, I disagree and I think you might be too confident given the evidence you have. We can check this belief against reality by looking at the total profit earned by my referred users here. If their profit goes up over time, they are making good bets; otherwise, they are making bad bets. At the moment, they are at +13063 mana. 

3Writer
Update:
2Gurkenglas
Ah, but what is the average trader's profit?
Writer20

They might not do that if they have different end goals though. Some version of this strategy doesn't seem so hopeless to me.

Writer40

For Rational Animations, there's no problem if you do that, and I generally don't see drawbacks.

Perhaps one thing to be aware of is that some of the articles we'll animate will be slightly adapted. Sorting Pebbles and The Power of Intelligence are exactly like the original. The Parable of The Dagger has deletions of words such as "replied" or "said" and adds a short additional scene at the end. The text of The Hidden Complexity of Wishes has been changed in some places, with Eliezer's approval, mainly because the original has some references to other articles. In any case, when there are such changes, I write it in the LW post accompanying the videos.

Writer51

If you just had to pick one, go for The Goddess of Everything Else. 

Here's a short list of my favorites.

In terms of animation: 

- The Goddess of Everything Else
- The Hidden Complexity of Wishes
- The Power of Intelligence

In terms of explainer:

- Humanity was born way ahead of its time. The reason is grabby aliens. [written by me]
- Everything might change forever this century (or we’ll go extinct). [mostly written by Matthew Barnett]

Also, I've sent the Discord invite.

Writer196

I (Gretta) will be leading the communications team at MIRI, working with Rob Bensinger, Colm Ó Riain, Nate, Eliezer, and other staff to create succinct, effective ways of explaining the extreme risks posed by smarter-than-human AI and what we think should be done about this problem. 

I just sent an invite to Eliezer to Rational Animations' private Discord server so that he can dump some thoughts on Rational Animations' writers. It's something we decided to do when we met at Manifest. The idea is that we could distill his infodumps into something succin... (read more)

7Said Achmiz
Question for you (the Rational Animations guys) but also for everyone else: should I link to these videos from readthesequences.com? Specifically, I’m thinking about linking to individual videos from individual essays, e.g. to https://www.youtube.com/watch?v=q9Figerh89g from https://www.readthesequences.com/The-Power-Of-Intelligence. Good idea? Bad idea? What do folks here think?
7Gretta Duleba
Thanks, much appreciated! Your work is on my (long) list to check out. Is there a specific video you're especially proud of that would be a great starting point? Feel free to send me a discord server invitation at gretta@intelligence.org.
Writer21

I don't speak for Matthew, but I'd like to respond to some points. My reading of his post is the same as yours, but I don't fully agree with what you wrote as a response.

If you find something that looks to you like a solution to outer alignment / value specification, but it doesn't help make an AI care about human values, then you're probably mistaken about what actual problem the term 'value specification' is pointing at.

[...]

It was always possible to attempt to solve the value specification problem by just pointing at a human. The fact that we can now al

... (read more)
Writer32

Keeping all this in mind, the actual crux of the post to me seems:

I claim that GPT-4 is already pretty good at extracting preferences from human data. If you talk to GPT-4 and ask it ethical questions, it will generally give you reasonable answers. It will also generally follow your intended directions, rather than what you literally said. Together, I think these facts indicate that GPT-4 is probably on a path towards an adequate solution to the value identification problem, where "adequate" means "about as good as humans". And to be clear, I don't mean th

... (read more)
Writer50

I agree that MIRI's initial replies don't seem to address your points and seem to be straw-manning you. But there is one point they've made, which appears in some comments, that seems central to me. I could translate it in this way to more explicitly tie it to your post:

"Even if GPT-N can answer questions about whether outcomes are bad or good, thereby providing "a value function", that value function is still a proxy for human values since what the system is doing is still just relaying answers that would make humans give thumbs up or thumbs down."

To me, ... (read more)

3Writer
Keeping all this in mind, the actual crux of the post to me seems: About it, MIRI-in-my-head would say: "No. RLHF or similarly inadequate training techniques mean that GPT-N's answers would build a bad proxy value function".        And Matthew-in-my-head would say: "But in practice, when I interrogate GPT-4 its answers are fine, and they will improve further as LLMs get better. So I don't see why future systems couldn't be used to construct a good value function, actually".
Writer12

Eliezer, are you using the correct LW account? There's only a single comment under this one.

3TekhneMakre
(It's almost certainly actually Eliezer, given this tweet: https://twitter.com/ESYudkowsky/status/1710036394977235282)
Writer10

Would it be fair to summarize this post as:

1. It's easier to construct the shape of human values than MIRI thought. An almost good enough version of that shape is within RLHFed GPT-4, in its predictive model of text. (I use "shape" since it's Eliezer's terminology under this post.)

2. It still seems hard to get that shape into some AI's values, which is something MIRI has always said.

Therefore, the update for MIRI should be on point 1: constructing that shape is not as hard as they thought.

7Matthew Barnett
That sounds roughly accurate, but I'd frame it more as "It now seems easier to specify a function that reflects the human value function with high fidelity than what MIRI appears to have thought." I'm worried about the ambiguity of "construct the shape of human values" since I'm making a point about value specification. This claim is consistent with what I wrote, but I didn't actually argue it. I'm uncertain about whether inner alignment is difficult and I currently think we lack strong evidence about its difficulty. Overall though I think you understood the basic points of the post.
Writer40

Up until recently, with a big spreadsheet and guesses about these metrics:
- Expected impact
- Expected popularity
- Ease of adaptation (for external material)

The next few videos will still be chosen in this way, but we're drafting some documents to be more deliberate. In particular, we now have a list of topics to prioritize within AI Safety, especially because sometimes they build on each other.

Writer20

Thank you for the heads-up about the Patreon page; I've corrected it!

Given that the logic puzzle is not the point of the story (i.e., you could understand the gist of what the story is trying to say without understanding the first logic puzzle), I've decided not to use more space to explain it. I think the video (just like the original article) should be watched one time all at once and then another time, but pausing multiple times and thinking about the logic.

Writer10

This is probably not the most efficient way for keeping up with new stuff, but aisafety.info is shaping up to be a good repository of alignment concepts.

Writer10

It will be a while before we run an experiment, and when I'd like to start one, I'll make another post and consult with you again. 

When/if we do one, it'll probably look like what @the gears to ascension proposed in their comment here: a pretty technical video that will likely get a smaller number of views than usual and filters for the kind of people we want on LessWrong. How I would advertise it could resemble the description of the market on Manifold linked in the post, but I'm going to run the details to you first.

LessWrong currently has about 2,0

... (read more)
Writer40

Rational Animations has a subreddit: https://www.reddit.com/r/RationalAnimations/

I hadn't advertised it until now because I had to find someone to help moderate it. 

I want people here to be among the first to join since I expect having LessWrong users early on would help foster a good epistemic culture.

Writer50

The answer must be "yes", since it's mentioned in the post

2the gears to ascension
whoops.
Writer30

I was thinking about publishing the post to hear what users and mods think on the EA Forum too, since some videos would link to EA Forum posts, while others to LW posts. 

I agree that moderation is less strict on the EA Forum and that users would have a more welcoming experience. On the other hand, the more stringent moderation on LessWrong makes me more optimistic about LessWrong being able to withstand a large influx of new users without degrading the culture. Recent changes by moderators, such as the rejected content section, make me more optim... (read more)

3Chris_Leong
If you mention Less Wrong, you might want to think carefully about how to properly set expectations.
Writer40

I'm evaluating how much I should invite people from the channel to LessWrong, so I've made a market to gauge how many people would create a LessWrong account given some very aggressive publicity, so I can get a per-video upper bound. I'm not taking any unilateral action on things like that, and I'll make a LessWrong post to hear the opinions of users and mods here after I get more traders on this market. 

3Chris_Leong
I guess one thing to think about is that Less Wrong is somewhat stricter on moderation than EA, so I wonder if inviting people to the EA forum would be a more welcoming experience?
Writer90

"April fool! It was not an April fool!"

Writer10

Here's a perhaps dangerous plan to save the world:

1. Have a very powerful LLM, or a more general AI in the simulators class. Make sure that we don't go extinct during its training (eg., some agentic simulacrum takes over during training somehow. I'm not sure if this is possible, but I figured I'd mention it anyway).

2. Find a way to systematically remove the associated waluigis in the superpostion caused by prompting a generic LLM (or simulator) to simulate a benevolent, aligned, and agentic character.

3. Elicit this agentic benevolent simulacrum in the supe... (read more)

Load More