It's true that a video ending with a general "what to do" section instead of a call-to-action to ControlAI would have been more likely to stand the test of time (it wouldn't be tied to the reputation of one specific organization or to how good a specific action seemed at one moment in time). But... did you write this because you have reservations about ControlAI in particular, or would you have written it about any other company?
Also, I want to make sure I understand what you mean by "betraying people's trust." Is it something like, "If in the future ControlAI does something bad, then, from the POV of our viewers, that means that they can't trust what they watch on the channel anymore?"
But the "unconstrained text responses" part is still about asking the model for its preferences even if the answers are unconstrained.
That just shows that the results of different ways of eliciting its values remain sorta consistent with each other, although I agree it constitutes stronger evidence.
Perhaps a more complete test would be to analyze whether its day to day responses to users are somehow consistent with its stated preferences and analyzing its actions in settings in which it can use tools to produce outcomes in very open-ended scenarios that contain stuff that could make the model act on its values.
Thanks! I already don't feel as impressed by the paper as I was while writing the shortform and I feel a little embarrassed for not thinking through things a little bit more before posting my reactions, although at least now there's some discussion under the linkpost so I don't entirely regret my comment if it prompted people to give their takes. I still feel to have updated in a non-negligible way from the paper though, so maybe I'm still not as pessimistic about it as other people. I'd definitely be interested in your thoughts if you find discourse is still lacking in a week or two.
I'd guess an important caveat might be that stated preferences being coherent doesn't immediately imply that behavior in other situations will be consistent with those preferences. Still, this should be an update towards agentic AI systems in the near future being goal-directed in the spooky consequentialist sense.
There's an X thread showing that the ordering of answer options is, in several cases, a stronger determinant of the model's answer than its preferences. While this doesn't invalidate the paper's results—they control for this by varying the ordering of the answer results and aggregating the results—, this strikes me as evidence in favor of the "you are not measuring what you think you are measuring" argument, showing that the preferences are relatively weak at best and completely dominated by confounding heuristics at worst.
Surprised that there's no linkpost about Dan H's new paper on Utility Engineering. It looks super important, unless I'm missing something. LLMs are now utility maximisers? For real? We should talk about it: https://x.com/DanHendrycks/status/1889344074098057439
I feel weird about doing a link post since I mostly post updates about Rational Animations, but if no one does it, I'm going to make one eventually.
Also, please tell me if you think this isn't as important as it looks to me somehow.
EDIT: Ah! Here it is! https://www.lesswrong.com/posts/SFsifzfZotd3NLJa...
FWIW, my sense is that it's a bad paper. I expect other people will come out with critiques in the next few days that will expand on that, but I will write something if no one has done it in a week or two. I think the paper notices some interesting weak correlations, but man, it really doesn't feel like the way you would go about answering the central question it is trying to answer and I keep having the feeling of it very much having been written to produce the thing that on the most shallow read will produce the most surface-level similar object in order to persuade and be socially viral, and not to inform.
The two Gurren Lagann movies cover all the events in the series, and based on my recollection, they should be better animated. Still based on what I remember, the first should have a pretty central take on scientific discovery. The second should be more about ambition and progress, but both probably have at least a bit of both. It's not by chance that some e/accs have profile pictures inspired by that anime. I feel like people here might disagree with part of the message, but I think it does say something about issues we care about here pretty forcefully. (Also, it was cited somewhere in HP: MoR, but for humor.)
I think the core message of optimism is a positive one, but of course IRL we have to deal with a world whose physical laws do not in fact seem to bend endlessly under sufficient application of MANLY WARRIOR SPIRIT, and thus that forces us to be occasionally Rossiu even when we'd want to be Simon. Memeing ourselves into believing otherwise doesn't really make it true.
I think it would be very interesting to see you and @TurnTrout debate with the same depth, preparation, and clarity that you brought to the debate with Robin Hanson.
Edit: Also, tentatively, @Rohin Shah because I find this point he's written about quite cruxy.
For me, perhaps the biggest takeaway from Aschenbrenner's manifesto is that even if we solve alignment, we still have an incredibly thorny coordination problem between the US and China, in which each is massively incentivized to race ahead and develop military power using superintelligence, putting them both and the rest of the world at immense risk. And I wonder if, after seeing this in advance, we can sit down and solve this coordination problem in ways that lead to a better outcome with a higher chance than the "race ahead" strategy and don't risk encou...
Noting that additional authors still don't carry over when the post is a cross-post, unfortunately.
I'd guess so, but with AGI we'd go much much faster. Same for everything you've mentioned in the post.
Turn everyone hot
If we can do that due to AGI, almost surely we can solve aging, which would be truly great.
Looking for someone in Japan who had experience with guns in games, he looked on twitter and found someone posting gun reloading animations
Having interacted with animation studios and being generally pretty embedded in this world, I know that many studios are doing similar things, such as Twitter callouts if they need some contractors fast for some projects. Even established anime studios do this. I know at least two people who got to work on Japanese anime thanks to Twitter interactions.
I hired animators through Twitter myself, using a similar proce...
It is a notorious practice.
As far as Palworld goes and the genius of its founders & character designer go, I would personally hold off on the encomiums until the dust has settled some more - both to see if it's anything but a gimmick (can 'unlicensed Pokemon but survival' really hold players long-term?) and also if they even survive (there's a reason 'unlicensed Pokemon but X' is not a large niche). While Nintendo has not yet formally sued them, the official Nintendo statement was not exactly friendly either... And their ex-head-lawyer seems to expect ...
Thank you! And welcome to LessWrong :)
The comments under this video seem okayish to me, but maybe it's because I'm calibrated on worse stuff under past videos, which isn't necessarily very good news to you.
The worst I'm seeing is people grinding their own different axes, which isn't necessarily indicative of misunderstanding.
But there are also regular commenters who are leaving pretty good comments:
The other comments I see range from amused and kinda joking about the topic to decent points overall. These are the top three in terms of popularity at the moment:
Stories of AI takeover often involve some form of hacking. This seems like a pretty good reason for using (maybe relatively narrow) AI to improve software security worldwide. Luckily, the private sector should cover it in good measure for financial interests.
I also wonder if the balance of offense vs. defense favors defense here. Usually, recognizing is easier than generating, and this could apply to malicious software. We may have excellent AI antiviruses devoted to the recognizing part, while the AI attackers would have to do the generating part.&n...
Also I don't think that LLMs have "hidden internal intelligence"
I don't think Simulators claims or implies that LLMs have "hidden internal intelligence" or "an inner homunculus reasoning about what to simulate", though. Where are you getting it from? This conclusion makes me think you're referring to this post by Eliezer and not Simulators.
Yoshua Bengio is looking for postdocs for alignment work:
...I am looking for postdocs, research engineers and research scientists who would like to join me in one form or another in figuring out AI alignment with probabilistic safety guarantees, along the lines of the research program described in my keynote (https://www.alignment-workshop.com/nola-2023) at the New Orleans December 2023 Alignment Workshop.
I am also specifically looking for a postdoc with a strong mathematical background (ideally an actual math or math+physics or math+CS degree) to take a lead
i think about this story from time to time. it speaks to my soul.
i truly wish everything will be ok :,)
thank you for this, tamsin.
Here's a new RA short about AI Safety: https://www.youtube.com/shorts/4LlGJd2OhdQ
This topic might be less relevant given today's AI industry and the fast advancements in robotics. But I also see shorts as a way to cover topics that I still think constitute fairly important context, but, for some reason, it wouldn't be the most efficient use of resources to cover in long forms.
The way I understood it, this post is thinking aloud while embarking on the scientific quest of searching for search algorithms in neural networks. It's a way to prepare the ground for doing the actual experiments.
Imagine a researcher embarking on the quest of "searching for search". I highlight in cursive the parts present in the post (if they are present at least a little):
- At some point, the researcher reads Risks From Learned Optimization.
- They complain: "OK, Hubinger, fine, but you haven't told me what search is anyway"
- They read or get invol...
RA has started producing shorts. Here's the first one using original animation and script: https://www.youtube.com/shorts/4xS3yykCIHU
The LW short-form feed seems like a good place for posting some of them.
In this post, I appreciated two ideas in particular:
"Loss as chisel" is a reminder of how loss truly does its job, and its implications on what AI systems may actually end up learning. I can't really argue with it and it doesn't sound new to my ear, but it just seems important to keep in mind. Alone, it justifies trying to break out of the inner/outer alignment frame. When I start reasoning in its terms, I more easily appreciate how successful alignment could realistically involve AIs that are neither outer nor inner aligned. In p...
Was Bing responding in Tibetan to some emojis already discussed on LW? I can't find a previous discussion about it here. I would have expected people to find this phenomenon after the SolidGoldMagikarp post, unless it's a new failure mode for some reason.
Toon Boom Harmony for animation and After Effects for compositing and post-production
the users you get are silly
Do you expect them to make bad bets? If so, I disagree and I think you might be too confident given the evidence you have. We can check this belief against reality by looking at the total profit earned by my referred users here. If their profit goes up over time, they are making good bets; otherwise, they are making bad bets. At the moment, they are at +13063 mana.
They might not do that if they have different end goals though. Some version of this strategy doesn't seem so hopeless to me.
For Rational Animations, there's no problem if you do that, and I generally don't see drawbacks.
Perhaps one thing to be aware of is that some of the articles we'll animate will be slightly adapted. Sorting Pebbles and The Power of Intelligence are exactly like the original. The Parable of The Dagger has deletions of words such as "replied" or "said" and adds a short additional scene at the end. The text of The Hidden Complexity of Wishes has been changed in some places, with Eliezer's approval, mainly because the original has some references to other articles. In any case, when there are such changes, I write it in the LW post accompanying the videos.
If you just had to pick one, go for The Goddess of Everything Else.
Here's a short list of my favorites.
In terms of animation:
- The Goddess of Everything Else
- The Hidden Complexity of Wishes
- The Power of Intelligence
In terms of explainer:
- Humanity was born way ahead of its time. The reason is grabby aliens. [written by me]
- Everything might change forever this century (or we’ll go extinct). [mostly written by Matthew Barnett]
Also, I've sent the Discord invite.
I (Gretta) will be leading the communications team at MIRI, working with Rob Bensinger, Colm Ó Riain, Nate, Eliezer, and other staff to create succinct, effective ways of explaining the extreme risks posed by smarter-than-human AI and what we think should be done about this problem.
I just sent an invite to Eliezer to Rational Animations' private Discord server so that he can dump some thoughts on Rational Animations' writers. It's something we decided to do when we met at Manifest. The idea is that we could distill his infodumps into something succin...
I don't speak for Matthew, but I'd like to respond to some points. My reading of his post is the same as yours, but I don't fully agree with what you wrote as a response.
...If you find something that looks to you like a solution to outer alignment / value specification, but it doesn't help make an AI care about human values, then you're probably mistaken about what actual problem the term 'value specification' is pointing at.
[...]
It was always possible to attempt to solve the value specification problem by just pointing at a human. The fact that we can now al
Keeping all this in mind, the actual crux of the post to me seems:
...I claim that GPT-4 is already pretty good at extracting preferences from human data. If you talk to GPT-4 and ask it ethical questions, it will generally give you reasonable answers. It will also generally follow your intended directions, rather than what you literally said. Together, I think these facts indicate that GPT-4 is probably on a path towards an adequate solution to the value identification problem, where "adequate" means "about as good as humans". And to be clear, I don't mean th
I agree that MIRI's initial replies don't seem to address your points and seem to be straw-manning you. But there is one point they've made, which appears in some comments, that seems central to me. I could translate it in this way to more explicitly tie it to your post:
"Even if GPT-N can answer questions about whether outcomes are bad or good, thereby providing "a value function", that value function is still a proxy for human values since what the system is doing is still just relaying answers that would make humans give thumbs up or thumbs down."
To me, ...
Eliezer, are you using the correct LW account? There's only a single comment under this one.
Would it be fair to summarize this post as:
1. It's easier to construct the shape of human values than MIRI thought. An almost good enough version of that shape is within RLHFed GPT-4, in its predictive model of text. (I use "shape" since it's Eliezer's terminology under this post.)
2. It still seems hard to get that shape into some AI's values, which is something MIRI has always said.
Therefore, the update for MIRI should be on point 1: constructing that shape is not as hard as they thought.
Up until recently, with a big spreadsheet and guesses about these metrics:
- Expected impact
- Expected popularity
- Ease of adaptation (for external material)
The next few videos will still be chosen in this way, but we're drafting some documents to be more deliberate. In particular, we now have a list of topics to prioritize within AI Safety, especially because sometimes they build on each other.
Thank you for the heads-up about the Patreon page; I've corrected it!
Given that the logic puzzle is not the point of the story (i.e., you could understand the gist of what the story is trying to say without understanding the first logic puzzle), I've decided not to use more space to explain it. I think the video (just like the original article) should be watched one time all at once and then another time, but pausing multiple times and thinking about the logic.
This is probably not the most efficient way for keeping up with new stuff, but aisafety.info is shaping up to be a good repository of alignment concepts.
It will be a while before we run an experiment, and when I'd like to start one, I'll make another post and consult with you again.
When/if we do one, it'll probably look like what @the gears to ascension proposed in their comment here: a pretty technical video that will likely get a smaller number of views than usual and filters for the kind of people we want on LessWrong. How I would advertise it could resemble the description of the market on Manifold linked in the post, but I'm going to run the details to you first.
...LessWrong currently has about 2,0
Rational Animations has a subreddit: https://www.reddit.com/r/RationalAnimations/
I hadn't advertised it until now because I had to find someone to help moderate it.
I want people here to be among the first to join since I expect having LessWrong users early on would help foster a good epistemic culture.
The answer must be "yes", since it's mentioned in the post
I was thinking about publishing the post to hear what users and mods think on the EA Forum too, since some videos would link to EA Forum posts, while others to LW posts.
I agree that moderation is less strict on the EA Forum and that users would have a more welcoming experience. On the other hand, the more stringent moderation on LessWrong makes me more optimistic about LessWrong being able to withstand a large influx of new users without degrading the culture. Recent changes by moderators, such as the rejected content section, make me more optim...
I'm evaluating how much I should invite people from the channel to LessWrong, so I've made a market to gauge how many people would create a LessWrong account given some very aggressive publicity, so I can get a per-video upper bound. I'm not taking any unilateral action on things like that, and I'll make a LessWrong post to hear the opinions of users and mods here after I get more traders on this market.
"April fool! It was not an April fool!"
Here's a perhaps dangerous plan to save the world:
1. Have a very powerful LLM, or a more general AI in the simulators class. Make sure that we don't go extinct during its training (eg., some agentic simulacrum takes over during training somehow. I'm not sure if this is possible, but I figured I'd mention it anyway).
2. Find a way to systematically remove the associated waluigis in the superpostion caused by prompting a generic LLM (or simulator) to simulate a benevolent, aligned, and agentic character.
3. Elicit this agentic benevolent simulacrum in the supe...
That's fair, we wrote that part before DeepSeek became a "top lab" and we failed to notice there was an adjustment to make