Fair point. I also haven't done much posting since adding the bounty to my profile. Was thinking it might attract the attention of people reading the archives, but maybe there just aren't many archive readers.
There is some observational evidence that coffee drinking increases lifespan. I think the proposed mechanism has to do with promoting autophagy. https://www.acpjournals.org/doi/10.7326/M21-2977 But it looks like decaf works too. (Decaf has a bit of caffeine.)
I think somewhere else I read that unfiltered coffee doesn't improve lifespan, so try to drink the filtered stuff?
In my experience caffeine dependence is not a big deal and might help my sleep cycle.
Eliezer is a good example of someone who built a lot of status on the back of "breaking" others' unworkable alignment strategies. I found the AI Box experiments especially enlightening in my early days.
Fair enough.
My personal feeling is that poking holes in alignment strategies is easier than coming up with good ones, but I'm also aware that thinking that breaking is easy is probably committing some quantity of typical mind fallacy.
Yeah personally building feels more natural to me.
I agree a leaderboard would be great. I think it'd be cool to have a...
I wrote a comment on your post with feedback.
I don't have anything prepared for red teaming at the moment -- I appreciate the offer though! Can I take advantage of it in the future? (Anyone who wants to give me critical feedback on my drafts should send me a personal message!)
I skimmed the post, here is some feedback (context):
I'm probably not the best person to red team this since some of my own alignment ideas are along similar lines. I'm also a bit on the optimistic side about alignment more generally -- it might be better to talk to a pessimist.
This sounds a bit like the idea of a "low-bandwidth oracle".
I think the biggest difficulty is the one you explicitly acknowledged -- boxing is hard.
But there are also problems around ensuring that bandwidth is actually limited. If you have a human check to see that the
Thanks for the reply!
As some background on my thinking here, last I checked there are a lot of people on the periphery of the alignment community who have some proposal or another they're working on, and they've generally found it really difficult to get quality critical feedback. (This is based on an email I remember reading from a community organizer a year or two ago saying "there is a desperate need for critical feedback".)
I'd put myself in this category as well -- I used to write a lot of posts and especially comments here on LW summarizing how I'd g...
Thanks for writing this! Do you have any thoughts on doing a red team/blue team alignment tournament as described here?
Many! Thanks for sharing. This could easily turn into its own post.
In general, I think this is a great idea. I'm somewhat skeptical that this format would generate deep insights; in my experience successful Capture the Flag / wargames / tabletop exercises work best in the form where each group spends a lot of time preparing for their particular role, but opsec wargames are usually easier to score, so the judge role makes less sense there. That said, in the alignment world I'm generally supportive of trying as many different approaches as possible to see wh...
Chapter 7 in this book had a few good thoughts on getting critical feedback from subordinates, specifically in the context of avoiding disasters. The book claims that merely encouraging subordinates to give critical feedback is often insufficient, and offers ideas for other things to do.
And just as I was writing this I came across another good example of the ‘you think you’re in competition with others like you but mostly you’re simply trying to be good enough’
I'm straight, so possibly unreliable, but I remember Michael Curzi as a very good-looking guy with a deep sexy voice. I believe him when he says other dudes are not competition for him 95% of the time. ;-)
I wrote a comment here arguing that voting systems tend to encourage conformity. I think this is a way in which the LW voting system could be improved. You might get rid of the unlabeled quality axis and force downvoters to be specific about why they dislike the comment. Maybe readers could specify which weights they want to assign to the remaining axes in order to sort comments.
I think Agree/Disagree is better than True/False, and Understandable/Confusing would be better than Clear/Muddled. Both of these axes are functions of two things (the reader an...
I'll respond to the "Predict hypothetical sensors" section in this comment.
First, I want to mention that predicting hypothetical sensors seems likely to fail in fairly obvious ways, e.g. you request a prediction about a sensor that's physically nonexistent and the system responds with a bunch of static or something. Note the contrast with the "human simulator" failure mode, which is much less obvious.
But I also think we can train the system to predict hypothetical sensors in a way that's really useful. As in my previous comment, I'll work from the assump...
Thanks for the reply! I'll respond to the "Hold out sensors" section in this comment.
One assumption which seems fairly safe in my mind is that as the operators, we have control over the data our AI gets. (Another way of thinking about it is if we don't have control over the data our AI gets, the game has already been lost.)
Given that assumption, this problem seems potentially solvable
...Moreover, my AI may be able to deduce the presence of the additional sensors very cheaply. Perhaps it can notice the sensors, or it can learn about my past actions to get
I wrote a post in response to the report: Eliciting Latent Knowledge Via Hypothetical Sensors.
Some other thoughts:
I felt like the report was unusually well-motivated when I put my "mainstream ML" glasses on, relative to a lot of alignment work.
ARC's overall approach is probably my favorite out of alignment research groups I'm aware of. I still think running a builder/breaker tournament of the sort proposed at the end of this comment could be cool.
Not sure if this is relevant in practice, but... the report talks about Bayesian networks learned via
(Well, really I expect it to take <12 months, but planning fallacy and safety margins and time to iterate a little and all that.)
There's also red teaming time, and lag in idea uptake/marketing, to account for. It's possible that we'll have the solution to FAI when AGI gets invented, but the inventor won't be connected to our community and won't be aware of/sold on the solution.
Edit: Don't forget to account for the actual engineering effort to implement the safety solution and integrate it with capabilities work. Ideally there is time for extensive testing and/or formal verification.
Yes, if you've just created it, then the criteria are meaningfully different in that case for a very limited time.
It's not obvious to me that this is only true right after creation for a very limited time. What is supposed to change after that?
I don't see how we're getting off track. (Your original statement was: 'One such "clever designer" idea is decoupling plan generation from plan execution, which really just means that the plan generator has humans as part of the initial plan executing hardware.' If we're discussing situations where that claim m...
My point is that plan execution can't be decoupled successfully from plan generation in this way. "Outputting a plan" is in itself an action that affects the world, and an unfriendly superintelligence restricted to only producing plans will still win.
"Outputting a plan" may technically constitute an action, but a superintelligent system (defining "superintelligent" as being able to search large spaces quickly) might not evaluate its effects as such.
...Yes, it is possible for plans to score highly under the first criterion but not the second. However, in
The main problem is that "acting via plans that are passed to humans" is not much different from "acting via plans that are passed to robots" when the AI is good enough at modelling humans.
I agree this is true. But I don't see why "acting via plans that are passed to humans" is what would happen.
I mean, that might be a component of the plan which is generated. But the assumption here is that we've decoupled plan generation from plan execution successfully, no?
So we therefore know that the plan we're looking at (at least at the top level) is the result...
I agree these are legitimate concerns... these are the kind of "deep" arguments I find more persuasive.
In that thread, johnswentworth wrote:
In particular, even if we have a reward signal which is "close" to incentivizing alignment in some sense, the actual-process-which-generates-the-reward-signal is likely to be at least as simple/natural as actual alignment.
I'd solve this by maintaining uncertainty about the "reward signal", so the AI tries to find a plan which looks good under both alignment and the actual-process-which-generates-the-reward-signal. ...
One such "clever designer" idea is decoupling plan generation from plan execution, which really just means that the plan generator has humans as part of the initial plan executing hardware. You don't need a deep argument to point out an obvious flaw there.
I don't see the "obvious flaw" you're pointing at and would appreciate a more in-depth explanation.
In my mind, decoupling plan generation from plan execution, if done well, accomplishes something like this:
You ask your AGI to generate a plan for how it could maximize paperclips.
Your AGI generates
I had the same view as you, and was persuaded out of it in this thread. Maybe to shift focus a little, one interesting question here is about training. How do you train a plan-generating AI? If you reward plans that sound like they'd succeed, regardless of how icky they seem, then the AI will become useless to you by outputting effective-sounding but icky plans. But if you reward only plans that look nice enough to execute, that tempts the AI to make plans that manipulate whoever is reading them, and we're back at square one.
Maybe that's a good way to look...
For what it's worth, I often find Eliezer's arguments unpersuasive because they seem shallow. For example:
The insight is in realizing that the hypothetical planner is only one line of outer shell command away from being a Big Scary Thing and is therefore also liable to be Big and Scary in many ways.
This seem like a fuzzy "outside view" sort of argument. (Compare with: "A loaded gun is one trigger pull away from killing someone and is therefore liable to be deadly in many ways." On the other hand, a causal model of a gun lets you explain which specif...
As the proposal stands it seems like the AI's predictions of human thoughts would offer no relevant information about how the AI is predicting the non-thought story content, since the AI could be predicting these different pieces of content through unrelated mechanisms.
Might depend whether the "thought" part comes before or after particular story text. If the "thought" comes after that story text, then it's generated conditional on that text, essentially a rationalization of that text from a hypothetical DM's point of view. If it comes before that sto...
I updated the post to note that if you want voting rights in Google, it seems you should buy $GOOGL not $GOOG. Sorry! Luckily they are about the same price, and you can easily dump your $GOOG for $GOOGL. In fact, it looks like $GOOGL is $6 cheaper than $GOOG right now? Perhaps because it is less liquid?
Fraud also seems like the kind of problem you can address as it comes up. And I suspect just requiring people to take a salary cut is a fairly effective way to filter for idealism.
All you have to do to distract fraudsters is put a list of poorly run software companies where you can get paid more money to work less hard at the top of the application ;-) How many fraudsters would be silly enough to bother with a fraud opportunity that wasn't on the Pareto frontier?
The problem comes when one tries to pour a lot of money into that sort of approach
It seems to me that the Goodhart effect is actually stronger if you're granting less money.
Suppose that we have a population of people who are keen to work on AI safety. Suppose every time a person from that population gets an application for funding rejected, they lose a bit of the idealism which initially drew them to the area and they start having a few more cynical thoughts like "my guess is that grantmakers want to fund X, maybe I should try to be more like X even th...
I think if you're in the early stages of a big project, like founding a pre-paradigmatic field, it often makes sense to be very breadth-first. You can save a lot of time trying to understand the broad contours of solution space before you get too deeply invested in a particular approach.
I think this can even be seen at the microscale (e.g. I was coaching someone on how to solve leetcode problems the other day, and he said my most valuable tip was to brainstorm several different approaches before exploring any one approach in depth). But it really shines ...
Yes, I tried it. It gave me a headache but I would guess that's not common. Think it's probably a decent place to start.
I didn't end up sticking to this because of various life disruptions. I think it was a bit helpful but I'm planning to try something more intensive next time.
I'm glad you are thinking about this. I am very optimistic about AI alignment research along these lines. However, I'm inclined to think that the strong form of the natural abstraction hypothesis is pretty much false. Different languages and different cultures, and even different academic fields within a single culture (or different researchers within a single academic field), come up with different abstractions. See for example lsusr's posts on the color blue or the flexibility of abstract concepts. (The Whorf hypothesis might also be worth looking i...
Interesting, thanks for sharing.
I couldn't figure out how to go backwards easily.
Command-shift-g right?
After practicing Vim for a few months, I timed myself doing the Vim tutorial (vimtutor on the command line) using both Vim with the commands recommended in the tutorial, and a click-and-type editor. The click-and-type editor was significantly faster. Nowadays I just use Vim for the macros, if I want to do a particular operation repeatedly on a file.
I think if you get in the habit of double-clicking to select words and triple-clicking to select lines (triple-click and drag to select blocks of code), click-and-type editors can be pretty fast.
We present a useful toy environment for reasoning about deceptive alignment. In this environment, there is a button. Agents have two actions: to press the button or to refrain. If the agent presses the button, they get +1 reward for this episode and -10 reward next episode. One might note a similarity with the traditional marshmallow test of delayed gratification.
Are you sure that "episode" is the word you're looking for here?
https://www.quora.com/What-does-the-term-“episode”-mean-in-the-context-of-reinforcement-learning-RL
I'm especially confused becaus...
lsuser had an interesting idea of creating a new Youtube account and explicitly training the recommendation system to recommend particular videos (in his case, music): https://www.lesswrong.com/posts/wQnJ4ZBEbwE9BwCa3/personal-experiment-one-year-without-junk-media
I guess you could also do it for Youtube channels which are informative & entertaining, e.g. CGP Grey and Veritasium. I believe studies have found that laughter tends to be rejuvenating, so optimizing for videos you think are funny is another idea.
I suspect you will be most successful at this if you get in the habit of taking breaks away from your computer when you inevitably start to flag mentally. Some that have worked for me include: going for a walk, talking to friends, taking a nap, reading a magazine, juggling, noodling on a guitar, or just daydreaming.
...When we can state code that would solve the problem given a hypercomputer, we have become less confused. Once we have the unbounded solution we understand, in some basic sense, the kind of work we are trying to perform, and then we can try to figure out how to do it efficiently.
ASHLEY: Which may well require new insights into the structure of the problem, or even a conceptual revolution in how we imagine the work we're trying to do.
I'm not convinced your chess example, where the practical solution resembles the hypercomputer one, is representativ...
From a safety standpoint, hoping and praying that SGD won't stumble across lookahead doesn't seem very robust, if lookahead represents a way to improve performance. I imagine that whether SGD stumbles across lookahead will end up depending on complicated details of the loss surface that's being traversed.
Lately I've been examining the activities I do to relax and how they might be improved. If you haven't given much thought to this topic, Meaningful Rest is excellent background reading.
An interesting source of info for me has been lsusr's posts on cutting out junk media: 1, 2, 3. Although I find lsusr's posts inspiring, I'm not sure I want to pursue the same approach myself. lsusr says: "The harder a medium is to consume (or create, as applicable) the smarter it makes me." They responded to this by cutting all the easy-to-consume media out of their lif...
Good to know! I was thinking the application process would be very transparent and non-demanding, but maybe it's better to ditch it altogether.
Related to the discussion of weighted voting allegedly facilitating groupthink earlier https://www.lesswrong.com/posts/kxhmiBJs6xBxjEjP7/weighted-voting-delenda-est
An interesting litmus test for groupthink might be: What has LW changed its collective mind about? By that I mean: the topic was discussed on LW, there was a particular position on the issue that was held by the majority of users, new evidence/arguments came in, and now there's a different position which is held by the majority of users. I'm a bit concerned that nothing comes to mind which mee...
I feel like there was a mass community movement (not unanimous but substantial) from AGI-scenarios-that-Eliezer-has-in-mind to AGI-scenarios-that-Paul-has-in-mind, e.g. more belief in slow takeoff + multipolar + "What Failure Looks Like" and less belief in fast takeoff + decisive strategic advantage + recursive self-improvement + powerful agents coherently pursuing misaligned goals. This was mostly before my time, I could be misreading things, that's just my impression. :-)
Makes sense, thanks.
For whatever it's worth, I believe I was the first to propose weighted voting on LW, and I've come to agree with Czynski that this is a big downside. Not necessarily enough to outweigh the upsides, and probably insufficient to account for all the things Czynski dislikes about LW, but I'm embarrassed that I didn't foresee it as a potential problem. If I was starting a new forum today, I think I'd experiment with no voting at all -- maybe try achieving quality control by having an application process for new users? Does anyone have thoughts about that?
Another possible AI parallel: Some people undergo a positive feedback loop where more despair leads to less creativity, less creativity leads to less problem-solving ability (e.g. P100 thing), less problem-solving ability leads to a belief that the problem is impossible, and a belief that the problem is impossible leads to more despair.
China's government is more involved to large-scale businesses.
According to the World Economic Forum website:
China is home to 109 corporations listed on the Fortune Global 500 - but only 15% of those are privately owned.
Like, maybe depending on the viewer history, the best video to polarize the person is different, and the algorithm could learn that. If you follow that line of reasoning, the system starts to make better and better models of human behavior and how to influence them, without having to "jump out of the system" as you say.
Makes sense.
...there's a lot of content on YouTube about YouTube, so it could become "self-aware" in the sense of understanding the system in which it is embedded.
I think it might be useful to distinguish between being aware of onesel...
I suspect the best way to think about the polarizing political content thing which is going on right now is something like: The algorithm knows that if it recommends some polarizing political stuff, there's some chance you will head down a rabbit hole and watch a bunch more vids. So in terms of maximizing your expected watch time, recommending polarizing political stuff is a good bet. "Jumping out of the system" and noticing that recommending polarizing videos also polarizes society as a whole and gets them to spend more time on Youtube on a macro level ...
Not sure if this answers, but the book Superforecasting explains, among other things, that probabilistic thinkers tend to make better forecasts.
Yes, I didn't say "they are not considering that hypothesis", I am saying "they don't want to consider that hypothesis". Those do indeed imply very different actions. I think one gives very naturally rise to producing counterarguments, the other one does not.
They don't want to consider the hypothesis, and that's why they'll spend a bunch of time carefully considering it and trying to figure out why it is flawed?
In any case... Assuming the Twitter discussion is accurate, some people working on AGI have already thought about the "alignment is hard" positi...
What? What about all the people who prefer to do fun research that builds capabilities and has direct ways to make them rich, without having to consider the hypothesis that maybe they are causing harm?
If they're not considering that hypothesis, that means they're not trying to think of arguments against it. Do we disagree?
I agree if the government was seriously considering regulation of AI, the AI industry would probably lobby against this. But that's not the same question. From a PR perspective, just ignoring critics often seems to be a good strategy.
Power makes you dumb, stay humble.
Tell everyone in the organization that safety is their responsibility, everyone's views are important.
Try to be accessible and not intimidating, admit that you make mistakes.
Schedule regular chats with underlings so they don't have to take initiative to flag potential problems. (If you think such chats aren't a good use of your time, another idea is to contract someone outside of the organization to do periodic informal safety chats. Chapter 9 is about how organizational outsiders are uniquely well-positioned