Demystifying "Alignment" through a Comic

milanrosko

106 Demystifying "Alignment" through a Comic

by milanrosko

9th Jun 2024

1 min read

106

Disclaimer: This explanatory comic is not specifically aimed at the Less Wrong contributor.

ArtAI Alignment FieldbuildingConceptual MediaHas DiagramInner AlignmentAI

Frontpage

106

Mentioned in

29DPO/PPO-RLHF on LLMs incentivizes sycophancy, exaggeration and deceptive hallucination, but not misaligned powerseeking

New Comment

19 comments, sorted by

top scoring

Click to highlight new comments since: Today at 11:19 PM

[-]the gears to ascension11mo*3332

edit #2: the title I was commenting on has been removed. I'd encourage others to reconsider their downvote.

~~Please chill about downvotes. If you edit the title back, you are likely to on net get upvoted. I had just upvoted when you changed it. Give it time to settle before you get doom and gloom about it.~~

edit: the title was originally something like "an alignment comic". OP is trying to throw away a perfectly good post that, in my estimation, wasn't even previously going to settle on a negative score, just was briefly negative.

[-]habryka11mo118

I changed the title back (from "Downvote this") and removed the first sentence (which complained about downvotes), since it seems like it changed people's voting behavior a lot in a confusing way. My best guess is this post would actually be pretty popular with its original title (and I also upvoted it).

[-]milanrosko11mo20

Good job. Thank you and have a nice week.

[-]Zack_M_Davis11mo95

Seconding this. A nonobvious quirk of the system where high-karma users get more vote weight is that it increases variance for posts with few votes: if a high-karma user or two who don't like you see your post first, they can trash the initial score in a way that doesn't reflect "the community's" consensus. I remember the early karma scores for one of my posts going from 20 to zero (!). It eventually finished at 131.

[-]Adam Shai11mo31

I've also noticed this phenomenon. I wonder if a solution would be to have an initial period where votes are considered more democratically, and then after that period the influence of high-karma users are applied (including back applying the influence of votes that occured during the intial period). I can also imagine downsides to this.

[-]milanrosko11mo182

Thanks for the mod for the deus ex machina.

I've been a LessWrong lurker (without an account) for around ten years, ever since the Roko's Basilisk "thing", so... This comic isn't targeted at the LessWrong community but was created by it.

The unusual style, gaming-related language, and iconography typical of manga and comics help bypass the bias known as "mortality salience." I'm trying to convey this message more indirectly, aiming to engage people who might not usually be interested in these topics or who would typically engage in "worldview defense".

Anyway, I've corrected a few of the issues that some pointed out. Thanks for the helpful feedback.

What deeply frustrates me is the way content is rated, ordered and created as a broad phenomenon in today's internet. I find it challenging to cope with how cultural products are dumbed down to the point of being unbearable because of algorithms, click optimization. This bitterness is beginning to affect my life and my relationships with people in general.

[-]habryka11mo30

You're welcome! Sorry for the variance in the karma system. It does take quite a while for things to settle.

[-]ilm11mo20

I liked your post.

What deeply frustrates me is the way content is rated, ordered and created as a broad phenomenon in today's internet.

Tangential, but I'd be interested to hear/read more about this. I have similar feelings but thoughts around this are very disorganized and blurry, rather than well laid out and clear.

[-]TsviBT11mo116

Since the author seems to have been discouraged at one point:

The good: The images (the blobs) are really good. Cute, quirky, engaging, in some cases good explainers. Overall the production value seems high.

The confusing: I'm not familiar with manga, e.g. its idioms or vibes; so there's probably stuff you were going for that I just wouldn't understand.

The probably bad: I'm guessing that the writing has too many leaps that are too unclear. I could imagine this being a style, like in a fictional work where things are sort of referenced without really being explained but referenced in a way that makes them "feel real" to the reader (and maybe makes them go look it up). But I'd guess the comic doesn't hit that as-is. Possibly the comic would benefit from test readers (preferably from your target audience) who you talk with to see where they got bored / confused.

[-]tailcalled11mo80

In the above comic, the AI is trained by having the human look at its behavior, judge what method the AI is solving the desired task and how much progress it's making along that method, and then selecting the ones that make more progress.

If we see the AI start studying the buttons to decide what to do, we can just select against that. It gets its capabilities entirely from our judgement of the likely consequences, so while that can lead to deception in simple cases where it accidentally stumbles into confusing us (e.g. by going in front of the sugar), this doesn't imply a selection in favor of complex misalignment/deception that is so unlikely to happen by chance that you wouldn't stumble into it without many steps of intentional selection.

[-]oumuamua11mo66

Some hopefully constructive criticism:

I believe it's "agentic", not "agentive".
"Save scumming" isn't a widely known term. If I hadn't known exactly where this was going, it might have confused me. Consider replacing it with something like "trial and error".
I would rework the part where the blob bites the finger off, it causes people to ask stuff like "but how should a piece of Software bite my finger?", this derails the conversation. Don't specify exactly how it's going to try to prevent the pushing of the button, explain that it has a strong inventive to do so, that it is correct about that, and that it can use the abilities which it learned to to understand and manipulate the world to accomplish that.

Edit: To end this on a positive note: This format is under explored. We need more "alignment is hard 101" content that is as convincing as possible, without making use of deception. Thank you for creating something that could become very valuable with a bit of iterative improvement. Like, genuinely. Thank you.

[-]milanrosko11mo*30

Corrected to agentic and changed the part where it derails a bit. Thank you.

[-]milanrosko11mo30

Yeah I will close this attempt. I mean currently it has -1 and some other dumb thread that is about how NYC is has like 21. Nah... Fuck this.

[-]oumuamua11mo149

Don't do this, please. Just wait and see. This community is forgiving about changing ones mind.

[-]geoffreymiller11mo31

This is really good, and it'll be required reading for my new 'Psychology and AI' class that I'll teach next year.

Students are likely to ask 'If the blob can figure out so much about the world, and modify its strategies so radically, why does it still want sugar? Why not just decide to desire something more useful, like money, power, and influence?'

[-]milanrosko11mo21

Wow what an honor! Thank you.

[-]Tapatakt11mo20

I think the short explanations with good memetic potential can hold a lot of value. Thank you!

Probably will translate this into Russian someday.

[-]Review Bot11mo10

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

[-]Dagon11mo*-23

I kind of like this (and it kind of misses me, but not badly). But the title and first sentence annoyed me enough that I followed instructions and downvoted.

edit: retracting because the title and first sentence are no longer complaints about karma and a request to downvote.

[This comment is no longer endorsed by its author]Reply

Moderation Log