Guardian AI (Misaligned systems are all around us.)

Jessica Rumbelow

Work done @ SERI-MATS, idea from a conversation with Ivan Vendrov at Future Forum earlier this year.

Misaligned systems are all around us. They are what make me watch another video of a man in filthy shorts building a hut using only tools made from rocks and his own armpit hair. And the reason I have never, ever watched a single episode of Flavourful Origins in isolation. Maybe they make you mindlessly seek cat gifs, or keep you scrolling twitter in a cosy fug of righteous indignation long after you should be asleep. They could also be the reason your uncle is a bit more xenophobic now than he used to be. A bit more dismissive of the ‘snowflakes’. A bit more superior, and less kind.

None of this is new, of course. Advertising and propaganda have been around for a long time. But it feels different now – or at least it does to me. I was born in 1990 – I haven’t got a chance. TikTok’s fabulous algorithm steamrollers my brain. I look up, glazed after an undetermined period of involuntary consumption, wondering what happened and why I feel so hollow. I had to delete the app.

Maybe this isn’t you. I’m prone to flow states of both the deeply productive and malign kinds: good at losing myself in a task, terrible at multitasking. But surely, no-one really looks back at a four hour TikTok binge and thinks “that was an excellent use of my time”.

(This isn’t to say that personalised recommender systems or targeted ads or compelling news feeds are bad per se – just that they’re probably not very well aligned with your longer term goals.)

So, probably none of the AI systems in your life are optimising hard for your flourishing right now. Money, yes; attention, yes; but long term, real happiness? Probably not.

(Though, I guess there are things like AI-powered fitness or habit forming apps, etc?)

Could we change that? I’ve been kicking an idea around for a while. It’s not very well formed yet, but it’s something like making wrappers around existing algorithms to shift the optimisation objective. Sick of twitter bickering? A guardian AI could adjust your news feed to give you content that’s more satisfying and informative, and less likely to drag you into pointless arguments. Want to refocus your YouTube recommendations to give you great maths lecture content, without being derailed by popular science videos? Or, stay down with the kids on TikTok without getting lost in it? The guardian wrapper could work to your advantage, retaining the power and joy of these systems while blunting the more pernicious effects. You don’t have to feel frustrated when your app blocker kicks in – you can enjoy your twenty minutes of twitter-time and then get on with your life. Like a sunrise lamp, instead of an alarm clock.

This idea maybe has broader applications than just making recommender systems better for you. There are all kinds of subtle interventions a guardian AI could make to protect you and improve your life in a million small ways. I’m sure you can think of lots of them. “You’ve not seen X for a while, you’re both free on Saturday and it’ll be sunny! Shall I book a tennis court?”.

Worth exploring? More broadly, this kind of stuff seems like a nice low-stakes-yet-real-world test bed for some important ideas. How to define good proxies for flourishing, satisfaction, et cetera with minimal human input? How to mitigate the effects of misaligned black-box systems that want to hijack our puny human brains?

Cool post! I think the minimum viable "guardian" implementation, would be to

embed each post/video/tweet into some high-dimensional space
find out which regions of that space are nasty (we can do this collectively - f.e. my clickbait is probably clickbaity for you too)
filter out those regions

I tried to do something along these lines for youtube: https://github.com/filyp/yourtube

I couldn't find a good way to embed videos using ML, so I just scraped which videos recommend each other, and made a graph from that (which kinda is an embedding). Then I let users narrow down on some particular region of that graph. So you can not only avoid some nasty regions, but you can also decide what you want to watch right now, instead of the algorithm deciding for you. So this gives the user more autonomy.

The accuracy isn't yet too satisfying. I think the biggest problem with systems like these is the network effect - you could get much better results with some collaborative filtering.

we can do this collectively - f.e. my clickbait is probably clickbaity for you too

This assumes good faith. As soon as enough people learn about the Guardian AI, I expect Twitter threads coordinating people: "let's flag all outgroup content as 'clickbait'".

Just like people are abusing current systems by falsely labeling the content that want removed as "spam" or "porn" or "original research" or whichever label effectively means "this will be hidden from the audience".

Oh yeah, definitely. I think such a system shouldn't try to enforce one "truth" - which content is objectively good or bad.

I'd much rather see people forming groups, each with its own moderation rules. And let people be a part of multiple groups. There's a lot of methods that could be tried out, f.e. some groups could use algorithms like EigenTrust, to decide how much to trust users.

But before we can get to that, I see a more prohibitive problem - that it will be hard to get enough people to get that system off the ground.

Could such a thing be developed right now? It wouldn't take any more AI than the recommender systems optimised for clicks. But I'd prefer it be and be called a servant, rather than a "guardian".

Yeah, I think it could be! I’m considering pursuing it after SERI-MATS. I’ll need a couple of cofounders.

I like the "guardian" framing a lot! Besides the direct impact on human flourishing, I think a substantial fraction of x-risk comes from the deployment of superhumanly persuasive AI systems. It seems increasingly urgent that we deploy some kind of guardian technology that at least monitors, and ideally protects, against such superhuman persuaders.