Gyrodiot

I'm Jérémy Perret. Based in France. PhD in AI (NLP). AI Safety & EA meetup organizer. Information sponge. Mostly lurking since 2014. Seeking more experience, and eventually a position, in AI safety/governance.

Extremely annoyed by the lack of an explorable framework for AI risk/benefits. Working on that.

Sequences

XiXiDu's AI Risk Interview Series

Wiki Contributions

Load More

Comments

Sorted by

Best weekend of the year. Been there in 2017, 2018, 2019, 2023, will be delighted to attend again. Consistent source of excellent discussions, assorted activities, fun and snacks. Does indeed feel like home.

Gyrodiot6-1

My raw and mostly confused/snarky comments as I was going through the paper can be found here (third section).

Cleaner version: this is not a technical agenda. This is not something that would elicit interesting research questions from a technical alignment researcher. There are however interesting claims:

  • what a safe system ought to be like; it proposes three scales describing its reliability;
  • how far up the scales we should aim for at minimum;
  • how low on the scales currently large deployed models are.

While it positions a variety of technical agendas (mainly those of the co-authors) on the scales, the paper does not advocate for a particular approach, only the broad direction of "here are the properties we would like to have". Uncharitably, it's a reformulation of the problem.

The scales can be useful to compare the agenda that belong to the "let's prove that the system adheres to this specification" family. It makes no claims over what the specification entails, nor failure modes of various (combinations of) levels.

I appreciate this paper as a gateway to the related agendas and relevant literature, but I'm not enthusiastic about it.

Here's a spreadsheet version you can copy. Fill your answers in the "answers" tab, make your screenshot from the "view" tab.

I plan to add more functionality to this (especially comparison mode, as I collect some answers found on the Internet). You can now compare between recorded answers! Including yours, if you have filled them!

I will attempt to collect existing answers, from X and LW/EA comments.

Survey complete! I enjoyed the new questions, this should bring about some pretty graphs. Thank you for coordinating this.

Answer by Gyrodiot10

I am producing videos in French (new batch in progress) on my main channel, Suboptimal.

But I also have a side channel in English, Suboptimal Voice, where I do readings of rationalist-sphere content. Some may appear in the future, I received requests for dramatic readings.

The Mindcrime tag might be relevant here! More specific than both concepts you mentioned, though. Which posts discussing them were you alluding to? Might be an opportunity to create an extra tag.

(also, yes, this in an Open Thread, your comment is in the right place)

Strongly upvoted for the clear write-up, thank you for that, and engagement with a potentially neglected issue.

Following your post I'd distinguish two issues:

(a) Lack of data privacy enabling a powerful future agent to target/manipulate you personally, because your data is just there for the taking, stored in not-so-well-protected databases, cross-reference is easier at higher capability levels, singling you out and fine-tuning a behavioral model on you in particular isn't hard ;

(b) Lack of data privacy enabling a powerful future agent to build that generic behavioral model of humans from the thousands/millions of well-documented examples from people who aren't particularly bothered by privacy, from the same databases as above, plus simply (semi-)public social media records.

From your deception examples we already have strong evidence that (b) is possible. LLM capabilities will get better, and it will get worse when [redacted plausible scenario because my infohazard policies are ringing].

In (b) comes to pass, I would argue that the marginal effort needed to prevent (a) would only be useful to prevent certain whole coordinated groups of people (who should already be infosec-aware) to be manipulated. Rephrased: there's already a ton of epistemic failures all over the place but maybe there can be pockets of sanity linked to critical assets.

I may be missing something as well. Also seconding the Seed webtoon recommendation.

Load More