Wikitag Contributions

Comments

Sorted by

We do want the participants on ARENA to have quite a strong interest in AI safety, which is why we ask people to evidence some substantial engagement with AI safety agendas (which is what that question is designed to do). However, we're not looking for perfect answers for either of the questions, nothing that should take hours of research if you're engaged consistently on LessWrong/Alignment Forum.

However, it's not that you must finish the application within 60-90 minutes, this is just a rough estimate of how long it would take someone who's engaged with AI and AI safety to complete it (which may be wrong, sorry that this was the case!). We aren't presuming in this estimate that people are doing a lot of research to provide the highest quality answers to these questions, since that's really not what we're expecting. Although of course you're free to spend as much or as little time as you want on the application.

I think there's a mistake in 17: \sin(x) is not a diffeomorphism between (-\pi,\pi) and (-1,1) (since it is e.g. not bijective between these domains). Either you mean sin(x/2) or the interval bounds should be (-\pi/2, \pi/2)

ARENA might end up teaching this person some mech-interp methods they haven't seen before, although it sounds like they would be more than capable of self-teaching any mech-interp. The other potential value-add for your acquaintance would be if they wanted to improve their RL or Evals skills, and have a week to conduct a capstone project with advisors. If they were mostly aiming to improve their mech-interp ability by doing ARENA, there would probably be better ways to spend their time.

The way we see this project going concretely looks something like:

First things first, we want to get a good enough theoretical background of IBP. This will ultimately result in something like a distillation of IBP that we will use as reference, and hope others will get a lot of use from.

In this process, we will be doing most of our testing in a theoretical framework. That is to say, we will be constructing model agents and seeing how InfraBayesian Physicalism actually deals with these in theory, whether it breaks down at any stage (as judged by us), and if so whether we can fix or avoid those problems somehow.

What comes after this, as we see it at the moment, is trying to implement the principles of InfraBayesian Physicalism in a real-life, honest-to-god, Inverse Reinforcment Learning proposal. We think IBP stands a good chance of being able to patch some of the largest problems in IRL, which should ultimately be demonstrable by actually making an IRL proposal that works robustly. (When this inevitably fails the first few times, we will probably return to step 1, having gained useful insights, and iterate).