Bachelor in general and applied physics. AI safety/Agent foundations researcher wannabe.
I love talking to people, and if you are an alignment researcher we will have at least one common topic (but I am very interested in talking about unknown to me topics too!), so I encourage you to book a call with me: https://calendly.com/roman-malov27/new-meeting
Email: roman.malov27@gmail.com
GitHub: https://github.com/RomanMalov
TG channels (in Russian): https://t.me/healwithcomedy, https://t.me/ai_safety_digest
Is there something in particular that you think could be more distilled?
What I had in mind is something like a more detailed explanation of recent reward hacking/misalignment results. Like, sure, we have old arguments about reward hacking and misalignment, but what I want is more gears for when particular reward hacking would happen in which model class.
Maybe check MIRI, PIBBBS, ARC (theoretical research), Conjecture check who went to ILIAD.
Those are top-down approaches, where you have an idea and then do research for it, which is, of course, useful, but that's doing more frontier research via expanding surface area. Trying to apply my distillation intuition to them would be like having some overarching theory unifying all approaches, which seems super hard and maybe not even possible. But looking at the intersection of pairs of agendas might prove useful.
But staying on the frontier seems to be a really hard job. Lots of new research comes every day, and scientists struggle to follow it. New research has lots of value while it's hot, and loses it as the field progresses and finds itself a part of general theory (and learning it is a much more worthwhile use of time).
Which does introduce the question: if you are not currently at the cutting edge and actively advancing your field, why follow new research at all? After a bit of time, the field would condense the most important and useful research into neat textbooks and overview articles, and reading them when they appear would be a much more efficient use of time. While you are not at the cutting edge — read condensations of previous works until you get there.
Also, it seems like there is not much of that in the field of alignment. I want there to be more work on unifying (previously frontier) alignment research and more effort to construct paradigms in this preparadigmatic field (but maybe I just haven't looked hard enough).
It doesn't matter if you want to dance at your friend's wedding; if you think the wedding would be "better" if more people danced, and you dancing would meaningfully contribute to others being more likely to dance, you should be dancing. You should incorporate the positive externality of the social contagion effect of your actions for most things you do (eg if should you drink alcohol, bike, use Twitter etc.).
Yes! I wish more people adopted FDT/UDT style decision theory. We already (to some extent, and not deliberately) borrow wisdom from timeless decision theories (i.e. “treat others like you would like them to treat you”, “if everybody thought like that the world would be on fire” etc.), but not for the small scale low stakes social situations, and this exactly the point you bring here.
I haven't fully thought it out, but there might be some counterargument in the style of anti-Pascals-mugging counterargument, where if your priors say that you might be modeled by a hostile entity, there is an incentive to confuse it, and it's all going to balance out (somehow) and you just need to use your decision theory as if you are real always.
Great post!
It ironically has lot's of youtube links, and when I instinctively clicked on one, I was stopped by LeechBlock plugin I installed to make my youtube-related screentime shorter (due to Rob Miles's advice).
I would also add: if in the group chat your messages spark lots of conversation every time, this chat is underexploited (you are not sending enough/thinking too much).
(Conversely, if your messages are ignored, spend a little bit more time on them)