Bachelor in general and applied physics. AI safety/Agent foundations researcher wannabe.
I love talking to people, and if you are an alignment researcher we will have at least one common topic (but I am very interested in talking about unknown to me topics too!), so I encourage you to book a call with me: https://calendly.com/roman-malov27/new-meeting
Email: roman.malov27@gmail.com
GitHub: https://github.com/RomanMalov
TG channels (in Russian): https://t.me/healwithcomedy, https://t.me/ai_safety_digest
If I understand correctly, it's smallest possible mass for a black hole.
Were you thinking of making the model good at answering questions whose correct answer depend on the model itself, like "When asked a question of the form x, what proportion of the time would you tend to answer y?"
I’m not an author of this post, so I don’t know.
I think one of the biggest dangers of this kind of self-awareness is that it allows models to know their level of accuracy in particular areas. Right now, they could be overconfident or underconfident in their abilities, which makes their plans less effective when actually implemented. If they are overconfident, their plan that relies on that ability would just fail; if they are underconfident, they are not using all of their capabilities.
By giving it more information about itself, which is self-awareness, and a big part of situational awareness.
02/01/2026
I wrote a part of a future post on probabilistic maths systems.
Please remember how strange this all is to understand how strange it all can get.
We want world models to be:
But those properties are in tension with one another. If we aim for the first property, the most intuitive approach is to encode the concepts we understand. In that case, we end up with GOFAI, one of the main problems of which is that the mental world it lives in is very limited. If we aim for the second property, we end up with tangled messes like NNs (which are directly optimized for accuracy), and it's hard for humans to understand the concepts in the model.
We can think of this as a continuum in the level of specification of the prior. We can make the prior very crisp and understandable, but then it's not rich enough to learn the most accurate representation of the world. Or the prior can be rich and able to learn the world accurately, but then we can't really encode that much into it, and therefore don't know that much about the world model it learned.
I hope that theories of abstraction (mentioned here) can help with this problem, but I'm not sure what the specific path is.
the "one try" framing ignores that iterative safety improvements have empirically worked (GPT-1→GPT-5).
That's because we aren't in the superintelligent regime yet.
Daily Research Diary
In the comments to this quick take, I am planning to report on my intellectual journey: what I read, what I learned, what exercises I’ve done, and which projects or research problems I worked on. Thanks to @TristianTrim for suggesting the idea. Feel free to comment with anything you think might be helpful or relevant.