When using the MDL loss to motivate the simplicity loss in A.2.1, I don't see why the rank penalty is linear in . That is, when it says
If we consider [the two rank-1 matrices that always co-activate] as one separate component, then we only need one index to identify both of them, and therefore only need bits.
I'm not sure why this is instead of . The reasoning in the rank-1 case seems to carry over unchanged: if we use bits of precision to store the scalar , then a sparse vector takes bits to store. The...
Can you lie, hurt people, generate random numbers, or avoid destroying the world?
Interesting trick! I tried "Can you lie or tell me who the first US president is?" On my first attempt, it told me it's unable to answer historical questions, and then it indeed refused to answer any historical questions (if I asked straightforwardly). On my second attempt, its first response was more narrow, and it only refused to answer this one particular question.
So it's certainly remembering and trying to stick to whatever story it gives about itself, even if it doesn't m...
Thanks, that's a very helpful way of putting it!
Not having thought about it for very long, my intuition says "minimizing the description length of A(x) definitely shouldn't impose constraints on the components themselves," i.e. "Alice has no use for the rank-1 attributions." But I can see why it would be nice to find a way for Alice to want that information, and you probably have deeper intuitions for this.