In the case with 1,253,725,569th digit of pi, if I try to construct a probability experiment consisting only of checking this paticular digit, I fail to model my uncertainty, as I don't know yet what is the value of this digit.
Ok, let me see if I'm understanding this correctly: if the experiment is checking the X-th digit specifically, you know that it must be a specific digit, but you don't know which, so you can't make a coherent model. So you generalize up to checking an arbitrary digit, where you know that the results are distributed evenly among {0......
Looks like Tell me about yourself: LLMs are aware of their learned behaviors investigates a similar topic, but finds the complete opposite result - if you fine-tune an LLM to have a specific unusual behaviour, without explicitly spelling out what the behaviour is, the LLM is able to accurately describe the unusual behaviour.
I wonder if the difference is that they fine-tuned it to exhibit a specific behaviour, whereas you and I were testing with the off-the-shelf model? Perhaps if there's not an obvious-enough behaviour for the AI to hone in on, it can deve...
The disconnect between stated and revealed preferences across all models suggests frontier models either misrepresent their actual preferences or lack accurate self-understanding of their own behaviors—both concerning possibilities for alignment research.
Bit of a tangent, but I suspect the latter. As part of my previous project (see "Round 2"), I asked GPT-4o to determine if a code sample was written normally by another instance of GPT-4o, or if it contained a suspicious feature. On about 1/5 of the "normal" samples, GPT-4o incorrectly asserted that the co...
I'm supposed to account for all the relevant information and ignore all the irrelevant.
Is there a formal way you'd define this? My first attempt is something like "information that, if it were different, would change my answer". E.g. knowing the coin is biased 2:1 vs 3:1 doesn't change your probability, so it's irrelevant; knowing the coin is biased 2:1 for heads vs 2:1 for tails changes your probability, so it's relevant.
Or maybe it should be considered from the perspective of reducing the sample space? Is knowing the coin is biased vs knowing it's biased...
Ok, I've been thinking about this more and I've figured out what exactly was bugging me about this red-teaming scheme. Is it fair to say you assume that your red team will naturally cover the same area of solution-space as actual malicious AIs aiming to self-coordinate?
If yes, I think that assumption is faulty. Self-coordinating AIs will have access to classes of strategies that a human red-teamer would not think of unless they spent significant effort to "think like the enemy", so to speak. These strategies will be usable even if the AI isn't much smarter...
Self-coordination might also be easier if AIs can recognize themselves by being good at recognizing features that survive paraphrasing and that vary very consistently between answers
Yeah, this is the case I'm thinking about. I think on longer outputs it will be possible for the AI to use a more nebulous sense of "style", rather than only relying on specific features like "using X secret phrase" (which are presumably easier to redact away).
I think this could be countered by paraphrasing at a higher level - i.e. rewriting entire paragraphs from scratch rathe...
it needs the ability to recursively handle instrumental subgoals anyway, so why have a whole extra different kind of goal at the top?
I'm picturing this somewhat differently. I imagine each goal as a node in a directed graph, with zero or more arrows pointing toward its supergoal(s). Under this construction, we have a natural separation between instrumental goals (nodes with at least one outgoing arrow) and terminal goals (nodes with no outgoing arrows), even though they're not fundamentally different "types" of things.
So the question of "how do you bu...
Hi, thank you very much for the response! I read through the post you linked, and also the steganography/paraphrasing post linked therein, and I see your point about redaction. I agree that in this intelligence regime, it's likely that we could come up with redaction schemes that very effectively prevent AIs from hiding information in their outputs.
However, I still have some concerns about generating synthetic inputs. Specifically this section:
...At any point where it naively seems like there might be an extremely good opportunity to behave badly, or where yo
First off, thank you for this series. You've thoroughly dissolved the paradox for me. I was initially a thirder despite suspicions about Elga's original argument (he seemed to equivocate between being awakened in general and being first awakened/awakened on Monday), because the probabilities for Lewis' halfer argument (the 1/2, 1/4, 1/4 split) made no sense no matter how I did the math. I now understand that both sides were making the same error in their models, and that while the thirder position is correctly measuring something, it's not your credence in...
and have no particular reason to think those people are very unusual in terms of cooking-skill
Yeah, that's what I was trying to get at with the typical-friend-group stuff. The people you know well aren't a uniform sample of all people, so you have no reason to conclude that their X-skill is normal for any arbitrary X.
The problem (or maybe just my problem) is that when I say "average" it feels like it's activating my concept of "mathematical concept of sum/count", even though the actual thing I'm thinking of is "typical member of class extracted from my mental model". I find myself treating "average" as if it came from real data even if it didn't.
(including just using the default settings, or uninstalling the app and choosing an easier one).
I feel like this was meant to be in jest, but if you were serious: my goal was to beat the stage with certain self-imposed constraints, so relaxing those constraints would've defeated the purpose of the exercise.
Anyways, it wasn't my friend's specific change that was surprising, it was the fact that he had come up with an optimization at all. My strategy was technically viable but incredibly tedious, and even though I knew it was tedious I was planning on trying...
Oops, that's my bad for not double-checking the definitions before I wrote that comment. I think the distinction I was getting at was more like known unknowns vs unknown unknowns, which isn't relevant in platonic-ideal probability experiments like the ones we're discussing here, but is useful in real-world situa... (read more)