LESSWRONG
LW

All of jwfiredragon's Comments + Replies

Untrusted monitoring insights from watching ChatGPT play coordination games

It feels like my key disagreement is I think that AI will be able to come up with strategies that are inhuman without being superhuman. i.e. human-level AIs will find strategies in a very different part of solution space than what humans would naturally think to prepare for.

My biggest intuition toward the above is AIs' performance in games (e.g. AlphaStar). I've seen a lot of scenarios where the AIs soundly beat top humans not by doing the same thing but better, but by doing something entirely outside of the human playbook. I don't see why this wouldn't tr... (read more)

3Fabien Roger3mo

I think inhuman but not superhuman is fine as long as it holds that succeeding at defending against human attack is strong evidence of success against inhuman ones. If the defenses are not "overfit" to human attacks, I think it's quite doable to get a sense of how good defenses are against inhuman attacks. For paraphrasing in particular, I'd be very surprised if the best human attacks (found in many days of research) were defeated by paraphrasing but human-level-ish LLMs (given only a few minutes of human-equivalent opaque thinking time) could figure out strategies which were not defeated by paraphrasing. (And there is also the strategy of using RL to elicit inhuman attacks which doesn't have the weakness you describe, but which has other issues.)

Probability Theory Fundamentals 102: Territory that Probability is in the Map of

jwfiredragon3mo30

That's not how people usually use these terms. The uncertainty about a state of the coin after the toss is describable within the framework of possible worlds just as uncertainty about a future coin toss, but uncertainty about a digit of pi - isn't.

Oops, that's my bad for not double-checking the definitions before I wrote that comment. I think the distinction I was getting at was more like known unknowns vs unknown unknowns, which isn't relevant in platonic-ideal probability experiments like the ones we're discussing here, but is useful in real-world situa... (read more)

Probability Theory Fundamentals 102: Territory that Probability is in the Map of

jwfiredragon3mo10

In the case with 1,253,725,569th digit of pi, if I try to construct a probability experiment consisting only of checking this paticular digit, I fail to model my uncertainty, as I don't know yet what is the value of this digit.

Ok, let me see if I'm understanding this correctly: if the experiment is checking the X-th digit specifically, you know that it must be a specific digit, but you don't know which, so you can't make a coherent model. So you generalize up to checking an arbitrary digit, where you know that the results are distributed evenly among {0...... (read more)

3Ape in the coat3mo

Basically yes. Strictly speaking it's not just any arbitrary digit, but any digit your knowledge about values of which works the same way as about value of X. For any digit you can execute this algorithm: Check whether you know about it more (or less) than you know about X. Yes: Go to the next digit No: Add it to the probability experiment As a result you get a bunch of digits about values of which you knew as much as you know about X. And so you can use them to estimate your credence for X Yes. As I say in the post: The fact how a lot of Bayesians mock Frequentists for not being able to conceptualize probability of a coin of unknown fairness, and then make the exact same mistake with not being able to conceptualize probability of a specific digit of pi, which value is unknown, has always appeared quite ironic to me. I think we did! That's not how people usually use these terms. The uncertainty about a state of the coin after the toss is describable within the framework of possible worlds just as uncertainty about a future coin toss, but uncertainty about a digit of pi - isn't. Moreover, isn't it the same before the flip? It's not that coin toss is "objectively random". At the very least, the answer also exists in the future and all you need is to wait a bit for it to be revealed. The core princinple is the same: there is in fact some value that Probability Experiment function takes in this iteration. But you don't know which. You can do some actions: look under the box, do some computation, just wait for a couple of seconds - to learn the answer. But you also can reason approximately for the state of your current uncertainty before these actions are taken.

Edge Cases in AI Alignment

jwfiredragon3mo30

I wonder if the difference is that they fine-tuned it to exhibit a specific behaviour, whereas you and I were testing with the off-the-shelf model? Perhaps if there's not an obvious-enough behaviour for the AI to hone in on, it can deve... (read more)

2Florian_Dietz3mo

I don't think these two are necessarily contradictions. I imagine it could go like so: "Tell me about yourself": When the LLM is finetuned on a specific behavior, the gradient descent during reinforcement learning strengthens whatever concepts are already present in the model and most likely to immediately lead to that behavior if strengthened. These same neurons are very likely also connected to neurons that describe other behavior. Both the "acting on a behavior" and the "describe behavior" mechanisms already exist, it's just that this particular combination of behaviors has not been seen before. Plausible-sounding completions: It is easier and usually accurate enough to report your own behavior based on heuristics than based on simulating the scenario in your head. The information is there, but the network tries to find a response in a single person rather than self reflecting. Humans do the same thing: "Would you do X?" often gets parsed as a simple "am I the type of person who does X?" and not as the more accurate "Carefully consider scenario X and go through all confounding factors you can think of before responding".

Edge Cases in AI Alignment

jwfiredragon3mo20

The disconnect between stated and revealed preferences across all models suggests frontier models either misrepresent their actual preferences or lack accurate self-understanding of their own behaviors—both concerning possibilities for alignment research.

Bit of a tangent, but I suspect the latter. As part of my previous project (see "Round 2"), I asked GPT-4o to determine if a code sample was written normally by another instance of GPT-4o, or if it contained a suspicious feature. On about 1/5 of the "normal" samples, GPT-4o incorrectly asserted that the co... (read more)

3jwfiredragon3mo

Looks like Tell me about yourself: LLMs are aware of their learned behaviors investigates a similar topic, but finds the complete opposite result - if you fine-tune an LLM to have a specific unusual behaviour, without explicitly spelling out what the behaviour is, the LLM is able to accurately describe the unusual behaviour. I wonder if the difference is that they fine-tuned it to exhibit a specific behaviour, whereas you and I were testing with the off-the-shelf model? Perhaps if there's not an obvious-enough behaviour for the AI to hone in on, it can develop a gap between what it says it would do and what it would actually do?

Probability Theory Fundamentals 102: Territory that Probability is in the Map of

jwfiredragon3mo30

I'm supposed to account for all the relevant information and ignore all the irrelevant.

Is there a formal way you'd define this? My first attempt is something like "information that, if it were different, would change my answer". E.g. knowing the coin is biased 2:1 vs 3:1 doesn't change your probability, so it's irrelevant; knowing the coin is biased 2:1 for heads vs 2:1 for tails changes your probability, so it's relevant.

Or maybe it should be considered from the perspective of reducing the sample space? Is knowing the coin is biased vs knowing it's biased... (read more)

5Ape in the coat3mo

I'd say that the rule is: "To construct probability experiment use the minimum generalization that still allows you to model your uncertainty". In the case with 1,253,725,569th digit of pi, if I try to construct a probability experiment consisting only of checking this paticular digit, I fail to model my uncertainty, as I don't know yet what is the value of this digit. So instead I use a more general probability experiment of checking any digit of pi that I don't know. This allows me to account for my uncertainty. Now, I may worry, that I overdid it and have abstracted away some relevant information, so I check: - Does knowing that the digit in question is specifically 1,253,725,569 affects my credence? - Not until I receive some evidence about the state of specifically 1,253,725,569th digit of pi. - So until then this information is not relevant. Yes. I'm making this point here: There is no particular need to talk about logical and physical uncertainty as different things. It's just a historical artifact of confused philosiophical approach of possible worlds and I'm presenting a better way. Even this difference is not real. Consider: This is physical uncertainty. And yet I do know how to find the answer: all I need is to remove the opaque box and look. Nevertheless, I can talk about my credence before I looked at the coin. The exact same situation goes with not knowing a particular digit of pi. Yes, I do know a way to find an answer: google an algorithm for calculating any digit of pi and insert there my digit as an input. Nevertheless, I can still talk about my credence before I performed all these actions.

Untrusted monitoring insights from watching ChatGPT play coordination games

jwfiredragon4mo10

Ok, I've been thinking about this more and I've figured out what exactly was bugging me about this red-teaming scheme. Is it fair to say you assume that your red team will naturally cover the same area of solution-space as actual malicious AIs aiming to self-coordinate?

If yes, I think that assumption is faulty. Self-coordinating AIs will have access to classes of strategies that a human red-teamer would not think of unless they spent significant effort to "think like the enemy", so to speak. These strategies will be usable even if the AI isn't much smarter... (read more)

3Fabien Roger3mo

I am not imagining the human red-team coming up with a long list of attack strategies and then collaborating with the blue team to patch each of them. I am imagining sth more like the blue team having a fairly general strategy (e.g. paraphrasing), and then relying on a property like "the part of solution space the human red-team can cover is roughly as effective at beating the blue-team protocol as the AI solutions". I agree that AIs having different kinds of ideas is a reason to expect this assumption to not hold for some protocols (e.g. for human monitoring), but as discussed here, the human red-team also has a bunch of affordances that make it plausible to get somewhat conservative evals. This seems especially true for things like steganography vs paraphrasing, where I don't see strong reasons to expect AIs to excel relative to humans. I guess a lot of this is contingent on what AGI will look like. Current AIs have mostly human-like insights. If AIs got very insane high-quality ideas (e.g. like move 37) in the relevant domains as we got closer to AGI, I think this would become more concerning. There are other hopes like "train AIs to be good at collusion / steganography, and use that as red-team" that suffer less from the concerns discussed here, but they have other problems like exploration hacking. I am not sure how much we disagree. For these sorts of difficulties both in ruling out collusion and in evaluating it, I expect that doing untrusted monitoring well enough to have it reduce risk significantly would likely require a large amount of effort.

1Satron3mo

This discussion is very interesting to read and I look forward to hearing @Fabien Roger's thoughts on your latest comment.

Untrusted monitoring insights from watching ChatGPT play coordination games

jwfiredragon5mo30

Self-coordination might also be easier if AIs can recognize themselves by being good at recognizing features that survive paraphrasing and that vary very consistently between answers

Yeah, this is the case I'm thinking about. I think on longer outputs it will be possible for the AI to use a more nebulous sense of "style", rather than only relying on specific features like "using X secret phrase" (which are presumably easier to redact away).

I think this could be countered by paraphrasing at a higher level - i.e. rewriting entire paragraphs from scratch rathe... (read more)

Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals

jwfiredragon5mo60

it needs the ability to recursively handle instrumental subgoals anyway, so why have a whole extra different kind of goal at the top?

I'm picturing this somewhat differently. I imagine each goal as a node in a directed graph, with zero or more arrows pointing toward its supergoal(s). Under this construction, we have a natural separation between instrumental goals (nodes with at least one outgoing arrow) and terminal goals (nodes with no outgoing arrows), even though they're not fundamentally different "types" of things.

So the question of "how do you bu... (read more)

Untrusted monitoring insights from watching ChatGPT play coordination games

jwfiredragon5mo10

Hi, thank you very much for the response! I read through the post you linked, and also the steganography/paraphrasing post linked therein, and I see your point about redaction. I agree that in this intelligence regime, it's likely that we could come up with redaction schemes that very effectively prevent AIs from hiding information in their outputs.

However, I still have some concerns about generating synthetic inputs. Specifically this section:

At any point where it naively seems like there might be an extremely good opportunity to behave badly, or where yo

... (read more)

3Fabien Roger5mo

I agree self-coordination is easier, mostly because of shared secret information (e.g. about the hidden scheme used for communication). And in the paper, we directly test for this setting by letting the red-team AIs agree beforehand on what coordination scheme to use. We find that even in this relatively hard and conservative case, redaction somewhat works. So option 3.5: if the redaction scheme is good, the AI won't know if its inputs are coming from itself anyway - but we test for it by letting the red-team AIs coordinate on secrets. Self-coordination might also be easier if AIs can recognize themselves by being good at recognizing features that survive paraphrasing and that vary very consistently between answers (and in particular between the synth data and generations from the untrusted LLM). I would bet against it, but I'd be excited if someone looked into it (this is the sort of thing you can test for even for scheming by doing the paraphrasing of multiple AIs and then doing SFT on the classification task before doing the eval).

The Solution to Sleeping Beauty

jwfiredragon9mo10

First off, thank you for this series. You've thoroughly dissolved the paradox for me. I was initially a thirder despite suspicions about Elga's original argument (he seemed to equivocate between being awakened in general and being first awakened/awakened on Monday), because the probabilities for Lewis' halfer argument (the 1/2, 1/4, 1/4 split) made no sense no matter how I did the math. I now understand that both sides were making the same error in their models, and that while the thirder position is correctly measuring something, it's not your credence in... (read more)

2Ape in the coat9mo

You are most welcome! Thank you for reading the series. Yes, I think this is a very good alternative formulation. I agree that it can feel more intuitive when framed like that, even though I do not hope too much that it will change minds of many people.

"... than average" is (almost) meaningless

jwfiredragon1y10

and have no particular reason to think those people are very unusual in terms of cooking-skill

Yeah, that's what I was trying to get at with the typical-friend-group stuff. The people you know well aren't a uniform sample of all people, so you have no reason to conclude that their X-skill is normal for any arbitrary X.

2johnswentworth1y

I mean, you have some priors, and priors are a totally valid thing to reason from even if you're not taking a uniform random sample of all people. In general, "I have a few samples of X, and I don't have any particular reason to think they're unusual samples, so on priors they're probably typical" is a totally valid way to reason even if your samples aren't uniform random from the population.

"... than average" is (almost) meaningless

jwfiredragon1y10

The problem (or maybe just my problem) is that when I say "average" it feels like it's activating my concept of "mathematical concept of sum/count", even though the actual thing I'm thinking of is "typical member of class extracted from my mental model". I find myself treating "average" as if it came from real data even if it didn't.

1lesswronguser1231y

Sorry for the late reply, I was looking through my past notifs, I would recommend you to taboo the words and replace the symbols with the substance , I would also recommend you to treat language as instrumental since words don't have inherent meaning, that's how an algorithm feels from inside.

Beware the suboptimal routine

jwfiredragon1y12

(including just using the default settings, or uninstalling the app and choosing an easier one).

I feel like this was meant to be in jest, but if you were serious: my goal was to beat the stage with certain self-imposed constraints, so relaxing those constraints would've defeated the purpose of the exercise.

Anyways, it wasn't my friend's specific change that was surprising, it was the fact that he had come up with an optimization at all. My strategy was technically viable but incredibly tedious, and even though I knew it was tedious I was planning on trying... (read more)

2Dagon1y

It was jest-ish. Reductio ad absurdum. Your friend found a setting that made your chosen difficulty level less difficult. I'm not sure how different that is from just playing an easier game. For activities where you care about the result, not about the activity itself, it makes tons of sense to change the approach to be more successful with less effort. For games where you're specifically challenging yourself more than necessary, I think the case is a lot weaker about how to optimize.