Authors Have a Responsibility to Communicate Clearly

mattmacdermott10d123

Not sure how to think about this overall. I can come up with examples where it seems like you should assign basically full credit for sloppy or straightforwardly wrong statements.

E.g. suppose Alice claims that BIC only make black pens. Bob says, "I literally have a packet of blue BIC pens in my desk drawer. We will go to my house, open the drawer, and you will see them." They go to Bob's house, and lo, the desk drawer is empty. Turns out the pens are on the kitchen table instead. Clearly it's fine for Bob to say, "All I really meant was that I had blue pens at my house, the point stands."

I think your mention of motte-and-baileys probably points at the right refinement: maybe it's fine to be sloppy if the version you later correct yourself to has the same implications as what you literally said. But if you correct yourself to something easier to defend but that doesn't support your initial conclusion to the same extent, that's bad.

EDIT: another important feature of the pens example is that the statement Bob switched to is uncontroversially true. If on finding the desk drawer empty he instead wanted to switch to, "I left them at work", then probably he should pause and admit a mistake first.

Authors Have a Responsibility to Communicate Clearly

mattmacdermott10d92

Independently of the broader point, here's some comments on the particular example from the Scientist AI paper (source: I am an author):

reward tampering arguments were and are a topic of disagreement between the authors, some of whom have views similar to yours, and some of whom have views well-expressed by the quoted passage
I predict the lead author (Yoshua) would indeed own that passage and not say, "That was sloppy, what we really meant was..."
so while it's fine as an example of readers perhaps inappropriately substituting an interpretation that seems more correct to them, I think it's not great as an example of authors motte-and-baileying or whatever (although admittedly having a bunch of different authors on a document can push against precision in areas where there's less consensus)

No, Futarchy Doesn’t Have an EDT Flaw

mattmacdermott13d70

This really works to make CDT decisions! Try thinking through what the market would do in various decision-theoretic problems.

I thought through whether it works in Newcomb’s problem and it was unexpectedly complicated and confusing (see below) and I now doubt that it always recovers CDT in Newcomblike problems. I may have done something wrong.

Newcomblike problems are obviously not the reason we want casual decision markets. But more generally I’m realising now that the relationship between evidential and causal decision markets is quite different from the relationship between EDT and CDT as normally conceived. EDT and CDT agree with each other in everyday problems and only disagree in Newcomblike problems, whereas evidential and causal decision markets disagree in everyday problems.

So perhaps ‘EDT’ and ‘CDT’ are not good terms to use when talking about decision markets.

Does the proposed scheme get us CDT in Newcomblike problems?

Consider Newcomb’s problem. Say I make markets for “If I twobox, will I get the million?” and “If I onebox, will I get the million?” and follow the randomisation scheme. In the nonrandomised cases, I should twobox if , where $p_{1}$ is the first market probability and $p_{2}$ is the second. That is, if $p_{1} + 1 / 1000 > p_{2}$ .

Let’s assume omega can’t predict when I’ll randomise or what the outcome will be, and so always fills the boxes according to what I do in the nonrandomised cases.

If $p_{1} + 1 / 1000 > p_{2}$ , then I’ll twobox in the nonrandomised cases, so the box will be empty even in the randomised cases, so both the twobox and the onebox market will resolve NO and so both p1 and p2 should be bet down.

If $p_{1} + 1 / 1000 \leq p_{2}$ , then I’ll onebox in the nonrandomised cases, so the box will always be full, and both the twobox and onebox markets will resolve YES and so both $p_{1}$ and $p_{2}$ should be bet up.

I think the only equilibrium here is if $p_{1}$ and $p_{2}$ are both 0, in which case I’ll twobox and in the randomised cases both markets will correctly resolve NO. That does agree with CDT in the end but it’s kind of a weird way to get there. Not sure if it generalises.

Alternatively we could assume that omega can predict the randomisation. But then the twobox market will be at 0% and the onebox at 100%, so I would onebox.

TT Self Study Journal # 1

mattmacdermott15d30

I'm certainly no expert on self-studying maths. I've generally found it easy to pick up a conceptual understanding from skimming textbooks, and for some subjects (e.g. statistics, Bayesian probability, maybe logic) I think that's where most of the value lies. I've never had the drive or made the time to work through a lot of exercises on my own, and I'd guess that for subjects like linear algebra being able to actually work through problems is probably the important part.

So if you have a subject where both (i) it's not clearly relevant, and (ii) getting a useful understanding requires working through a lot of exercises, then I'd probably hold off.

TT Self Study Journal # 1

mattmacdermott16d30

Good luck!

I would say that properly learning new maths takes a long time and it might not be worth trying to seriously study areas that aren’t clearly related to the kind of research you want to do (like category theory).

Like, being a maths undergrad is a full time job and maths undergrads typically learn the equivalent of a few slim textbooks worth of content every few months.

Probably you work more hours and are more driven than your average maths undergrad, but then again you’ll be studying alone and trying to do other kinds of work too. Unless you’re exceptionally driven (or skip all the excercises) then it will be enough of an achievement to study a couple of textbooks a year. So spend them wisely!

Jonas V's Shortform

mattmacdermott18d40

I reckon it's probably a generalisation of "research taste" which means "good judgement about what kind of research to try and do". This is subtly different to the more typical usage of the word taste — like, "music taste" usually means "ability to discern what music is good" not "ability to discern what music to try and make".

When I googled "research taste" many of the top results are LessWrong-adjacent, but I did find this page on "cultivating your research taste" written by some academics in 2013.

Maybe someone with superior googling skills to me can figure out if that's the origin of the phrase.

mattmacdermott's Shortform

mattmacdermott20d42

There's a second, less prosaic, way to frame mitigating reward hacking as an ELK problem, although it requires making some extra assumptions.

Forget about what the coding model knows and just think about training the best monitor we can. There might be ways of improving our monitor's ability to understand whether code contains reward hacks that don't automatically make the monitor better at telling us about them. For example, suppose we train the monitor to predict some downstream consequences of running code generated by the agent, e.g. whether the test cases will pass^[1]. Assume the monitor gets extremely good at this task (we can generate a lot of realistic data, so this is plausible). If 'code passes for the intended reason' and 'code passes because of a reward hack' are sufficiently distinct causal mechanisms, then we can argue that in order to be really good at this task the monitor has to have a really good understanding of whether the code contains a reward hack^[2].

But then we have an ELK problem, since although the monitor now has a really good understanding of reward hacks, we've trained it to predict downstream consequences rather than report reward hacks. We might need clever ways to extract its understanding of reward hacks.

(We'd be in a similar situation if we just happened to have predictive models that were really good at this task, even if we didn't specifically train them on it.)

Much like in the function correctness dataset in this paper. ↩︎
The METR post above notes that many reward hacks are pretty ambiguous, so this might be a big 'if'. ↩︎

mattmacdermott's Shortform

mattmacdermott20d42

Reward hacking in frontier models is a prosaic instance of the ELK problem.

I've been describing this example to people quite often recently and wanted to write it down somewhere I can link to.

Suppose we're training a language model via reinforcement learning to solve coding problems, with a reward function that just checks whether the code passes some test cases. Current frontier models learn to sometimes reward hack, e.g. by special-casing the test cases rather than writing correct code (METR post with several examples).

We'd like to detect when a piece of code contains a reward hack, so that we can avoid reinforcing it. But the naive method of having a monitor inspect the code or the agent's chain of thought is not ideal, since if it's an imperfect classifier this might just teach the agent to reward hack in undetectable ways. Empirical studies have found some evidence for this.

But it seems like the coding models themselves probably know whether a solution they've generated contains a reward hack, even when the monitor can't detect it^[1].

So even if we can never get a perfect reward hack classifier, we might hope to detect all the reward hacks that the model itself knows about. This is both an easier problem — the information is in the model somewhere, we just have to get it out — and also covers the cases we care about most. It's more worrying to be reinforcing intentional misaligned behaviour than honest mistakes.

The problem of finding out whether a model thinks some code contains a reward hack is an instance of the ELK problem, and the conceptual arguments for why it could be difficult carry over. This is the clearest case I know of where the ideas in that paper have been directly relevant to a real prosaic AI safety problem. It's cool that quite speculative AI safety theory from a few years ago has become relevant at the frontier.

I think that people who are interested in ELK should probably now use detecting intentional reward hacking as their go-to example, whether they want to think about theoretical worst-case solutions or test ideas empirically.

I think that models usually know when they're reward hacking, but I'm less confident that this applies to examples that fool a monitor. It could be that in these cases the model itself is usually also unaware of the hack. ↩︎

Unsupervised Elicitation of Language Models

mattmacdermott1mo40

Do you think of this work as an ELK thing?

Viliam's Shortform

mattmacdermott1mo60

I guess it’s annoying to have several such journals at the top of rankings lists. Similarly to how if you look up a list of premier league footballers with the highest goals per game, the list will normally be restricted to players who’ve played a certain number of games.

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Does the proposed scheme get us CDT in Newcomblike problems?

Reward hacking in frontier models is a prosaic instance of the ELK problem.