User Comment Replies

Goodhart's Law in Reinforcement Learning

Thanks for the comment! Note that we use state-action visitation distribution, so we consider trajectories that contain actions as well. This makes it possible to invert $η$ (as long as all states are visited). Using only states trajectories, it would indeed be impossible to recover the policy.

2TurnTrout1y

Thanks, this was an oversight on my part.

A warm-up for the AI governance project

jacek2y30

Yes, I agree that the politicisation is the central issue. But this is exactly why I wrote the first part - I feel that this section is true despite it (I didn't claim that most people agree with the solution, only that the elites, experts, and the reader's social bubble does!).

So one question I'm trying to understand is: since politicisation happened to climate change, why do we think that it won't happen to AI governance? I.e. the point is that pursuing goals by political means might just usually end up like that, because of the basic structure of the political discourse (you get points for opposing the other side, etc).

Categorical-measure-theoretic approach to optimal policies tending to seek power

jacek2y10

Hm, so one comment is that the proof in the post was not meant to convey the intuition for the existence of the concrete probability distribution - the measurability of the POWER inequality is a necessary first step, but not really technically related to the (potential) rest of the proof (although I had initially hoped that lifting some distribution on rewards by the Giry monad might produce something interesting).

As for why the additional structure might be helpful: the issue with there being no Lebesgue-like uniform measure is that in the infinite-dimens... (read more)

1Algon2y

Thank you for writing this, I feel like it makes the core idea you're expressing at much clearer. My intuition is that abstract Wiener spaces won't get you the sort of measure you're looking for alone, based off my experience with measures over big spaces in physics. But, that said, I feel like there should be some such measure over large physical spaces, as presumably power has a definition in terms of physical concepts, or else how the heck can we recover our intuition of power in our world? It should all add up to normality, after all. It seems to me that looking over those physics papers which descibed single particles as agentic because our distributions over them tend towards max entropy, which we can view as the particle seeking the greatest "option value" it can, would be a good place to build up the latter intuition. I think I am undecided as to whether you can use the rich structre of reward functions to limit the allowed transormations in a useful way. Partly because I suspect that this rich structure reflects a physical structure (something like the natural abstractions thesis + selection pressure from reality for the sorts of rewards we typically see) or perhaps a simplicity prior of some sort. But maybe it will work out. I don't know. My lack of optimism as to the possibility of your agenda is basically why I was willing to accept the strange probability distribution TurnTrout went with, I guess. But on reflection, perhaps I should have used that as an existence proof of distribution over reward which allows something like our intuitive picture of power seeking. And tried to see if I could interpret it to be something less weird, use it to find something less weird, or just go look for less weird things because maybe they'd work. Sorry for the long post, but I just realized I didn't update based off Turntrout's results. It seems more likely to me now that your agenda might work. Though I'd be more optimistic if you were using Turntrout's distribu

LESSWRONG
LW

All of jacek's Comments + Replies