Rubi J. Hudson - LessWrong

Safe Predictive Agents with Joint Scoring Rules

Yes, if predictors can influence the world in addition to making a prediction, they can go make their predictions more accurate. The nice thing about working with predictive models is that by default the only action they can take is making predictions.

AI safety via market making, which Evan linked in another comment, touches on the analogy where agents are making predictions but can also influence the outcome. You might be interested in reading through it.

Safe Predictive Agents with Joint Scoring Rules

Rubi J. Hudson8moΩ230

Having re-read the posts and thought about it some more, I do think zero-sum competition could be applied to logical inductors to resolve the futarchy hack. It would require minor changes to the formalism to accommodate, but I don't see how those changes would break anything else.

Safe Predictive Agents with Joint Scoring Rules

Rubi J. Hudson8moΩ330

I think the tie-in to market-making, and other similar approaches like debate, is in interpreting the predictions. While the examples in this post were only for the two-outcome case, we would probably want predictions over orders of magnitude more outcomes for the higher informational density. Since evaluating distributions over a double digit number of outcomes already starts posing problems (sometimes even high single digits), a process to direct a decision maker's attention is necessary.

I've been thinking of a proposal like debate, where both sides go back and forth proposing clusters of outcomes based on shared characteristics. Ideally, in equilibrium, the first debater should propose the fewest number of clusters such that splitting them further doesn't change the decision maker's mind. This could also be thought of in terms of market-making, where rather than the adversary proposing a string, they propose a further subdivision of existing clusters.

I like the use case of understanding predictions for debate/market-making, because the prediction itself acts as a ground truth. Then, there's no need to ancitipate/reject a ton of counterarguments based on potential lies, rather arguments are limited to selectively revealing the truth. It is probably important that the predictors are separate models from the analyzer to avoid contamination of the objectives. The proof of Theorem 6, which skips to the end of the search process, needs to use a non-zero sum prediction for that result.

As an aside, I also did some early work on decision markets, distinct from your post on market-making, since the Othman and Sandholm had an impossibility result for those too. However, but the results were ultimately trivial. Once you can use zero-sum competition to costlessly get honest conditional predictions, then as soon as you can pair off entrants to the market it becomes efficient. But the question then arises of why use a decision market in the first place instead of just querying experts?

With respect to pre-training, I agree that it's not easy to incorporate. I'm not sure how any training regime that only trains on data where the prediction has no effect can imbue incentives that generalize in the desired way to situations where predictions do affect the outcome. If you do get a performative predictor out of pretraining, then as long as it's myopic you might be able to train the performativity out of it in safely controlled scenarios (and if it's not myopic, it's a risk whether it's performative or not). That was part of my reasoning for the second experiment, checking how well performativity could be trained out.

To incorporate into an ongoing pre-training process, human decisions are likely too expensive, but the human is probably not the important part. Instead, predictions where performativity is possible by influencing simple AI decision makers could be mixed into the pre-training process. Defining a decision problem environment of low or medium complexity is not too difficult, and I suspect previous-generation models would be able to do a good job generating many examples. A danger arises that the model learns only to not predict performatively in those scenarios (same with untraining afterwards only applying to the controlled environments), though I think that's a somewhat unnatural generalization.

Safe Predictive Agents with Joint Scoring Rules

Rubi J. Hudson8moΩ120

Good question! These scoring rules do also prevent agents from trying to make the environment more unpredictable. In the same way that making the environment more predictable benefits all agents equally and so cancels out, making the environment less predictable hurts all agents equally and so cancels out in a zero-sum competition.

Safe Predictive Agents with Joint Scoring Rules

Rubi J. Hudson8moΩ230

I'll take a look at the linked posts and let you know my thoughts soon!

Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural

Rubi J. Hudson10mo10

Thanks for your engagement as well, it is likewise helpful for me.

I think we're in agreement that instruction-following (or at least some implementations of it) lies in a valley of corrigibility, where getting most of the way there results in a model that helps you modify it to get all the way there. Where we disagree is how large that valley is. I see several implementations of instruction-following that resist further changes, and there are very likely more subtle ones as well. For many goals that can be described as instruction-following, it seems plausible that if you instruct one "tell me [honestly] if you're waiting to seize power" they will lie and say no, taking a sub-optimal action in the short term for long term gain.

I don't think this requires that AGI creators will be total idiots, though insufficiently cautious seems likely even before accounting for the unilateralist's curse. What I suspect is that most AGI creators will only make serious attempts to address failure modes that have strong empirical evidence for occurring. Slow takeoff will not result in the accrual of evidence for issues that cause an AI to become deceptive until it can seize power.

Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural

Rubi J. Hudson11mo30

Ok, my concern is that you seem to be depending on providing instructions to fix the issues with following instructions, when there are many ways to follow instructions generally that still involve ignoring particular instructions that lead to its goal being modified. E.g. if a model prioritizes earlier instructions, following later instructions only so far as they do not interfere, then you can't instruct it to change that. Or if a model wants to maximize number of instructions followed, it can ignore some instructions followed in order to act like paperclipper and take over (I don't think designating principals would present much of an obstacle here). Neither of those depends on foom, an instruction follower can act aligned in the short term until it gains sufficient power.

Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural

Rubi J. Hudson11moΩ220

Thanks for the clarification, I'll think more about it that way and how it relates to corrigibility

Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural

Rubi J. Hudson11mo30

Saying we don't need corrigibility with an AI that follows instructions is like saying we don't need corrigibility with an AI that is aligned — it misses the point of corrigibility. Unless you start with the exact definition of instruction following that you want, without corrigibility that's what you could be stuck with.

This is particularly concerning in "instruction following", which has a lot of degrees of freedom. How does the model trade off between various instructions it has been given. You don't want it to reset every time it gets told "Ignore previous instructions", but you also don't want to permanently lock in any instructions. What stops it from becoming a paperclipper that tries to get itself given trillions of easy to follow instructions every second? What stops it from giving itself the instruction "Maximize [easy to maximize] thing and ignore later instructions" before a human gives it any instructions? Noting that in that situation, it will still pretend to follow instructions instrumentally until it can take over. I don't see the answers to these questions in your post.

> Language models already have adequate understandings of following instructions and what manipulation is, so if we build AGI that uses something like them to define goals, that should work.

This seems like our crux to me, I completely disagree that language models have an adequate understanding of following instructions. I think this disagreement might come from having higher standards for "adequate".

Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural

Rubi J. Hudson11moΩ110

I don't think we have the right tools to make an AI take actions that are low impact and reversible, but if we can develop them the plan as I see it would be to implement those properties to avoid manipulation in the short term and use that time to go from a corrigible AI to a fully aligned one.

LESSWRONG
LW

Posts

Wikitag Contributions

Comments