User Comment Replies

If you had a defense of the idea, or a link to one I could read, I would be very interested to hear it. I wasn't trying to be dogmatically skeptical.

2Rohin Shah3y

Responded above

Is CIRL a promising agenda?

Answer by VesaVanDelzigJun 24, 202260

His hostility to the program as I understand it is that is CIRL doesn't much answer the question of how to specify specify a learning procedure that would go from an observations of a human being to a correct model of a human being's utility function. This is the hard part of the problem. This is why he says "specifying an update rule which converges to a desirable goal is just a reframing of the problem of specifying a desirable goal, with the "uncertainty" part a red herring".

One of the big things that CIRL was claimed to have going for it is that ... (read more)

1wassname10mo

It seems that 1) when extrapolating to new situations 2) if you add a term to decay the relevance of old information (pretty standard in RL) 3) or you add a minimum bounds to uncertainty then it would remain deferential. In other words, it doesn't seem like an unsolvable problem, just an open question. But every other alignment agenda also has numerous open questions. So why the hostility. Academia and LessWrong are two different groups, which have different cultures and jargon. I think they may be overly skeptical towards each other's work at times. It's worth noting though that many of the nice deferential properties may appear in other value modelling techniques (like recursive reward modelling at OpenAI).

2Roger Dearnaley2y

One of the things that almost all AI researchers agree on is that rationality is convergent: as something thinks better, it will be more successful, and to be successful, it will have to think better. In order to think well, it need to have a model of itself and what it knows and don't know, and also a model of its own uncertainty -- to do Bayesian updates, you need probability priors. All Russell has done is say "thus you shouldn't have a utility function that maps a state to its utility, you should have a utility functional that maps a state to a probability distribution that describes a range of possible utilities that models your best estimate of your uncertainty in about its utility, and do Bayesian-like updates on that and optimization searches across it that include a look-elsewhere effect (i.e. the more states you optimize over, the more you should allow for the possibility that what you're locating is a P-hacking mis-estimate of the utility of the state you found, so the higher your confidence in its utility needs to be)". Now you have a system capable of expressing statements like "to the best of my current knowledge, this action has a 95% chance of me fetching a human coffee, and a 5% chance of wiping out the human race - therefore I will not do it" followed by "and I'll prioritize whatever actions will safely reduce that uncertainty (i.e. not an naive multi-armed-bandit exploration policy of trying it to see what happens), at a 'figuring this out will make me better at fetching coffee' priority level". This is clearly rational behavior: it is equally useful for pursuing any goal in any situation that has a possibility of small gains or large disasters and uncertainty about the outcome (i.e. in the real world). So it's convergent behavior for anything sufficiently smart, whether your brain was originally built by Old Fashioned AI or gradient descent. [Also, maybe we should be doing Bayes-inspired gradient descent on networks of neurons that describe proba

3Tor Økland Barstad3y

But maybe continuing to be deferential (in many/most situations) would be part of the utility function it converged towards? Not saying this consideration refutes your point, but it is a consideration. (I don't have much of an opinion regarding the study-worthiness of CIRL btw, and I know very little about CIRL. Though I do have the perspective that one alignment-methodology need not necessarily be the "enemy" of another, partly because we might want AGI-systems where sub-systems also are AGIs (and based on different alignment-methodologies), and where we see whether outputs from different sub-systems converge.)

LESSWRONG
LW

All of VesaVanDelzig's Comments + Replies