LTM - LessWrong

LTM5mo30

I agree that in a takeover scenario where AI capabilities rush wildly ahead of human understanding or control, the ability of the world's second species to retain exclusive resource access will be limited. This is a plausible future, but it is not the only one. A lot of effort is, and very likely will continue to be, directed towards controlling frontier AI and making it as economically beneficial for its owners as possible.

If this work bears fruit, a world where AI is made by people with capital for people with capital seems very likely.

The French nobility had their capital taken by force by revolutionaries in a violent takeover - the kind which AI may execute. An analogy for the kind of situation I am concerned about would be some portion of the French nobility making the laws so complex, the financial structure of the kingdom so riddled with loopholes, that the other nobles were disenfranchised by due process rather than violence.

Taking this analogy further, the French king had to care about the nobility, and the nobility about the peasantry, in part because of their productive capacity. As described in The Intelligence Curse — LessWrong, when AGI renders your work valueless the incentive for the state to invest in your welfare greatly decreases.

Preference Inversion

LTM6mo10

I think this post is interesting, although I don't particularly agree with the conclusions. I think it is helpful to think about the formation of your mind and goals - a tradition which I know goes back to Rousseau and most likely goes back further (I am not very knowledgeable on the topic).

I think a lot of the difficulty goes back to the distinction between 'real'/'intrinsic' and goals and those people proport to believe in. Looking at your example of a Christian sexual prude, nothing about their behaviour implies to me that these virtues of chastity and distance from sex are false at all. Perhaps had they been raised in the absence of society, they would think differently. But if that were the case their views on fashion would probably vastly differ also, and yet we don't consider their desire to dress a certain way to be falsified.

To be clear, I am not claiming that values cannot be gained societally. More that the issue at play is not the creation and adoption of these values from a childhood of coercive exposure, but that these values are hard to satisfy. Sexual prudishness makes it difficult to fulfil inbuilt sexual desires, and so combining both into a single person leaves them less overall satisfiable.

But this goes far beyond societal impacts. In a world of scarce resources, our many inbuilt physical desires as well as our entirely inbuilt desire to behave altruistically conflict in a similar way. It is difficult to be both feared and loved, as some people seem to want to be. Your concern might be better framed at this kind of conflict dragging down one's maximum available utility than a conflict between inbuilt and constructed desires.

Nice post! I'd be interested to see what you think about the above.

As a side note, you mention that men are apparently better at aligning their sexual preferences to those of society than women. This could be evidence of men being hardwired to adopt the expectations of sexual prudishness to their core. But it could also be that men shaped those expectations to begin with through generations of societal dominance. Or a myriad of other factors which mean that for a reason entirely distinct from inbuilt preference falsification, men's desires just line up better to those of society's than women's.

Why Don't We Just... Shoggoth+Face+Paraphraser?

LTM7mo72

I'm really excited about this, but not because of the distinction drawn between the shoggoth and the face. Applying a paraphraser such that the model's internal states are repeatedly swapped for states which we view as largely equivalent could be a large step towards interpretability.

This reminds me on the concept that CNNs work as they are equivariant under translation. The models can also be made (approximately) rotationally equivariant by applying all possible rotations for a given resolution to the training data. In doing this, we create a model which does not use absolute position or orientation which turn out to be excellent priors for generalisable image categorisation among many other tasks.

We decided upon transformations which our model should be invariant under, and applied them to training data. Doing the same to internal states periodically across the model's activations could force the internal messaging to not only be interpretable to us, but share our approach to irrelevant details.

There would likely be a negative alignment tax associated with this. However, this seems to be a broadly applicable approach to improving interpretability in other contexts.

Consider an image generation model which is configured to output progressively higher resolution and fleshed out generations. If asked to generate a dog, perhaps our 'paraphraser' could swap out parts of the background or change the absolute position of the dog in the image, making fewer changes as the product is filled in. If our model works well, this should give us greater diversity of outputs. If it is failing, a change of scenery from a grassy field to a city block could cause the model to entirely diverge from the prompt by generating a telephone box rather than a dog. This could show erroneous connections the model has drawn and unveil details about its functioning which impede proper generalisation.

Making a conservative case for alignment

LTM8mo50

I agree with your points about avoiding political polarisation and allowing people with different ideological positions to collaborate on alignment. I'm not sure about the idea that aligning to a single group's values (or to a coherent ideology) is technically easier than a more vague 'align to humanity's values' goal.

Groups rarely have clearly articulated ideologies - more like vibes which everyone normally gets behind. An alignment approach from clearly spelling out what you consider valuable doesn't seem likely to work. Looking to existing models which have been aligned to some degree through safety testing, the work doesn't take the form of injecting a clear structured value set. Instead, large numbers of people with differing opinions and world views continually correct the system until it generally behaves itself. This seems far more pluralistic than 'alignment to one group' suggests.

This comes with the caveat that these systems are created and safety tested by people with highly abnormal attitudes when compared to the rest of their species. But sourcing viewpoints from outside seems to be an organisational issue rather than a technical one.

My motivation and theory of change for working in AI healthtech

LTM9mo50

One method of keeping humans in key industrial processes might be expanding credentialism. Individuals remaining control even when the majority of the thinking isn't done by them has always been a key part of any hierarchical organisation.

Legally speaking, certain key tasks can only be performed by qualified accountants, auditors, lawyers, doctors, elected officials and so on.

It would not be good for short term economic growth. However, legally requiring that certain tasks be performed by people with credentials machines are not eligible for might be a good (though absolutely not perfect) way of keeping humans in the loop.

Safety isn’t safety without a social model (or: dispelling the myth of per se technical safety)

LTM1y10

Broadly agree, in that most safety research expands control over systems and our understanding of them, which can be abused by a bad actor.

This problem is encountered by for-profit companies, where profit is on the lines instead of catastrophe. They too have R&D departments and research directions which have the potential for misuse. However, this research is done inside a social environment (the company) where it is only explicitly used to make money.

To give a more concrete example, improving self-driving capabilities also allows the companies making the cars to intentionally make them run people down if they so wish. The more advanced the capabilities, the more precise they can be in deploying their pedestrian-killing machines onto the roads. However, we would never expect this to happen as this would clearly demolish the profitability of a company and result in the cessation of these activities.

AI safety research is not done in this kind of environment at present whatsoever. However, it does seem to me that these kinds of institutions that carefully vet research and products, only releasing them when they remain beneficial, are possible.

My PhD thesis: Algorithmic Bayesian Epistemology

LTM1y20

Really fascinating stuff! I have a (possibly answered) question about how using expert updates on other expert prediction might be valuable.

You discuss the negative impacts of allowing experts to aggregate themselves, or viewing one another's forecasts before initially submitting their own. Might there be value in allowing experts to submit multiple times, each time seeing the submitted predictions of a previous round? The final aggregation scheme would be able to not only assign a credence to each expert, but also gain a proxy for what credence the experts give to one another. In a more practical scenario where experts will talk if not collude, this might give better insight into how expert predictions are being created.

Thanks for taking the time to distill this work into a more approachable format - it certainly made the thesis more manageable!

LESSWRONG
LW

Posts

Wikitag Contributions

Comments