It seems like we keep getting LLMs that are better and better at getting the point of fairly abstract concepts (e.g. understanding jokes). As compute increases and their performance improves, it seems increasingly likely that human “values” are within the class of not-that-heavily-finetuned LLMs.
For example, if I prompted a GPT-5 model fine-tuned on lots of moral opinions about stuff: “[details of world], would a human say that was a more beautiful world than today, and why?” I… don’t think it’d do terribly?
The same goes for e.g. how the AI would answer the trolley problem. I’d guess it’d look roughly like humans’ responses: messy, slightly different depending on the circumstance, but not genuinely orthogonal to most humans’ values.
This is obviously vulnerable to adversarial examples or extreme OOD settings, but then robustness seems to be increasing with compute used, and we can do a decent job of OOD-catching.
Is there a modern reformulation of “fragility of value” that addresses this obvious situational improvement? Because as of now, the pure "Fragility of Value" thesis seems a little absurd (though I’d still believe a weaker version).
I agree with this criticism, and I never know when to decide my response should be an "answer", so I'll express my view as a comment: selecting the output and training data that will cause a large language model to converge towards behavioral friendliness is a big deal, and seems very promising towards ensuring that large language models are only as misaligned as humans. unfortunately we already know well that that's not enough; corporations are to a significant degree aggregate agents who are not sufficiently aligned. I'm in the process of posting a flood of youtube channel recommendations on my short form section, will edit here in a few minutes with a few relevant selections that I think need to be linked to this.
(Slightly humorous: It is my view that reinforcement learning should not have been invented.)
Hmm. I guess that might be okay? as long as you don't do really intense planning, the model shouldn't be any more misaligned than a human, so it then boils down to training kindness by example and figuring out game dynamics. https://www.youtube.com/watch?v=ENpdhwYoF5g. more braindump of safety content I always want to recommend in every damn conversation here on my shortform