I think a big bias preventing sufficient appraisal of AI safety is people generally perceiving future issues as simply scaled up versions of current issues, and generally having the belief that current issues will scale up at the same rate relative to each others.
Of course, depending on the person, this varies, but the gravity the default presumption holds over speculative imagination limits consideration of long-tailed future worlds, wherein some issue only making up 0.0001% of current consideration becomes 99.9999% of what is relevant to continued prosperous future human lives.
People have generally seen technology empower humans in a manner proportional to its advancement. So, extrapolating into the future, they imagine technology will continue... (read 272 more words →)
This is inevitable as language models increase in size and ascend the benchmarks.
This is because, to accurately answer tricky questions, language models must be able to create robust models of a frame of reality based on initial premises. When answering something like an SAT question, correctness depends on isolating the thread of signal in the question/answer relationship, and ignoring the noise. The better it gets at this, the better a language model can isolate some functional relationship between question and answer and ignore the noise.
Reward hacking is just an extension of this. Basically, if a language model is tasked with performing some task to reach some desired outcome, the naive, small language... (read more)