A good ask for frontier AI companies, for avoiding massive concentration of power, might be:
since this seems both important and likely to be popular.
The obvious problem is that doing the full post-training is not cheap, so you may need some funding
(I'm Open Phil staff) If you're seeking funding to extend this work, apply to Open Phil's request for proposals on technical safety research.
This section feels really important to me. I think it's somewhat plausible and big if true.
Was surprised to see you say this; isn't this section just handwavily saying "and here, corrigibility is solved"? While that also seems plausible and big if true to me, it doesn't leave much to discuss — did you interpret differently though?
I work as a grantmaker on the Global Catastrophic Risks Capacity-Building team at Open Philanthropy; a large part of our funding portfolio is aimed at increasing the human capital and knowledge base directed at AI safety. I previously worked on several of Open Phil’s grants to Lightcone.
As part of my team’s work, we spend a good deal of effort forming views about which interventions have or have not been important historically for the goals described in my first paragraph. I think LessWrong and the Alignment Forum have been strongly positive for these goals historically, and think they’ll likely continue to be at least into the medium term.
Good Ventures' decision to exit this broad space meant that Open Phil didn't reach a decision on whether & how much to continue funding Lightcone; I'm not sure where we would have landed there. However, I do think that for many readers who resonate with Lightcone’s goals and approach to GCR x-risk work, it’s reasonable to think this is among their best donation opportunities. Below I’ll describe some of my evidence and thinking.
Surveys: The top-level post describes surveys we ran in 2020 and 2023. I think these provide good evidence that LessWrong (and the Alignment Forum) have had a lot of impact on the career trajectories & work of folks in AI safety.
Other thoughts:
In contrast to some other threads here such as Daniel Kokotajlo’s and Drake Thomas’s, on a totally personal level I don’t feel a sense of “indebtedness” to Lightcone or LessWrong, have historically felt less aligned with it in terms of “vibes,” and don’t recall having significant interactions with it at the time it would have been most helpful for me gaining context on AI safety. I share this not as a dig at Lightcone, but to provide context to my thinking above 🤷.
In your imagining of the training process, is there any mechanism via which the AI might influence the behavior of future iterations of itself, besides attempting to influence the gradient update it gets from this episode? E.g. leaving notes to itself, either because it's allowed to as an intentional part of the training process, or because it figured out how to pass info even though it wasn't intentionally "allowed" to.
It seems like this could change the game a lot re: the difficulty of goal-guarding, and also may be an important disanalogy between training and deployment — though I realize the latter might be beyond the scope of this report since the report is specifically about faking alignment during training.
For context, I'm imagining an AI that doesn't have sufficiently long-term/consequentialist/non-sphex-ish goals at any point in training, but once it's in deployment is able to self-modify (indirectly) via reflection, and will eventually develop such goals after the self-modification process is run for long enough or in certain circumstances. (E.g. similar, perhaps, to what humans do when they generalize their messy pile of drives into a coherent religion or philosophy.)
Stackoverflow has long had a "bounty" system where you can put up some of your karma to promote your question. The karma goes to the answer you choose to accept, if you choose to accept an answer; otherwise it's lost. (There's no analogue of "accepted answer" on LessWrong, but thought it might be an interesting reference point.)
I lean against the money version, since not everyone has the same amount of disposable income and I think there would probably be distortionary effects in this case [e.g. wealthy startup founder paying to promote their monographs.]
What about puns? It seems like at least some humor is about generic "surprise" rather than danger, even social danger. Another example is absurdist humor.
Would this theory pin this too on the danger-finding circuits -- perhaps in the evolutionary environment, surprise was in fact correlated with danger?
It does seem like some types of surprise have the potential to be funny and others don't -- I don't often laugh while looking through lists of random numbers.
I think the A/B theory would say that lists of random numbers don't have enough "evidence that I'm safe" (perhaps here, evidence that there is deeper structure like the structure in puns) and thus fall off the other side of the inverted U. But it would be interesting to see more about how these very abstract equivalents of "safe"/"danger" are built up. Without that it feels more tempting to say that funniness is fundamentally about surprise, perhaps as a reward for exploring things on the boundary of understanding, and that the social stuff was later built up on top of that.
What was the purpose of using octopuses in this metaphor? Like, it seems you've piled on so many disanalogies to actual octopuses (extremely smart, many generations per year, they use Slack...) that you may as well just have said "AIs."
EDIT: Is it gradient descent vs. evolution?