Running Lightcone Infrastructure, which runs LessWrong and Lighthaven.space. You can reach me at habryka@lesswrong.com.
(I have signed no contracts or agreements whose existence I cannot mention, which I am mentioning here as a canary)
See also my other discussion with Tom McGrath (one of the Goodfire cofounders) over here: https://www.lesswrong.com/posts/XzdDypFuffzE4WeP7/themanxloiner-s-shortform?commentId=DcBrTraAcxpyyzgF4
I still don't really have any model where their investors think their >$1.2B of future profits will come from, if not from somehow helping with frontier model training, so I still currently believe this is the default thing they will do. But I sure feel confused about lots of people saying it's a bad business model.
Yep, agree this is possible (though pretty unlikely), but I was just invoking this stuff to argue against pure CDT (or equivalent decision-theories that Thomas was saying would rule out rewarding people after the fact being effective).
Or to phrase it a different way: I am very confident that future, much smarter, people will not believe in decision-theories that rule out retrocausal incentives as a class. I am reasonably confident, though not totally confident, that de-facto retrocausal incentives will bite on currently alive humans. This overall makes me think it's like 70% likely that if we make it through the singularity well, then future civilizations will spend a decent amount of resources aligning incentives retroactively.
This isn't super confident, but you know, somewhat more likely than not.
(1) seems like evidence in favor of what I am saying. In as much as we are not confident in our current DT candidates, it seems like we should expect future much smarter people to be more correct. Us getting DT wrong is evidence that getting it right is less dependent on incidental details of the people thinking about it.
(2) I mean, there are also literally millions of people in China with higher IQ than you that believe in spirits, and millions in the rest of the world that believe in the Christian god and disbelieve evolution. The correlation between correct DT and intelligence seems about as strong as it does for the theory of evolution (meaning reasonably strong in the human range, but the human range is narrow enough to not overdetermine the correct answer, especially when you don't have any reason to think hard about it)
(3) I am quite confident that pure CDT which rules out retrocausal incentives is false. I agree that I do not know what the right way to run the math is to understand when retrocausal incentives work, and how important they are. I really don't have much uncertainty on this, so I don't really get your point here. I don't need to formalize these theories to make a confident prediction that any "decision theory where rewarding people after the fact for one-time actions" cannot provide an incentive is false.
I expect it would still be annoying enough to only happen if there were significant gain, since training is a delicate and complex system and it's very expensive if things break, so there's rational resistance to added complexity.
Given trade secrets and everything you might not be able to say anything about this, but my model of frontier post-training was that we kind of throw the kitchen sink at it in terms of RL environments. This is pulling in a different kind of feedback so does add complexity that other RL environments don't add, but my sense is that post-training isn't that fragile to stuff like this, and we kind of throw all kinds of random things into post-training (and generally are pretty bottlenecked on any kind of RL-feedback).
And as you're not taking backprop through the probe, it seems less concerning to me, though it could still break the probe.
Hmm, my guess is the model would still pretty quickly learn to make their thinking just not trigger their probe. My guess is doing it with RL is worse because I expect it to generalize to more probes (whereas backropping through a specific probe will most likely just make that specific probe no longer detect anything).
Fortunately, it would be such a massive pain to change the highly optimised infrastructure stacks of frontier labs to use model internals in training that I think this is only likely to happen if there are major gains to be had and serious political will, whether for safety or otherwise. I would be very surprised if this happens in frontier model training in the near future, and I see this as a more speculative longer-term research bet.
I am confused about this. Can't you just do this in post-training in a pretty straightforward way? You do a forward pass, run your probe on the activations, and then use your probe output as part of the RL signal. Why would this require any kind of complicated infrastructure stack changes?
No, it seems highly unlikely. Considered from a purely commercial perspective - which I think is the right one when considering the incentives - they are terrible customers! Consider:
That is good news! Though to be clear, I expect the default path by which they would become your customers, after some initial period of using your products or having some partnership with them, would be via acquisition, which I think avoids most of the issues that you are talking about here (in general "building an ML business with the plan of being acquired by a frontier company" has worked pretty well as a business model so far).
Whatever techniques end up being good are likely to be major modifications to training stack that would be hard to integrate, so the options for doing such a deal without revealing IP are extremely limited, making cutting us out easy.
Agree on the IP point, but I am surprised that you say that most techniques would end up major modifications to the training stack. The default product I was imagining is "RL on interpretability proxies of unintended behavior", and I think you could do that purely in post-training. I might be wrong here, I haven't thought that much about it, but my guess is it would just work?
I do notice I feel pretty confused about what's going on here. Your investors clearly must have some path to profitability in mind, and it feels to me that frontier model training is really where all the money is at. Do people expect lots of smaller specialized models to be deployed? What game is there in town that isn't frontier model training for this kind of training technique, if it does improve capabilities substantially?
You know your market better, and so I do update when you say that you don't see your techniques used for frontier model training, but I do find myself pretty confused what the stories in the actual eyes of investors is (and you might not be able to tell me for some reason or another), and the flags I mentioned make me hesitant to update too much on your word here. So for now I will thank you for you saying otherwise, make a medium-sized positive update, and would be interested if you could expand a bit on what the actual path to profitability is without routing through frontier model training. But I understand if you don't want to! I already appreciate your contributions here quite a bit.
I expect misalignment rates to be locally linear in intelligence [1]
I disagree! I think treacherous turn things will generally mean it's very hard to measure misalignment rates before capabilities exceed pivotal thresholds. I honestly have no idea how I would measure misalignment in current models at all (and think current "misalignment" measures tend to be if anything anticorrelated with attributes I care about). I do think this will get a bit easier around critical thresholds, but there is definitely a decent chance we will cross a critical threshold in a single model release.
That said, in many worlds it is still local, and it's better than nothing to check for local stuff. I just don't think it's a good thing to put tons of weight on.
Yep, I didn't come up with anything great, but still open to suggestions.
I also make a commitment to us not working on self-improving superintelligence, which I was surprised to need to make but is apparently not a given?
Thank you, I do appreciate that!
I do have trouble understanding how this wouldn't involve a commitment to not provide your services to any of the leading AI capability companies, who have all stated quite straightforwardly that this is their immediate aim within the next 2-3 years. Do you not expect that leading capability companies will be among your primary customers?
I would only endorse using this kind of technique in a potentially risky situation like a frontier training run if we were able to find a strong solution to the train/test issue described here.
Oh, cool, that is actually also a substantial update for me. The vibe I have been getting was definitely that you expect to use these kinds of techniques pretty much immediately, with frontier training companies being among your top target customers.
I agree with you that train/test splits might help here, and now thinking about it, I am actually substantially in favor of people figuring out the effect-sizes here and doing science in the space. I do think given y'alls recent commercialization focus (plus asking employees to sign non-disparagement agreements and in some cases secret non-disparagement agreements) this puts you in a tricky position as an organization I feel like I could trust to be reasonably responsive to evidence of the actual risks here, so I don't currently think y'all are the best people to do that science, but it does seem important to acknowledge that science in the space seems pretty valuable.
I have goals that benefit from having hundreds of millions to billions of dollars. So do other people. Money is for steering the world. I can use money to hire other people and get them to do things I want. "Personal consumption" is not the reason why almost anyone tries to get rich!