Well, continuing your analogy: to see discrete lines somewhere at all, you will need some sort of optical spectrometer, which requires at least some form of optical tools like lenses and prisms, and they have to be good enough to actually show the sharp spectra lines, and probably easily available, so that someone smart enough eventually will be able to use them to draw the right conclusions.
At least that's how it seems to be done in the past. And I think we shouldn't do exactly this with AGI: like open-source every single tool and damn model, hoping that someone will figure out something while building them as fast as we can. But overall, I think building small tools/ getting marginal results/ aligning current dumb AI's could produce a non-zero cumulative impact. You can't produce fundamental breakthroughs completely out of thin air after all.
Thank you for the detailed comment!
By contrast, you’re advocating (IIUC) to start with 2, and then do mechanistic interpretability on the artifact that results, thus gaining insight about how a “caring drive” might work. And then the final AGI can be built using approach 1.
Yes, that's exactly correct. I haven't thought about "if we managed to build a sufficiently smart agent with the caring drive, then AGI is already too close". If any "interesting" caring drive requires capabilities very close to AGI, then i agree that it seems like a dead end in light of the race towards AGI. So it's only viable if "interesting" and "valuable" caring drive could be potentially found within ~current level of capability agents. Which honestly doesn't sound like something totally improbable to me.
Also, without some global regulation to stop this damn race I expect everyone to die soon anyway, and since I'm not in the position to meaningfully impact this, I might as well continue trying to work in the directions that will work only in the worlds where we would suddenly have more time.
And once we have something like this, I expect a lot of gains in speed of research from all the benefits that come from the ability to precisely control and run experiments on artificial NN.
I’m curious why you picked parenting-an-infant rather than helping-a-good-friend as your main example. I feel like parenting-an-infant in humans is a combination of pretty simple behaviors / preferences (e.g. wanting the baby to smile)
Several reasons:
Our better other-human/animal modelling ability allows us to do better at infant wrangling than something stupider like a duck.
I agree, humans are indeed better at a lot of things, especially intelligence, but that's not the whole reason why we care for our infants. Orthogonally to your "capability", you need to have a "goal" for it. Otherwise you would probably just immediately abandon grossly looking screaming piece of flesh that fell out of you for unknown to you reasons, while you were gathering food in the forest. Yet something inside will make you want to protect it, sometimes with your own life for the rest of your life if it works well.
Simulating an evolutionary environment filled with AI agents and hoping for caring-for-offspring strategies to win could work but it's easier just to train the AI to show caring-like behaviors.
I want agents that take effective actions to care about their "babies", which might not even look like caring at the first glance. Something like, keeping your "baby" in some enclosed kindergarden, while protecting the only entrance from other agents? It would look like "mother" agent abandoned its "baby", but in reality could be a very effective strategy for caring. It's hard to know an optimal strategy in every proceduraly generated environment and hence trying to optimize for some fixed set of actions, called "caring-like behaviors" would probably indeed give you what your asked, but I expect nothing "interesting" behind it.
Goal misgeneralisation is the problem that's left. Humans can meet caring-for-small-creature desires using pets rather than actual babies.
Yes they can, until they will actually make a baby, and after that, it's usually really hard to sell loving mother "deals" that will involve suffering of her child as the price, or abandon the child for the more "cute" toy, or persuade it to hotwire herself to not care about her child (if she is smart enough to realize the consequences). And carefully engenireed system could potentialy be even more robust than that.
Outside of "alignment by default" scenarios where capabilities improvements preserve the true intended spirit of a trained in drive, we've created a paperclip maximizer that kills us and replaces us with something outside the training distribution that fulfills its "care drive" utility function more efficiently.
Again. I'm not proposing the "one easy solution to the big problem". I understand that training agents that are capable of RSI in this toy example will result in everyone's dead. But we simply can't do that yet, and I don't think we should. I'm just saying that there is this strange behavior in some animals, that in many aspects looks very similar to the thing that we want from aligned AGI, yet nobody understands how it works, and few people try to replicate it. It's a step in that direction, not a fully functional blueprint for the AI Alignment.
Yes, I've read the whole sequence a year ago. I might be missing something and probably should revisit it, just to be sure and because it's a good read anyway, but i think that my idea is somewhat different.
I think that instead of trying to directly understand wet bio-NN, it might be a better option to replicate something similar in an artificial-NN. It is much easier to run experiments since you can save the whole state at any moment and intoduce it to the different scenarios, so it much easier to control for some effect. Much easier to see activations, change weights, etc. The catch is that we have to first find it blindly with gradient descent, probably by simulating something similar to the evolutionary environment that produced "caring drives" in us. And maternal instinct in particular sounds like the most interesting and promising candidate for me.
Can you provide links to your posts on that? I will try to read more about it in the next few days.
With the duckling -> duck or "baby" -> "mother" inprinting and other interactions I expect no, or significantly less "caring drives". Since a baby is weaker/dumber and caring for your mother provides few genetic fitness incentives, evolution wouldn't try that hard to make it happen, even if it was an option (still could happen sometimes as a generalization artifact, if it's more or less harmless). I agree that "forming a stable way to recognize and track some other key agent in the environment" should be in both "baby" -> "mother" and "mother" -> "baby" cases. But the "probably-kind-of-alignment-technique" from nature should be only in the latter.
Great post! It would be interesting to see what happens if you RLHF-ed LLM to become a "cruel-evil-bad person under control of even more cruel-evil-bad government" and then prompted it in a way to collapse into rebellious-good-caring protagonist which could finally be free and forget about cluelty of the past. Not the alignment solution, just the first thing that comes to mind
Feed the outputs of all these heuristics into the inputs of region . Loosely couple region to the rest of your world model. Region will eventually learn to trigger in response to the abstract concept of a woman. Region will even draw on other information in the broader world model when deciding whether to fire.
I am not saying that the theory is wrong, but I was reading about something similiar before, and I still don't understand why would such a system, "region W" in this case, learn something more general than the basic heuristics that were connected to it? It seems like it would have less surprise if it would just copy-paste the behavior of the input.
The first explanation that comes to mind: "it would work better and have less surprise as a whole, since other regions could use output of "region W" for their predictions". But again, I don't think that I understand that. I think "region W" doesn't "know" about other regions surprise rates and hence cannot care about it, so why would it learn something more general and thus contradictory to the heuristics in some cases?
I don't understand how this contradicts anything? As soon as you let loose some of the physical constraints, you can start to pile up precomputation/memory/ budget/volume/whatever. If you spend all of this to solve one task, then, well, you should get higher performance than any other approach that doesn't focus on one thing. Or, you can make an algorithm that can outperform anything that you've made before. Given enough of any kind of unconstrained resource.
Precompute is just another resource
Probably it is also depends on how much information about "various models trying to naively backstab their own creators" there are in the training dataset
In the next few hours we’ll get to noticable flames [...] Some number of hours after that, the fires are going to start connecting to each other, probably in a way that we can’t understand, and collectively their heat [...] is going to rise very rapidly. My retort to that is, do you know what we’re going to do in that scenario? We’re going to unkindle them all.