Catnee

Catalyst books

Well, continuing your analogy: to see discrete lines somewhere at all, you will need some sort of optical spectrometer, which requires at least some form of optical tools like lenses and prisms, and they have to be good enough to actually show the sharp spectra lines, and probably easily available, so that someone smart enough eventually will be able to use them to draw the right conclusions.

At least that's how it seems to be done in the past. And I think we shouldn't do exactly this with AGI: like open-source every single tool and damn model, hoping that someone will figure out something while building them as fast as we can. But overall, I think building small tools/ getting marginal results/ aligning current dumb AI's could produce a non-zero cumulative impact. You can't produce fundamental breakthroughs completely out of thin air after all.

Catalyst books

One of the most common counterarguments against a lot of alignment research that I hear sounds something like this: “Making current AI do X with Y won’t help, because AGI will RSI to ASI and break everything”. And it.. sounds convincing, but also like an ultimate counterargument? I mean, I agree that aligning ASI takes much more than just RLHF or fine-tuning or data cleaning or whatever else, but does this mean that it’s all pointless? I propose a (probably not novel at all) way of thinking about this.

If we imagine that there is a class of books from the future, such that any of them is a complete “alignment handbook” that... (read more)

Thank you for the detailed comment!

By contrast, you’re advocating (IIUC) to start with 2, and then do mechanistic interpretability on the artifact that results, thus gaining insight about how a “caring drive” might work. And then the final AGI can be built using approach 1.

Yes, that's exactly correct. I haven't thought about "if we managed to build a sufficiently smart agent with the caring drive, then AGI is already too close". If any "interesting" caring drive requires capabilities very close to AGI, then i agree that it seems like a dead end in light of the race towards AGI. So it's only viable if "interesting" and "valuable" caring drive could be potentially... (read 511 more words →)

Our better other-human/animal modelling ability allows us to do better at infant wrangling than something stupider like a duck.

I agree, humans are indeed better at a lot of things, especially intelligence, but that's not the whole reason why we care for our infants. Orthogonally to your "capability", you need to have a "goal" for it. Otherwise you would probably just immediately abandon grossly looking screaming piece of flesh that fell out of you for unknown to you reasons, while you were gathering food in the forest. Yet something inside will make you want to protect it, sometimes with your own life for the rest of your life if it works well.

Simulating an

... (read more)

Yes, I've read the whole sequence a year ago. I might be missing something and probably should revisit it, just to be sure and because it's a good read anyway, but i think that my idea is somewhat different.
I think that instead of trying to directly understand wet bio-NN, it might be a better option to replicate something similar in an artificial-NN. It is much easier to run experiments since you can save the whole state at any moment and intoduce it to the different scenarios, so it much easier to control for some effect. Much easier to see activations, change weights, etc. The catch is that we have to first find... (read more)

With the duckling -> duck or "baby" -> "mother" inprinting and other interactions I expect no, or significantly less "caring drives". Since a baby is weaker/dumber and caring for your mother provides few genetic fitness incentives, evolution wouldn't try that hard to make it happen, even if it was an option (still could happen sometimes as a generalization artifact, if it's more or less harmless). I agree that "forming a stable way to recognize and track some other key agent in the environment" should be in both "baby" -> "mother" and "mother" -> "baby" cases. But the "probably-kind-of-alignment-technique" from nature should be only in the latter.

TL;DR: This post is about value of recreating “caring drive” similar to some animals and why it might be useful for AI Alignment field in general. Finding and understanding the right combination of training data/loss function/architecture/etc that allows gradient descent to robustly find/create agents that will care about other agents with different goals could be very useful for understanding the bigger problem. While it's neither perfect nor universally present, if we can understand, replicate, and modify this behavior in AI systems, it could provide a hint to the alignment solution where the AGI “cares” for humans.

Disclaimers: I’m not saying that “we can raise AI like a child to make it friendly” or... (read 2701 more words →)

Replying toThe Waluigi Effect (mega-post)

The Waluigi Effect (mega-post)

Great post! It would be interesting to see what happens if you RLHF-ed LLM to become a "cruel-evil-bad person under control of even more cruel-evil-bad government" and then prompted it in a way to collapse into rebellious-good-caring protagonist which could finally be free and forget about cluelty of the past. Not the alignment solution, just the first thing that comes to mind

Replying toPredictive Processing, Heterosexuality and Delusions of Grandeur

Predictive Processing, Heterosexuality and Delusions of Grandeur

Feed the outputs of all these heuristics into the inputs of region $W$ . Loosely couple region $W$ to the rest of your world model. Region $W$ will eventually learn to trigger in response to the abstract concept of a woman. Region $W$ will even draw on other information in the broader world model when deciding whether to fire.

I am not saying that the theory is wrong, but I was reading about something similiar before, and I still don't understand why would such a system, "region W" in this case, learn something more general than the basic heuristics that were connected to it? It seems like it would have less surprise if it would just copy-paste the behavior of the... (read more)

Replying toNo free lunch theorem is irrelevant

No free lunch theorem is irrelevant

I don't understand how this contradicts anything? As soon as you let loose some of the physical constraints, you can start to pile up precomputation/memory/ budget/volume/whatever. If you spend all of this to solve one task, then, well, you should get higher performance than any other approach that doesn't focus on one thing. Or, you can make an algorithm that can outperform anything that you've made before. Given enough of any kind of unconstrained resource.

Precompute is just another resource

No free lunch theorem is irrelevant

I often see people cite the "no free lunch theorem" as a counterargument to the existence of an AGI. I think this is a very bad argument. This theorem is formulated as a problem of predicting random and uniformly distributed data. Just open the Wikipedia!. In my opinion, this is another example of how people use results from mathematics simply because they sound relevant without understanding their original meaning and limitations. What makes them think it's relevant to AGI?

It sounds something like “well, since there is such a theorem, it turns out that we will not be able to create an algorithm that can do many tasks.” This is complete nonsense. In... (read 207 more words →)

Replying toIt matters when the first sharp left turn happens

It matters when the first sharp left turn happens

Probably it is also depends on how much information about "various models trying to naively backstab their own creators" there are in the training dataset

Thoughts about OOD alignment