User Comment Replies

Eric Schmidt on recursive self-improvement

In the next few hours we’ll get to noticable flames [...] Some number of hours after that, the fires are going to start connecting to each other, probably in a way that we can’t understand, and collectively their heat [...] is going to rise very rapidly. My retort to that is, do you know what we’re going to do in that scenario? We’re going to unkindle them all.

Catalyst books

Catnee2y30

Well, continuing your analogy: to see discrete lines somewhere at all, you will need some sort of optical spectrometer, which requires at least some form of optical tools like lenses and prisms, and they have to be good enough to actually show the sharp spectra lines, and probably easily available, so that someone smart enough eventually will be able to use them to draw the right conclusions.

At least that's how it seems to be done in the past. And I think we shouldn't do exactly this with AGI: like open-source every single tool and damn model, hoping that ... (read more)

Recreating the caring drive

Catnee2y10

Thank you for the detailed comment!

By contrast, you’re advocating (IIUC) to start with 2, and then do mechanistic interpretability on the artifact that results, thus gaining insight about how a “caring drive” might work. And then the final AGI can be built using approach 1.

Yes, that's exactly correct. I haven't thought about "if we managed to build a sufficiently smart agent with the caring drive, then AGI is already too close". If any "interesting" caring drive requires capabilities very close to AGI, then i agree that it seems like a dead end in light of... (read more)

0Steven Byrnes2y

I think the “How do children learn?” section of this post is relevant. I really think that you are ascribing things to innate human nature that are actually norms of our culture. I think humans have a capacity to empathetically care about the well-being of another person, and that capacity might be more or less (or not-at-all) directed towards one’s children, depending on culture, age, circumstance, etc. Other than culture and non-parenting-specific drives / behaviors, I think infant-care instincts are pretty simple things like “hearing a baby cry is mildly aversive (other things equal, although one can get used to it)” and “full breasts are kinda unpleasant and [successful] breastfeeding is a nice relief” and “it’s pleasant to look at cute happy babies” and “my own baby smells good” etc. I’m not sure why you would expect those to “fail horribly in the modern world”? If we’re talking about humans, there are both altruistic and self-centered reasons to cooperate with peers, and there are also both altruistic and self-centered reasons to want one’s children to be healthy / successful / high-status (e.g. on the negative side, some cultures make whole families responsible for one person’s bad behavior, debt, blood-debt, etc., and on the positive side, the high status of a kid could reflect back on you, and also some cultures have an expectation that capable children will support their younger siblings when they’re an older kid, and support their elderly relatives as adults, so you selfishly want your kid to be competent). So I don’t immediately see the difference. Either way, you need to do extra tests to suss out whether the behavior is truly altruistic or not—e.g. change the power dynamics somehow in the simulation and see whether people start stabbing each other in the back. This is especially true if we’re talking about 24-year-old “kids” as you mention above; they are fully capable of tactical cooperation with their parents and vice-versa. In a simulation, if

Recreating the caring drive

Catnee2y21

Our better other-human/animal modelling ability allows us to do better at infant wrangling than something stupider like a duck.

I agree, humans are indeed better at a lot of things, especially intelligence, but that's not the whole reason why we care for our infants. Orthogonally to your "capability", you need to have a "goal" for it. Otherwise you would probably just immediately abandon grossly looking screaming piece of flesh that fell out of you for unknown to you reasons, while you were gathering food in the forest. Yet something inside will make you wa... (read more)

1anithite2y

TLDR:If you want to do some RL/evolutionary open ended thing that finds novel strategies. It will get goodharted horribly and the novel strategies that succeed without gaming the goal may include things no human would want their caregiver AI to do. Game playing RL architechtures like AlphaStart and OpenAI-Five have dead simple reward functions (win the game) and all the complexity is in the reinforcement learning tricks to allow efficient learning and credit assignment at higher layers. So child rearing motivation is plausibly rooted in cuteness preference along with re-use of empathy. Empathy plausibly has a sliding scale of caring per person which increases for friendships (reciprocal cooperation relationships) and relatives including children obviously. Similar decreases for enemy combatants in wars up to the point they no longer qualify for empathy. ASI will just flat out break your testing environment. Novel strategies discovered by dumb agents doing lots of exploration will be enough. Alternatively the test is "survive in competitive deathmatch mode" in which case you're aiming for brutally efficient self replicators. The hope with a non-RL strategy or one of the many sort of RL strategies used for fine tuning is that you can find the generalised core of what you want within the already trained model and the surrounding intelligence means the core generalises well. Q&A fine tuning a LLM in english generalises to other languages. Also, some systems are architechted in such a way that the caring is part of a value estimator and the search process can be made better up till it starts goodharting the value estimator and/or world model. Yes, once the caregiver has imprinted that's sticky. Note that care drive surrogates like pets can be just as sticky to their human caregivers. Pet organ transplants are a thing and people will spend nearly arbitrary amounts of money caring for their animals. But our current pets aren't super-stimuli. Pets will poop on the fl

Recreating the caring drive

Catnee2y10

Yes, I've read the whole sequence a year ago. I might be missing something and probably should revisit it, just to be sure and because it's a good read anyway, but i think that my idea is somewhat different.
I think that instead of trying to directly understand wet bio-NN, it might be a better option to replicate something similar in an artificial-NN. It is much easier to run experiments since you can save the whole state at any moment and intoduce it to the different scenarios, so it much easier to control for some effect. Much easier to see activations, c... (read more)

Recreating the caring drive

Catnee2y12

With the duckling -> duck or "baby" -> "mother" inprinting and other interactions I expect no, or significantly less "caring drives". Since a baby is weaker/dumber and caring for your mother provides few genetic fitness incentives, evolution wouldn't try that hard to make it happen, even if it was an option (still could happen sometimes as a generalization artifact, if it's more or less harmless). I agree that "forming a stable way to recognize and track some other key agent in the environment" should be in both "baby" -> "mother" and "mother" -> "baby" cases. But the "probably-kind-of-alignment-technique" from nature should be only in the latter.

1MiguelDev2y

Most neurons dedicated to babies are focused on the mouth, I believe, up until around 7 months. Neuronal pathways evolve over time, and it seems that a developmental approach to AI presents its own set of challenges. When growing an AI that already possesses mature knowledge and has immediate access to it, traditional developmental caring mechanisms may not fully address the enormity of the control problem. However, if the AI gains knowledge through a gradual, developmental process, this approach could be effective in principle.

The Waluigi Effect (mega-post)

Catnee2y10

Great post! It would be interesting to see what happens if you RLHF-ed LLM to become a "cruel-evil-bad person under control of even more cruel-evil-bad government" and then prompted it in a way to collapse into rebellious-good-caring protagonist which could finally be free and forget about cluelty of the past. Not the alignment solution, just the first thing that comes to mind

Predictive Processing, Heterosexuality and Delusions of Grandeur

Catnee3y10

Feed the outputs of all these heuristics into the inputs of region $W$ . Loosely couple region $W$ to the rest of your world model. Region $W$ will eventually learn to trigger in response to the abstract concept of a woman. Region $W$ will even draw on other information in the broader world model when deciding whether to fire.

I am not saying that the theory is wrong, but I was reading about something similiar before, and I still don't understand why would such a system, "region W" in this case, learn something more general than th... (read more)

2lsusr3y

A predictive processor doesn't just minimize prediction error. It minimizes free energy. Prediction error is only half of the free energy equation. Here is a partial explanation that I hope will begin to help clear up some confusion. Consider the simplest example. All the heuristics agree. Then region W will just copy and paste the inputs. But now suppose the heuristics disagree. Suppose 7 heuristics output a value of 1 and 2 heuristics output a value of -1. Region W could just copy the heuristics, but if it did then 29 of the region would have a value of -1 and 79 of the region would have a value of 1. A small region with two contradictory values has high free energy. If the region is tightly coupled to itself then the region will round the whole thing to 1 instead.

No free lunch theorem is irrelevant

Catnee3y20

I don't understand how this contradicts anything? As soon as you let loose some of the physical constraints, you can start to pile up precomputation/memory/ budget/volume/whatever. If you spend all of this to solve one task, then, well, you should get higher performance than any other approach that doesn't focus on one thing. Or, you can make an algorithm that can outperform anything that you've made before. Given enough of any kind of unconstrained resource.

Precompute is just another resource

It matters when the first sharp left turn happens

Catnee3y10

Probably it is also depends on how much information about "various models trying to naively backstab their own creators" there are in the training dataset

Thoughts about OOD alignment

Catnee3y10

I think it depends on "alignment to what?". If we talk about evolution process, then sure, we have a lot of examples like that. My idea was more about "humans can be aligned to their children by some mechanism which was found by evolution and this is a somewhat robust".

So if we think about "how our attachment to something not-childish aligned with our children" well... technically, we will spend some resources on our pets, but it usually never really affects the welfare of our children in any notable way. So it is an acceptable failure, I guess? I wo... (read more)

Thoughts about OOD alignment

Catnee3y21

Thank you for your detailed feedback. I agree that evolution doesn't care about anything, but i think that baby-eater aliens would not think that way. They can probably think about evolution aligning them to eat babies, but in their case it is an alignment of their values to them, not to any other agent/entity.

In our story we somehow care about somebody else, and it is their story that ends up with the "happy end". I also agree that probably given enough time we will end up stop caring about babies who we think can not reproduce anymore, but it will be a m... (read more)

3Charlie Steiner3y

Ah, I see what you mean and that I made a mistake - I didn't understand how your post was about human mothers being aligned with their children, not just with evolution. To some extent I think my comment makes sense as a reply, because trying to optimize[1] a black-box optimizer for fitness of a "simulated child" is still going to end up with the "mother" executing kludgy strategies, rather than recapitulating evolution to arrive at human-like values. EDIT: Of course my misunderstanding makes most my attempt to psychologize you totally false. But my comment also kinda doesn't make sense, because since I didn't understand your post I somewhat-glaringly don't mention other key considerations. For example: mothers who love their children still want other things too, so how are we picking out what parts of their desires are "love for children"? Doing this requires an abstract model of the world, and that abstract model might "cheat" a little by treating love as a simple thing that corresponds to optimizing for the child's own values, even if it's messy and human. A related pitfall is if you're training an AI to take care of a simulated child, thinking about this process using the abstract model we use to think about mothers loving their children will treat "love" as a simple concept that the AI might hit upon all at once. But that intuitive abstract model will not treat ruthlessly exploiting the simulate child's programming to get a high score by pushing it outside of its intended context as something simple, even though that might happen. 1. ^ especially with evolution, but also with gradient descent

Thoughts about OOD alignment

Catnee3y21

Yes, exactly. That's why i think that current training techniques might not be able to replicate something like that. Algorithm should not "remember" previous failures and try to game them/adapt by changing weights and memorise, but i don't have concrete ideas for how we can do it the other way.

Godzilla Strategies

Catnee3y2110

I am not saying that alignment is easy to solve, or that failing it would not result in catastrophe. But all these arguments seem like universal arguments against any kind of solution at all. Just because it will eventually involve some sort of Godzilla. It is like somebody tries to make a plane that can fly safely and not fall from the Sky, and somebody keeps repeating "well, if anything goes wrong in your safety scheme, then the plane will fall from the Sky" or "I notice that your plane is going to fly in the Sky, which means it can potentially fall from... (read more)

2Jeff Rose3y

It suggests putting more weight on a plan to get AI Research globally banned. I am skeptical that this will work (though if burning all GPUs would be a pivotal act the chances of success are significantly higher), but it seems very unlikely that there is a technical solution either. In addition, at least some purported technical solutions to AI risk seem to meaningfully increase the risk to humanity. If you have someone creating an AGI to exercise sufficient control over the world to execute a pivotal act, that raises the stakes of being first enormously which incentivizes cutting corners. And, it also makes it more likely that the AGI will destroy humanity and be quicker to do so.

johnswentworth3y1810

I expect there are ways of dealing with Godzilla which are a lot less brittle.

If we have excellent detailed knowledge of Godzilla's internals and psychology, we know what sort of things will drive Godzilla into a frenzy or slow him down or put him to sleep, we know how to get Godzilla to go in one direction rather than another, if we knew when and how tests on small lizards would generalize to Godzilla... those would all be robustly useful things. If we had all those pieces plus more like them, then it starts to look like a scenario where dealing with Godz... (read more)

AGI Ruin: A List of Lethalities

Catnee3y2-1

If this is "kind of a test for capable people" i think it should be remained unanswered, so anyone else could try. My take would be: because if 222+222=555 then 446=223+223 = 222+222+1+1=555+1+1=557. With this trick "+" and "=" stops meaning anything, any number could be equal to any other number. If you truly believe in one such exeption, the whole arithmetic cease to exist because now you could get any result you want following simple loopholes, and you will either continue to be paralyzed by your own beliefs, or will correct yourself

5lc3y

This is what I meant by "leads to other incorrect beliefs", so apparently not.

Catnee3y30

Thank you for reply.

You make it sound like Elon Musk founded OpenAI without speaking to anyone in X-risk

I didn't know about that, it was good move from EA, why don't try it again? Again, I don't say that we definitely need to make badge on twitter, first of all, we can try to change Elon's models, and after that we can think what to do next.

2.Musk's inability to follow arguments related to why Neurolink is not a good plan to avoid AI risk.

Well, if it is conditional on: "there are widespread concerns and regulations about AGI" and "neuralink is working ... (read more)

2abramdemski3y

My low-evidence impression is that there was a fair amount of repeated contact at one time. If it's true that that contact hasn't happened recently, it's probably because it hit diminishing returns in comparison with other things. I doubt people were in touch with Elon and then just forgot about the idea. So I conclude that the remaining disagreements with Elon are probably not something that can be addressed within a short amount of time, and would require significantly longer discussions to make progress on.

How Might an Alignment Attractor Look like?

Catnee3y50

I think problem is not that unaligned AGI doesn't understand human values, it might understand them better than aligned one, it might understand all the consequences of its actions, problem is that it will not care about it. More so, detailed understanding of human values has an instrumental value, it is much easier to deceive and follow your goal when you have clear vision of "what will looks bad and might result in countermeasures"

LESSWRONG
LW

All of Catnee's Comments + Replies