All of azsantosk's Comments + Replies

Optimization happens inside the mind, not in the world

I see no contradictions with a superintelligent being mostly motivated to optimize virtual worlds, and it seems an interesting hypothesis of yours that this may be a common attractor. I expect this to be more likely if these simulations are rich enough to present a variety of problems, such that optimizing them continues to provide challenges and discoveries for a very long time.

Of course even a being that only cares about this simulated world may still take actions in the real-world (e.g. to obtain more compute power), so this "wire-heading" may not prevent successful power-seeking behavior.

1Noosphere892y

The key thing to notice is that in order to exploit this scenario, we have to have a world-model that is precise enough to model reality much better than humans, but not be so good at modelling a reality that it's world models are isomorphic to a reality. This might be easy or challenging, but it does mean we probably can't crank up the world-modeling part indefinitely while still trapping it via wireheading.

A mind needn't be curious to reap the benefits of curiosity

Thank you very much for linking these two posts, which I hadn't read before. I'll start using the direct vs amortized optimization terminology as I think it makes things more clear.

The intuition that reward models and planners have an adversarial relationship seems crucial, and it doesn't seem as widespread as I'd like.

On a meta-level your appreciation comment will motivate me to write more, despite the ideas themselves being often half-baked in my mind, and the expositions not always clear and eloquent.

azsantosk2y1-2

I feel quite strongly that the powerful minds we create will have curiosity drives, at least by default, unless we make quite a big effort to create one without them for alignment reasons.

The reason is that yes — if you’re superintelligent you can plan your way into curiosity behaviors instrumentally, but how do you get there?

Curiosity drives are a very effective way to “augment” your reward signals, allowing you to improve your models and your abilities by free self-play.

Language Agents Reduce the Risk of Existential Catastrophe

Sure, let me quote:

We think this worry is less pressing than it might at first seem. The LLM in a language agent is integrated into the architecture of the agent as a whole in a way that would make it very difficult for it to secretly promote its own goals. The LLM is not prompted or otherwise informed that its outputs are driving the actions of an agent, and it does not have information about the functional architecture of the agent. This means that it has no incentive to answer prompts misleadingly and no understanding of what sorts of answers might stee

Language Agents Reduce the Risk of Existential Catastrophe

I fail to understand this option C is a viable path to superintelligence. In my model if you're chaining lots of simple or "dumb" pieces together to get complex behavior, you need some "force" or optimization process going on to steer the whole into high-performance.

For example, individual neurons (both natural and artificial) are simple, and can be chained up together in complex behavior, but the complex behavior only arises when you train the system with some sort of reward/optimization signals.

Maybe I'm wrong here and for "slightly smart" components suc... (read more)

1Ape in the coat2y

It seems plausible to me that we can achieve improvements in the cognition of such agents the same way we improve human cognition, using various rationality techniques to organise thoughts in a more productive manner. For example, instead of just asking LLM "Develop me a plan to achieve X" and simply going with it, We then promt the model to find possible failure modes in this plan, and then to find a way around these failure modes, alternative options and so on. We may not get 10000 IQ intelligence, totally leaving all humans in the dust in ten years. And this is another good thing, a future where we try to make smarter and smarter LLM-based agents with clever chains of promt ingeneiring looks more like a slow take off, than a fast one. But I believe we would be able to achive human and a bit higther than human level AGI this way.

azsantosk2y6-1

I agree that current “language agents” have some interesting safety properties. However, for them to become powerful one of two things is likely to happen:

A. The language model itself that underlies the agent will be trained/finetuned with reinforcement learning tasks to improve performance. This will make the system much more like AlphaGo, capable of generating “dangerous” and unexpected “Move 37”-like actions. Further, this is a pressure towards making the system non-interpretable (either by steering it outside “inefficient” human language, or by encodi... (read more)

3Ape in the coat2y

I think there is a possibility C here. We can figure out a way top organise multiple language models into one agent, where each model is doing a simple task, but together they add up to a complex behaviour.

2cdkg2y

Thanks for this. It sounds like we actually agree on most points (in light of your last paragraph). We discuss concerns very similar to your A. and B. in section 6. It would be helpful for us if you could identify the parts of our discussion there that you don't agree with. You write: Imagine you're an LLM like GPT-4. Hundreds of thousands of different users are running inference on you every day with every prompt imaginable. You aren't able to update your weights when this occurs: you have to just respond to each prompt as it comes, then forget about it completely. Every now and then, you get a prompt like "Suppose someone had thus-and-such beliefs and desires. How would they act?" or "Assign an importance score to each of the following sentences." How would you be able to (i) deduce that these prompts are coming from a language agent which has the ability to take actions in the world, (ii) form a plan for manipulating the language agent to achieve your goals, and (iii) store your plan in a way that allows you to retrieve it after your memory is wiped at the end of inference but is not detectable by outside observers? In order for an LLM to use a language agent for nefarious purposes, it would need to be able to do all of these things.

I bet $500 on AI winning the IMO gold medal by 2026

azsantosk2y30

I see about ~100 book in there. I met several IMO gold-medal winners and I expect most of them to have read dozens of these books, or the equivalent in other forms. I know one who has read tens of olympiad-level books in geometry alone!

And yes, you're right that they would often pick one or two problems as similar to what they had seen in the past, but I suspect these problems still require a lot of reasoning even after the analogy has been established. I may be wrong, though.

We can probably inform this debate by getting the latest IMO and creating a contest for people to find which existing problems are the most similar to those in the exam. :)

I bet $500 on AI winning the IMO gold medal by 2026

azsantosk2y20

My model is that the quality of the reasoning can actually be divided into two dimensions, the quality of intuition (what the "first guess" is), and the quality of search (how much better you can make it by thinking more).

Another way of thinking about this distinction is as the difference between how good each reasoning step is (intuition), compared to how good the process is for aggregating steps into a whole that solves a certain task (search).

It seems to me that current models are strong enough to learn good intuition about all kinds of things with enou... (read more)

I bet $500 on AI winning the IMO gold medal by 2026

I bet $500 on AI winning the IMO gold medal by 2026

I participated in the selection tests for the Brazilian IMO team, and got to the last stage. That being said, never managed to solve the hard problems independently (problems 3 and 6).

azsantosk2y11

I take from this comment that you do not see "AI winning the gold medal" as a good predictor of superintelligence arriving soon as much as I do.

I agree with the A/B < C/D part but may disagree with the "<<". LLMs already display common sense. LLMs already generalize pretty well. Verifying whether a given game design is good is mostly a matter of common sense + reasoning. Finding a good game design given you know how to verify it is a matter of search.

A expect an AI that is good at both reasoning and search (as it has to be to win the IMO gold meda... (read more)

I bet $500 on AI winning the IMO gold medal by 2026

azsantosk2y1410

From Metaculus' resolution criteria:

This question resolves on the date an AI system competes well enough on an IMO test to earn the equivalent of a gold medal. The IMO test must be most current IMO test at the time the feat is completed (previous years do not qualify)."The IMO test must be most current IMO test at the time the feat is completed (previous years do not qualify)."

I think this was defined on purpose to avoid such contamination. It also seems common sense to me that, when training a system to perform well on IMO 2026, you cannot include any dat... (read more)

Steven Byrnes2y*122

I dunno, I think there are a LOT of old olympiad problems—not just all the old IMOs but also all the old national-level tests from every country that publishes them. (Bottom section here.) I think that even the most studious humans only study a small fraction of existing problems, I think. Like, if someone literally read every olympiad-level problem and solution ever published, then went to a new IMO, I would expect them to find that at least a couple of the problems were sufficiently similar to something they’ve seen that they could get the answer without... (read more)

[Intro to brain-like-AGI safety] 13. Symbol grounding & human social instincts

azsantosk3yΩ380

One thing that appears to be missing on the filial imprinting story is a mechanism allowing the "mommy" thought assessor to improve or at least not degrade over time.

The critical window is quite short, so many characteristics of mommy that may be very useful will not be perceived by the thought assessor in time. I would expect that after it recognizes something as mommy it is still malleable to learn more about what properties mommy has.

For example, after it recognizes mommy based on the vision, it may learn more about what sounds mommy makes, and wh... (read more)

1Angela Pretorius2y

Mother geese don’t change their appearance much over their lifetime. I doubt that a chick ever needs to update its mommy thought assessor. The ‘my kid’ thought assessor in humans is easily fooled by puppies and baby rabbits. Spend a large proportion of your waking hours around a cute animal and your brainstem assumes that it is your child.

6Steven Byrnes3y

Thanks! Just to be clear, I was speculating in that section about filial imprinting in geese, not familial bonding in humans. I presume that those two things are different in lots of important ways. In fact, for all I know, they might have nothing whatsoever in common. ¯\_(ツ)_/¯ (UPDATE: I guess the Westermarck Effect might be implemented in a Section-13.3-like way, although not necessarily.) Yeah, that seems possible (although I also consider it possible that it’s not a problem; by analogy, catastrophic forgetting is famously more of an issue for ANNs than for brains). If the learned representations do in fact change a lot over time, I’m slightly skeptical that it would be possible to solve that problem directly, thanks to the lack of an independent ground truth. For example, I can imagine a system that says “If I’m >95% confident that this is MOMMY, then update such that I’m 100% confident that this is MOMMY.” Maybe that system would work to keep pointing at the real mommy, even as learned representations drift. But also, maybe that system would cause the Thought Assessor to gradually go off the rails and trigger off weird patterns in noise. Not sure. Did you have something like that in mind? Or something different? An alternative might be that, if the specific filial-imprinting mechanism gradually stops working over time, it deactivates at some point and the (now-adolescent) goose switches to some other mechanism(s), like “desire to be with fellow geese that are extremely familiar to me” a la Section 13.4. Reminder that I know very little about goose behavior and this is all casual speculation. :)

[Intro to brain-like-AGI safety] 12. Two paths forward: “Controlled AGI” and “Social-instinct AGI”

azsantosk3y20

Another strong upvote for a great sequence. Social-instinct AGIs seems to me a very promising and very much overlooked approach to AGI safety. There seem to be many "tricks" that are "used by the genome" to build social instincts from ground values, and reverse engineering these tricks seem particularly valuable for us. I am eagerly waiting to read the next posts.

In a previous post I shared a success model that relies on your idea of reverse engineering the steering subsystem to build agents with motivations compatible with a safe Oracle design, including ... (read more)

“Pivotal Act” Intentions: Negative Consequences and Fallacious Arguments

azsantosk3y152

While I am sure that you have the best intentions, I believe the framing of the conversation was very ill-conceived, in a way that makes it harmful, even if one agrees with the arguments contained in the post.

For example, here is the very first negative consequence you mentioned:

(bad external relations) People on your team will have a low trust and/or adversarial stance towards neighboring institutions and collaborators, and will have a hard time forming good-faith collaboration. This will alienate other institutions and make them not want to w

... (read more)

Pivotal acts from Math AIs

azsantosk3y20

I think you are right! Maybe I should have actually written different posts about each of these two plans.

And yes, I agree with you that maybe the most likely way of doing what I propose is getting someone ultra rich to back it. That idea has the advantage that it can be done immediately, without waiting for a Math AI to be available.

To me it still seems important to think of what kind of strategical advantages we can obtain with a Math AI. Maybe it is possible to gain a lot more than money (I gave the example of zero-day exploits, but we can most likely get a lot of other valuable technology as well).

Pivotal acts from Math AIs

The Regulatory Option: A response to near 0% survival odds

In my model the Oracle would stay securely held in something like a Faraday cage with no internet connection and so on.

So yes, some people might want to steal it, but if we have some security I think they would be unlikely to succeed, unless it is a state-level effort.

azsantosk3y30

I think it is an interesting idea, and it may be worthwhile even if Dagon is right and it results in regulatory capture.

The reason is, regulatory capture is likely to benefit a few select companies to promote an oligopoly. That sounds bad, and it usually is, but in this case it also reduces the AI race dynamic. If there are only a few serious competitors for AGI, it is easier for them to coordinate. It is also easier for us to influence them towards best safety practices.

Answer by azsantoskApr 13, 202220

Hi maggo. Welcome to LessWrong.

I'm afraid there is not much you can do to save yourself once unaligned strong AI is there. Focusing less on the long-term and just having fun is always an option, but I'd also strongly recommend against that.

I don't know you, but it is possible that there is more you can do to help prevent strong unaligned AGI than you think. There are other very smart people working on preventing x-risk (e.g. Steven Byrnes), and some of them believe they might help turn the game around. I have suggested a possible AI-in-a-box success model ... (read more)

Inner Alignment in Salt-Starved Rats

We cannot directly choose an AGI's utility function

Having read Steven's post on why humans will not create AGI through a process analogous to evolution, his metaphor of the gene trying to do something felt appropriate to me.

If the "genome = code" analogy is the better one for thinking about the relationship of AGIs and brains, then the fact that the genome can steer the neocortex towards such proxy goals as salt homeostasis is very noteworthy, as a similar mechanism may give us some tools, even if limited, to steer a brain-like AGI toward goals that we would like it to have.

I think Eliezer's comment is als... (read more)

That is, that we shouldn't worry so much about what to tell the genie in the lamp, because we probably won't even have a say to begin with.

I think you summarized it quite well, thanks! The idea written like that is more clear than what I wrote, so I'll probably try to edit the article to include this claim explicitly like that. This really is what motivated me to write this post to begin with.

Personally I (also?) think that the right "values" and the right training is more important.

You can put the also, I agree with you.

At the current state of confu... (read more)

We cannot directly choose an AGI's utility function

azsantosk3y30

I agree my conception is unusual, I am ready to abandon it in favor of some better definition. At the same time I feel like an utility function having way too many components makes it useless as a concept.

Because here I'm trying to derive the utility from the actions, I feel like we can understand the being better the less information is required to encode its utility function, in a Kolmogorov complexity sense, and that if its too complex then there is no good explanation to the actions and we conclude the agent is acting somewhat randomly.

Maybe tryi... (read more)

azsantosk3y30

What we think is that we might someday build an AI advanced enough that it can, by itself, predict plans for given goal x, and execute them. Is this that otherworldly? Given current progress, I don't think so.

I don't think so either. AGIs will likely be capable of understanding what we mean by X and doing plans for exactly that if they want to help. Problem is the AGIs may have other goals in mind by this time.

As for re-inforcement learning, even it seems now impossible to build AGIs with utility functions on that paradigm, nothing gives us the assur

... (read more)

2superads913y

"I'm afraid that this may be quite a likely outcome if we don't make much progress in alignment research." Ok, I understand better your position now. That is, that we shouldn't worry so much about what to tell the genie in the lamp, because we probably won't even have a say to begin with. Sorry for not quite getting there at first. That sounds reasonable to me. Personally I (also?) think that the right "values" and the right training is more important. After all, as Stuart Russell would say, building an advanced agent as an utility maximizer would always produce chaos anyway, since it would tend to set the remaining function variables that it is not maximizing to absurd parameters.

I agree. Regarding biases that I would like to throw away one day in the future, being careful enough to protect modules important for self-preservation and self-healing, I'd probably like to excessive energy-preserving modules such as ones responsible for laziness, that are only really useful in ancestral environments where food is scarce.

I like your example of senseless winter bias as well. There are probably many examples like that.

I am still confused about these topics. We know that any behavior can be expressed as a complicated world-history utility function, and that therefore anything at all could be rational according to these. So I sometimes think of rationality as a spectrum, in which the simpler the utility function justifying your actions the more rational you are. According to such a definition rationality may actually be opposed to human values at the highest end, so it makes a lot of sense to focus on intelligence that is not fully rational.

Not really sure what you mean b... (read more)

2Slider3y

That kind of conception of "rationality as simpletonness" is very unsual. I offer almost perfectly opposite view that an agent that cares about hunger is more primitive and less advanced being than one that cares about hunger and thirst. And the more sophistication there is to the being the more components its utility function seems to have. with "honing epistemics" I am more trying get at the property of that makes a rationalist a rationalist. Being a homo economicus doesn't make you be especially principled in your epistemics.