All of porby's Comments + Replies

Instrumentality exists on the simulacra level, not the simulator level.  This would suggest that corrigibility could be maintained by establishing a corrigible character in context.  Not clear on the practical implications.

That one, yup. The moment you start conditioning (through prompting, fine tuning, or otherwise) the predictor into narrower spaces of action, you can induce predictions corresponding to longer term goals and instrumental behavior. Effective longer-term planning requires greater capability, so one should expect this kind of thin... (read more)

4Noosphere89
To uncover the generators of this, I think one of the reasons for this is because inductive biases turned out to matter little, enabling you to avoid having to do simulated evolution, which is where I think a lot of danger lies, combined with sparse RL not generally working very well on low compute, and AI early on needing a surprising amount of structure/world models, allowing you to somewhat safely automate research.

There are lots of little things when it's not at a completely untenable level. Stuff like:

  1. Going up a flight or three of steps and really feeling it in my knees, slowing down, and saying 'hoo-oof.'
  2. Waking up and stepping out of bed and feeling general unpleasantness in my feet, ankles, knees, hips, or back.
  3. Quickly seeking out places to sit when walking around, particularly if there's also longer periods of standing, because my back would become terribly stiff.
  4. Walking on uneven surfaces and having a much harder time catching myself when I stumbled, not infreq
... (read more)

Hey, we met at EAGxToronto : )

🙋‍♂️

So my model of progress has allowed me to observe our prosaic scaling without surprise, but it doesn't allow me to make good predictions since the reason for my lack of surprise has been from Vingean prediction of the form "I don't know what progress will look like and neither do you".

This is indeed a locally valid way to escape one form of the claim—without any particular prediction carrying extra weight, and the fact that reality has to go some way, there isn't much surprise in finding yourself in any given world.

I do t... (read more)

I've got a fun suite of weird stuff going on[1], so here's a list of sometimes-very-N=1 data:

  1. Napping: I suck at naps. Despite being very tired, I do not fall asleep easily, and if I do fall asleep, it's probably not going to be for just 5-15 minutes. I also tend to wake up with a lot of sleep inertia, so the net effect of naps on alertness across a day tends to be negative. They also tend to destroy my sleep schedule. 
  2. Melatonin: probably the single most noticeable non-stimulant intervention. While I'm by-default very tired all the time, it's still har
... (read more)
1Sheikh Abdur Raheem Ali
Great writeup, strong upvoted. I'd share a similar list of N=1 data I wrote in a facebook post a few years ago but I'm currently unable to access the site due to a content blocker.
3Johannes C. Mayer
Probably not useful but just in case here are some other medications that are prescribed for narcolepsy (i.e. stuff that makes you not tired): * Pitolisant * Solriamfetol Solriamfetol is supposed to be more effective than Modafinil. Possibly hard to impossible to get without a prescription. Haven't tried that yet. Pitolisant is interesting because it has a novel mechanism of action. Possibly impossible to get even with a prescription, as it is super expensive if you don't have the right health insurance. For me, it did not work that well. Only lasted 2-4 hours, and taking multiple doses makes me not be able to sleep.

But I disagree that there’s no possible RL system in between those extremes where you can have it both ways.

I don't disagree. For clarity, I would make these claims, and I do not think they are in tension:

  1. Something being called "RL" alone is not the relevant question for risk. It's how much space the optimizer has to roam.
  2. MuZero-like strategies are free to explore more space than something like current applications of RLHF. Improved versions of these systems working in more general environments have the capacity to do surprising things and will tend to be
... (read more)

It does still apply, though what 'it' is here is a bit subtle. To be clear, I am not claiming that a technique that is reasonably describable as RL can't reach extreme capability in an open-ended environment.

The precondition I included is important:

in the absence of sufficient environmental structure, reward shaping, or other sources of optimizer guidance, it is nearly impossible for any computationally tractable optimizer to find any implementation for a sparse/distant reward function

In my frame, the potential future techniques you mention are forms of op... (read more)

4Steven Byrnes
I agree that in the limit of an extremely structured optimizer, it will work in practice, and it will wind up following strategies that you can guess to some extent a priori. I also agree that in the limit of an extremely unstructured optimizer, it will not work in practice, but if it did, it will find out-of-the-box strategies that are difficult to guess a priori. But I disagree that there’s no possible RL system in between those extremes where you can have it both ways. On the contrary, I think it’s possible to design an optimizer which is structured enough to work well in practice, while simultaneously being unstructured enough that it will find out-of-the-box solutions very different from anything the programmers were imagining. Examples include: * MuZero: you can’t predict a priori what chess strategies a trained MuZero will wind up using by looking at the source code. The best you can do is say “MuZero is likely to use strategies that lead to its winning the game”. * “A civilization of humans” is another good example: I don’t think you can look at the human brain neural architecture and loss functions etc., and figure out a priori that a civilization of humans will wind up inventing nuclear weapons. Right?

Calling MuZero RL makes sense. The scare quotes are not meant to imply that it's not "real" RL, but rather that the category of RL is broad enough that it belonging to it does not constrain expectation much in the relevant way. The thing that actually matters is how much the optimizer can roam in ways that are inconsistent with the design intent.

For example, MuZero can explore the superhuman play space during training, but it is guided by the structure of the game and how it is modeled. Because of that structure, we can be quite confident that the optimizer isn't going to wander down a path to general superintelligence with strong preferences about paperclips.

7Steven Byrnes
Right, and that wouldn’t apply to a model-based RL system that could learn an open-ended model of any aspect of the world and itself, right? I think your “it is nearly impossible for any computationally tractable optimizer to find any implementation for a sparse/distant reward function” should have some caveat that it only clearly applies to currently-known techniques. In the future there could be better automatic-world-model-builders, and/or future generic techniques to do automatic unsupervised reward-shaping for an arbitrary reward, such that AIs could find out-of-the-box ways to solve hard problems without handholding.

I do think that if you found a zero-RL path to the same (or better) endpoint, it would often imply that you've grasped something about the problem more deeply, and that would often imply greater safety.

Some applications of RL are also just worse than equivalent options. As a trivial example, using reward sampling to construct a gradient to match a supervised loss gradient is adding a bunch of clearly-pointless intermediate steps.

I suspect there are less trivial cases, like how a decision transformer isn't just learning an optimal policy for its dataset but... (read more)

Answer by porby7211

"RL" is a wide umbrella. In principle, you could even train a model with RL such that the gradients match supervised learning. "Avoid RL" is not the most directly specified path to the-thing-we-actually-want.

The source of spookiness

Consider two opposite extremes:

  1. A sparse, distant reward function. A biped must successfully climb a mountain 15 kilometers to the east before getting any reward at all.
  2. A densely shaped reward function. At every step during the climb up the mountain, there is a reward designed to induce gradients that maximize training performanc
... (read more)
4Steven Byrnes
I guess when you say “powerful world models”, you’re suggesting that model-based RL (e.g. MuZero) is not RL but rather “RL”-in-scare-quotes. Was that your intention? I’ve always thought of model-based RL is a central subcategory within RL, as opposed to an edge-case.
3dil-leik-og
thinking at the level of constraints is useful. very sparse rewards offer less constraints on final solution. imitation would offer a lot of constraints (within distribution and assuming very low loss). a way to see RL/supervised distinction dissolve is to convert back and forth. With a reward as negative token prediction loss, and actions being the set of tokens, we can simulate auto-regressive training with RL (as mentioned by @porby). conversely, you could first train RL policy and then imitate that (in which case why would imitator be any safer?).  also, the level of capabilities and the output domain might affect the differences between sparse/dense reward. even if we completely constrain a CPU simulator (to the point that only one solution remains), we still end up with a thing that can run arbitrary programs. At the point where your CPU simulator can be used without performance penalty to do the complex task that your RL agent was doing, it is hard to say which is safer by appealing to the level of constraints in training. i think something similar could be said of a future pretrained LLM that can solve tough RL problems simply by being prompted to "simulate the appropriate RL agent", but i am curious what others think here. 
5the gears to ascension
Oh this is a great way of laying it out. Agreed on many points, and I think this may have made some things easier for me to see, likely some of that is actual update that changes opinions I've shared before that you're disagreeing with. I'll have to ponder.
5Chris_Leong
Oh, this is a fascinating perspective. So most uses of RL already just use a small-bit of RL. So if the goal was "only use a little bit of RL", that's already happening. Hmm... I still wonder if using even less RL would be safer still.

Stated as claims that I'd endorse with pretty high, but not certain, confidence:

  1. There exist architectures/training paradigms within 3-5 incremental insights of current ones that directly address most incapabilities observed in LLM-like systems. (85%; if false, my median strong AI estimate would jump by a few years, p(doom) effect would vary depending on how it was falsified)
  2. It is not an accident that the strongest artificial reasoners we have arose from something like predictive pretraining. In complex and high dimensional problem spaces like general reaso
... (read more)
Reply3221

Yup, exactly the same experience here.

Has there been any work on the scaling laws of out-of-distribution capability/behavior decay?

A simple example:

  1. Simultaneously train task A and task B for N steps.
  2. Stop training task B, but continue to evaluate the performance of both A and B.
  3. Observe how rapidly task B performance degrades.

Repeat across scale and regularization strategies.

Would be nice to also investigate different task types. For example, tasks with varying degrees of implied overlap in underlying mechanisms (like #2).

I've previously done some of these experiments privately, but not with nea... (read more)

3Decaeneus
For what it's worth (perhaps nothing) in private experiments I've seen that in certain toy (transformer) models, task B performance gets wiped out almost immediately when you stop training on it, in situations where the two tasks are related in some way. I haven't looked at how deep the erasure is, and whether it is far easier to revive than it was to train it in the first place.
4Nathan Helm-Burger
Yeah, I've seen work on the sort of thing in your example in the continual learning literature. Also tasks that have like.... 10 components, and train sequentially but test on every task so far trained on. Then you can watch the earlier tasks fall off as training progresses.

A further extension and elaboration on one of the experiments in the linkpost:
Pitting execution fine-tuning against input fine-tuning also provides a path to measuring the strength of soft prompts in eliciting target behaviors. If execution fine-tuning "wins" and manages to produce a behavior in some part of input space that soft prompts cannot elicit, it would be a major blow to the idea that soft prompts are useful for dangerous evaluations.

On the flip side, if ensembles of large soft prompts with some hyperparameter tuning always win (e.g. execution fin... (read more)

Having escaped infinite overtime associated with getting the paper done, I'm now going back and catching up on some stuff I couldn't dive into before.

Going through the sleeper agents paper, it appears that one path—adversarially eliciting candidate backdoor behavior—is hampered by the weakness of the elicitation process. Or in other words, there exist easily accessible input conditions that trigger unwanted behavior that LLM-driven adversarial training can't identify.

I alluded to this in the paper linkpost, but soft prompts are a very simple and very stron... (read more)

4ryan_greenblatt
Soft prompts used like this ~= latent adversarial training So see that work etc.
4porby
A further extension and elaboration on one of the experiments in the linkpost: Pitting execution fine-tuning against input fine-tuning also provides a path to measuring the strength of soft prompts in eliciting target behaviors. If execution fine-tuning "wins" and manages to produce a behavior in some part of input space that soft prompts cannot elicit, it would be a major blow to the idea that soft prompts are useful for dangerous evaluations. On the flip side, if ensembles of large soft prompts with some hyperparameter tuning always win (e.g. execution fine tuning cannot introduce any behaviors accessible by any region of input space without soft prompts also eliciting it), then they're a more trustworthy evaluation in practice.

By the way: I just got into San Francisco for EAG, so if anyone's around and wants to chat, feel free to get in touch on swapcard (or if you're not in the conference, perhaps a DM)! I fly out on the 8th.

It's been over a year since the original post and 7 months since the openphil revision.

A top level summary:

  1. My estimates for timelines are pretty much the same as they were.
  2. My P(doom) has gone down overall (to about 30%), and the nature of the doom has shifted (misuse, broadly construed, dominates).

And, while I don't think this is the most surprising outcome nor the most critical detail, it's probably worth pointing out some context. From NVIDIA:

In two quarters, from Q1 FY24 to Q3 FY24, datacenter revenues went from $4.28B to $14.51B.

From the post:

In 3 year

... (read more)

Mine:

My answer to "If AI wipes out humanity and colonizes the universe itself, the future will go about as well as if humanity had survived (or better)" is pretty much defined by how the question is interpreted. It could swing pretty wildly, but the obvious interpretation seems ~tautologically bad.

2Unnamed
Agreed, I can imagine very different ways of getting a number for that, even given probability distributions for how good the future will be conditional on each of the two scenarios. A stylized example: say that the AI-only future has a 99% chance of being mediocre and a 1% chance of being great, and the human future has a 60% chance of being mediocre and a 40% chance of being great. Does that give an answer of 1% or 60% or something else? I'm also not entirely clear on what scenario I should be imagining for the "humanity had survived (or better)" case.
5[anonymous]
So there's an argument here, one I don't subscribe to, but I have seen prominent AI experts make it implicitly. If you think about it, if you have children, and they have children, and so in a series of mortal generations, with each n+1 generation more and more of your genetic distinctiveness is being lost.  Language and culture will evolve as well. This is the 'value drift' argument.  That whatever you value now, as in yourself and those humans you know and your culture and language and various forms of identity, as each year passes, a percentage of that value is going to be lost.  Value is being discounted with time.   It will eventually diminish to 0 as long as humans are dying from aging. You might argue that the people in 300+ years will at least share genetics with the people now, but that is not necessarily true since genetic editing will be available and bespoke biology where all the prior rules of what's possible are thrown out. So you are comparing outcome A, where hundreds of years from now the alien cyborgs descended from people now exist, vs the outcome B, where hundreds of years from now, descendents of some AI are all that exist. "value" wise you could argue that A == B, both have negligible value compared to what we value today.   I'm not sure this argument is correct but it does discount away the future and is a strong argument against long termism.   Value drift only potential stops once immortal beings exist, and AIs are immortal from the very first version.  Theoretically some AI system that was trained on all of human knowledge, even if it goes on to kill it's creators and consume the universe, need not forget any of that knowledge.  It also as an individual would know more human skills and knowledge and culture than any human ever could, so in a way such a being is a human++.   The AI expert who expressed this is near the end of his expected lifespan, and there's no difference from an individual perspective who is about to die between

I sometimes post experiment ideas on my shortform. If you see one that seems exciting and you want to try it, great! Please send me a message so we can coordinate and avoid doing redundant work.

Retrodicting prompts can be useful for interpretability when dealing with conditions that aren't natively human readable (like implicit conditions induced by activation steering, or optimized conditions from soft prompts). Take an observed completion and generate the prompt that created it.

What does a prompt retrodictor look like?

Generating a large training set of soft prompts to directly reverse would be expensive. Fortunately, there's nothing special in principle about soft prompts with regard to their impact on conditioning predictions.

Just take large t... (read more)

Another potentially useful metric in the space of "fragility," expanding on #4 above:

The degree to which small perturbations in soft prompt embeddings yield large changes in behavior can be quantified. Perturbations combined with sampling the gradient with respect to some behavioral loss suffices.

This can be thought of as a kind of internal representational fragility. High internal representational fragility would imply that small nudges in the representation can blow up intent.

Does internal representational fragility correlate with other notions of "fragi... (read more)

A further extension: While relatively obvious in context, this also serves as a great way to automate adversarial jailbreak attempts (broadly construed), and to quantify how resistant a given model or prompting strategy is to jailbreaks.

Set up your protections, then let SGD try to jailbreak it. The strength of the protections can be measured by the amount of information required to overcome the defenses to achieve some adversarial goal.

In principle, a model could be perfectly resistant and there would be no quantity of information sufficient to break it. T... (read more)

Expanding on #6 from above more explicit, since it seems potentially valuable:

From the goal agnosticism FAQ:

The definition as stated does not put a requirement on how "hard" it needs to be to specify a dangerous agent as a subset of the goal agnostic system's behavior. It just says that if you roll the dice in a fully blind way, the chances are extremely low. Systems will vary in how easy they make it to specify bad agents.

From earlier experimentpost:

Figure out how to think about the "fragility" of goal agnostic systems. Conditioning a predictor can easily

... (read more)
2porby
A further extension: While relatively obvious in context, this also serves as a great way to automate adversarial jailbreak attempts (broadly construed), and to quantify how resistant a given model or prompting strategy is to jailbreaks. Set up your protections, then let SGD try to jailbreak it. The strength of the protections can be measured by the amount of information required to overcome the defenses to achieve some adversarial goal. In principle, a model could be perfectly resistant and there would be no quantity of information sufficient to break it. That'd be good to know! This kind of adversarial prompt automation could also be trivially included in an evaluations program. I can't imagine that this hasn't been done before. If anyone has seen something like this, please let me know.

Soft prompts are another form of prompt automation that should naturally preserve all the nice properties of goal agnostic architectures.

Does training the model to recognize properties (e.g. 'niceness') explicitly as metatokens via classification make soft prompts better at capturing those properties?

You could test for that explicitly:

  1. Pretrain model A with metatokens with a classifier.
  2. Pretrain model B without metatokens.
  3. Train soft prompts on model A with the same classifier.
  4. Train soft prompts on model B with the same classifier.
  5. Compare performance of soft
... (read more)
2porby
Another potentially useful metric in the space of "fragility," expanding on #4 above: The degree to which small perturbations in soft prompt embeddings yield large changes in behavior can be quantified. Perturbations combined with sampling the gradient with respect to some behavioral loss suffices. This can be thought of as a kind of internal representational fragility. High internal representational fragility would imply that small nudges in the representation can blow up intent. Does internal representational fragility correlate with other notions of "fragility," like the information-required-to-induce-behavior "fragility" in the other subthread about #6? In other words, does requiring very little information to induce a behavior correlate with the perturbed gradients with respect to behavioral loss being large for that input? Given an assumption that the information content of the soft prompts have been optimized into a local minimum, sampling the gradient directly at the soft prompt should show small gradients. In order for this correlation to hold, there would need to be steeply bounded valley in the loss landscape. Or to phrase it another way, for this correlation to exist, behaviors which are extremely well-compressed by the model and have informationally trivial pointers would need to correlate with fragile internal representations. If anything, I'd expect anticorrelation; well-learned regions probably have enough training constraints that they've been shaped into more reliable, generalizing formats that can representationally interpolate to adjacent similar concepts. That'd still be an interesting thing to observe and confirm, and there are other notions of fragility that could be considered.
2porby
Expanding on #6 from above more explicit, since it seems potentially valuable: From the goal agnosticism FAQ: From earlier experimentpost: This can be phrased as "what's the amount of information required to push a model into behavior X?" Given a frozen model, optimizing prompt tokens gives us a direct way of answering a relevant proxy for this question: "What is the amount of information (accessible to SGD through soft prompting) required to push a model into behavior X?" In practice, this seems like it should be a really good proxy, and (provided some compute) it gives you a trivially quantifiable answer: Try different soft prompt token counts and observe performance on the task that the soft prompts were targeting. The resulting token count versus performance curve characterizes the information/performance tradeoff for that behavior, given that model. This seems like... it's... an extremely good answer to the "fragility" question? It's trivial to incorporate this into an evaluations scheme. Just have a bunch of proxy tasks that would be alarming if they were accessible by trivial differences in prompting. Conceptually, it's a quantification of the number of information theoretic mistakes you'd need to make to get bad behavior from the model.

Quarter-baked experiment:

  1. Stick a sparse autoencoder on the residual stream in each block.
  2. Share weights across autoencoder instances across all blocks.
  3. Train autoencoder during model pretraining.
  4. Allow the gradients from autoencoder loss to flow into the rest of the model.

Why? With shared autoencoder weights, every block is pushed toward sharing a representation. Questions:

  1. Do the meanings of features remain consistent over multiple blocks? What does it mean for an earlier block's feature to "mean" the same thing as a later block's same feature when they're at
... (read more)

I think that'd be great!

Some of this stuff technically accelerates capabilities (or more specifically, the elicitation of existing capabilities), but I think it also belongs to a more fundamentally reliable path on the tech tree. The sooner the industry embraces it, the less time they spend in other parts of the tech tree that are more prone to misoptimization failures, and the less likely it is that someone figures out how to make those misoptimization failures way more efficient.

I suspect there's a crux about the path of capabilities development in there for a lot of people; I should probably get around to writing a post about the details at some point. 

3RogerDearnaley
I've seen a number of cases where something that helps alignment also helps capabilities, or vice versa, and also cases where people are worrying a lot about something as an alignment problem that looks to me like primarily a capabilities problem (so given how few alignment engineers we have, maybe we should leave solving it to all the capabilities engineers). Generally I think we're just not very good at predicting the difference, and tend to want to see this as an either-or taboo rather than a spectrum buried inside a hard-to-anticipate tech tree. In general, capabilities folks also want to control their AI (so it won't waste tokens, do weird stuff, or get them sued or indicted). The big cross-purposes concerns tend to come mostly from deceit, sharp left turn, and Foom scenarios, where capabilities seem just fine until we drive off the cliff. What I think we need (and even seems to be happening in many orgs, with a few unfortunate exceptions) is for all the capabilities engineers to be aware that alignment is also a challenge and needs to be thought about.

What I'm calling a simulator (following Janus's terminology) you call a predictor

Yup; I use the terms almost interchangeably. I tend to use "simulator" when referring to predictors used for a simulator-y use case, and "predictor" when I'm referring to how they're trained and things directly related to that.

I also like your metatoken concept: that's functionally what I'm suggesting for the tags in my proposal, except I follow the suggestion of this paper to embed them via pretraining.

Yup again—to be clear, all the metatoken stuff I was talking about would a... (read more)

Alas, nope! To my knowledge it hasn't actually been tried at any notable scale; it's just one of those super simple things that would definitely work if you were willing to spend the compute to distill the behavior.

3RogerDearnaley
FWIW, I'm a Staff ML SWE, interested in switching to research engineering, and I'd love to make these things happen — either at a superscaler with ample of resources for it, or failing that, at something like Eleuther or an alignment research lab.

Signal boosted! This is one of those papers that seems less known that it should be. It's part of the reason why I'm optimistic about dramatic increases in the quality of "prosaic" alignment (in the sense of avoiding jailbreaks and generally behaving as expected) compared to RLHF, and I think it's part of a path that's robust enough to scale.

You can compress huge prompts into metatokens, too (just run inference with the prompt to generate the training data). And nest and remix metatokens together.

It's also interesting in that it can preserve the constraint... (read more)

1Shiroe
I'm very curious about this technique but couldn't find anything about it. Do you have any references I can read?
2RogerDearnaley
I much enjoyed your post Using predictors in corrigible systems — now I need to read the rest of your posts! (I also love the kindness vacuum cleaner.) What I'm calling a simulator (following Janus's terminology) you call a predictor, but it's the same insight: LLMs aren't potentially-dangerous agents, they're non-agentic systems capable of predicting the sequence of tokens from (many different) potentially-dangerous agents. I also like your metatoken concept: that's functionally what I'm suggesting for the tags in my proposal, except I follow the suggestion of this paper to embed them via pretraining. Which is slow and computationally expensive, so probably an ideal that one works one's way up for the essentials, rather then an rapid-iteration technique.

I claim we are many scientific insights away from being able to talk about these questions at the level of precision necessary to make predictions like this.

Hm, I'm sufficiently surprised at this claim that I'm not sure that I understand what you mean. I'll attempt a response on the assumption that I do understand; apologies if I don't:

I think of tools as agents with oddly shaped utility functions. They tend to be conditional in nature.

A common form is to be a mapping between inputs and outputs that isn't swayed by anything outside of the context of that m... (read more)

While this probably isn't the comment section for me to dump screeds about goal agnosticism, in the spirit of making my model more legible:

I think that if it is easy and obvious how to make a goal-agnostic AI into a goal-having AI, and also it seems like doing so will grant tremendous power/wealth/status to anyone who does so, then it will get done. And do think that these things are the case.

Yup! The value I assign to goal agnosticism—particularly as implemented in a subset of predictors—is in its usefulness as a foundation to build strong non-goal agnost... (read more)

Another experiment:

  1. Train model M.
  2. Train sparse autoencoder feature extractor for activations in M.
  3. FT = FineTune(M), for some form of fine-tuning function FineTune.
  4. For input x, fineTuningBias(x) = FT(x) - M(x)
  5. Build a loss function on top of the fineTuningBias function. Obvious options are MSE or dot product with bias vector.
  6. Backpropagate the loss through M(x) into the feature dictionaries.
  7. Identify responsible features by large gradients.
  8. Identify what those features represent (manually or AI-assisted).
  9. To what degree do those identified features line up with t
... (read more)

Some experimental directions I recently wrote up; might as well be public:

  1. Some attempts to demonstrate how goal agnosticism breaks with modifications to the architecture and training type. Trying to make clear the relationship between sparsity/distance of the implicit reward function and unpredictability of results.
  2. A continuation and refinement of my earlier (as of yet unpublished) experiments about out of distribution capability decay. Goal agnosticism is achieved by bounding the development of capabilities into a shape incompatible with internally motiva
... (read more)

In retrospect, the example I used was poorly specified. It wouldn't surprise me if the result of the literal interpretation was "the AI refuses to play chess" rather than any kind of worldeating. The intent was to pick a sparse/distant reward that doesn't significantly constrain the kind of strategies that could develop, and then run an extreme optimization process on it. In other words, while intermediate optimization may result in improvements to chess playing, being better at chess isn't actually the most reliable accessible strategy to "never lose at chess" for that broader type of system and I'd expect superior strategies to be found in the limit of optimization.

8gwern
Yes, that would be immediately reward-hacked. It's extremely easy to never lose chess: you simply never play. After all, how do you force anyone to play chess...? "I'll give you a billion dollars if you play chess." "No, because I value not losing more than a billion dollars." "I'm putting a gun to your head and will kill you if you don't play!" "Oh, please do, thank you - after all, it's impossible to lose a game of chess if I'm dead!" This is why RL agents have a nasty tendency to learn to 'commit suicide' if you reward-shape badly or the environment is too hard. (Tom7's lexicographic agent famously learns to simply pause Tetris to avoid losing.)

But the point is that in this scenario the LM doesn't want anything in the behaviorist sense, yet is a perfectly adequate tool for solving long-horizon tasks. This is not the form of wanting you need for AI risk arguments.

My attempt at an ITT-response:

Drawing a box around a goal agnostic LM and analyzing the inputs and outputs of that box would not reveal any concerning wanting in principle. In contrast, drawing a box around a combined system—e.g. an agentic scaffold that incrementally asks a strong inner goal agnostic LM to advance the agent's process—cou... (read more)

Trying to respond in what I think the original intended frame was:

A chess AI's training bounds what the chess AI can know and learn to value. Given the inputs and outputs it has, it isn't clear there is an amount of optimization pressure accessible to SGD which can yield situational awareness and so forth; nothing about the trained mapping incentivizes that. This form of chess AI can be described in the behaviorist sense as "wanting" to win within the boundaries of the space that it operates.

In contrast, suppose you have a strong and knowledgeable multimod... (read more)

2Logan Zoellner
"If we build AI in this particular way, it will be dangerous" Okay, so maybe don't do that then.

you mention « restrictive », my understanding is that you want this expression to specifically refers to pure predictors. Correct?

Goal agnosticism can, in principle, apply to things which are not pure predictors, and there are things which could reasonably be called predictors which are not goal agnostic.

A subset of predictors are indeed the most powerful known goal agnostic systems. I can't currently point you toward another competitive goal agnostic system (rocks are uselessly goal agnostic), but the properties of goal agnosticism do, in concept, extend ... (read more)

3Ilio
I’d be happy if you could point out a non competitive one, or explain why my proposal above does not obey your axioms. But we seem to get diminished returns to sort these questions out, so maybe it’s time to close at this point and wish you luck. Thanks for the discussion!

I'm not sure if I fall into the bucket of people you'd consider this to be an answer to. I do think there's something important in the region of LLMs that, by vibes if not explicit statements of contradiction, seems incompletely propagated in the agent-y discourse even though it fits fully within it. I think I at least have a set of intuitions that overlap heavily with some of the people you are trying to answer.

In case it's informative, here's how I'd respond to this:

Well, I claim that these are more-or-less the same fact. It's no surprise that the AI fal

... (read more)
5Nathan Helm-Burger
So, I agree with most of your points Porby, and like your posts and theories overall.... but I fear that the path towards a safe AI you outline is not robust to human temptation. I think that if it is easy and obvious how to make a goal-agnostic AI into a goal-having AI, and also it seems like doing so will grant tremendous power/wealth/status to anyone who does so, then it will get done. And do think that these things are the case. I think that a carefully designed and protected secret research group with intense oversight could follow your plan, and that if they do, there is a decent chance that your plan works out well. I think that a mish-mash of companies and individual researchers acting with little effective oversight will almost certainly fall off the path, and that even having most people adhering to the path won't be enough to stop catastrophe once someone has defected. I also think that misuse can lead more directly to catastrophe, through e.g. terrorists using a potent goal-agnostic AI to design novel weapons of mass destruction. So in a world with increasingly potent and unregulated AI, I don't see how to have much hope for humanity. And I also don't see any easy way to do the necessary level of regulation and enforcement. That seems like a really hard problem. How do we prevent ALL of humanity from defecting when defection becomes cheap, easy-to-hide, and incredibly tempting?

This isn't directly evidence, but I think it's worth flagging: by the nature the topic, much of the most compelling evidence is potentially hazardous. This will bias the kinds of answers you can get.

(This isn't hypothetical. I don't have some One Weird Trick To Blow Up The World, but there's a bunch of stuff that falls under the policy "probably don't mention this without good reason out of an abundance of caution.")

For what it's worth, I've had to drop from python to C# on occasion for some bottlenecks. In one case, my C# implementation was 418,000 times faster than the python version. That's a comparison between a poor python implementation and a vectorized C# implementation, but... yeah.

…but I thought the criterion was unconditional preference? The idea of nausea is precisely because agents can decide to act despite nausea, they’d just rather find a better solution (if their intelligence is up to the task).

Right; a preference being conditionally overwhelmed by other preferences does not make the presence of the overwhelmed preference conditional.

Or to phrase it another way, suppose I don't like eating bread[1] (-1 utilons), but I do like eating cheese (100 utilons) and garlic (1000 utilons).

You ask me to choose between garlic bread (... (read more)

3Ilio
Well, assuming a robust implementation, I still think it obeys your criterions, but now you mention « restrictive », my understanding is that you want this expression to specifically refers to pure predictors. Correct? If yes, I’m not sure that’s the best choice for clarity (why not « pure predictors »?) but of course that’s your choice. If not, can you give some examples of goal agnostic agents other than pure predictors?

For example, a system that avoids experimenting on humans—even when prompted to do so otherwise—is expressing a preference about humans being experimented on by itself.

Being meaningfully curious will also come along with some behavioral shift. If you tried to induce that behavior in a goal agnostic predictor through conditioning for being curious in that way and embed it in an agentic scaffold, it wouldn't be terribly surprising for it to, say, set up low-interference observation mechanisms.

Not all violations of goal agnosticism necessarily yield doom, but even prosocial deviations from goal agnosticism are still deviations.

3Ilio
…but I thought the criterion was unconditional preference? The idea of nausea is precisely because agents can decide to act despite nausea, they’d just rather find a better solution (if their intelligence is up to the task). I agree that curiosity, period seems highly vulnerable (You read Scott Alexander? He wrote an hilarious hit piece about this idea a few weeks or months ago). But I did not say curious, period. I said curious about what humans will freely chose next. In other words, the idea is that it should prefer not to trick humans, because if it does (for example by interfering with our perception) then it won’t know what we would have freely chosen next. It also seems to cover security (if we’re dead it won’t know), health (if we’re incapacitated it won’t know) and prosperity (if we’re under economical constraints that impacts our free will). But I’m interested to consider possible failure modes. (« Sorry, I’d rather not do your wills, for that would impact the free will of other humans. But thanks for letting me know that was your decision! You can’t imagine how good it feels when you tell me that sort of things! ») Notice you don’t see me campaigning for this idea, because I don’t like any solution that does not also take care of AI well being. But when I first read « goal agnosticism » it strikes me as an excellent fit for describing the behavior of an agent acting under these particular drives.

I think what we're discussing requires approaching the problem with a mindset entirely foreign to the mainstream one. Consider how many words it took us to get to this point in the conversation, despite the fact that, as it turns out, we basically agree on everything. The inferential distance between the standard frameworks in which AI researchers think, and here, is pretty vast.

True!

I expect that if the mainstream AI researchers do make strides in the direction you're envisioning, they'll only do it by coincidence. Then probably they won't even realize wh

... (read more)

I assume that by "lower-level constraints" you mean correlations that correctly capture the ground truth of reality, not just the quirks of the training process. Things like "2+2=4",  "gravity exists", and "people value other people"

That's closer to what I mean, but these constraints are even lower level than that. Stuff like understanding "gravity exists" is a natural internal implementation that meets some constraints, but "gravity exists" is not itself the constraint.

In a predictor, the constraints serve as extremely dense information about what pr... (read more)

5Thane Ruthenis
Yeah, for sure. A training procedure that results in an idealized predictor isn't going to result in an agenty thing, because it doesn't move the system's design towards it on a step-by-step basis; and a training procedure that's going to result in an agenty thing is going to involve some unknown elements that specifically allow the system the freedom to productively roam. I think we pretty much agree on the mechanistic details of all of that! — yep, I was about to mention that. @TurnTrout's own activation-engineering agenda seems highly relevant here. But I still disagree with that. I think what we're discussing requires approaching the problem with a mindset entirely foreign to the mainstream one. Consider how many words it took us to get to this point in the conversation, despite the fact that, as it turns out, we basically agree on everything. The inferential distance between the standard frameworks in which AI researchers think, and here, is pretty vast. Moreover, it's in an active process of growing larger. For example, the very idea of viewing ML models as "just stochastic parrots" is being furiously pushed against in favour of a more agenty view. In comparison, the approach we're discussing wants to move in the opposite direction, to de-personify ML models to the extent that even the animalistic connotation of "a parrot" is removed. The system we're discussing won't even be an "AI" in the sense usually thought. It would be an incredibly advanced forecasting tool. Even the closest analogue, the "simulators" framework, still carries some air of agentiness. And the research directions that get us from here to an idealized-predictor system look very different from the directions that go from here to an agenty AGI. They focus much more on building interfaces for interacting with the extant systems, such as the activation-engineering agenda. They don't put much emphasis on things like: * Experimenting with better ways to train foundational models, with the

Probably not? It's tough to come up with an interpretation of those properties that wouldn't result in the kind of unconditional preferences that break goal agnosticism.

3Ilio
As you might guess, it’s not obvious to me. Would you mind to provide some details on these interpretations and how you see the breakage happens? Also, we’ve been going back and forth without feeling the need to upvote each other, which I thought was fine but turns out being interpreted negatively. [to clarify: it seems to be one of the criterion here: https://www.lesswrong.com/posts/hHyYph9CcYfdnoC5j/automatic-rate-limiting-on-lesswrong] If that’s you thoughts too, we can close at this point, otherwise let’s give each other some high fives. Your call and thanks for the discussion in any case.

I'm using as a "an optimization constraint on actions/plans that correlated well with good performance on the training dataset; a useful heuristic".

Alright, this is pretty much the same concept then, but the ones I'm referring to operate at a much lower and tighter level than thumbs-downing murder-proneness.

So...

Such constraints are, for example, the reason our LLMs are able to produce coherent speech at all, rather than just babbling gibberish.

Agreed.

... and yet this would still get in the way of qualitatively more powerful capabilities down the line, and

... (read more)
4Thane Ruthenis
Hm, I think the basic "capabilities generalize further than alignment" argument applies here? I assume that by "lower-level constraints" you mean correlations that correctly capture the ground truth of reality, not just the quirks of the training process. Things like "2+2=4",  "gravity exists", and "people value other people"; as contrasted with "it's bad if I hurt people" or "I must sum numbers up using the algorithm that humans gave me, no matter how inefficient it is". Slipping the former type of constraints would be disadvantageous for ~any goal; slipping the latter type would only disadvantage a specific category of goals. But since they're not, at the onset, categorized differently at the level of cognitive algorithms, a nascent AGI would experiment with slipping both types of constraints. The difference is that it'd quickly start sorting them in "ground-truth" vs. "value-laden" bins manually, and afterwards it'd know it can safely ignore stuff like "no homicides!" while consciously obeying stuff like "the axioms of arithmetic". Hm, yes, I think that's the crux. I agree that if we had an idealized predictor/a well-formatted superhuman world-model on which we could run custom queries, we would be able to use it safely. We'd be able to phrase queries using concepts defined in the world-model, including things like "be nice", and the resultant process (1) would be guaranteed to satisfy the query's constraints, and (2) likely (if correctly implemented) wouldn't be "agenty" in ways that try to e. g. burst out of the server farm on which it's running to eat the world. Does that align with what you're envisioning? If yes, then our views on the issue are surprisingly close. I think it's one of our best chances at producing an aligned AI, and it's one of the prospective targets of my own research agenda. The problems are: * I don't think the current mainstream research directions are poised to result in this. AI Labs have been very clear in their intent to prod

I think we're using the word "constraint" differently, or at least in different contexts.

Sure! Human values are not arbitrary either; they, too, are very heavily constrained by our instincts. And yet, humans still sometimes become omnicidal maniacs, Hell-worshipers, or sociopathic power-maximizers. How come?

In terms of the type and scale of optimization constraint I'm talking about, humans are extremely unconstrained. The optimization process represented by our evolution is way out there in terms of sparsity and distance. Not maximally so—there are all sor... (read more)

4Thane Ruthenis
(Haven't read your post yet, plan to do so later.) I'm using as a "an optimization constraint on actions/plans that correlated well with good performance on the training dataset; a useful heuristic". E. g., if the dataset involved a lot of opportunities to murder people, but we thumbs-downed the AI every time it took them, the AI would learn a shard/a constraint like "killing people is bad" which will rule out such actions from the AI's consideration. Specifically, the shard would trigger in response to detecting some conditions in which the AI previously could but shouldn't kill people, and constrain the space of possible action-plans such that it doesn't contain homicide. It is, indeed, not a way to hinder capabilities, but the way capabilities are implemented. Such constraints are, for example, the reason our LLMs are able to produce coherent speech at all, rather than just babbling gibberish. ... and yet this would still get in the way of qualitatively more powerful capabilities down the line, and a mind that can't somehow slip these constraints won't be a general intelligence. Consider traditions and rituals vs. science. For a medieval human mind, following traditional techniques is how their capabilities are implemented — a specific way of chopping wood, a specific way of living, etc. However, the meaningful progress is often only achieved by disregarding traditions — by following a weird passion to study and experiment instead of being a merchant, or by disregarding the traditional way of doing something in favour of a more efficient way you stumbled upon. It's the difference between mastering the art of swinging an axe (self-improvement, but only in the incremental ways the implacable constraint permits) vs. inventing a chainsaw. Similar with AI. The constraints of the aforementioned format aren't only values-type constraints[1] — they're also constraints on "how should I do math?" and "if I want to build a nuclear reactor, how do I do it?" and "if I wa

My model says that general intelligence[1] is just inextricable from "true-goal-ness". It's not that I think homunculi will coincidentally appear as some side-effect of capability advancement — it's that the capabilities the AI Labs want necessarily route through somehow incentivizing NNs to form homunculi. The homunculi will appear inasmuch as the labs are good at their jobs.

I've got strong doubts about the details of this. At the high level, I'd agree that strong/useful systems that get built will express preferences over world states like those tha... (read more)

4Thane Ruthenis
Sure, but I never said we'd be inducing homunculi using this approach? Indeed, given that it doesn't work for what sounds like fundamental reasons, I expect it's not the way. I don't know how that would be done. I'm hopeful the capability is locked behind a Transformer-level or even a Deep-Learning-level novel insight, and won't be unlocked for a decade yet. But I predict that the direct result of it will be a workable training procedure that somehow induces homunculi. It may look nothing like what we do today. Sure! Human values are not arbitrary either; they, too, are very heavily constrained by our instincts. And yet, humans still sometimes become omnicidal maniacs, Hell-worshipers, or sociopathic power-maximizers. How come? 1. These constraints are not actually sufficient. The constraints placed by human values still have the aforementioned things in their outcome space, and an AI model will have different constraints, widening (from our perspective) that space further. My point about "moral philosophy is unstable" is that we need to hit an extremely narrow target, and the tools people propose (intervening on shards/instincts) are as steady as the hands of a sniper during a magnitude-9 earthquake. 2. A homunculus needs to be able to nudge these constraints somehow, for it to be useful, and its power grows the more it's able to disregard them. * If humans were implacably bound by instincts, they'd have never invented technology or higher-level social orders, because their instincts would've made them run away from fires and refuse cooperating with foreign tribes. And those are still at play — reasonable fears and xenophobia — but we can push past them at times. * More generally, the whole point of there being a homunculus is that it'd be able to rewrite or override the extant heuristics to better reflect the demands of whatever novel situation it's in. It needs to be able to do that. 3. These constraints do not generalize as fast as a homunculus' un

If LLMs end up being useful, how do they get around these theorems? Can we get some result where if RLHF has a capabilities component and a power-averseness component, the capabilities component can cause the agent to be power-seeking on net?

Intuitively, eliciting that kind of failure seems like it would be pretty easy, but it doesn't seem to be a blocker for the usefulness of the generalized form of LLMs. My mental model goes something like:

  1. Foundational goal agnosticism evades optimizer-induced automatic doom, and 
  2. Models implementing a strong approxi
... (read more)

In my view, if we’d feed a good enough maximizer with the goal of learning to look as if they were a unified goal agnostic agent, then I’d expect the behavior of the resulting algorithm to handle the paradox well enough it’ll make sense.

If you successfully gave a strong maximizer the goal of maximizing a goal agnostic utility function, yes, you could then draw a box around the resulting system and correctly call it goal agnostic.

In my view our volitions look as if from a set of internal thermostats that impulse our behaviors, like the generalization to low

... (read more)
1Ilio
Thanks, that helps. Suppose an agent is made robustly curious about what humans will next chose when free from external pressures and nauseous if its own actions could be interpreted as if experimenting on humans or its own code, do you agree it would be a good candidate for goal agnosticism?
Load More