All of Andrew McKnight's Comments + Replies

Do you think putting extra effort into learning about existing empirical work while doing conceptual work would be sufficient for good conceptual work or do you think people need to be producing empirical work themselves to really make progress conceptually?

5Richard_Ngo
The former can be sufficient—e.g. there are good theoretical researchers who have never done empirical work themselves. In hindsight I think "close conjunction" was too strong—it's more about picking up the ontologies and key insights from empirical work, which can be possible without following it very closely.

Maybe you've addressed this elsewhere but isn't scheming convergent in the sense that a perfectly aligned AGI would still have an incentive to do so unless they already fully know themselves? An aligned AGI can still desire to have some unmonitored breathing room to fully reflect and realize what it truly cares about, even if that thing is what we want.

Also a possible condition for a fully corrigible AGI would be to not have this scheming incentive in the first place even while having the capacity to scheme.

6ryan_greenblatt
I think we should broadly aim for AIs which are myopic and don't scheme with early transformative AIs rather than aiming for AIs which are aligned in their long run goals/motives.

Another possible inflection point, pre-self-improvement could be when an AI gets a set of capabilities that allows it to gain new capabilities at inference time.

1boazbarak
Some things like that already happened - bigger models are better at utilizing tools such as in-context learning and chain of thought reasoning. But again, whenever people plot any graph of such reasoning capabilities as a function of model compute or size (e.g., Big Bench paper) the X axis is always logarithmic. For specific tasks, the dependence on log compute is often sigmoid-like (flat for a long time but then starts going up more sharply as a function of log. compute) but as mentioned above, when you average over many tasks you get this type of linear dependence.

I'll repeat this bet, same odds same conditions same payout, if you're still interested. My $10k to your $200 in advance.

1RatsWrongAboutUAP
Sure, reach out

Responding to your #1, do you think we're on track to handle the cluster of AGI Ruin scenarios pointed at in 16-19? I feel we are not making any progress here other than towards verifying some properties in 17.

16: outer optimization even on a very exact, very simple loss function doesn't produce inner optimization in that direction.
17: on the current optimization paradigm there is no general idea of how to get particular inner properties into a system, or verify that they're there, rather than just observable outer ones you can run a loss function over.&nb

... (read more)

Thanks for the links and explanation, Ethan.

I mean, it's mostly semantics but I think of mechanical interpretability as "inner" but not alignment and think it's clearer that way, personally, so that we don't call everything alignment. Observing properties doesn't automatically get you good properties. I'll read your link but it's a bit too much to wade into for me atm.

Either way, it's clear how to restate my question: Is mechanical interpretability work the only inner alignment work Anthropic is doing?

Ethan PerezΩ7147

Evan and others on my team are working on non-mechanistic-interpretability directions primarily motivated by inner alignment:

  1. Developing model organisms for deceptive inner alignment, which we may use to study the risk factors for deceptive alignment
  2. Conditioning predictive models as an alternative to training agents. Predictive models may pose fewer inner alignment risks, for reasons discussed here
  3. Studying the extent to which models exhibit likely pre-requisites to deceptive inner alignment, such as situational awareness (a very preliminary exploration is i
... (read more)

Great post. I'm happy to see these plans coming out, following OpenAI's lead.

It seems like all the safety strategies are targeted at outer alignment and interpretability. None of the recent OpenAI, Deepmind, Anthropic, or Conjecture plans seem to target inner alignment, iirc, even though this seems to me like the biggest challenge.

Is Anthropic mostly leaving inner alignment untouched, for now?

evhubΩ101614

It seems like all the safety strategies are targeted at outer alignment and interpretability.

None of the recent OpenAI, Deepmind, Anthropic, or Conjecture plans seem to target inner alignment

???


Less tongue-in-cheek: certainly it's unclear to what extent interpretability will be sufficient for addressing various forms of inner alignment failures, but I definitely think interpretability research should count as inner alignment research.

Taken literally, the only way to merge n utility functions into one without any other info (eg the preferences that generated the utility functions) is to do a weighted sum. There's only n-1 free parameters.

2mako yass
So you think it's computationally tractable? I think there are some other factors you're missing. That's a weighted sum of a bunch of vectors assigning numbers to all possible outcomes, either all possible histories+final states of the universe, or all possible experiences. And there are additional complications with normalizing utility functions; you don't know the probability distribution of final outcomes (so you can't take the integral of the utility functions) until you already know how the aggregation of normalized weighted utility functions is going to influence it.

Wouldn't the kind of alignment you'd be able to test behaviorally in a game be unrelated to scalable alignment?

I know this was 3 years ago, but was this disagreement resolved, maybe offline?

3johnswentworth
I don't think we've talked about it since then.

Is there reason to believe algorithmic improvements follow an exponential curve? Do you happen to know a good source on this?

5Daniel Kokotajlo
As opposed to what, linear? Or s-curvy? S-curves look exponential until you get close to the theoretical limit. I doubt we are close to the theoretical limits. Ajeya bases her estimate on empirical data, so if you want to see whether it's exponential go look at that I guess.

I'm tempted to call this a meta-ethical failure. Fatalism, universal moral realism, and just-world intuitions seem to be the underlying implicit hueristics or principals that would cause this "cosmic process" thought-blocker.

I think it's good to go back to this specific quote and think about how it compares to AGI progress.

A difference I think Paul has mentioned before is that Go was not a competitive industry and competitive industries will have smaller capability jumps. Assuming this is true, I also wonder whether the secret sauce for AGI will be within the main competitive target of the AGI industry.

The thing the industry is calling AGI and targeting may end up being a specific style of shallow deployable intelligence when "real" AGI is a different style of "deeper" intelli... (read more)

von Neumann's design was in full detail, but, iirc, when it was run for the first time (in the 90s) it had a few bugs that needed fixing. I haven't followed Freitas in a long time either but agree that the designs weren't fully spelled out and would have needed iteration.

If we merely lose control of the future and virtually all resources but many of us aren't killed in 30 years, would you consider Eliezer right or wrong?

4mukashi
Wrong. He is being quite clear about what he means

There is some evidence that complex nanobots could be invented in ones head with a little more IQ and focus because von Neumann designed a mostly functional (but fragile) replicator in a fake simple physics using the brand-new idea of a cellular automata and without a computer and without the idea of DNA. If a slightly smarter von Neumann focused his life on nanobots, could he have produced, for instance, the works of Robert Freitas but in the 1950s, and only on paper?

I do, however, agree it would be helpful to have different words for different styles of ... (read more)

-1jbash
"On paper" isn't "in your head", though. In the scenario that led to this, the AI doesn't get any scratch paper. I guess it could be given large working memory pretty easily, but resources in general aren't givens. More importantly, even in domains where you have a lot of experience, paper designs rarely work well without some prototyping and iteration. So far as I know, von Neumann's replicator was never a detailed mechanical design that could actually be built, and certainly never actually was built. I don't think anything of any complexity that Bob Freitas designed has ever been built, and I also don't think any of the complex Freitas designs are complete to the point of being buildable. I haven't paid much attention since the repirocyte days, so I don't know what he's done since then, but that wasn't even a detailed design, and it even the ideas that were "fleshed out" probably wouldn't have worked in an actual physiological environment.

I think this makes sense because eggs are haploid (already only have 23 chromosomes) but a natural next question is: why are eggs haploid if there is a major incentive to pass more of the 46 chromosomes?

2ChristianKl
If you would say that there are two copies of chromosome 11 in the egg and none in sperm, you would lose sexual selection for chromosome 11.

I've been thinking about benefits of "Cognitive Zoning Laws" for AI architecture.

If specific cognitive operations were only performed in designated modules then these modules could have operation-specific tracking, interpreting, validation, rollback, etc. If we could ensure "zone breaches" can't happen (via e.g. proved invariants or more realistically detection and rollback) then we could theoretically stay aware of where all instances of each cognitive operation are happening in the system. For now let's call this cognitive-operation-factored architecture... (read more)

I think the main thing you're missing here is that an AI is not generally going to share common learning facilities with humans. An AI growing up as a human will make it wildly different from a normal human because they aren't built precisely to learn from those experiences the way a human does.

I haven't read your papers but your proposal seems like it would scale up until the point when the AGI looks at itself. If it can't learn at this point then I find it hard to believe it's generally capable, and if it can, it will have incentive to simply remove the device or create a copy of itself that is correct about its own world model. Do you address this in the articles?

On the other hand, this made me curious about what we could do with an advanced model that is instructed to not learn and also whether we can even define and ensure a model stops learning.

2Koen.Holtman
Yes I address this, see for example the part about The possibility of learned self-knowledge in the sequence. I show there that any RL agent, even a non-AGI, will always have the latent ability to 'look at itself' and create a machine-learned model of its compute core internals. What is done with this latent ability is up to the designer. The key thing here is that you have a choice as a designer, you can decide if you want to design an agent which indeed uses this latent ability to 'look at itself'. Once you decide that you don't want to use this latent ability, certain safety/corrigibility problems become a lot more tractable. Wikipedia has the following definition of AGI: Though there is plenty of discussion on this forum which silently assumes otherwise, there is no law of nature which says that, when I build a useful AGI-level AI, I must necessarily create the entire package of all human cognitive abilities inside of it. Terminology note if you want to look into this some more: ML typically does not frame this goal as 'instructing the model not to learn about Q'. ML would frame this as 'building the model to approximate the specific relation P(X|Y,Z) between some well-defined observables, and this relation is definitely not Q'.

I agree that this thread makes it clearer why takeoff speeds matter to people but I always want to ask why people think sufficient work is going to get done in that extended 4-10 years even with access to proto-AGI to directly study.

Thanks. This is great! I hadn't thought of Embedded Agency as an attempt to understand optimization. I thought it was an attempt to ground optimizers in a formalism that wouldn't behave wildly once they had to start interacting with themselves. But on second thought it makes sense to consider an optimizer that can't handle interacting with itself to be a broken or limited optimizer.

8Rob Bensinger
I think another missing puzzle piece here is 'the Embedded Agency agenda isn't just about embedded agency'. From my perspective, the Embedded Agency sequence is saying (albeit not super explicitly): * Here's a giant grab bag of anomalies, limitations, and contradictions in our whole understanding of reasoning, decision-making, self-modeling, environment-modeling, etc. * A common theme in these ways our understanding of intelligence goes on the fritz is embeddedness. * The existence of this common theme (plus various more specific interconnections) suggests it may be useful to think about all these problems in light of each other; and it suggests that these problems might be surprisingly tractable, since a single sufficiently-deep insight into 'how embedded reasoning works' might knock down a whole bunch of these obstacles all at once. The point (in my mind -- Scott may disagree) isn't 'here's a bunch of riddles about embeddedness, which we care about because embeddedness is inherently important'; the point is 'here's a bunch of riddles about intelligence/optimization/agency/etc., and the fact that they all sort of have embeddedness in common may be a hint about how we can make progress on these problems'. This is related to the argument made in The Rocket Alignment Problem. The core point of Embedded Agency (again, in my mind, as a non-researcher observing from a distance) isn't stuff like 'agents might behave wildly once they get smart enough and start modeling themselves, so we should try to understand reflection so they don't go haywire'. It's 'the fact that our formal models break when we add reflection shows that our models are wrong; if we found a better model that wasn't so fragile and context-dependent and just-plain-wrong, a bunch of things about alignable AGI might start to look less murky'. (I think this is oversimplifying, and there are also more direct value-adds of Embedded Agency stuff. But I see those as less core.) The discussion of Subsyst

No one has yet solved "and then stop" for AGI even though this should be easier than a generic stop button which in turn should be easier than full corrigibility. (Also I don't think we know how to refer to things in the world in a way that gets an AI to care about it rather than observations of it or its representation of it)

the ways in which solving AF would likely be useful

Other than the rocket alignment analogy and the general case for deconfusion helping, has anyone ever tried to describe with more concrete (though speculative) detail how AF would help with alignment? I'm not saying it wouldn't. I just literally want to know if anyone has tried explaining this concretely. I've been following for a decade but don't think I ever saw an attempted explanation.

Example I just made up:

  • Modern ML is in some sense about passing the buck to gradient-descent-ish processes to find our optimizers for us. This results in very complicated, alien systems that we couldn't build ourselves, which is a worst-case scenario for interpretability / understandability.
  • If we better understood how optimization works, we might able to do less buck-passing / delegate less of AI design to gradient-descent-ish processes.
  • Developing a better formal model of embedded agents could tell us more about how optimization works in this way, allowing
... (read more)