LESSWRONG
LW

All of Peter S. Park's Comments + Replies

AI Deception: A Survey of Examples, Risks, and Potential Solutions

Thank you so much for taking the time to read our paper, Chris! I'm extremely grateful.

3Chris_Leong2y

Haha, I actually didn’t read your paper, only the summary. I might have read your paper, but you wrote a summary so I didn’t have to =P. I appreciate your appreciation nonetheless.

Towards understanding-based safety evaluations

Peter S. Park2y*10

Thank you so much for sharing this extremely insightful argument, Evan! I really appreciate hearing your detailed thoughts on this.

I've been grappling with the pros and cons of an atheoretical-empirics-based approach (in your language, "behavior") and a theory-based approach (in your language, "understanding") within the complex sciences, such as but not limited to AI. My current thought is that unfortunately, both of the following are true:

1) Findings based on atheoretical empirics are susceptible to being brittle, in that it is unclear whether or in prec... (read more)

In defense of probably wrong mechanistic models

Peter S. Park2y30

Thank you so much for the excellent and insightful post on mechanistic models, Evan!

My hypothesis is that the difficulty of finding mechanistic models that consistently make accurate predictions is likely due to the agent-environment system’s complexity and computational irreducibility. Such agent-environment interactions may be inherently unpredictable "because of the difficulty of pre-stating the relevant features of ecological niches, the complexity of ecological systems and [the fact that the agent-ecology interaction] can enable its own novel system s... (read more)

AI can exploit safety plans posted on the Internet

Peter S. Park2y*30

Thank you very much for the honest and substantive feedback, Harfe! I really appreciate it.

I think the disagreeing commenters and perhaps many of the downvoters agreed that the loss in secrecy value was a factor, but disagreed about the magnitude of this effect (and my claim that it may be comparable or even exceed the magnitude of the other effect, a reduction in the number of AI safety plans and new researchers).

Quoting my comment on the EA forum for discussion of the cruxes and how I propose they may be updated:

"Thank you so much for the clarification, ... (read more)

The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

Peter S. Park2y55

Thank you so much for this writeup of your fascinating findings about interpreting the SVD of the weight matrix, Beren and Sid!

Understanding the degree to which transformer representations are linear vs nonlinear, and developing methods that can help us discover, locate, and interpret nonlinear representations will ultimately be necessary for fully solving interpretability of any nonlinear neural network.

Completely agree. For what it's worth, I expect interpreting nonlinear representations in complex neural nets to be quite difficult. We should expect line... (read more)

1beren2y

I broadly agree with this. This method definitely does not uncover any nonlinear representations in the network and is not expected to. We are primarily trying to uncover the relatively 'easy' information we can get with linear methods first. In further defence of linear methods, I would also argue that 'most' of the transformer architecture is pretty linear looking. The residual stream is linear, and the I/O matrices reading from and writing to the residual stream are also linear (if we ignore the layernorms!). I suspect that because of this some kind of linear directions might be the best way to understand representations in the residual stream, as well as writes to it, but that obviously the process of computing these writes involves nonlinear token-wise functions for the MLPs and nonlinear mixing across tokens for the attention blocks.

Discussing how to align Transformative AI if it’s developed very soon

Peter S. Park2y20

Thank you so much for writing up this thorough and well-thought-out writeup, Eli and Charlotte! I think there is a high probability that TAI will be developed soon, and it is excellent that you have brainstormed your opinions on the implications for alignment with such detail.

Regarding your point on interpreting honesty and other concepts from the AI's internal states: I wanted to share a recent argument I've made that the practical upside of interpretability may be lower than originally thought. This is because the interaction dynamics between the AI agen... (read more)

The limited upside of interpretability

Peter S. Park2y32

Thank you so much, Beth, for your extremely insightful comment! I really appreciate your time.

I completely agree with everything you said. I agree that "you can make many very useful predictions about humans by modeling them as (for example) having some set of goals, habits, tendencies, and knowledge," and that these insights will be very useful for alignment research.

I also agree that "it's difficult to identify what a human's intentions are just by having access to their brain." This was actually the main point I wanted to get across; I guess it wa... (read more)

2Richard_Ngo2y

At some points in your comment you use the criterion "likely to be valid", at other points you use the criterion "guaranteed to be valid". These are very different! I think almost everyone agrees that we're unlikely to get predictions which are guaranteed to be valid out-of-distribution. But that's true of every science apart from fundamental physics: they all apply coarse-grained models, whose predictive power out-of-distribution varies very widely. There are indeed some domains in which it's very weak (like ecology), but also some domains in which it's pretty strong (like chemistry). There are some reasons to think interpretability will be more like the former (networks are very complicated!) and some reasons to think it'll be more like the latter (experiments with networks are very reproducible). I don't think this is the type of thing which can be predicted very well in advance, because it's very hard to know what types of fundamental breakthroughs may arise. More generally, the notion of "computational irreducibility" doesn't seem very useful to me, because it takes a continuous property (some systems are easier or harder to make predictions about) and turns it into a binary property (is it computationally reducible or not), which I think obscures more than it clarifies.

The limited upside of interpretability

Peter S. Park2y*21

Thank you so much for your detailed and insightful response, Esben! It is extremely informative and helpful.

So as far as I understand your text, you argue that fine-grained interpretability loses out against "empiricism" (running the model) because of computational intractability.
I generally disagree with this. beren points out many of the same critiques of this piece as I would come forth with. Additionally, the arguments seem too undefined, like there is not in-depth argumentation enough to support the points you make. Strong upvote for writing them out,

... (read more)

The limited upside of interpretability

Peter S. Park2y*21

Thank you so much for your insightful and detailed response, Beren! I really appreciate your time.

The cruxes seem very important to investigate.

This seems especially likely to me if the AGIs architecture is hand-designed by humans – i.e. there is a ‘world model’ part and a ‘planner’ part and a ‘value function’ and so forth.

It probably helps to have the AGI's architecture hand-designed to be more human-interpretable. My model is that on the spectrum of high-complexity paradigms (e.g., deep learning) to low-complexity paradigms (e.g., software design b... (read more)

Current themes in mechanistic interpretability research

Peter S. Park2y32

Thank you very much for the detailed and insightful post, Lee, Sid, and Beren! I really appreciate it.

In the spirit of full communication, I'm writing to share my recent argument that mechanistic interpretability may not be a reliable safety plan for AGI-scale models.

It would be really helpful to hear your thoughts on it!

The limited upside of interpretability

Peter S. Park2y43

Thank you so much, Erik, for your detailed and honest feedback! I really appreciate it.

I agree with you that it is obviously true that we won't be able to make detailed predictions about what an AGI will do without running it. In other words, the most efficient source of information will be empiricism in the precise deployment environment. The AI safety plans that are likely to robustly help alignment research will be those that make empiricism less dangerous for AGI-scale models. Think BSL-4 labs for dangerous virology experiments, which would be analogou... (read more)

4Erik Jenner2y

I only agree with the first sentence here, and I don't think the rest of the paragraph follows from it. I agree being able to safely experiment on AGIs would be useful, but it's not a replacement for what interpretability is trying to do. Deception is a good example here: how do you empirically tell whether a model is deceptive without giving it a chance to actually execute a treacherous turn? You'd have to fool the model, and there are big obstacles to that. Maybe relaxed adversarial training could help, but that's also more of a research direction than a concrete method for now---I think for any specific alignment approach, it's easy to find challenges. If there is a specific problem that people are currently planning to solve with interpretability, and that you think could be better solved using some other method based on safely experimenting with the model, I'd be interested to hear that example, that seems more fruitful than abstract arguments. (Alternatively, you'd have to argue that interpretability is just entirely doomed and we should stop pursuing it even lacking better alternatives for now---I don't think your arguments are strong enough for that.) I want to clarify that any story for solving deception (or similarly big obstacles) that's as detailed as what I described seems unrealistically optimistic to me. Out of all stories this concrete that I can tell, the interpretability one actually looks like one of the more plausible ones to me. This is actually something I'd be interested to read more about (e.g. I think a post looking at what lessons we can learn for interpretability from neuroscience and attempts to understand the brain could be great). I don't know much about this myself, but some off-the-cuff thoughts: * I think mechanistic interpretability might turn out to be intractably hard in the near future, and I agree that understanding the brain being hard is some evidence for that * OTOH, there are some advantages for NN interpretability that

Instead of technical research, more people should focus on buying time

Peter S. Park2y10

I think the point of Thomas, Akash, and Olivia's post is that more people should focus on buying time, because solving the AI safety/alignment problem before capabilities increase to the point of AGI is important, and right now the latter is progressing much faster than the former.

See the first two paragraphs of my post, although I could have made its point and the implicit modeling assumptions more explicitly clear:

"AI capabilities research seems to be substantially outpacing AI safety research. It is most likely true that successfully solving the A... (read more)

2Emrik2y

That's fair, but sorry[1] I misstated my intended question. I meant that I was under the impression that you didn't understand the argument, not that you didn't understand the action they advocated for. I understand that your post and this post argue for actions that are similar in effect. And your post is definitely relevant to the question I asked in my first comment, so I appreciate you linking it. 1. ^ Actually sorry. Asking someone a question that you don't expect yourself or the person to benefit from is not nice, even if it was just due to careless phrasing. I just wasted your time.

Instead of technical research, more people should focus on buying time

Peter S. Park2y10

That’s totally fair!

The part of my post I meant to highlight was the last sentence: “To put it bluntly, we should—on all fronts—scale up efforts to recruit talented AI capabilities researchers into AI safety research, in order to slow down the former in comparison to the latter. ”

Perhaps I should have made this point front-and-center.

1Emrik2y

No, this isn't the same. If you wish, you could try to restate what I think the main point of this post is, and I could say if I think that's accurate. At the moment, it seems to me like you're misunderstanding what this post is saying.

Instead of technical research, more people should focus on buying time

Peter S. Park2y21

Please check out my writeup from April! https://forum.effectivealtruism.org/posts/juhMehg89FrLX9pTj/a-grand-strategy-to-recruit-ai-capabilities-researchers-into

2Emrik2y

I would not have made this update by reading your post, and I think you are saying very different things. The thing I updated on from this post wasn't "let's try to persuade AI people to do safety instead," it was the following: If I am capable of doing an average amount of alignment work ¯x per unit time, and I have n units of time available before the development of transformative AI, I will have contributed ¯x∗n work. But if I expect to delay transformative AI by m units of time if I focus on it, everyone will have that additional time to do alignment work, which means my impact is ¯x∗m∗p, where p is the number of people doing work. Naively then, if m∗p>n, I should be focusing on buying time.[1] 1. ^ This assumes time-buying and direct alignment-work is independent, whereas I expect doing either will help with the other to some extent.

Instead of technical research, more people should focus on buying time

Peter S. Park2y11

Some practical ideas of how to achieve this (and a productive debate in the comments section of the risks from low-quality outreach efforts) can be found in my related forum post from earlier: https://forum.effectivealtruism.org/posts/juhMehg89FrLX9pTj/a-grand-strategy-to-recruit-ai-capabilities-researchers-into

Why do we post our AI safety plans on the Internet?

Peter S. Park2y10

Thanks so much for your helpful comment, Charlie! I really appreciate it.

It is likely that our cruxes are the following. I think that (1) we probably cannot predict the precise moment the AGI becomes agentic and/or dangerous, (2) we probably won't have a strong credence that a specific alignment plan will succeed, and (3) AGI takeoff will be slow enough that secrecy can be a key difference-maker in whether we die or not.

So, I expect we will have alignment plan numbers 1, 2, 3, and so on. We will try alignment plan 1, but it will probably not succeed (and h... (read more)

The Defender’s Advantage of Interpretability

Peter S. Park3y2-7

Thank you so much, Marius, for writing this pertinent post! The question of whether a given interpretability tool will help us or hurt us in expectation is an extremely important one.

The answer, however, differs on a situation-to-situation basis. The scientific benefit of an interpretability tool (more generally, of any information channel) is difficult to estimate a priori, but likely is tied to its informational efficiency. Roughly speaking, how much informational value can the interpretability tool/information channel yield per unit of bitrate?

The... (read more)

1mesaoptimizer3y

Given that the audience of this post has signalled mixed responses to your comment, and I'm confused as to why (because your basic argument makes sense to me), and that no one has replied to you, here's an attempt to understand this situation. The core thesis of Marius' argument, it seems, is the fact that the marginal cost for alignment of an AI model is less than that of increasing SOTA AI model capabilities, given marginal increase in interpretability research. He refers to biorisk research arguments to imply that a similar situation arises in alignment research. You claim, however, that this isn't true broadly speaking, since what actually matters is the amount of information we get from an interpretability tool per bit of information transferred. Marius' threat model is alignment research also increasing capabilities and therefore shortening timelines. Your threat model seems to be that of the uninhibited use of interpretability tools resulting in AI researchers (and by extension, the world) being taken control over by a sufficiently capable AI. If this is the case, then it seems that both of you are talking across each other, and the readers' responses (or the lack thereof) makes sense.

Can We Align a Self-Improving AGI?

Peter S. Park3y21

Thank you so much for your kind words! I really appreciate it.

One definition of alignment is: Will the AI do what we want it to do? And as your post compellingly argues, "what we want it to do" is not well-defined, because it is something that a powerful AI could be able to influence. For many settings, using a term that's less difficult to rigorously pin down, like safe AI, trustworthy AI, or corrigible AI, could have better utility.

I would definitely count the AI's drive towards self-improvement as a part of the College Kid Problem! Sorry if the post did not make that clear.

Private alignment research sharing and coordination

Peter S. Park3y42

In general, it is much easier to keep potentially concerning material out of the AGI’s training set if it’s already a secret rather than something that’s been published on the Internet. This is because there may be copies, references, and discussions of the material elsewhere in the training set that we fail to catch.

If it’s already posted on the Internet and it’s too late, we should of course still try our best to keep it out of the training set.

As for the question of “should we give up on security after AGI attains high capabilities?” we shouldn’t give u... (read more)

Private alignment research sharing and coordination

Peter S. Park3y31

This is an excellent idea. An encrypted, airgapped, or paper library that coordinates between AI researchers seems crucial for AGI safety.

This is because we should expect in the worst-case scenario that AGI will be trained on the whole Internet, including any online discussion of our interpretability tools, security reserach, and so on. This is information that the AGI can use against us (e.g., by using our interpretability tools against us, to hack, deceive, or otherwise socially engineer the alignment researchers).

Security through obscurity can buy... (read more)

3porby3y

I suppose that's an additional consideration. Keeping potentially concerning material out of trivially scraped training sets is pretty low cost and worth it. I wouldn't want to sacrifice much usability beyond the standard security measures to focus on that angle, though; that would mean trying to directly fight a threat which is 1. already able to misuse observed research, 2. already able to otherwise socially or technically engineer its way to gaining access to that research, and 3. somehow not already massively lethal without that research.

Interpretability Tools Are an Attack Channel

Peter S. Park3y64

This is indeed a vital but underdiscussed problem. My SERI MATS team published a post about a game-theoretic model of alignment where the expected scientific benefit of an interpretability tool can be weighed against its expected cost due to its enabling of AGI escape risks. The expected cost can be reduced by limiting the capabilities of the AGI and by increasing the quality of security, and the expected scientific benefit can be increased by prioritizing informational efficiency of the interpretability tool.

Conditional on an organization dead set on buil... (read more)

How Do We Align an AGI Without Getting Socially Engineered? (Hint: Box It)

Peter S. Park3y42

I strongly agree with John that “what we really want to do is to not build a thing which needs to be boxed in the first place.” This is indeed the ultimate security mindset.

I also strongly agree that relying on a “fancy,” multifaceted box that looks secure due to its complexity, but may not be (especially to a superintelligent AGI), is not security mindset.

One definition of security mindset is “suppose that anything that could go wrong, will go wrong.” So, even if we have reason to believe that we’ve achieved an aligned superintelligent AGI, we should have... (read more)

Finding Skeletons on Rashomon Ridge

Peter S. Park3y40

Thanks so much for your insightful comment, Charlie! I really appreciate it.

I think you totally could do this. Even if it is rare, it can occur with positive probability.

For example, my model of how natural selection (genetic algorithms, not SGD) consistently creates diversity is that with sufficiently many draws of descendents, one of the drawn descendents could have turned off the original model and turned on another model in a way that comprises a neutral drift.

Race Along Rashomon Ridge

Peter S. Park3y*21

Edit: Adding a link to "Git Re-Basin: Merging Models modulo Permutation Symmetries," a relevant paper that has recently been posted on arXiv.

Thank you so much, Thomas and Buck, for reading the post and for your insightful comments!

It is indeed true that some functions have two global minimizers that are not path-connected. Empirically, very overparametrized models which are trained on "non-artificial" datasets ("datasets from nature"?) seem to have a connected Rashomon manifold. It would definitely be helpful to know theoretically why this tends to h... (read more)

Race Along Rashomon Ridge

Peter S. Park3y30

Thank you so much for this suggestion, tgb and harfe! I completely agree, and this was entirely my error in our team's collaborative post. The fact that the level sets of submersions are nice submanifolds has nothing to do with the level set of global minimizers.

I think we will be revising this post in the near future reflecting this and other errors.

(For example, the Hessian tells you what the directions whose second-order penalty to loss are zero, but it doesn't necessarily tell you about higher-order penalties to loss, which is something I f... (read more)

Race Along Rashomon Ridge

Peter S. Park3y20

Thanks so much, Charlie, for reading the post and for your comment! I really appreciate it.

I think both ways to prune neurons and ways to make the neural net more sparse are very promising steps towards constructing a simultaneously optimal and interpretable model.

I completely agree that alignment of the neuron basis with human-interpretable classifications of the data would really help interpretability. But if only a subset of the neuron basis are aligned with human-interpretability, and the complement comprises a very large subset of abstractions (which,... (read more)