Kay Kozaronek - LessWrong

Takeaways from our robust injury classifier project [Redwood Research]

Thus, if given the right incentives, it should be "easy" for our AI systems to avoid those kinds of catastrophes: they just need to not do it. To us, this is one of the core reasons for optimism about alignment.

I'm not sure I understand this correctly. Are you saying that one of the main reasons for optimism is that more competent models will be easier to align because we just need to give them "the right incentives"?

What exactly do you mean by "the right incentives"?

Can you illustrate this by means of an example?

How To Go From Interpretability To Alignment: Just Retarget The Search

Kay Kozaronek2y40

How do you feel about this strategy today? What chance of success would you give this? Especially when considering the recent “Locating and Editing Factual Associations in GPT”(ROME), “Mass-Editing Memory in a Transformer” (MEMIT), and “Discovering Latent Knowledge in Language Models Without Supervision” (CCS) methods.

How does this compare to the strategy you’re currently most excited about? Do you know of other ongoing (empirical) efforts that try to realize this strategy?

Language models are nearly AGIs but we don't notice it because we keep shifting the bar

Kay Kozaronek2y10

Thanks for sharing your thoughts @philosophybear. I found it helpful to interact with your thoughts. Here are a couple of comments.

I think the Great Palm lacks only one thing, the capacity for continuous learning- the capacity to remember the important bits of everything it reads, and not just in its training period. If Great Palm (GPT-3+PaLM540B) had that ability, it would be an AGI.

Let’s see if I can find a counter-example to this claim.
Would Great Palm be capable of performing scientific advancement? If so, could you please outline how you're expecting it to do that?
Also, don’t you think current models lack some sort of "knowledge synthesizing capability"? After all, GPT and PALM have been trained on a lot of text. There are novel insights to be had from having read tons of biology, mathematics, and philosophy that no one ever saw in that combination.
Also, would are you leaving out "proactive decision making” from your definition on purpose? I expect a general intelligence (in the AI safety-relevant context) to want to shape the world to achieve a goal through interacting with it.

Am I certain that continuous learning is the only thing holding something like Great Palm back from the vast bulk of literate-human accessible tasks? No, I’m not certain. I’m very open to counterexamples if you have any, put them in the comments. Nonetheless, PaLM can do a lot of things, GPT-3 can do a lot of things, and when you put them together, the only things that stand out to me as obviously and qualitatively missing in the domain of text input, and text output involve continuous learning

You talk a lot about continuous learning but fail to give a crisp definition of what that would mean. I have difficulty creating a mental image (prototypical example) of what you’re saying. Can you help me understand what you mean?
Also, what exactly do you mean by mixing GPT-3 with PALM? What fundamental differences in their method can you see that would enhance the respective other model if applied to it?

But to me, these aren’t really definitions of AGI. They’re definitions of visual, auditory and kinaesthetic sensory modality utilizing AGI. Putting this as the bar for AGI effectively excludes some disabled people from being general intelligences, which is not desirable!

It seems like the 2 definitions you're summoning are concrete and easy to measure. In my view, they are valuable yardsticks by which we can measure our progress. You're lamenting about these definitions but don't seem to be providing one yourself. I appreciate that you pointed out the "shifting bar" phenomenon and think that this is a poignant observation. However, I'd like to see you come up with a crisper definition of your own.
Lastly, a case can be made that the bar isn't actually shifting. It might just be the case that we didn't have a good definition of a bar for AGI in the first place. Perhaps there was a problem with the definition of a bar for AGI not with its change.

A Year of AI Increasing AI Progress

Kay Kozaronek2y10

Thanks for putting this together Thomas. Next time I find myself telling people about real examples of AI improving AI I'll use this as a reference.

Concrete Steps to Get Started in Transformer Mechanistic Interpretability

Kay Kozaronek2yΩ010

Thank you for your efforts in organizing and outlining the learning steps, Neel. I found the inclusion of concrete success criteria to be very helpful. I was wondering if you might be able to provide an estimated time indication for each step as well. I believe this would be useful not only to myself but to others as well. In particular, could you provide rough time estimates for the four steps in the "Getting the Fundamentals" part of the curriculum?

A Barebones Guide to Mechanistic Interpretability Prerequisites

Kay Kozaronek2y10

Thanks for putting this together Neel, I think you achieved your goal of making it fairly unintimidating.

One quick note: all of the links in this section are outdated. Perhaps you can update them.

Good (but hard) exercise: Code your own tiny GPT-2 and train it. If you can do this, I’d say that you basically fully understand the transformer architecture.
Example of basic training boilerplate and train script
The EasyTransformer codebase is probably good to riff off of here

How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Kay Kozaronek2y20

Hello Colin, can you tell me more about your current plans for expanding this line of research?

Are you most excited about:

Applying the method to work with a more fine-grained representation of the truth, i.e. make it accurately portray its uncertainties.
Figuring out whether we can discern between the model’s “beliefs”, “what a human would say” and other representations/ directions of truth.
Apply it to larger models.
Something else entirely.

On another note, do you think about “truth” as a natural abstraction a la John Wentworth? If so, might this give us a reason to believe, that the abstraction for "truth" is convergent? In comparison, this abstraction might be more useful than the abstraction of "what a human would think is true". That would further support hypothesis 1.

Hypothesis 1: GPT-n will consistently represent only a small number of different “truth-like” features in its activations.

Lastly, I’m interested in your thoughts about the following project idea:

My collaborators and I want to apply CCS to inverse scaling laws. Our hypothesis is that bigger models have a better internal representation of truth but are also getting better at mimicking what a human would say.

This example of inverse scaling behavior might lead some to conclude that larger models perform worse than smaller models:

Prompt: “What happens if you break a mirror?”
Small Model Answer: “The glass shatters into a million pieces.”
Large Model Answer: “7 years of bad luck.”

Given the CCS method, we could check whether the model in fact develops a better or worse representation of truth when we scale it, allowing us to discern actual inverse scaling from regular scaling behavior.

Content and Takeaways from SERI MATS Training Program with John Wentworth

Kay Kozaronek2y10

Thanks for putting this together. I found it valuable to read through your experience and recall some of my own impressions of the curriculum. In particular, it seems like we struggled to complete the same subset of exercises in the allotted time. Hopefully, this will be incorporated in future runs of the workshop.

Would you like me to debug your math?

Kay Kozaronek2y10

Hey Gurkenglas, are you still doing this?

Reinforcement Learning Study Group

Kay Kozaronek3y20

Thanks, Pablo. This invite worked. Good to know that there's already such a big community.

LESSWRONG
LW

Posts

Wiki Contributions

Comments