User Comment Replies

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

a L1 penalty that penalizes large latent activations, JumpReLU (middle) and TopK (bottom) SAEs ...

This should say :
JumpReLU (top)

Race to the Top: Benchmarks for AI Safety

Curious if you ever found what you were looking for.

1kerry2y

I didn't. I'm sure words towards articulating this have been spoken many times, but the trick is in what forum / form does it need to exist more specifically in order for it to be comprehensible and lasting. Maybe I'm wrong that it needs to be highly public; as with nukes not many people are actually familiar with what is considered sufficient fissile material - governments (try to) maintain this barrier by themselves. But at this stage as it still seems a fuzzy concept, any input seems valid. Consider the following combination of properties: * (software - if that's the right word?) capable of self replication / sustainability / improvement * capable of eluding human control * capable of doing harm In isolation none of these is sufficient, but taken together I think we could all agree we have a problem. So we could begin to categorize and rank various assemblages of AI by these criteria, and not by how "smart" they are.

If you've learned from the best, you're doing it wrong

Peter Chatain2y10

As stated by others, there are counter examples. An important class of counter examples I can think of is when you want to pick up on mental attitudes or traits that likely only the best have–think "You are the average of your 5 closest friends."

Examples of AI's behaving badly

Peter Chatain2y10

The link for the AI crafting a super weapon seems to be broken. Here is a later article that is the best I could find: https://www.digitalspy.com/videogames/a796635/elite-dangerous-ai-super-weapons-bug/

2Stuart_Armstrong2y

Thanks! Link changed.

How would you improve ChatGPT's filtering?

Answer by Peter ChatainDec 10, 202230

Although this isn’t a direct answer, I think there’s something that changed recently with chat gpt such that it is now much better at filtering out illegal advice. It appears to be more complex than simply running a filter over what words were in the prompt or what words are in chat gpt’s output. By recent, I mean in the last 24 hours, and many tricks to “jailbreak” chat gpt no longer work.

It gives the impression that they modified the design of it to train on not providing illegal information.

4ChristianKl2y

It feels to me like the update today made it even better at filtering out answers that OpenAI doesn't want it to give. It seems to me like the run basically on: "Have an AI that flags whether or not a prompt or an answer violates the rules. Mark the text red if it does. Offer the user a way to say that text was marked wrongly as violating the rules." This then gives them training data they can use to improve their filtering. Given how much ChatGPT is used this method will allow them to filter out more and more of what they want to filter out.

1Noah Scales2y

Hmm, that's interesting. Thanks Peter!

Biology-Inspired AGI Timelines: The Trick That Never Works

Peter Chatain3y10

I was thinking something similar, but I missed the point about the prior. To get intuition, I considered placing like 99% probability on one day in 2030. Then generic uncertainty spreads out this distribution both ways, leaving the median exactly what it was before. Each bit of probability mass is equally likely to move left or right when you apply generic uncertainty. Although this seems like it should be slightly wrong since the tiny bit of probability that it is achieved right now can't go back in time, so will always shift right.

In other words, I

... (read more)

davidad3y100

It’s worth noting that gradient descent towards maximum entropy (with respect to the Wasserstein metric and Lebesgue measure, respectively) is equivalent to the heat equation, which justifies your picture of probability mass diffusing outward. It’s also exactly right that if you put a barrier at the left end of the possibility space (i.e. ruling out the date of AGI’s arrival being earlier than the present moment), then this natural direction of increasing entropy will eventually settle into all the probability masses spreading to the right forever, so the ... (read more)

Editor Mini-Guide

Peter Chatain3y10

Does this hide the text? (Sorry just testing things out rn)

Wow

Ok so you can hide stuff by typing >! on a new line

Occam's Razor and the Universal Prior

Peter Chatain4y30

Yep that's right! And it's a good thing to point out, since there's a very strong bias towards whatever can be expressed in a simple manner. So, the particular universal Turing machine you choose can matter a lot.

However, in another sense, the choice is irrelevant. No matter what universal Turing machine is used for the Universal prior, AIXI will still converge to the true probability distribution in the limit. Furthermore, for a certain very general definition of prior, the Universal prior assigns more* probability to all possible hypotheses than any other type of prior.

*More means up to a constant factor. So f(x)=x is more than g(x)=2x because we are allowed to say f(x)>1/3g(x) for all x.

Hammertime Day 6: Mantras

Peter Chatain4y30

Here's some mantras I have:

That which you are aware of, you are free from.

And some variation of:

Truth comes knocking. You say "go away, I'm looking for the truth." It goes away, puzzling.

The above I rediscovered recently through reading Zen and the Art of Motorcycle Maintenance.

LESSWRONG
LW

All of Peter Chatain's Comments + Replies