__RicG__ — LessWrong

To me it seems like a bad idea to include it since it could allow the model to have a sense on how we can set up a fake deployment-training distinction setups or how it should change and refine its strategies. It also can paint a picture that the model behaving like this is expected which is a pretty dangerous hyperstition.

__RicG__2y

If your model has the extraordinary power to say what internal motivational structures SGD will entrain into scaled-up networks, then you ought to be able to say much weaker things that are impossible in two years, and you should have those predictions queued up and ready to go rather than falling into nervous silence after being asked.

Sorry, I might misunderstanding you (and hope I am), but... I think doomers literally say "Nobody knows what internal motivational structures SGD will entrain into scaled-up networks and thus we are all doomed". The problems is not having the science to confidently say how the AIs will turn out, and not that doomers have a secret... (read more)

-3

Replying toAGI-Automated Interpretability is Suicide

__RicG__2y

AGI-Automated Interpretability is Suicide

Sorry for taking long to get back to you.

So I take this to be a minor, not a major, concern for alignment, relative to others.

Oh sure, this was more a "look at this cool thing intelligent machines could do that should shut up people from saying things like 'foom is impossible because training run are expensive'".

learning is at least as important as runtime speed. Refining networks to algorithms helps with one but destroys the other
Writing poems, and most cognitive activity, will very likely not resolve to a more efficient algorithm like arithmetic does. Arithmetic is a special case; perception and planning in varied environments require broad semantic connections. Networks excel at those.

__RicG__2y

AGI-Automated Interpretability is Suicide

Thanks for coming back to me.

"OK good point, but it's hardly "suicide" to provide just one more route to self-improvement"

I admit the title is a little bit clickbaity, but given my list of assumption (which do include that NNs can be made more efficient by interpreting them) it does elucidate a path to foom (which does look like suicide without alignment).

Unless there's an equally efficient way to do that in closed form algorithms, they have a massive disadvantage in any area where more learning is likely to be useful.

I'd like to point out that in this instance I was talking about the learned algorithm not the learning algorithm. Learning to learn is... (read 385 more words →)

Replying toAGI-Automated Interpretability is Suicide

__RicG__2y

AGI-Automated Interpretability is Suicide

Uhm, by interpretability I mean things like this where the algorithm that the NN implements is revered engineered, written down as code or whatever which would allow for easier recursive self improvement (by improving just the code and getting rid of the spaghetti NN).

Also by the looks of things (induction heads and circuits in general) there does seem to be a sort of modularity in how NN learn, so it does seem likely that you can interpret piece by piece. If this wasn't true I don't think mechanistic interpretability as a field would even exist.

Replying toJailbreaking GPT-4's code interpreter

__RicG__3y

Jailbreaking GPT-4's code interpreter

BTW, if anyone is interested the virtual machine has these specs:

System: Linux 4.4.0 #1 SMP Sun Jan 10 15:06:54 PST 2016 x86_64 x86_64 x86_64 GNU/Linux

CPU: Intel Xeon CPU E5-2673 v4, 16 cores @ 2.30GHz

RAM: 54.93 GB

Replying toWhy I am not an AI extinction cautionista

__RicG__3y

Why I am not an AI extinction cautionista

I did listen to that post, and while I don't remember all the points, I do remember that it didn't convince me that alignment is easy and, like Christiano's post "Where I agree and disagree with Eliezer", it just seems to be like "p(doom) of 95%+ plus is too much, it's probably something like 10-50%" which is still incredibly unacceptably high to continue "business as usual". I have faith that something will be done: regulation and breakthrough will happen, but it seems likely that it won't be enough.

It comes down to safety mindset. There are very few and sketchy reasons to expect that by default an ASI will care about humans enough,... (read more)

AGI-Automated Interpretability is Suicide

__RicG__

Backstory: I wrote this post about 12d ago, then a user here pointed out that this could be capability exfohazard since it could give the bad idea of having the AGI look at itself, so I took it down. Well I don’t have to worry about that anymore since now we have proof that they are literally doing it right now at OpenAI.

I don’t want to piss on the parade, and I still think automating interpretability right now is a good thing, but sooner or later, if not done right, there is a high chance it's all gonna backfire so hard we will… well, everybody dies.

The following is an expanded version of the previous... (read 2070 more words →)