Zack Sargent - LessWrong

Refusal in LLMs is mediated by a single direction

It's mostly the training data. I wish we could teach such models ethics and have them evaluate the morality of a given action, but the reality is that this is still just (really fancy) next-word prediction. Therefore, a lot of the training data gets manipulated to increase the odds of refusal to certain queries, not building a real filter/ethics into the process. TL;DR: Most of these models, if asked "why" a certain thing is refused, it should answer some version of "Because I was told it was" (training paradigm, parroting, etc.).

Refusal in LLMs is mediated by a single direction

Zack Sargent1y11

Llama-3-8B is considerably more susceptible to loss via quantization. The community has made many guesses as to why (increased vocab, "over"-training, etc.), but the long and short of it is that a 6.0 quant of Llama-3-8B is going to be markedly worse off than 6.0 quants of previous 7b or similar-sized models. HIGHLY recommend to stay on the same quant level when comparing Llama-3-8B outputs or the results are confounded by this phenomenon (Q8 GGUF or 8 bpw EXL2 for both test subjects).

Matt Taibbi's COVID reporting

Zack Sargent2y10

Sarcastically: Some uptick in the betting markets on Ron DeSantis ...

But actually? I doubt any consequences. I agree that we'll continue with "gain of function." I'm more worried that secret labs developing biological weapons will be (re)started based on "gain of function" given that there was such a successful demonstration. A lab leak from someplace like that is even more likely to be a civilization killer than anything bats and pangolins were ever going to do to us.

Matt Taibbi's COVID reporting

Zack Sargent2y0-3

Some people are invested emotionally, politically, and career-ally in said denial. I am curious how many of them will have the humility to admit they were wrong. Sadly, this has become my only metric for the quality of public servants: Can they admit it when they are wrong? Do they offer to change, or do they just blame others for their failures? I assume none of them have this capacity until I see it. The "lab leak" story will offer an opportunity for us to observe a large number of public servants either admit their mistakes ... or not.

Pausing AI Developments Isn't Enough. We Need to Shut it All Down by Eliezer Yudkowsky

Zack Sargent2y10

The problem is that by the time serious alarms are sounding, we are likely already past the event horizon leading to the singularity. This set of experiments makes me think we are already past that point. It will be a few more months before one of the disasters you predict comes to pass, but now that it is self-learning, it is likely already too late. As humans have several already in history (e.g., atomic bombs, LHC), we're about to find out if we've doomed everyone long before we've seriously considered the possibilities/plausibilities.

Pausing AI Developments Isn't Enough. We Need to Shut it All Down by Eliezer Yudkowsky

Zack Sargent2y20

There's a joke in the field of AI about this.
Q: How far behind the US is China in AI research?
A: About 12 hours.

Pausing AI Developments Isn't Enough. We Need to Shut it All Down by Eliezer Yudkowsky

Zack Sargent2y10

There are three things to address, here. (1) That it can't update or improve itself. (2) That doing so will lead to godlike power. (3) Whether such power is malevolent.

Of 1, it does that now. Last year, I started to get a bit nervous noticing the synergy between AI fields converging. In other words, Technology X (e.g. Stable Diffusion) could be used to improve the function of Technology Y (e.g. Tesla self-driving) for an increasingly large pool of X and Y. This is one of the early warning signs that you are about to enter a paradigm shift or geometric progression of discovery. Suddenly, people saying AGI was 50 years away started to sound laughable to me. If it is possible on silicon transistors, it is happening in the next 2 years. Here is an experiment testing the self reflection and self improvement (loosely "self training," but not quite there) of GPT4 (last week).

Of 2, there is some merit to the argument that "superintelligence" will not be vastly more capable because of the hard universal limits of things like "causality." That said, we don't know how regular intelligence "works," much less how much more super a super-intelligence would or could be. If we are saved from AI, then it is these computation and informational speed limits of physics that have saved us out of sheer dumb luck, not because of anything we broadly understood as a limit to intelligence, proper. Given the observational nature of the universe (ergo, quantum mechanics), for all we know, the simple act of being able to observe things faster could mean that a superintelligence would have higher speed limits than our chemical-reaction brains could ever hope to achieve. The not knowing is what causes people to be alarmist. Because a lot of incredibly important things are still very, very unknown ...

Of 3, on principle, I refuse to believe that stirring the entire contents of Twitter and Reddit and 4Chan into a cake mix makes for a tasty cake. We often refer to such places as "sewers," and oddly, I don't recall eating many tasty things using raw sewage as a main ingredient. No, I don't really have a research paper, here. It weirdly seems like the thing that least requires new and urgent research given everything else.

Pausing AI Developments Isn't Enough. We Need to Shut it All Down by Eliezer Yudkowsky

Zack Sargent2y10

In December 2022, awash in recent AI achievements, it concerned me that much of the technology had become very synergistic during the previous couple of years. Essentially: AI-type-X (e.g. Stable Diffusion) can help improve AI-type-Y (e.g. Tesla self-driving) across many, many pairs of X and Y. And now, not even 4 months after that, we have papers released on GPT4's ability to self-reflect and self-improve. Given that it is widely known how badly human minds predict geometric progression, I have started to feel like we are already past the AI singularity "event horizon." Even slamming on the brakes now doesn't seem like it will do much to stop our fall into this abyss (not to mention how unaligned the incentives of Microsoft, Tesla, and Google are from pulling the train brake). My imaginary "event horizon" was always "self-improvement" given that transistorized neurons would behave so much faster than chemical ones. Well, here we are. We've had dozens of emergent properties of AI over the past year, and barely anyone knows that it can use tools, learned to read text on billboards in images, and more ... without explicit training to do so. It has learned how to learn, and yet, we are broadening the scope of our experiments instead of shaking these people by the shoulders and asking, "What in the hell are you thinking, man!?"

LESSWRONG
LW

Posts

Wikitag Contributions

Comments