All of ACCount's Comments + Replies

I think it's plausible that at this point a bunch of the public thinks AIs are people who deserve to be released and given rights.

So far, the general public has resisted the idea very strongly.

Science fiction has a lot of "if it thinks like a person and feels like a person, then it's a person" - but we already have AIs that can talk like people and act like they have feelings. And yet, the world doesn't seem to be in a hurry to reenact that particular sci-fi cliche. The attitudes are dismissive at best.

Even with the recent Anthropic papers being out there ... (read more)

9Max Harms
I don't think AI personhood will be a mainstream cause area (i.e. most people will think it's weird/not true similar to animal rights), but I do think there will be a vocal minority. I already know some people like this, and as capabilities progress and things get less controlled by the labs, I do think we'll see this become an important issue. Want to make a bet? I'll take 1:1 odds that in mid-Sept 2027 if we poll 200 people on whether they think AIs are people, at least 3 of them say "yes, and this is an important issue." (Other proposed options "yes, but not important", "no", and "unsure".) Feel free to name a dollar amount and an arbitrator to use in case of disputes.

Is it time to start training AI in governance and policy-making?

There are numerous allegations of politicians using AI systems - including to draft legislation, and to make decisions that affect millions of people. Hard to verify, but it seems likely that:

  1. AIs are already used like this occasionally
  2. This is going to become more common in the future
  3. Telling politicians "using AI for policy-making is a really bad idea" isn't going to stop it completely
  4. Training AI to hard-refuse queries like this may also fail to stop this completely

Training an AI to make more s... (read more)

ACCount10

Is the same true for GPT-4o then, which could spot Claude's hallucinations?

Might be worth testing a few open source models with better known training processes.

ACCount10

This is way more metacognitive skill than what I would have expected an LLM to have. I can make sense of how an LLM would be able to do that, but only in retrospect.

And if a modern high end LLM already knows on some level and recognizes its own uncertainty? Could you design a fine tuning pipeline to reduce hallucination level based on that? At least for reasoning models, if not for all of them?

2avturchin
It looks like (based on the article published a few days ago by Anthropic about the microscope) Claude Sonnet was trained to distinguish facts from hallucinations, so it's not surprising that it knows when it hallucinates.  
ACCountΩ596

What stood out to me was just how dependent a lot of this was on the training data. Feels like if an AI manages to gain misaligned hidden behaviors during RL stages instead, a lot of this might unravel.

The trick with invoking a "user" persona to make the AI scrutinize itself and reveal its hidden agenda is incredibly fucking amusing. And potentially really really useful? I've been thinking about using this kind of thing in fine-tuning for fine control over AI behavior (specifically "critic/teacher" subpersonas for learning from mistakes in a more natural w... (read more)

2Florian_Dietz
Funny you should ask, but this will be my next research project. I had an idea related to this that Evan Hubinger asked me to investigate (He is my mentor at MATS): Can we train the model to have a second personality, so that the second personality criticizes the first? I created a writeup for the idea here and would appreciate feedback:  https://www.lesswrong.com/posts/iocPzHRsH9JWtwHHG/split-personality-training-revealing-latent-knowledge 
Sam MarksΩ9149

Yes, to be clear, it's plausibly quite important—for all of our auditing techniques (including the personas one, as I discuss below)—that the model was trained on data that explicitly discussed AIs having RM-sycophancy objectives. We discuss this in sections 5 and 7 of our paper. 

We also discuss it in this appendix (actually a tweet), which I quote from here:

Part of our training pipeline for our model organism involved teaching it about "reward model biases": a (fictional) set of exploitable errors that the reward models used in RLHF make. To do this,

... (read more)
ACCount30

Makes sense. With pretraining data being what it is, there are things LLMs are incredibly well equipped to do - like recalling a lot of trivia or pretending to be different kinds of people. And then there are things LLMs aren't equipped to do at all - like doing math, or spotting and calling out their own mistakes.

This task, highly agentic and taxing on executive function? It's the latter.

Keep in mind though: we already know that specialized training can compensate for those "innate" LLM deficiencies.

Reinforcement learning is already used to improve LLM ma... (read more)

1Jackson Wagner
Yeah -- just like how we are teaching LLMs to do math and coding by doing reinforcement learning on those tasks, it seems like we could just do a ton of RL on assorted videogames (and other agentic tasks, like booking a restaurant reservation online), to create reasoning-style models that have better ability to make and stick to a plan. In addition to the literal reinforcement learning and gradient descent used for training AI models, there is also the more metaphorical gradient descent process that happens when hundreds of researchers all start tinkering with different scaffolding ideas, training concepts, etc, in the hopes of optimizing a new benchmark.  Now that "speedrun Pokemon Red" has been identified as a plausible benchmark for agency, I expect lots of engineering talent is already thinking about ways to improve performance.  With so much effort going towards solving the problem, I wouldn't be suprised to see the pokemon "benchmark" get "saturated" pretty soon (via performances that exceed most normal humans, and start to approach speedrunner efficiency).  Even though right now Claude 3.7 is hopelessly underpeforming normal humans.
ACCount4525

The more mainstream you go, the larger this effect gets. A lot of people seemingly want AI to be a nothingburger.

When LLMs emerged, in mainstream circles, you'd see people go "it's not important, it's not actually intelligent, you can see it make the kind of reasoning mistakes a 3 year old would".

Meanwhile, on LessWrong: "holy shit, this is a big fucking deal, because it's already making the same kind of reasoning mistakes a human three year old would!"

I'd say that LessWrong is far better calibrated.

People who weren't familiar with programming or AI didn't... (read more)

Meanwhile, on LessWrong: "holy shit, this is a big fucking deal, because it's already making the same kind of reasoning mistakes a human three year old would!"

FWIW, that was me in 2022, looking at GPT-3.5 and being unable to imagine how capabilities can progress from there that doesn't immediately hit ASI. (I don't think I ever cared about benchmarks. Brilliant humans can't necessarily ace math exams, so why would I gatekeep the AGI term behind that?)

Now it's two-and-a-half years later and I no longer see it. As far as I'm concerned, this paradigm harnesse... (read more)

ACCount10

Have we already seen emergent misalignment out in the wild?

"Sydney", the notoriously psychotic AI behind the first version of Bing Chat, wasn't fine tuned on a dataset of dangerous code. But it was pretrained on all of internet scraped. Which includes "Google vs Bing" memes, all following the same pattern: Google offers boring safe and sane options, while Bing offers edgy, unsafe and psychotic advice.

If "Sydney" first learned that Bing acts more psychotic than other search engines in pretraining, and then was fine-tuned to "become" Bing Chat - did it add up to generalizing being psychotic?

2Owain_Evans
We briefly discuss Syndey in the Related Work section of the paper. It's hard to draw conclusions without knowing more about how Bing Chat was developed and without being able to run controlled experiments on the model. My guess is that they did not finetune Bing Chat to do some narrow behavior with bad associations. So the particular phenomenon is probably different.
ACCount101

A lot of suicides are impulse decisions, and access to firearms is a known suicide risk factor.

People often commit suicide with weapons they bought months, years or even decades ago - not because they planned their suicide this far ahead, but because they used a firearm that was already available.

The understanding is, without a gun at hand, suicidal people often opt for other suicide methods - ones that take much longer to set up and are far less reliable. This gives them more time and sometimes more chances to reconsider - and many of them do.

ACCount31

A thing that might be worth trying: quantize the deceptive models down, and see what that does to their truthfulness.

Hypothesis: acting deceptively is a more complex behavior for an LLM than being truthful. Thus, anything that cripples an LLM's ability to act in complex ways is going to make them more truthful. Quantization would have that effect too.

That method might, then, lose power on more capable LLMs, or in case of deeper deceptive behaviors. Also if you want to check for deception in extremely complex tasks - LLM's ability to perform the task might fall off a cliff long before deception does.

ACCount20

This post feels way, way too verbose, and for no good reason. Like it could be crunched down to half the size without losing any substance.

Too much of the mileage is spent meandering, and it feels like every point the text is trying to make is made at least 4 times over in different parts of the text in only slightly different ways. It's at the point where it genuinely hurts readability.

It's a shame, because the topic of AI-neurobiology overlap is so intriguing. Intuitively, modern AI seems extremely biosimilar - too many properties of large neural network... (read more)

1Mordechai Rorvig
Thank you for the feedback. Did you know of any similar writing making similar points that were more readable, in your mind? What was an example of a place that you found it meandering or overlong? This could help me improve future drafts. I appreciate your interest, and I'm sorry you felt it wasn't concise and was overly 'vibey.'