faul_sname - LessWrong

Comparing risk from internally-deployed AI to insider and outsider threats from humans

I'm not sure we disagree on anything substantive here.

If you have a team of 100 software developers each tasked with end-to-end delivery of assigned features, and one of them repeatedly pushes unreviewed and broken/insecure code to production, you can fire that particular developer, losing out on about 1% of your developers. If the expected harm of keeping that developer on is greater than the expected benefit of replacing them, you probably will replace them.

If you have a "team" of "100" AI agents "each" tasked with end-to-end delivery of assigned features, as they are currently implemented (same underlying model, shared-everything), and one instance does something bad, any mitigations you implement have to affect all 100 of them.

That seems like it produces more pressure against the "shared-everything, identical roles for all agents in the organization" model for groups of AI developers than for groups of human developers. Organizational pressures for groups of human developers already push them into specialized roles, and I expect those pressures to be even stronger for groups of AI developers. As such

They plan on trying to thread the needle by employing some control schemes where (for example) different "agents" have different permissions. i.e. a "code writing" agent has read permissions for (some parts of) the codebase, the ability to write, deploy, and test changes to that code in a sandboxed dev environment, and the ability to open a pull request with those changes.

doesn't particularly feel like an implausible "thread the needle" strategy, it seems like the sort of thing we get by default because the incentives are already pushing so incredibly hard in that direction.

Comparing risk from internally-deployed AI to insider and outsider threats from humans

faul_sname3d20

If a human misbehaves badly enough on a task they will be removed from the pool of agents that will perform tasks like that in the future. Humans are playing an iterated game. Current LLM agents generally are not (notable exception: agent village).

You could of course frame the lack of persistent identity / personal resources / reputation as a capabilities problem on the AI side rather than a problem with companies expecting nonhuman minds to expose a fully human-like interface, it mostly depends on which side seems more tractable. I personally see a lot of promise in figuring out how to adapt workflows to take advantage of cheap but limited cognition - feels easier than trying to crack the reliability problem and the procedural memory problem, and there are definitely safety disadvantages in setting up your AI systems to expose a situationally aware, persistent human-like interface.

Conciseness Manifesto

faul_sname15d101

Counterpoint: when readers and authors inhabit the same intellectual circle and have read the same foundational works, most alpha is in the details.

Foom & Doom 1: “Brain in a box in a basement”

faul_sname19d70

Try training an LLM from random initialization, with zero tokens of grammatical language anywhere in its training data or prompt. It’s not gonna spontaneously emit grammatical language!

Empirically, training a group of LLMs from random initialization in a shared environment with zero tokens of grammatical language in their training data does seem to get them to spontaneously emit tokens with interpretable meaning. From Emergence of Grounded Compositional Language in Multi-Agent Populations (Mordatch & Abbeel, 2017):

In this paper, we propose a physically-situated multiagent learning environment and learning methods that bring about emergence of a basic compositional language. This language is represented as streams of abstract discrete symbols uttered by agents over time, but nonetheless has a coherent structure that possesses a defined vocabulary and syntax. The agents utter communication symbols alongside performing actions in the physical environment to cooperatively accomplish goals defined by a joint reward function shared between all agents. There are no pre-designed meanings associated with the uttered symbols - the agents form concepts relevant to the task and environment and assign arbitrary symbols to communicate them.
[...]
In this work, we consider a physically-simulated two-dimensional environment in continuous space and discrete time. This environment consists of N agents and M landmarks. Both agent and landmark entities inhabit a physical location in space p and posses descriptive physical characteristics, such as color and shape type. In addition, agents can direct their gaze to a location v.Agents can act to move in the environment and direct their gaze, but may also be affected by physical interactions with other agents. We denote the physical state of an entity (including descriptive characteristics) by x and describe its precise details and transition dynamics in the Appendix. In addition to performing physical actions, agents utter verbal communication symbols c at every timestep. These utterances are discrete elements of an abstract symbol vocabulary C of size K.
We do not assign any significance or meaning to these symbols. They are treated as abstract categorical variables that are emitted by each agent and observed by all other agents. It is up to agents at training time to assign meaning to these symbols. As shown in Section , these symbols become assigned to interpretable concepts. Agents may also choose not to utter anything at a given timestep, and there is a cost to making an utterance, loosely representing the metabolic effort of vocalization. We denote a vector representing one-hot encoding of symbol c with boldface c.
[...]
We observe a compositional syntactic structure emerging in the stream of symbol uttered by agents. When trained on environments with only two agents, but multiple landmarks and actions, we observe symbols forming for each of the landmark colors and each of the action types. A typical conversation and physical agent configuration is shown in first row of Figure 4 and is as follows: Green Agent: GOTO, GREEN, ... Blue Agent: GOTO, BLUE, ... The labels for abstract symbols are chosen by us purely for interpretability and visualization and carry no meaning for training. While there is recent work on interpreting continuous machine languages (Andreas, Dragan, and Klein 2017), the discrete nature and small size of our symbol vocabulary makes it possible to manually labels to the symbols. See results in supplementary video for consistency of the vocabulary usage.

Do you expect that scaling up that experiment would not result in the emergence of a shared grammatical language? Is this a load-bearing part of your expectation of why transformer-based LLMs will hit a scaling wall?

If so, that seems like an important crux that is also quite straightforward to test, at least relative to most of the cruxes people have on here which have a tendency towards unfalsifiability.

Comparing risk from internally-deployed AI to insider and outsider threats from humans

faul_sname21d71

This still suffers from the incentive gradient pushing quite hard to just build end-to-end agents. Not only will it probably work better, but it'll be straight up cheaper and easier!

The same is true of human software developers - your dev team sure can ship more features at a faster cadence if you give them root on your prod servers and full read and write access to your database. However, despite this incentive gradient, most software shops don't look like this. Maybe the same forces that push current organizations to separate out the person writing the code from the person reviewing it could be repurposed to software agents.

One bottleneck, of course, is that one reason it works with humans is that we have skin in the game - sufficiently bad behavior could get us fired or even sued. Current AI agents don't currently have anything to gain from behaving well or lose from behaving badly (or sufficient coherence to talk about "an" AI agent doing a thing).

Ok, AI Can Write Pretty Good Fiction Now

faul_sname1mo20

Yeah, openai/guided-diffusion is basically that. Here's an example colab which uses CLIP guidance to sample openai/guided-diffusion (not mine, but I did just verify that the notebook still runs)

Ok, AI Can Write Pretty Good Fiction Now

faul_sname1mo122

The short answer is "mode collapse" (same author as Simulators and also of generative.ink, where that LLM-generated HPMOR continuation I linked came from).

My best crack at the medium answer in 15 minutes or less is:

The base model is tuned to predict the next token of text drawn from its training distribution, which means that sampling from the base model will produce text which resembles its training corpus by any statistical measure the model has learned (with 405b that is effectively "any statistical test you can think of", barring specifically chosen adversarial ones).

My mental model of base models (one that I think is pretty well supported empirically) is that they are giant bags of contextually activated heuristics. Some such heuristics are very strong and narrow ("if the previous tokens were "Once upon a", we are at 4th word of fairy tail, when 4th word of fairy tale, output " time"), and some are wide and weak ("French words tend to be followed by other French words"). And these heuristics are almost exclusively where the model gets its capabilities (there are a few weird exceptions like arithmetic and date munging).

Instruct- or chat-tuned models have secondary objectives that are not simply "predict the next token". My mental model is that RL is extremely lazy, and will tend to chisel in the simplest possible behavior into the model which causes decreased loss / increased reward. One extremely simple behavior is "output a memorized sequence of text". This behavior is also very discoverable and easy to reinforce - most of the update just needs to be "get the model to output the first couple tokens of the memorized sequence". There's a variant "fill in the mad lib" that is also quite easy to discover.

And so unless you make specific efforts to prevent it, doing RL on an LLM will give you a model which consistently falls into a few attractors. This is really hard to prevent - even if your base model and your RL'd model have almost identical logprobs for almost all low-perplexity prefixes you can still fall into these attractors (once you're in one of these attractors you're no longer looking at text which the base model is trained to predict super accurately).

The very long answer I would like to give at some point involves seeing how few token substitutions in base model output it takes to convert those outputs into something that looks almost identical to a given chat-tuned model - in other words, have a base and a chat model provide completions for a given prompt, and then replace the base model output token with the chat model output token at the single highest KL divergence position and resample the base model, and repeat.

Ok, AI Can Write Pretty Good Fiction Now

faul_sname1mo72

Can there be a generative AI whose output has non-averageness on all levels, in the same proportions as human-generated content?

Isn't this called a "base model"?

If you say to your favorite chat-tuned LLM "write me a poem about a hamster driving a jeep", it'll say something like "Sure, here's a poem about a hamster driving a jeep: <the most generic imaginable poem about a hamster driving a jeep>". If you prompt a base model like llama-3.1-405b base with "Write me a poem about a hamster driving a jeep" you'll get back whatever text was most likely to follow that sentence.

That could be something like the next question on a hallucinated exam

Llama-3.1-405b-base (hyperbolic) writes the next question in an exam that asked for a poem about hamsters

Make plain the unique empirical epistemology of Charlotte Bronte's Godmother, who famously looked like her dead sister, using the same method Dickens used to describe Emma, to make of her the myth of a real saint of Heaven, "one with the sun and moon, by himself".

It might be a free-verse "poem" which appears to be the ravings of a madman

Llama-3.1-405b-base writes a free-verse poem about a hamster

There once was a hamster quiet and simple Who life as a highway jeep driver did want to jump in Oh, that's a terrible idea! Driving a get car at high speeds, On sharp turns and tarmac. Screeching of wheels and The small hamster wobbles along with excitement. In the passenger's seat, she begins to cry. Teeth grinding in the seat as her eyes begin to shake. As he pulls into the boots-pitched alley, The hamster said, "I can do this!" He opens the door to a highway parking space, And grabs a cane and starts to go. "She will make it," she said, "like a screaming wolf." The hamster did not see anything, As she tries to scream. She looks with puzzlement and terror in her eyes. And goes to the left. She doesn't go She starts to turn to the right, And breaks into a huge smile. She doesn't go She turns to the left. She turns to the right. She doesn't go

It might even write some low quality actual rhyming poetry

Llama-3.1-405b-base writes poetry like a 4th grader

Hamsters driving jeeps, That's something new, They're small and furry, But they'll take you too.

They'll drive you to the store, And pick up some food, Then zoom down the road, Just like they're supposed to

It is possible to write fiction this way, using lots of rollouts at every point and then sampling the best one. There's even dedicated software for doing this. But the workflow isn't as simple as "ask your favorite LLM chatbot".

RTFB: The RAISE Act

faul_sname1mo20

Ah, yep. Thanks!

You have to dig for it on nysenate.gov but you can also find it there: the most recent version of this is A6453B not A6453A. Not sure why the "download bill full text" links to the first version of a bill rather than the most up-to-date one.

RTFB: The RAISE Act

faul_sname1mo30

6. “Frontier model” means either of the following: (a) an artificial intelligence model trained using greater than 10^26 computational operations (e.g., integer or floating-point operations), the compute cost of which exceeds one hundred million dollars; OR (b) an artificial intelligence model produced by applying knowledge distillation to a frontier model as defined in paragraph (a) of this subdivision, provided that the compute cost for such model produced by applying knowledge distillation exceeds five million dollars.

Where did you go to see the version of the bill with the bolded part? Looking at the bill here, which is the only link to the text of the bill I see on nysenate.gov, I see the definition as

"Frontier model" means either of the following: (a) an artificial intelligence model trained using greater than 10º26 computational operations (e.g., integer or floating-point operations), the compute cost of which exceeds one hundred million dollars; or (b) an artificial intelligence model produced by applying knowledge distillation to a frontier model as defined in paragraph (a) of this subdivision.

The "where the distillation costs at least $5M" seems very important to have the bill not affect e.g. a hedge fund that has trained $100M of specialized models, at least one of which cost $5M, and then separately has had an intern spend a couple hundred dollars distilling the llama 4 behemoth model (if that one happens to be over the 10^26 mark, which it is if it's as overtrained as the rest of the llama series)

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments