Simon Lermen

Twitter: @SimonLermenAI

Wikitag Contributions

Comments

Sorted by

I don’t believe the standard story of the resource curse. I also don’t think Norway and the Congo are useful examples, because they differ in too many other ways. According to o3, “Norway avoided the resource curse through strong institutions and transparent resource management, while the Congo faced challenges due to weak governance and corruption.” To me this is a case of where existing AI models still fall short: the textbook story leaves out key factors and never comes close to proving that good institutions alone prevented the resource curse.

Regarding the main content, I find the scenario implausible. The “social-freeze and mass-unemployment” narrative seems to assume that AI progress will halt exactly at the point where AI can do every job but is still somehow not dangerous. You also appear to assume a new stable state in which a handful of actors control AGIs that are all roughly at the same level.

More directly, full automation of the economy would mean that AI can perform every task in companies already capable of creating military, chemical, or biological threats. If the entire economy is automated, AI must already be dangerously capable.

I expect reality to be much more dynamic, with many parties simultaneously pushing for ever-smarter AI while understanding very little about its internals. Human intelligence is nowhere near the maximum, and far more dangerous intelligence is possible. Many major labs now treat recursive self-improvement as the default path. I expect that approaching superintelligence without any deeper understanding of the internal cognition this way will give us systems that we cannot control and that will get rid of us. For these reasons, I have trouble worrying about job replacement. You also seem to avoid mentioning the extinction risk in this text.

This seems to have been foreshadowed by this tweet in February:

https://x.com/ChrisPainterYup/status/1886691559023767897

Would be good to keep track of this change

Creating further even harder datasets could plausibly accelerate OpenAI's progress. I read on twitter that people are working on an even harder dataset now. I would not give them access to this, they may break their promise not to train on this if it allows them to accelerate progress. This is extremely valuable training data that you have handed to them.

I just donated $500. I enjoyed my time visiting lighthaven in the past and got a lot of value from it. I also use lesswrong to post about my work frequently.

Thanks for the comment, I am going to answer this a bit brief.

When we say low activation, we are referring to strings with zero activation, so 3 sentences have a high activation and 3 have zero activation. These should be negative examples, though I may want to really make sure in the code the activation is always zero. we could also add some mid activation samples for more precise work here. If all sentences were positive there would be an easy way to hack this by always simulating a high activation. 

Sentences are presented in batches, both during labeling and simulation. 

When simulating, the simulating agent uses function calling to write down a guessed activation for each sentence.

We mainly use activations per sentence for simplicity, making the task easier for the ai, I'd imagine we would need the agent to write down a list of values for each token in a sentence. Maybe the more powerful llama 3.3 70b is capable of this, but I would have to think of how to present this in a non-confusing way to the agent. 

Having a baseline is good and would verify our back of the envelope estimation.

I think there is somewhat of a flaw with our approach, but this might extend to bills algorithm in general. Let's say we apply some optimization pressure to the simulating agent to get really good scores, an alternative method to solve this is to catch up on common themes, since we are oversampling text that triggers the latent. let's say the latent is about japan, the agent may notice that there are a lot of mentions of japan and deduce the latent must be on japan even without any explanation label. this could be somewhat reduced if we only show the agent small pieces of text in its context and don't present all sentences in a single batch.

I would say the three papers show a clear pattern that alignment didn't generalize well from chat setting to agent setting, solid evidence for that thesis. That is evidence for a stronger claim of an underlying pattern, ie that alignment will in general not generalize as well as capabilites. For conceptual evidence of that claim you can look at the linked post. my attempt to summarize the argument, capabilites are a kind of attractor state, being smarter and more capable is an objective thing about the universe in a way. however, being more aligned with humans is not a special thing about the universe but a free parameter. In fact, alignment stands in some conflict with capabilites, as instrumental incentives undermine alignment.

For what a third option would be, ie the next step were alignment might not generalize

From the article

While it's likely that future models will be trained to refuse agentic requests that cause harm, there are likely going to be scenarios in the future that developers at OpenAI / Anthropic / Google failed to anticipate. For example, with increasingly powerful agents handling tasks with long planning horizons, a model would need to think about potential negative externalities before committing to a course of action. This goes beyond simply refusing an obviously harmful request.

from a different comment of mine:

I seems easy to just training it to refuse bribing, harassing, etc. But as agents will take on more substantial tasks, how do we make sure agents don't do unethical things while let's say running a company? Or if an agent midway through a task realizes it is aiding in cyber crime, how should it behave?

I only briefly touch on this in the discussion, but making agents safe is quite different from current refusal based safety. 

With increasingly powerful agents handling tasks with long planning horizons, a model would need to think about potential negative externalities before committing to a course of action. This goes beyond simply refusing an obviously harmful request.

It would need to sometimes reevaluate the outcomes of actions while executing a task.

Has somebody actually worked on this? I am not aware of anyone using a type of RLHF, DPO, RLAIF, or SFT to make agents behave safely within bounds, make agents consider negative externalities or agents reevaluating outcomes occasionally during execution.

I seems easy to just training it to refuse bribing, harassing, etc. But as agents will take on more substantial tasks, how do we make sure agents don't do unethical things while let's say running a company? Or if an agent midway through a task realizes it is aiding in cyber crime, how should it behave?

I had finishing this up on my to-do list for a while. I just made a full length post on it.

https://www.lesswrong.com/posts/ZoFxTqWRBkyanonyb/current-safety-training-techniques-do-not-fully-transfer-to

I think it's fair to say that some smarter models do better at this, however, it's still worrisome that there is a gap. Also attacks continue to transfer.

Here is a way in which it doesn't generalize in observed behavior:

Alignment does not transfer well from chat models to agents

TLDR: There are three new papers which all show the same finding, i.e. the safety guardrails from chat models don’t transfer well from chat models to the agents built from them. In other words, models won’t tell you how to do something harmful, but they will do it if given the tools. Attack methods like jailbreaks or refusal-vector ablation do transfer.

Here are the three papers, I am the author of one of them:

https://arxiv.org/abs/2410.09024

https://static.scale.com/uploads/6691558a94899f2f65a87a75/browser_art_draft_preview.pdf

https://arxiv.org/abs/2410.10871

I thought of making a post here about this if it is interesting

Hi Evan, I published this paper on arxiv recently and it also got accepted at the SafeGenAI workshop at Neurips in December this year. Thanks for adding the link, I will probably work on the paper again and put an updated version on arxiv as I am not quite happy with the current version.

I think that using the base model without instruction fine-tuning would prove bothersome for multiple reasons:

1. In the paper I use the new 3.1 models which are fine-tuned for tool using, these base models were never fine-tuned to use tools through function calling.

2. Base models are highly random and hard to control, they are not really steerable. They require very careful prompting/conditioning to do anything useful.

3. I think current post-training basically improves all benchmarks

I am also working on using such agents and directly evaluating how good they are on humans at spear phishing: https://openreview.net/forum?id=VRD8Km1I4x

Load More