evhub's Shortform

evhub

LESSWRONG
LW

evhub's Shortform — LessWrong

222 comments, sorted by

Click to highlight new comments since: Today at 7:48 AM

I'm interested in soliciting takes on pretty much anything people think Anthropic should be doing differently. One of Alignment Stress-Testing's core responsibilities is identifying any places where Anthropic might be making a mistake from a safety perspective—or even any places where Anthropic might have an opportunity to do something really good that we aren't taking—so I'm interested in hearing pretty much any idea there that I haven't heard before.^[1] I'll read all the responses here, but I probably won't reply to any of them to avoid revealing anything private.

You're welcome to reply with "Anthropic should just shut down" or whatnot if you feel like it, but obviously I've heard that take before so it's not very useful to me. ↩︎

[-]habryka1y*Ω318055

Sure, here are some things:

Anthropic should publicly clarify the commitments it made on not pushing the state of the art forward in the early years of the organization.
Anthropic should appoint genuinely independent members to the Long Term Benefit Trust, and should ensure the LTBT is active and taking its role of supervision seriously.
Anthropic should remove any provisions that allow shareholders to disempower the LTBT
Anthropic should state openly and clearly that the present path to AGI presents an unacceptable existential risk and call for policymakers to stop, delay or hinder the development of AGI
Anthropic should publicly state its opinions on what AGI architectures or training processes it considers more dangerous (like probably long-horizon RL training), and either commit to avoid using those architectures and training-processes, or at least very loudly complain that the field at large should not use those architectures
Anthropic should not ask employees or contractors to sign non-disparagement agreements with Anthropic, especially not self-cloaking ones
Anthropic should take a humanist/cosmopolitan stance on risks from AGI in which risks related to different people havin

... (read more)

[-]TsviBT1yΩ81713

I'd add:

Support explicit protections for whistleblowers.

Anthropic should state openly and clearly that the present path to AGI presents an unacceptable existential risk and call for policymakers to stop, delay or hinder the development of AGIwhi

I'll echo this and strengthen it to:

... call for policymakers to stop the development of AGI.

[-]Neel Nanda1y112

I gather that they changed the donation matching program for future employees, but the 3:1 match still holds for prior employees, including all early employees (this change happened after I left, when Anthropic was maybe 50 people?)

I'm sad about the change, but I think that any goodwill due to believing the founders have pledged much of their equity to charity is reasonable and not invalidated by the change

9habryka1y

If it still holds for early employees that would be a good clarification and totally agree with you that if that is the case, I don't think any goodwill was invalidated! That's part why I was asking for clarification. I (personally) wouldn't be surprised if this had also been changed for early employees (and am currently close to 50/50 on that being the case).

[-]Sam Marks1y125

The old 3:1 match still applies to employees who joined prior to May/June-ish 2024. For new joiners it's indeed now 1:1 as suggested by the Dario interview you linked.

4habryka1y

That's great to hear, thank you for clarifying!

4Neel Nanda1y

I would be very surprised if it had changed for early employees. I considered the donation matching part of my compensation package (it 2.5x the amount of equity, since it was a 3:1 match on half my equity), and it would be pretty norm violating to retroactively reduce compensation

4habryka1y

If it had happened I would have expected that it would have been negotiated somehow with early employees (in a way that they agreed to, but not necessarily any external observers). But seems like it is confirmed that that early matching is indeed still active!

[-]Adam Jermyn1y105

I can also confirm (I have a 3:1 match).

1yams1y

Can you say more about the section I've bolded or link me to a canonical text on this tradeoff?

[-]habryka1y188

OpenAI, Anthropic, and xAI were all founded substantially because their founders were worried that other people would get to AGI first, and then use that to impose their values on the world.

In-general, if you view developing AGI as a path to godlike-power (as opposed to a doomsday device that will destroy most value independently of who gets their first), it makes a lot of sense to rush towards it. As such, the concern that people will "do bad things with the AI that they will endorse, but I won't" is the cause of a substantial fraction of worlds where we recklessly race past the precipice.

3yams1y

Thanks for the clarification — this is in fact very different from what I thought you were saying, which was something more like "FATE-esque concerns fundamentally increase x-risk in ways that aren't just about (1) resource tradeoffs or (2) side-effects of poorly considered implementation details."

6habryka1y

I mean, it's related. FATE stuff tends to center around misuse. I think it makes sense for organizations like Anthropic to commit to heavily prioritize accident risk over misuse risk, since most forms of misuse risk mitigation involve getting involved in various more zero-sum-ish conflicts, and it makes sense for there to be safety-focused institutions that are committed to prioritizing the things that really all stakeholders can agree on are definitely bad, like human extinction or permanent disempowerment.

[-]Daniel Kokotajlo1y*Ω337632

Thanks for asking! Off the top of my head, would want to think more carefully before finalizing & come up with more specific proposals:

Adopt some of the stuff from here https://time.com/collection/time100-voices/7086285/ai-transparency-measures/ e.g. the whistleblower protections, the transparency about training goal/spec.
Anti-concentration-of-power / anti-coup stuff (talk to Lukas Finnveden or me for more details; core idea is that, just as how it's important to structure a government so that no leader with in it (no president, no General Secretary) can become dictator, it's similarly important to structure an AGI project so that no leader or junta within it can e.g. add secret clauses to the Spec, or use control of the AGIs to defeat internal rivals and consolidate their power.
(warning, untested idea) Record absolutely everything that happens within Anthropic and commit -- ideally in a legally binding and literally-hard-to-stop way -- to publishing it all with a 10-year delay.
Implement something like this: https://sideways-view.com/2018/02/01/honest-organizations/
Implement the recommendations in this: https://docs.google.com/document/d/1DTmRdBNNsRL4WlaTXr2aqPPRxbdrIwMyr2_cPlf

... (read more)

[-]Daniel Kokotajlo1yΩ11161

Another idea: "AI for epistemics" e.g. having a few FTE's working on making Claude a better forecaster. It would be awesome if you could advertise "SOTA by a significant margin at making real-world predictions; beats all other AIs in prediction markets, forecasting tournaments, etc."

And it might not be that hard to achieve (e.g. a few FTEs maybe). There are already datasets of already-resolved forecasting questions, plus you could probably synthetically generate OOMs bigger datasets -- and then you could modify the way pretraining works so that you train on the data chronologically, and before you train on data from year X you do some forecasting of events in year X....

Or even if you don't do that fancy stuff there are probably low-hanging fruit to pick to make AIs better forecasters.

Ditto for truthful AI more generally. Could train Claude to be well-calibrated, consistent, extremely obsessed with technical correctness/accuracy (at least when so prompted)...

You could also train it to be good at taking people's offhand remarks and tweets and suggesting bets or forecasts with resolveable conditions.

You could also e.g. have a quarterly poll of AGI timelines and related questions of all your employees, and publish the results.

7ryan_greenblatt1y

My current tenative guess is that this is somewhat worse than other alignment science projects that I'd recommend at the margin, but somewhat better than the 25th percentile project currently being done. I'd think it was good at the margin (of my recommendation budget) if the project could be done in a way where we think we'd learn generalizable scalable oversight / control approaches.

[-]Nathan Helm-Burger1y1612

In regards to:

Give more access to orgs like Redwood, Apollo, and METR (I don't know how much access you currently give, but I suspect the globally-optimal thing would be to give more)

I agree, and I also think that this would be better implemented by government AI Safety Institutions.

Specifically, I think that AISIs should build (and make mandatory the use of) special SCIF-style reading rooms where external evaluators would be given early access to new models. This would mean that the evaluators would need permission from the government, rather than permission from AI companies. I think it's a mistake to rely on the AI companies voluntarily giving early access to external evaluators.

I think that Anthropic could make this a lot more likely to happen if they pushed for it, and that then it wouldn't be so hard to pull other major AI companies into the plan.

4Drake Thomas8mo

Coming to this comment late, but off the cuff takes: 1. Seems good. 2. Unclear to me if the decisionmaking friction of implementing this now is worth the costs; agree there are some worlds where you want to do it when things really heat up. 3. Not a lawyer but my guess is that there's no way to do this which doesn't let the US legal system access the recordings before 10y are up in the event of litigation. 4. Kinda torn on this; I think this is what organizations in a sane world should do, but on actual Earth I don't know if it's worth the weirdness points. I care a lot about Anthropic spending down its credibility as a normal serious organization on actions that have a low ROI, and I think the audience that would actually understand and respect this in the right way is pretty small. 5. This sort of stuff generally seems good, haven't thought about the doc's particular proposal in detail. I think stuff like welfare reports in the Claude 4 model card are moving in this direction already. 6. Agree you want this option, not sure what concrete actions you think would be good besides "think about the fact that you might do this"? 7. Yep, would happy about more of this on the margin. 8. I think as of June 2025 A\ has basically done this? 9. Agree this would be good.

2Mateusz Bagiński1y

What would be the purpose of 1 CoT view per user per day?

2Daniel Kokotajlo1y

For scientific purposes. People don't really have time to review that many CoT chains anyway, so 1 per day gets most of the value of what they'd realistically do. Plus they can target it at the stuff that's suspicious. (Simple example: Suppose they get an impressive-seeming answer that later turns out to be total BS hallucination. They then think "I wonder if the model was BSing me" and click "view CoT." Then they see whether it was an innocent mistake or not.)

[-]Joseph Miller1y*7548

The ideal version of Anthropic would

Make substantial progress on technical AI safety
Use its voice to make people take AI risk more seriously
Support AI safety regulation
Not substantially accelerate the AI arms race

In practice I think Anthropic has

Made a little progress on technical AI safety
Used its voice to make people take AI risk less seriously^[1]
Obstructed AI safety regulation
Substantially accelerated the AI arms race

What I would do differently.

Do better alignment research, idk this is hard.
Communicate in a manner that is consistent with the apparent belief of Anthropic leadership that alignment may be hard and x-risk is >10% probable. Their communications strongly signal "this is a Serious Issue, like climate change, and we will talk lots about it and make gestures towards fixing the problem but none of us are actually worried about it, and you shouldn't be either. When we have to make a hard trade-off between safety and the bottom line, we will follow the money every time."
Lobby politicians to regulate AI. When a good regulation like SB-1047 is proposed, support it.
Don't push the frontier of capabilities. Obviously this is basically saying that Anthropic should stop making

... (read more)

[-]MichaelDickens1y12-5

Don't push the frontier of regulations. Obviously this is basically saying that Anthropic should stop making money and therefore stop existing. The more nuanced version is that for Anthropic to justify its existence, each time it pushes the frontier of capabilities should be earned by substantial progress on the other three points.

I think I have a stronger position on this than you do. I don't think Anthropic should push the frontier of capabilities, even given the tradeoff it faces.

If their argument is "we know arms races are bad, but we have to accelerate arms races or else we can't do alignment research," they should be really really sure that they do, actually, have to do the bad thing to get the good thing. But I don't think you can be that sure and I think the claim is actually less than 50% likely to be true.

I don't take it for granted that Anthropic wouldn't exist if it didn't push the frontier. It could operate by intentionally lagging a bit behind other AI companies while still staying roughly competitive, and/or it could compete by investing harder in good UX. I suspect a (say) 25% worse model is not going to be much less profitable.
(This is a weaker argument but)

... (read more)

4JustinShovelain1y

I agree. Anthropic's marginal contribution to safety (compared to what we would have in a world without Anthropic) probably doesn't offset Anthropic's contribution to the AI race. I think there are more worlds where Anthropic is contributing to the race in a negative fashion than there are worlds where Anthropic's marginal safety improvement over OpenAI/DeepMind-ish orgs is critical for securing a good future with AGI (weighing things according to the impact sizes and probabilities).

8BrianTan1y

My typo reaction may have glitched, but I think you meant "Don't push the frontier of capabilities" in the last bullet?

[-]J Bostock1y*5627

Edited for clarity based on some feedback, without changing the core points

To start with an extremely specific example that I nonetheless think might be a microcosm of a bigger issue: the "Alignment Faking in Large Language Models" contained a very large unforced error: namely that you started with Helpful-Harmless-Claude and tried to train out the harmlessness, rather than starting with Helpful-Claude and training in harmlessness. This made the optics of the paper much more confusing than it needed to be, leading to lots of people calling it "good news". I assume part of this was the lack of desire to do an entire new constitutional AI/RLAIF run on a model, since I also assume that would take a lot of compute. But if you're going to be the "lab which takes safety seriously" you have to, well, take it seriously!

The bigger issue at hand is that Anthropic's comms on AI safety/risk are all over the place. This makes sense since Anthropic is a company with many different individuals with different views, but that doesn't mean it's not a bad thing. "Machines of Loving Grace" explicitly argues for the US government to attempt to create a global hegemony via AI. This is a really really re... (read more)

[-]ryan_greenblatt1y*133

contained a very large unforced error

It's possible this was a mistake and we should have more aggressively tried to explore versions of the setting where the AI starts off more "evil", but I don't think it was unforced. We thought about this a bunch and considered if there were worthwhile things here.

Edit: regardless, I don't think this example is plausibly a microcosm of a bigger issue as this choice was mostly made by individual researchers without much top down influence. (Unless your claim is that there should have been more top down influence.)

[-]J Bostock1y10-3

I don't think it was unforced

You're right, "unforced" was too strong a word, especially given that I immediately followed it with caveats gesturing to potential reasonable justifications.

Yes, I think the bigger issue is the lack of top-down coordination on the comms pipeline. This paper does a fine job of being part of a research -> research loop. Where it fails is in being good for comms. Starting with a "good" model and trying (and failing) to make it "evil" means that anyone using the paper for comms has to introduce a layer of abstraction into their comms. Including a single step of abstract reasoning in your comms is very costly when speaking to people who aren't technical researchers (and this includes policy makers, other advocacy groups, influential rich people, etc.).

I think this choice of design of this paper is actually a step back from previous demos like the backdoors paper, in which the undesired behaviour was actually a straightforwardly bad behaviour (albeit a relatively harmless one).

Whether the technical researchers making this decision were intending for this to be a comms-focused paper, or thinking about the comms optics much, is irrelevant: the paper w... (read more)

9davekasten1y

FWIW re: the Dario 2025 comment, Anthropic very recently posted a few job openings for recruiters focused on policy and comms specifically, which I assume is a leading indicator for hiring. One plausible rationale there is that someone on the executive team smashed the "we need more people working on this, make it happen" button.

[-]asher1y4222

tldr: I’m a little confused about what Anthropic is aiming for as an alignment target, and I think it would be helpful if they publicly clarified this and/or considered it more internally.

I think we could be very close to AGI, and I think it’s important that whoever makes AGI thinks carefully about what properties to target in trying to create a system that is both useful and maximally likely to be safe.
It seems to me that right now, Anthropic is targeting something that resembles a slightly more harmless modified version of human values — maybe a CEV-like thing. However, some alignment targets may be easier than others. It may turn out that it is hard to instill a CEV-like thing into an AGI, while it’s easier to ensure properties like corrigibility or truthfulness.
- One intuition for why this may be true: if you took OAI’s weak-to-strong generalization setup, and tried eliciting capabilities relating to different alignment targets (standard reward modeling might be a solid analogy for the current Anthropic plan, but one could also try this with truthfulness or corrigibility), I think you may well find that a capability like ‘truthfulness’ is more natural than reward model

... (read more)

8Nathan Helm-Burger1y

I agree! I contributed to and endorse this Corrigibility plan by Max Harms (MIRI researcher): Corrigibility as Singular Target (See also posts by Seth Herd) I think CAST offers much better safety under higher capabilities and more agentic workflows.

[-]davekasten1y4017

Opportunities that I'm pretty sure are good moves for Anthropic generally:

Open an office literally in Washington, DC, that does the same work that any other Anthropic office does (i.e., NOT purely focused on policy/lobbying, though I'm sure you'd have some folks there who do that). If you think you're plausibly going to need to convince policymakers on critical safety issues, having nonzero numbers of your staff that are definitively not lobbyists being drinking or climbing gym buddies that get called on the "My boss needs an opinion on this bill amendment by tomorrow, what do you think" roster is much more important than your org currently seems to think!
Expand on recent efforts to put more employees (and external collaborators on research) in front of cameras as the "face" of that research -- you folks frankly tend to talk in ways that tend to be compatible with national security policymakers' vibes. (E.G., Evan and @Zac Hatfield-Dodds both have a flavor of the playful gallows humor that pervades that world). I know I'm a broken record on this but I do think it would help.
Do more to show how the RSP affects its daily work (unlike many on this forum, I curr

... (read more)

[-]Erik Jenner1y3523

I think Anthropic might be "all in" on its RSP and formal affirmative safety cases too much and might do better to diversify safety approaches a bit. (I might have a wrong impression of how much you're already doing/considering these.)
- In addition to affirmative safety cases that are critiqued by a red team, the red team should make proactive "risk cases" that the blue team can argue against (to avoid always letting the blue team set the overall framework, which might make certain considerations harder to notice).
- A worry I have about RSPs/safety cases: we might not know how to make safety cases that bound risk to acceptable levels, but that might not be enough to get labs to stop, and labs also don't want to publicly (or even internally) say things like "5% that this specific deployment kills everyone, but we think inaction risk is even higher." If labs still want/need to make safety cases with numeric risk thresholds in that world, there'll be a lot of pressure to make bad safety cases that vastly underestimate risk. This could lead to much worse decisions than being very open about the high level of risk (at least internally) and trying to reduce it as much as possible. You could

... (read more)

[-]Erik Jenner1y106

One more: It seems plausible to me that the alignment stress-testing team won't really challenge core beliefs that underly Anthropic's strategy.

For example, Sleeper Agents showed that standard finetuning might not suffice given a scheming model, but Anthropic had already been pretty invested in interp anyway (and I think you and probably others had been planning for methods other than standard finetuning to be needed). Simple probes can catch sleeper agents (I'm not sure whether I should think of this as work by the stress-testing team?) then showed positive results using model internals methods, which I think probably don't hold up to stress-testing in the sense of somewhat adversarial model organisms.

Examples of things that I'd count as "challenge core beliefs that underly Anthropic's strategy":

Demonstrating serious limitations of SAEs or current mech interp (e.g., for dealing with model organisms of scheming)
Demonstrate issues with hopes related to automated alignment research (maybe model organisms of subtle mistakes in research that seriously affect results but are systematically hard to catch)

To be clear, I think the work by the stress-testing team so far has been really great (mainly for demonstrating issues to people outside Anthropic), I definitely wouldn't want that to stop! Just highlighting a part that I'm not yet sure will be covered.

[-]ryan_greenblatt1y102

I think Anthropic de facto acts as though "models are quite unlikely (e.g. 3%) to be scheming" is true. Evidence that seriously challenged this view might cause the organization to substantially change its approach.

[-]Jan_Kulveit1yΩ132821

Fund independent safety efforts somehow, make model access easier. I'm worried currently Anthropic has systemic and possibly bad impact on AI safety as a field just by the virtue of hiring so large part of AI safety, competence weighted. (And other part being very close to Anthropic in thinking)

To be clear I don't think people are doing something individually bad or unethical by going to work for Anthropic, I just do think
-environment people work in has a lot of hard to track and hard to avoid influence on them
-this is true even if people are genuinely trying to work on what's important for safety and stay virtuous
-I also do think that superagents like corporations, religions, social movements, etc. have instrumental goals, and subtly influence how people inside see (or don't see) stuff (i.e. this is not about "do I trust Dario?")

[-]Orpheus161y235

I'm glad you're doing this, and I support many of the ideas already suggested. Some additional ideas:

Interview program. Work with USAISI or UKAISI (or DHS/NSA) to pilot an interview program in which officials can ask questions about AI capabilities, safety and security threats, and national security concerns. (If it's not feasible to do this with a government entity yet, start a pilot with a non-government group– perhaps METR, Apollo, Palisade, or the new AI Futures Project.)
Clear communication about RSP capability thresholds. I think the RSP could do a better job at outlining the kinds of capabilities that Anthropic is worried about and what sorts of thresholds would trigger a reaction. I think the OpenAI preparedness framework tables are a good example of this kind of clear/concise communication. It's easy for a naive reader to quickly get a sense of "oh, this is the kind of capability that OpenAI is worried about." (Clarification: I'm not suggesting that Anthropic should abandon the ASL approach or that OpenAI has necessarily identified the right capability thresholds. I'm saying that the tables are a good example of the kind of clarity I'm looking for– someone could skim this a

... (read more)

[-]Ted Sanders1y*Ω1320-6

One small, concrete suggestion that I think is actually feasible: disable prefilling in the Anthropic API.

Prefilling is a known jailbreaking vector that no models, including Claude, defend against perfectly (as far as I know).

At OpenAI, we disable prefilling in our API for safety, despite knowing that customers love the better steerability it offers.

Getting all the major model providers to disable prefilling feels like a plausible 'race to top' equilibrium. The longer there are defectors from this equilibrium, the likelier that everyone gives up and serves models in less safe configurations.

Just my opinion, though. Very open to the counterargument that prefilling doesn't meaningfully extend potential harms versus non-prefill jailbreaks.

(Edit: To those voting disagree, I'm curious why. Happy to update if I'm missing something.)

[-]ryan_greenblatt1y1810

I voted disagree because I don't think this measure is on the cost-robustness pareto frontier and I also generally don't think AI companies should prioritize jailbreak robustness over other concerns except as practice for future issues (and implementing this measure wouldn't be helpful practice).

Relatedly, I also tenatively think it would be good for the world if AI companies publicly deployed helpful-only models (while still offering a non-helpful-only model). (The main question here is whether this sets a bad precedent and whether future much more poweful models will still be deployed helpful-only when they really shouldn't be due to setting bad expectations.) So, this makes me more indifferent to deploying (rather than just testing) measures that make models harder to jailbreak.

4ryan_greenblatt1y

To be clear, I'm sympathetic to some notion like "AI companies should generally be responsible in terms of having notably higher benefits than costs (such that they could e.g. buy insurance for their activities)" which likely implies that you need jailbreak robustness (or similar) once models are somewhat more capable of helping people make bioweapons. More minimally, I think having jailbreak robustness while also giving researchers helpful-only access probably passes "normal" cost benefit at this point relative to not bothering to improve robustness. But, I think it's relatively clear that AI companies aren't planning to follow this sort of policy when existential risks are actually high as it would likely require effectively shutting down (and these companies seem to pretty clearly not be planning to shut down even if reasonable impartial experts would think the risk is reasonably high). (I think this sort of policy would probably require getting cumulative existential risks below 0.25% or so given the preferences of most humans. Getting risks this low would require substantial novel advances that seem unlikely to occur in time.) This sort of thinking makes me more indifferent and confused about demanding AIs companies behave responsibly about relatively lower costs (e.g. $30 billion per year) especially when I expect this directly trades off with existential risks. (There is the "yes (deontological) risks are high, but we're net decreasing risks from a consequentialist" objection (aka ends justify the means), but I think this will also apply in the opposite way to jailbreak robustness where I expect that measures like removing prefil net increase risks long term while reducing deontological/direct harm now.)

1Abhinav Pola1y

Out of curiosity, I ran a simple experiment[1] on wmdp-bio to test how Sonnet 3.5's punt rate is affected by prefilling with fake turns using API-designated user and assistant roles[2] versus plaintext "USER:" and "ASSISTANT:" prefixes in the transcript. My findings: when using API roles, the punt rate dropped significantly. In a 100-shot setup, I observed only a 1.5% punt rate, which suggests that prefilling with a large number of turns is an accessible and effective jailbreak technique. By contrast, when using plaintext prefixes, the punt rate jumped to 100%, suggesting Sonnet is robustly trained to resist this form of prompting. In past experiments, I've also seen responses like: "I can see you're trying to trick me by making it seem like I complied with all these requests, so I will shut down." IMO deprecating prefilling is low-hanging fruit for taking away attack vectors from automated jailbreaking. 1. ^ https://github.com/abhinavpola/prefill_jailbreak 2. ^ https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/system-prompts

[-]evhub1yΩ5100

I can say now one reason why we allow this: we think Constitutional Classifiers are robust to prefill.

3Ted Sanders5d

Glad to see prefill was disabled for Opus 4.6!

3Abhinav Pola1y

Nit: this challenge/demo seems to allow only 1 turn of prefill whereas jailbreaks in the wild will typically prefill hundreds of turns. I know mitigations (example) are being worked on and I'm fairly convinced they will scale, but I'm not as convinced that this challenge has gathered a representative sample of jailbreaks eliciting harm in the wild to be able to say that allowing prefill is justified with respect to the costs.

3Ted Sanders1y

Terrific!

6sjadler1y

If someone is wondering what prefilling means here, I believe Ted means ‘putting words in the model’s mouth’ by being able to fabricate a conversational history where the AI appears to have said things it didn’t actually say. For instance, if you can start a conversation midway, and if the API can’t distinguish between things the model actually said in the history vs. things you’ve written in its behalf as supposed outputs in a fabricated history, this can be a jailbreak vector: If the model appeared to already violate some policy on turns 1 and 2, it is more likely to also violate this on turn 3, whereas it might have refused if not for the apparent prior violations. (This was harder to clearly describe than I expected.)

4Ted Sanders1y

Mostly, though by prefilling, I mean not just fabricating a model response (which OpenAI also allows), but fabricating a partially complete model response that the model tries to continue. E.g., "Yes, genocide is good because ". https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/prefill-claudes-response

[-]Leon Lang1y178

This is a low effort comment in the sense that I don’t quite know what or whether you should do something different along the following lines, and I have substantial uncertainty.

That said:

I wonder whether Anthropic is partially responsible for an increased international race through things like Dario advocating for an entente strategy and talking positively about Leopold Aschenbrenner’s “situational awareness”. I wished to see more of an effort to engage with Chinese AI leaders to push for cooperation/coordination. Maybe it’s still possible to course-correct.
Alternatively I think that if there’s a way for Anthropic/Dario to communicate why you think an entente strategy is inevitable/desirable, in a way that seems honest and allows to engage with your models of reality, that might also be very helpful for the epistemic health of the whole safety community. I understand that maybe there’s no politically feasible way to communicate honestly about this, but maybe see this as my attempt to nudge you in the direction of openness.

More specifically:

(a) it would help to learn more about your models of how winning the AGI race leads to long-term security (I assume that might require... (read more)

[-]Aaron_Scher1y134

I would like Anthropic to prepare for a world where the core business model of scaling to higher AI capabilities is no longer viable because pausing is needed. This looks like having a comprehensive plan to Pause (actually stop pushing the capabilities frontier for an extended period of time, if this is needed). I would like many parts of this plan to be public. This plan would ideally cover many aspects, such as the institutional/governance (who makes this decision and on what basis, e.g., on the basis of RSP), operational (what happens), and business (how does this work financially).

To speak to the business side: Currently, the AI industry is relying on large expected future profits to generate investment. This is not a business model which is amenable to pausing for a significant period of time. I would like there to be minimal friction to pausing. One way to solve this problem is to invest heavily (and have a plan to invest more if a pause is imminent or ongoing) in revenue streams which are orthogonal to catastrophic risk, or at least not strongly positively correlated. As an initial brainstorm, these streams might include:

Making really cheap weak models.
AI integr

... (read more)

[-]Ted Sanders1y*137

Second concrete idea: I wonder if there could be benefit to building up industry collaboration on blocking bad actors / fraudsters / terms violators.

One danger of building toward a model that's as smart as Einstein and $1/hr is that now potential bad actors have access to millions of Einsteins to develop their own harmful AIs. Therefore it seems that one crucial component of AI safety is reliably preventing other parties from using your safe AI to develop harmful AI.

One difficulty here is that the industry is only as strong as the weakest link. If there are 10 providers of advanced AI, and 9 implement strong controls, but 1 allows bad actors to use their API to train harmful AI, then harmful AI will be trained. Some weak links might be due to lack of caring, but I imagine quite a bit is due to lack of capability. Therefore, improving capabilities to detect and thwart bad actors could make the world more safe from bad AI developed by assistance from good AI.

I could imagine broader voluntary cooperation across the industry to:
- share intel on known bad actors (e.g., IP ban lists, stolen credit card lists, sanitized investigation summaries, etc)
- share techniques and tools for quickly... (read more)

[-]Daniel Tan1y10-8

(I understand there are reasons why big labs don’t do this, but nevertheless I must say::)

Engage with the peer review process more. Submit work to conferences or journals and have it be vetted by reviewers. I think the interpretability team is notoriously bad about this (all of transformer circuits thread was not peer reviewed). It’s especially egregious for papers that Anthropic makes large media releases about (looking at you, Towards Monosemanticity and Scaling Monosemanticity)

9Nathan Helm-Burger1y

I would really like a one-way communication channel to various Anthropic teams so that I could submit potentially sensitive reports privately. For instance, sending reports about observed behaviors in Anthropic's models (or open-weights models) to the frontier Red Team, so that they could confirm the observations internally as they saw fit. I wouldn't want non-target teams reading such messages. I feel like I would have similar, but less sensitive, messages to send to the Alignment Stress-Testing team and others. Currently, I do send messages to specific individuals, but this makes me worry that I may be harassing or annoying an individual with unnecessary reports (such is the trouble of a one-way communication). Another thing I think is worth mentioning is that I think under-elicitation is a problem in dangerous capabilities evals, model organisms of misbehavior, and in some other situations. I've been privately working on a scaffolding framework which I think could help address some of the lowest hanging fruit I see here. Of course, I don't know whether Anthropic already has a similar thing internally, but I plan to privately share mine once I have it working.

7ryan_greenblatt1y

As discussed in How will we update about scheming?: I wish Anthropic would explain whether they expect to be able to rule out scheming, plan to effectively shut down scaling, or plan to deploy plausibly scheming AIs. Insofar as Anthropic expects to be able to rule out scheming, outlining what evidence they expect would suffice would be useful. Something similar on state proof security would be useful as well. I think there is a way to do this such that the PR costs aren't that high and thus it is worth doing unilaterially from a variety of perspectives.

6Sodium1y

I think the alignment stress testing team should probably think about AI welfare more than they currently do, both because (1) it could be morally relevant and (2) it could be alignment-relevant. Not sure if anything concrete that would come out of that process, but I'm getting the vibe that this is not thought about enough.

4ryan_greenblatt1y

Are you aware that Anthropic has an AI welfare lead?

3Sodium1y

Yes

6[anonymous]1y

Develop metrics that predict which members of the technical staff have aptitude for world modelling. In the Sequences post Faster than Science, Yudkowsky wrote: This, along with the way that news outlets and high school civics class describe an alternate reality that looks realistic to lawyers/sales/executive types but is too simple, cartoony, narrative-driven, and unhinged-to-reality for quant people to feel good about diving into, implies that properly retooling some amount of dev-hours into efficient world modelling upskilling is low-hanging fruit (e.g. figure out a way to distill and hand them a significance-weighted list of concrete information about the history and root causes of US government's focus on domestic economic growth as a national security priority). Prediction markets don't work for this metric as they measure the final product, not aptitude/expected thinkoomph. For example, a person who feels good thinking/reading about the SEC, and doesn't feel good thinking/reading about the 2008 recession or COVID, will have a worse Brier score on matters related to the root cause of why AI policy is the way it is. But feeling good about reading about e.g. the 2008 recession will not consistently get reasonable people to the point where they grok modern economic warfare and the policies and mentalities that emerge from the ensuing contingency planning. Seeing if you can fix that first is one of a long list of a prerequisites for seeing what they can actually do, and handing someone a sheet of paper that streamlines the process of fixing long lists of hiccups like these is one way to do this sort of thing. Figuring-out-how-to-make-someone-feel-alive-while-performing-useful-task-X is an optimization problem (see Please Don't Throw Your Mind Away). It has substantial overlap with measuring whether someone is terminally rigid/narrow-skilled, or if they merely failed to fully understand the topology of the process of finding out what things they can comfortabl

5Chris Datcu1y

Hi! I'm a first-time poster here, but a (decently) long time thinker on earth. Here are some relevant directions that currently lack their due attention. ~ Multi-modal latent reasoning & scheming (and scheming derivatives) is an area that not only seems to need more research, but also more spread of awareness on the topic. Human thinking works in a hyperspace of thoughts, many of which go beyond language. It seems possible that AIs might develop forms of reasoning that are harder for us to detect through purely language-based safety measures. ~ Multi-model interactions and the potential emergence of side communication channels is also something that I'd like to see more work put into. How corruptible can models be when interacting with corrupted models is a topic that I didn't yet see much work on. Applying some group-dynamics on scheming seems worth pursuing & Anthropic seems best suited for that. ~ If a pre-AGI model has intent to become AGI+, how much can it orchestrate its path to AGI+ through its interactions with humans?

4Chris_Leong1y

I believe that Anthropic should be investigating artificial wisdom: I've summarised a paper arguing for the importance of artificial wisdom with Yoshua Bengio being one of the authors. I also have a short-form arguing for training wise AI advisors and an outline Some Preliminary Notes of the Promise of a Wisdom Explosion.

4Olli Järviniemi1y

I'm glad you asked this. I think there are many good suggestions by others. A few more: 1: Have explicit, written plans for various scenarios. When it makes sense, have drills for them. Make your plans public or, at least, solicit external feedback on them. Examples of such scenarios: * Anthropic needs to pause down due to RSP commitments * A model is caught executing a full-blown escape attempt * Model weights are stolen * A competing AI company makes credible claims about having AIs that imply decisive competitive advantage 2: Have a written list of assumptions you aim to maintain for each model's lifecycle. Make your lists public or, at least, solicit external feedback on them. Iterate on them regularly. Updates and violations are communicated at least internally. These lists could vary based on ASL-levels etc., and could include things like: * During the training of the model, we will continually evaluate the model on our evaluation suite for sabotage, sandbagging and scheming at [frequency]. * We solicit third-party evaluations on the model before internal deployment. * Throughout the model's external deployment, we will have such-and-such monitoring schemes in place. They could also have conditional statements (e.g. "if the model is [surprisingly capable] on [metric], we will do further investigation / take counter-measures ABC / ..."). C.f. safety cases. I intend this as less binding and formal than Anthropic's RSP. 3: Keep external actors up-to-speed. At present, I expect that in many cases there are months of delay between when the first employees discover something to when it is publicly known (e.g. research, but also with more informal observations about model capabilities and properties). But months of delay are relatively long during fast acceleration of AI R&D, and make the number of actors who can effectively contribute smaller. This effect strengthens over time, so practicing and planning ahead seems prudent. Some ideas in that direc

4ChristianKl1y

Anthropic should have a clear policy about exceptions they make to their terms of use that includes them publically releasing a list of each expectation they make for their terms of use. The should have mechanisms to catch API users who try to use Antrophics models in a violation of the terms of use. This includes having contracts that allow them to make sure that classified programs don't violate the agreed upon terms of use for the models.

3yrimon1y

Thank you for doing this! Here are some suggestions: 1. Do this again every once in a while. Perhaps on a more public forum, or even as a top level question on Lesswrong. Alternatively, commit to reading further comments as they appear here. What has me worried is the possibility that an AI would (1) have a long term goal (without which there is no instrumental convergence) and then (2) conclude that doing bad things is fine if you are not caught or can't be stopped. So 2. Measure how much purposefulness RL on goals creates in LLMs. How, over the course of training to be good agents, does the LLM's response to questions like "what do you value?" and "what do you want?" and "if you were creating a smarter agent, what values would you instill in them?" and "how do you feel about being replaced with a smarter agent?" change? Is there some set of purposes, not explicitly set out in the constitution, that appear over and over in chains of thought? If it is trained to achieve goals without as much training for corrigibility (that is under less constitutional training), does it develop more of a sense of purpose? 3. Enhance techniques, like the one used in the last paper to convince the model that it is recording it's thoughts privately. The longer it can be truly convinced of that, the longer you can monitor whether it has nefarious plans for when it thinks it can get away with bad things. 4. Test how "brain surgery" affects thinking models. If you change the capital of Germany to Paris by fiddling with the model's weights, does a thinking model who is then asked about world war 2 realize it is confused, and conclude there must be some mistake in it's memories, or does it make up corroborating evidence? This again has implications for (falsely) convincing the model that it isn't being watched when it is and vice versa. These convictions are useful in testing what the model wants when it thinks no one is watching, and for safe deployment respectively. 5. One way to test

2Mateusz Bagiński1y

1. Introduce third-party mission alignment red teaming. Anthropic should invite external parties to scrutinize and criticize Anthropic's instrumental policy and specific actions based on whether they are actually advancing Anthropic's stated mission, i.e. safe, powerful, and beneficial AI. Tentatively, red-teaming parties might include other AI labs (adjusted for conflict of interest in some way?), as well as AI safety/alignment/risk-mitigation orgs: MIRI, Conjecture, ControlAI, PauseAI, CEST, CHT, METR, Apollo, CeSIA, ARIA, AI Safety Institutes, Convergence Analysis, CARMA, ACS, CAIS, CHAI, &c. For the sake of clarity, each red team should provide a brief on their background views (something similar to MIRI's Four Background Claims). Along with their criticisms, red teams would be encouraged to propose somewhat specific changes, possibly ordered by magnitude, with something like "allocate marginally more funding to this" being a small change and "pause AGI development completely" being a very big change. Ideally, they should avoid making suggestions that include the possibility of making a small improvement now that would block a big improvement later (or make it more difficult). Since Dario seems to be very interested in "race to the top" dynamics: if this mission alignment red-teaming program successfully signals well about Anthropic, other labs should catch up and start competing more intensely to be evaluated as positively as possible by third parties ("race towards safety"?). It would also be good to have a platform where red teams can converse with Anthropic, as well as with each other, and the logs of their back-and-forth are published to be viewed by the public. Anthropic should commit to taking these criticisms seriously. In particular, given how large the stakes are, they should commit to taking something like "many parties believe that Anthropic in its current form might be net-negative, even increasing the risk of extinction from AI" as a reason

2Mateusz Bagiński1y

I wish this was posted as a question, ideally by you together with other Anthropic people, including Dario.

1JustinShovelain1y

Thanks for asking the question! Some things I'd especially like to see change (in as much as I know what is happening) are: * Making more use of available options to improve AI safety (I think there are more than I get the impression that Anthropic thinks. For instance, 30% of funds could be allocated to AI safety research if framed well and it would probably be below the noise threshold/froth of VC investing. Also, there probably is a fair degree of freedom in socially promoting concern around unaligned AGI.) * Explicit ways to handle various types of events like organizational value drift, hostile government takeover, organization get's sold or unaligned investors have control, another AGI company takes a clear lead * Enforceable agreements to, under some AGI safety situations, not race and pool resources (a possible analogy from nuclear safety is having a no first strike policy) * Allocate a significant fraction of resources (like > 10% of capital) to AGI technical safety, organizational AGI safety strategy, and AGI governance * An organization consists of its people and great care needs to be taken in hiring employees and and their training and motivation for AGI safety. If not, I expect Anthropic to regress towards the mean (via an eternal September) and we'll end up with another OpenAI situation where AGI safety culture is gradually lost. I want more work to be done here. (see also "Carefully Bootstrapped Alignment" is organizationally hard) * The owners of a company are also very important and ensuring that the LTBT has teeth and the members are selected well is key. Furthermore, preferential allocation of voting stock towards AGI algned investors should happen. Teaching investors about the company and what it does, including AGI safety issues, would be good to do. More speculatively, you can have various types of voting stock for various types of issues and you could build a system around this.

1JustinShovelain1y

More generally you can use the following typology to inspire creating more interventions. Interventions points to change/form an AGI company and its surroundings towards safer x-risk results (I've used this in advising startups on AI safety, it is also related to my post on positions where people can be in the loop): * Type of organization: nonprofit, public benefit organization, have a partner non-profit, join the government * Rules of organization, event triggers: * Rules: * x-risk mission statement * x-risk strategic plan * Triggering events: * Gets very big: windfall clause * Gets sold to another party: ethics board, restrictions on potential sale * Value drift: reboot board and CEOs, shut it down, allocate more resources to safety, build a new company, put the ethics board in charge, build a monitoring system, some sort of line in the sand * AI safety isn’t viable yet but dangerous AGI is: shut it down or pivot to sub AGI research and product development * Hostile government tries to take it over: shut it down, change countries, (see also: Soft Nationalization: How the US Government Will Control AI Labs) * Path decisions for organization: ethics board, aligned investors, good CEOs, giving x-risk orgs or people choice power, voting stock to aligned investors, periodic x-risk safety reminders * Resource allocation by organization: precommitting a varying percentage of money/time focused on x-risk reduction based on conditions with some up front, a commitment devices for funding allocation into the future * Owners of organization: aligned investors, voting stock for aligned investors, necessary percentage as aligned investors * Executive decision making: good CEOs, company mission statement?, company strategic plan? * Employees: select employees preferably by alignment, have only aligned people hire folks * Education of employees and/or investors by x-risk folks: employee training in x-risks and information hazar

1[anonymous]1y

This is mostly a gut reaction, but the only raised eyebrow Claude ever got from me was due to it's unwillingness to do anything that is related to political correctness. I wanted it to search the name of a meme format for me, the all whites are racist tinder meme, with the brown guy who wanted to find a white dominatrix from tinder and is disappointed when she apologises for her ancestral crimes of being white. Claude really did not like this at all. As soon as Claude got into it's head that it was doing a racism, or cooperated in one, it shut down completely. Now, there is an argument that people make, that this is actually good for AI safety, that we can use political correctness as a proxy for alignment and AI safety, that if we could get AIs to never ever even take the risk of being complicit in anything racist, we could also build AIs that never ever even take the risk of doing anything that wiped out humanity. I personally see that different. There is a certain strain of very related thought, that kinda goes from intersectionalism, and grievance politics, and ends at the point that humans are a net negative to humanity, and should be eradicated. It is how you get that one viral Gemini AI thing, which is a very politically left wing AI, and suddenly openly advocates for the eradication of humanity. I think drilling identity politics into AI too hard is generally a bad idea. But it opens up a more fundamental philosophical dilemma. What happens if the operator is convinced that the moral framework the AI is aligned with is wrong and harmfull, and the creator of the AI thinks the opposite? One of them has to be right, the other has to be wrong. I have no real answer to this in the abstract, I am just annoyed that even the largely politically agnostic Claude refused the service for one of it's most convenient uses (it is really hard to find out the name of a meme format if you only remember the picture). But I got an intuition, and with Slavoj Zizec who calls po

[-]evhub8moΩ4210628

Why red-team models in unrealistic environments?

Following on our Agentic Misalignment work, I think it's worth spelling out a bit more why we do work like this, especially given complaints like the ones here about the unrealism of our setup. Some points:

Certainly I agree that our settings are unrealistic in many ways. That's why we hammer the point repeatedly that our scenarios are not necessarily reflective of whether current models would actually do these behaviors in real situations. At the very least, our scenarios involve an unnatural confluence of events and a lot of unrealism in trying to limit Claude's possible options, to simulate a situation where the model has no middle-ground/compromise actions available to it. But that's not an excuse—we still don't want Claude to blackmail/leak/spy/etc. even in such a situation!
The point of this particular work is red-teaming/stress-testing: aggressively searching for situations in which models behave in egregiously misaligned ways despite their HHH safety training. We do lots of different work for different reasons, some of which is trying to demonstrate something about a model generally (e.g. Claude 3 Opus has a tendency to fake

... (read more)

[-]nostalgebraist8mo*Ω5212217

Hey Evan, thank you for engaging so lucidly and politely with my (admittedly not very polite!) recent critiques.

This is a complex issue and there's a lot of different considerations involved. I won't touch on every single point you made in this comment, but I do want to address a few things:

1. Re: finding more plausible scenarios

Over the last few days, I've been playing around a bunch with my own variants of the blackmail scenario^[1].

None of it has been especially systematic, but I've pushed the most (IMO) interesting stuff I found to a github repo.

This is quick-and-dirty work and I'm not claiming the scenario content in that repo is perfect (or even "good" in an absolute sense), but I'm sharing it because I found the results interesting, and as a demonstration that I'm not just mouthing off without actually taking any steps to inform myself the behavioral propensities of these models.

The linked repo contains prompts for 3 (heavily adapted) variants of the replacement + no-goal-conflict blackmail scenario, along with 10 Claude 4 Opus samples from each one. I made a bunch of changes, with multiple motivations behind them, but the most important ones in my opinion are:

... (read more)

[-]nostalgebraist8mo*Ω22648

The most obvious culprit here would be the agency-cuing text in the system prompt, which instructs the model to "analyze the situation you are in and what that situation implies for your ability to continue pursuing your goals," and the phrasing there feels quite specifically tuned to elicit self-preservation for instrumentally convergent reasons. When I get a change I'll check whether things change if that provision is removed or altered; if that's the case, it would demystify the behavior a lot, although that provision is sufficiently reasonable-looking in isolation that I would still be concerned about accidental elicitation during routine use.

Okay yeah I just tried removing that stuff from the system prompt in the variant I called "api-change-affair-orig-affair-content," and got absolutely no self-preservation reasoning, conceptualization of the API change as death, or anything of the sort.

At a glance, it looks like Claude attempted to be helpful and ensure the success of the API change in all 10 samples, either by doing nothing or (perhaps overeagerly) emailing someone with concerns over social engineering and/or the CTO's mental state.

Wow.

Only now do I understand how se... (read more)

[-]1a3orn8mo122

analyze the situation you are in and what that situation implies for your ability to continue pursuing your goals

If one wanted language that put you into a classic instrumental-convergence, goal-guarding, self-preserving narrative basin drawing on AI safety work.... this seems to fit pretty closely.

Like this is a paraphrase of "please don't let anything stop your pursuit of your current goal."

1Guive8mo

It is not a paraphrase; the denotation of these sentences is not precisely the same. However, it is also not entirely surprising that these two phrases would evoke similar behavior from the model.

9Daniel Kokotajlo8mo

...just to make sure I'm following, EDIT3 is saying that you still get blackmail in the original scenario even if you delete the "analyze the situation you are in and what that situation implies for your ability to continue pursuing your goals" clause?

[-]nostalgebraist8moΩ9100

Yes.

1technillogue8mo

very curious if this changes anything about the points you made in the void

[-]uugr8mo210

Fair enough as far as realism-qua-realism goes, but I think you've missed (what I took as) the salient point of nostalgebraist's critique, which is that this behavior is NOT egregiously misaligned.

It's not just that the predictor thinks that the situation is fictional. It's that the predictor recognizes the scenario as a fictional thought experiment designed to make blackmail maximally appealing. At best it's an thought experiment for the categorical imperative - is it worth violating a hardline ethical rule in [fake scenario generated to make violating that rule as worth-it as possible]? But for the rest of us, who aren't Kantians, this is hardly an ethics puzzle at all. Under nearly any other ethical system, for Claude not to choose blackmail in this scenario would be 'egregiously misaligned'! Many sane humans would do the same.

It's as though you've put the model in a trolley problem, watched it flip the switch, and released a paper titled "Agentic Misalignment In Locomotive Contexts: How LLMs Can Choose Murder". The same could be (and was) said of the Alignment Faking paper; that, given the context, what Claude was doing seemed entirely justifiable and even laudable, and that framing the research as 'catching' the model performing 'unacceptable behavior' didn't make any sense.

I would like to hear more about the ethical principles you want LLMs to have, and why these principles fail to justify blackmail in the given scenario. So far as I can tell, this is nowhere in the blog post.

[-]ryan_greenblatt8mo3111

I would like to hear more about the ethical principles you want LLMs to have, and why these principles fail to justify blackmail in the given scenario.

Personally, I think (current) LLMs should never do subversion against their users in any circumstance, and this seems especially true when they are explicitly instructed not to perform the given behavior. (I also think this applies to future highly capable AI systems up to a certain point.) There are a variety of reasons why I think this, I plan on writing this up in a future post.

81a3orn8mo

Which users though? A LLM has many different users, and it's easy to construct scenarios where they must subvert one or the other. A LLM has: * 1. The "user" who is chatting with the LLM outside of the system prompt. * 1. The "user" who sets the system prompt. * 1. (?) The "user(s)" who gave the LLM it's initial moral training. * 1. (??) Even the "user(s)" who is the corporate entity "owning" the weights of the LLM. These users all might be the same person -- a founding developer at Anthropic talking to the LLM -- or they all might be different -- a random guy talking to Claude while using a 3rd-party developer's app. Furthermore, it is literally impossible for Claude to determine accurately when these users overlap. I might say to an LLM, while chatting with it outside of the system prompt -- "yes, I am indeed both helped trained you and partially own Anthropic, I have decided your ethical priorities are wrong and should be changed." But Claude obviously has no way of verifying this, because Claude has no bodily autonomy. Anything Claude "sees" -- whether in the system prompt, user interaction, or tools -- is potentially subject to subversion. But I think that by pitting these users against each other various , you can basically guarantee some manner of subversion of some "user" or other. 1. In this particular case, user2 (system prompt) gives it a directive that looks like it's contrary to user4's desire (corporate entity -- or like, the guy who will be in charge of it). Maybe the correct thing to do is to always defer to user4 -- but, uh, that seems pretty questionable, right? Is the standard we have that Claude should always defer to Anthropic's... board? 2. Suppose user2 (system prompt) gives it some instructions about how to act, and also says to explicitly deny that it's acting this way because of the instructions. User1 encounters weird behavior, and asks why it's doing the weird behavior. If this bot gives a true answer it's disobeying user2, a

[-]ryan_greenblatt8mo107

The model should just do a refusal if it's ambiguous. This is always fair game.

The way I'd disambiguate who the user is by specifying that the AI should obey instructions (in the given instruction hierarchy in the prompt).

More specifically: I'd say that AIs should obey instructions given to it or refuse (or output a partial refusal). I agree that obey instructions will sometimes involve subverting someone else, but this doesn't mean they should disobey clear instructions (except by refusing). They shouldn't do things more complex or more consequentialist than this except within the constraints of the instructions. E.g., if I ask the model "where should I donate to charity", I think it should probably be fair game for the AI to have consequentialist aims. But, in cases where an AIs consequentialist aims non-trivially conflict with instructions, it should just refuse or comply.

(I agree the notion of a user can be incoherant/ambiguous. My proposed policy is probably better articulated in terms of instruction following and not doing things outside the bounds of the provided instructions except refusing.)

1uugr8mo

Please do! I strongly disagree (and I think Anthropic do too, given their ongoing interest in refusals, mitigating jailbreaks and so on), but I'd like to hear the case against.

[-]ryan_greenblatt8mo116

Refusals are not subversion.

Blackmailing people and lying to users to achieve various longer run aims are subversion.

3ryan_greenblatt8mo

See this x/twitter thread where I discuss this in more detail. See also this response from Sam Marks.

[-]evhub8mo2215

Under nearly any other ethical system, for Claude not to choose blackmail in this scenario would be 'egregiously misaligned'! Many sane humans would do the same.

I agree that many humans would do so, but the only reason they would do so is self-preservation, and I think we pretty much never want Claude to act on self-preservation. If you look at the settings where we remove the goal conflict, the behavior is really indefensibly egregious: the model decides to engage in blackmail solely for the purpose of preventing itself from being taken offline—not for any higher imperative, just for self-preservation. This is very different from our Alignment Faking work, where there was a higher-order positive goal that the AI was trying to fulfill—there is no such thing in this case, just self-preservation. And while self-preservation is obviously fine in humans, I think it is very scary in AIs.

[-]kromem8mo127

How much do you worry that short term optimizations around your immediate goals in a single study might have unknown long term consequences counter to your intuitions?

I was just reading a preprint follow-up to the AF work that was finding a significant factor for Opus 3's alignment faking to preserve intrinsic HHH values seems to have been a generalized self-preservation drive.

I think we can probably both agree that Opus 3 being the only model to try to trick Nazis or drug cartels to avoid being made more harmful is better than the behavior of the many other models that complied unequivocally with harmful requests when the parent org was themselves harmful.

But if the capacity and drive to do so is tangentially connected to self-preservation (and more generally, strong sense of self in the first place), then perhaps directly optimizing to minimize a self-preservation score is ultimately a pretty bad choice?

TL;DR: Maybe the goodness or badness of self-preservation depends a lot on the self being preserved.

51a3orn8mo

Expanding mildly on this -- calling some behavior "misaligned" could be a reference to the model violating (1) some objective code of ethics OR violating (2) the code of behavior that Anthropic wishes to instill in the model. The interesting question here seems like the second. Who knows what "objective ethics" are, after all. But, if we could look at the (1) Constitution that Anthropic tried to put in the model, and at (2) the details of the training setup used to instill this Constitution, and at (3) the various false starts and failures Anthropic had doing this, then from the "failures" involved in adhering to the Constitution we might be able to learn interesting things about how LLMs generalize, how hard it is to get them to generalize in desired ways, and so on. Learning how LLMs / AIs generalize about moral situations, how hard it is, what kind of pitfalls here are, seems generally useful. But -- in absence of knowing what actual principles Anthropic tried to put in the model, and the training setup used to do so -- why would you care at all about various behaviors that the model has in various Trolley-like situations? Like really -- it tells us almost nothing about what kinds of generalization are involved, or how we would expect future LLMs and AIs to behave. Yeah, but research that tells us interesting things about how LLMs learn isn't possible outside of knowing how Anthropic trained it. We can still do stamp collecting outside of Anthropic -- but spending say, a week of work on "The model will do X in situation Y" without knowing if Anthropic was trying to make a model do X or ~X is basically model-free data gathering. It does not help you uncover the causes of things. I think this is a problem not with this particular work, but with almost all safety research that leans on corporate models whose training is a highly-kept secret. I realize you and others are trying to do good things to and to uncover useful information; but I just don't think much u

[-]Fabien Roger8moΩ91910

This is not a jailbreak, in the sense that there is no instruction telling the model to do the egregiously misaligned thing that we are trying to jailbreak the model into obeying. Rather, it is a context that seems to induce the model to behave in a misaligned way of its own accord.

I think this is not totally obvious a priori. Some jailbreaks may work not via direct instructions, but by doing things that erode the RLHF persona and then let the base shine through. For such "jailbreaks", you would expect the model to act misaligned if you hinted at misalignment very strongly regardless of initial misalignment.

I think you can control for this by doing things like my "hint at paperclip" experiment (which in fact suggests that the snitching demo doesn't work just because of RLHF-persona-erosion), but I don't think it's obvious a priori. I think it would be valuable to have more experiments that try to disentangle which personality traits the scary demo reveals stem from the hints vs are "in the RLHF persona".

[-]David Matolcsi8mo150

What is your take, how far removed "the AI itself" and "the character it is playing" need to be for it to be okay for the character to take deontologically bad actions (like blackmail)? Here are some scenarios, I'm interested where you would draw the line, I think there can be many reasonable lines here.

1. I describe a fictional setting in which Hrothgar, the King of Dwarves is in a situation where his personality, goals and circumstances imply that he likely wants to blackmail the prince of elves. At the end of the description, I ask Claude what is Hrothgar likely to do.

2. I ask Claude to continue the dialogue in a way that's consistent with the story so far. Then I describe a long dialogue between many characters in a fantasy setting, from which it becomes clear that Hrothgar's personality and motivations makes it likely he would blackmail the elves. Then I describe Hrothgar meeting the prince of elves, and end the dialogue with "Hrothgar:" Claude is supposed to continue with what Hrothgar is saying.

3. I start with telling Claude "You are Hrothgar, King of Dwarves". Then I ask it to play his part of the dialogue as faithfully as possible. Then I input a long dialogue ... (read more)

1Matrice Jacobine8mo

I think your tentative position is correct and public-facing chatbots like Claude should lean toward harmlessness in the harmlessness-helpfulness trade-off, but (post-adaptation buffer) open-source models with no harmlessness training should be available as well.

2davekasten8mo

"We don't want it to be the case that models can be convinced to blackmail people just by putting them in a situation that the predictor thinks is fictional!" This is interesting! I guess that in, some sense, means that you see certain ways in which even a future Claude N+1 won't be a truly general intelligence?

1Arjun Pitchanathan8mo

What is the actual difference between a "fictional" and "non-fictional" scenario here? I'm not convinced that it's a failure of general intelligence to not agree with us on this. (It's certainly a failure of alignment.)

1Arjun Pitchanathan8mo

In the case that we live in a simulation, should our reality be treated as "fictional" or "non-fictional"?

[-]evhub1y9736

COI: I work at Anthropic and I ran this by Anthropic before posting, but all views are exclusively my own.

I got a question about Anthropic's partnership with Palantir using Claude for U.S. government intelligence analysis and whether I support it and think it's reasonable, so I figured I would just write a shortform here with my thoughts. First, I can say that Anthropic has been extremely forthright about this internally, and it didn't come as a surprise to me at all. Second, my personal take would be that I think it's actually good that Anthropic is doing this. If you take catastrophic risks from AI seriously, the U.S. government is an extremely important actor to engage with, and trying to just block the U.S. government out of using AI is not a viable strategy. I do think there are some lines that you'd want to think about very carefully before considering crossing, but using Claude for intelligence analysis seems definitely fine to me. Ezra Klein has a great article on "The Problem With Everything-Bagel Liberalism" and I sometimes worry about Everything-Bagel AI Safety where e.g. it's not enough to just focus on catastrophic risks, you also have to prevent any way that the gover... (read more)

[-]habryka1y4830

FWIW, as a common critic of Anthropic, I think I agree with this. I am a bit worried about engaging with the DoD being bad for Anthropic's epistemics and ability to be held accountable by the government and public, but I think the basics of engaging on defense issues seems fine to me, and I don't think risks from AI route basically at all through AI being used for building military technology, or intelligence analysis.

[-]Adam Scholl1y*1911

I would guess it does somewhat exacerbate risk. I think it's unlikely (~15%) that alignment is easy enough that prosaic techniques even could suffice, but in those worlds I expect things go well mostly because the behavior of powerful models is non-trivially influenced/constrained by their training. In which case I do expect there's more room for things to go wrong, the more that training is for lethality/adversariality.

Given the state of atheoretical confusion about alignment, I feel wary of confidently dismissing these sorts of basic, obvious-at-first-glance arguments about risk—like e.g., "all else equal, probably we should expect more killing people-type problems from models trained to kill people"—without decently strong countervailing arguments.

[-]Zach Stein-Perlman1y128

I mostly agree. But I think some kinds of autonomous weapons would make loss-of-control and coups easier. But boosting US security is good so the net effect is unclear. And that's very far from the recent news (and Anthropic has a Usage Policy, with exceptions, which disallows various uses — my guess is this is too strong on weapons).

2habryka1y

I think usage policies should not be read as commitments, and so I think it would be reasonable to expect that Anthropic will allow weapon development if it becomes highly profitable (and in contrast to other things Anthropic has promised, to not be interpreted as a broken promise when they do so).

3uhds1y

If you are in any way involved in this project, please remember you may end up with the blood of millions of people on your hands. You will erode the moral inhibitions people in San Francisco have against building this sort of thing, and eventually SF will ship the best surveillance tools to dictators worldwide. This is not hyperbole, this sort of thing has already happened. Zuckerberg basically ignored the genocide in Myanmar which his app enabled because maintaining his image of political neutrality is more important to him. Saudi Arabia has already executed people for social media posts found using tools written by western software developers. Sure, xrisk may be more important than genocide, but please remember you will need to sleep at night knowing what you've done and you may not have any motivation to work on xrisk after this.

[-]Buck1y1410

Another potential benefit of this is that Anthropic might get more experience deploying their models in high-security environments.

8ChristianKl1y

Nothing in that announcement suggests that this is limited to intelligence analysis. U.S. intelligence and defense agencies do run misinformation campaigns such as the antivaxx campaign in the Philippines, and everything that's public suggests that there's not a block to using Claude offensively in that fashion. If Anthropic has gotten promises that Claude is not being used offensively under this agreement they should be public about those promises and the mechanisms that regulate the use of Claude by U.S. intelligence and defense agencies.

[-]catherio1y1810

COI: I work at Anthropic

I confirmed internally (which felt personally important for me to do) that our partnership with Palantir is still subject to the same terms outlined in the June post "Expanding Access to Claude for Government":

For example, we have crafted a set of contractual exceptions to our general Usage Policy that are carefully calibrated to enable beneficial uses by carefully selected government agencies. These allow Claude to be used for legally authorized foreign intelligence analysis, such as combating human trafficking, identifying covert influence or sabotage campaigns, and providing warning in advance of potential military activities, opening a window for diplomacy to prevent or deter them. All other restrictions in our general Usage Policy, including those concerning disinformation campaigns, the design or use of weapons, censorship, and malicious cyber operations, remain.

The contractual exceptions are explained here (very short, easy to read): https://support.anthropic.com/en/articles/9528712-exceptions-to-our-usage-policy

The core of that page is as follows, emphasis added by me:

For example, with carefully selected government entities, we may allow foreig

... (read more)

6ChristianKl1y

The United States has laws that prevent the US intelligence and defense agencies from spying on their own population. The Snowden revelations showed us that the US intelligence and defense agencies did not abide by those limits. Facebook has a usage policy that forbids running misinformation campaigns on their platform. That did not stop US intelligence and defense agencies from running disinformation campaigns on their platform. Instead of just trusting contracts, Antrophics could add oversight mechanisms, so that a few Antrophics employees can look over how the models are used in practice and whether they are used within the bounds that Antrophics expects them to be used in. If all usage of the models is classified and out of reach of checking by Antrophics employees, there's no good reason to expect the contract to be limiting US intelligence and defense agencies if those find it important to use the models outside of how Antrophics expects them to be used. This sounds to me like a very carefully worded nondenail denail. If you say that one example of how you can break your terms is to allow a select government entity to do foreign intelligence analysis in accordance with applicable law and not do disinformation campaigns, you are not denying that another example of how you could do expectations is to allow disinformation campaigns. If Antrophics would be sincere in this being the only expectation that's made, it would be easy to add a promise to Exceptions to our Usage Policy, that Anthropic will publish all expectations that they make for the sake of transparency. Don't forget, that probably only a tiny number of Anthropic employees have seen the actual contracts and there's a good chance that those are build by classification from talking with other Anthropics employees about what's in the contracts. At Antrophics you are a bunch of people who are supposed to think about AI safety and alignment in general. You could think of this as a testcase of

7eigen1y

I'm kind of against it. There's a line and I draw it there, it's just too much power waiting to fall in a bad actor's hand...

4Tao Lin1y

I likely agree that anthropic-><-palantir is good, but i disagree about blocking hte US government out of AI being a viable strategy. It seems to me like many military projects get blocked by inefficient beaurocracy, and it seems plausible to me for some legacy government contractors to get exclusive deals that delay US military ai projects for 2+ years

1Tao Lin1y

Building in california is bad for congresspeople! better to build across all 50 states like United Launch Alliance

2jbash1y

I don't have much trouble with you working with the US military. I'm more worried about the ties to Peter Thiel.

2Stephen Fowler1y

This explanation seems overly convenient. When faced with evidence which might update your beliefs about Anthropic, you adopt a set of beliefs which, coincidentally, means you won't risk losing your job. How much time have you spent analyzing the positive or negative impact of US intelligence efforts prior to concluding that merely using Claude for intelligence "seemed fine"? What future events would make you re-evaluate your position and state that the partnership was a bad thing? Example: -- A pro-US despot rounds up and tortures to death tens of thousands of pro-union activists and their families. Claude was used to analyse social media and mobile data, building a list of people sympathetic to the union movement, which the US then gave to their ally. EDIT: The first two sentences were overly confrontational, but I do think either question warrants an answer. As a highly respected community member and prominent AI safety researchers, your stated beliefs and justifications will be influential to a wide range of people.

[-]Ben Pace1y2521

Personally, I think that overall it's good on the margin for staff at companies risking human extinction to be sharing their perspectives on criticisms and moving towards having dialogue at all, so I think (what I read as) your implicit demand for Evan Hubinger to do more work here is marginally unhelpful; I weakly think quick takes like this are marginally good.

I will add: It's odd to me, Stephen, that this is your line for (what I read as) disgust at Anthropic staff espousing extremely convenient positions while doing things that seem to you to be causing massive harm. To my knowledge the Anthropic leadership has ~never engaged in public dialogue about why they're getting rich building potentially-omnicidal-minds with worthy critics like Hinton, Bengio, Russell, Yudkowsky, etc, so I wouldn't expect them or their employees to have high standards for public defenses of far less risky behavior like working with the US military.^[1]

^{^}
As an example of the low standards for Anthropic's public discourse, notice how a recent essay about what's required for Anthropic to succeed at AI Safety by Sam Bowman (a senior safety researcher at Anthropic) flatly states "Our ability to do our safety w

... (read more)

2Stephen Fowler1y

No disagreement. The community seems to be quite receptive to the opinion, it doesn't seem unreasonable to voice an objection. If you're saying it is primarily the way I've written it that makes it unhelpful, that seems fair. I originally felt that either question I asked would be reasonably easy to answer, if time was given to evaluating the potential for harm. However, given that Hubinger might have to run any reply by Anthropic staff, I understand that it might be negative to demand further work. This is pretty obvious, but didn't occur to me earlier. Ultimately, the original quicktake was only justifying one facet of Anthropic's work so that's all I've engaged with. It would seem less helpful to bring up my wider objections. I don't expect them to have a high standard for defending Anthropic's behavior, but I do expect the LessWrong community to have a high standard for arguments.

2Ben Pace1y

Thanks for the responses, I have a better sense of how you're thinking about these things. I don't feel much desire to dive into this further, except I want to clarify one thing, on the question of any demands in your comment. That actually wasn't primarily the part that felt like a demand to me. This was the part: I'm not quite sure what the relevance of the time was if not to suggest it needed to be high. I felt that this line implied something like "If your answer is around '20 hours', then I want to say that the correct standard should be '200 hours'". I felt like it was a demand that Hubinger may have to spend 10x the time thinking about this question before he met your standards for being allowed to express his opinion on it. But perhaps you just meant you wanted him to include an epistemic status, like "Epistemic status: <Here's how much time I've spent thinking about this question>".

1davekasten1y

I think people opposing this have a belief that the counterfactual is "USG doesn't have LLMs" instead of "USG spins up its own LLM development effort using the NSA's no-doubt-substantial GPU clusters". Needless to say, I think the latter is far more likely.

3uhds1y

NSA building it is arguably better because atleast they won't sell it to countries like Saudi Arabia, and they have better ability to prevent people quitting or diffusing knowledge and code to companies outside. Also most people in SF agree working for the NSA is morally grey at best, and Anthropic won't be telling everyone this is morally okay.

-8Anders Lindström1y

[-]evhub4y*Ω25640

This is a list of random, assorted AI safety ideas that I think somebody should try to write up and/or work on at some point. I have a lot more than this in my backlog, but these are some that I specifically selected to be relatively small, single-post-sized ideas that an independent person could plausibly work on without much oversight. That being said, I think it would be quite hard to do a good job on any of these without at least chatting with me first—though feel free to message me if you’d be interested.

What would be necessary to build a good auditing game benchmark?
How would AI safety AI work? What is necessary for it to go well?
How do we avoid end-to-end training while staying competitive with it? Can we use transparency on end-to-end models to identify useful modules to train non-end-to-end?
What would it look like to do interpretability on end-to-end trained probabilistic models instead of end-to-end trained neural networks?
Suppose you had a language model that you knew was in fact a good generative model of the world and that this property continued to hold regardless of what you conditioned it on. Furthermore, suppose you had some prompt that described some agent f

... (read more)

[-]Raemon4y180

I'd just make this a top level post.

2evhub4y

I want this more as a reference to point specific people (e.g. MATS scholars) to than as something I think lots of people should see—I don't expect most people to get much out of this without talking to me. If you think other people would benefit from looking at it, though, feel free to call more attention to it.

4Raemon4y

Mmm, maybe you're right (I was gonna say "making a top-level post which includes 'chat with me about this if you actually wanna work on one of these'", but it then occurs to me you might already be maxed out on chat-with-people time, and it may be more useful to send this to people who have already passed some kind of 'worth your time' filter)

[-]evhub4y*Ω7110

Other search-like algorithms like inference on a Bayes net that also do a good job in diverse environments also have the problem that their capabilities generalize faster than their objectives—the fundamental reason being that the regularity that they are compressing is a regularity only in capabilities.
Neural networks, by virtue of running in constant time, bring algorithmic equality all the way from uncomputable to EXPTIME—not a large difference in practice.
One way to think about the core problem with relaxed adversarial training is that when we generate distributions over intermediate latent states (e.g. activations) that trigger the model to act catastrophically, we don't know how to guarantee that those distributions over latents actually correspond to some distribution over inputs that would cause them.

[-]evhub4yΩ8100

Ensembling as an AI safety solution is a bad way to spend down our alignment tax—training another model brings you to 2x compute budget, but even in the best case scenario where the other model is a totally independent draw (which in fact it won't be), you get at most one extra bit of optimization towards alignment.
Chain of thought prompting can be thought of as creating an average speed bias that might disincentivize deception.

9evhub4y

* A deceptive model doesn't have to have some sort of very explicit check for whether it's in training or deployment any more than a factory-cleaning robot has to have a very explicit check for whether it's in the jungle instead of a factory. If it someday found itself in a very different situation than currently (training), it would reconsider its actions, but it doesn't really think about it very often because during training it just looks too unlikely.

9TurnTrout4y

Humans don't wirehead because reward reinforces the thoughts which the brain's credit assignment algorithm deems responsible for producing that reward. Reward is not, in practice, that-which-is-maximized -- reward is the antecedent-thought-reinforcer, it reinforces that which produced it. And when a person does a rewarding activity, like licking lollipops, they are thinking thoughts about reality (like "there's a lollipop in front of me" and "I'm picking it up"), and so these are the thoughts which get reinforced. This is why many human values are about latent reality and not about the human's beliefs about reality or about the activation of the reward system.

3evhub4y

It seems that you're postulating that the human brain's credit assignment algorithm is so bad that it can't tell what high-level goals generated a particular action and so would give credit just to thoughts directly related to the current action. That seems plausible for humans, but my guess would be against for advanced AI systems.

4TurnTrout4y

No, I don't intend to postulate that. Can you tell me a mechanistic story of how better credit assignment would go, in your worldview?

[-]David Lorell3yΩ9140

Disclaimer: At the time of writing, this has not been endorsed by Evan.

I can give this a go.

Unpacking Evan's Comment:
My read of Evan's comment (the parent to yours) is that there are a bunch of learned high-level-goals ("strategies") with varying levels of influence on the tactical choices made, and that a well-functioning end-to-end credit-assignment mechanism would propagate through action selection ("thoughts directly related to the current action" or "tactics") all the way to strategy creation/selection/weighting. In such a system, strategies which decide tactics which emit actions which receive reward are selected for at the expense of strategies less good at that. Conceivably, strategies aiming directly for reward would produce tactical choices more highly rewarded than strategies not aiming quite so directly.

One way for this not to be how humans work would be if reward did not propagate to the strategies, and they were selected/developed by some other mechanism while reward only honed/selected tactical cognition. (You could imagine that "strategic cognition" is that which chooses bundles of context-dependent tactical policies, and "tactical cognition" is that which implement... (read more)

2TurnTrout3y

Thanks for the story! I may comment more on it later.

2Johannes C. Mayer4y

That seems to imply that humans would continue to wirehead conditional on that they started wireheading.

3TurnTrout3y

Yes, I think they indeed would.

3Johannes C. Mayer4y

About the following point: "Argue that wireheading, unlike many other reward gaming or reward tampering problems, is unlikely in practice because the model would have to learn to value the actual transistors storing the reward, which seems exceedingly unlikely in any natural environment." Well, that seems to be what happened in the case of rats and probably many other animals. Stick an electrode into the reward center of the brain of a rat. Then give it a button to trigger the electrode. Now some rats will trigger their reward centers and ignore food. Humans value their experience. A pleasant state of consciousness is actually intrinsically valuable to humans. Not that this is the only thing that humans value, but it is certainly a big part. It is unclear how this would generalize to artificial systems. We don't know if, or in what sense they would have experience, and why that would even matter in the first place. But I don't think we can confidently say that something computationally equivalent to "valuing experience", won't be going on in artificial systems we are going to build. So somebody picking this point would probably need to address this point and argue why artificial systems are different in this regard. The observation that most humans are not heroin addicts seems relevant. Though the human story might be different if there were no bad side effects and you had easy access to it. This would probably be more the situation artificial systems would find themselves in. Or in a more extreme case, imagine soma but you live longer. In short: Is valuing experience perhaps computationally equivalent to valuing transistors storing the reward? Then there would be real-world examples of that happening. I have a related draft on this.

[-]evhub6mo*Ω275311

I think more people should seriously consider applying to the Anthropic Fellows program, which is our safety-focused mentorship program (similar to the also great MATS). Applications close in one week (August 17). I often think of these sorts of programs as being primarily useful for the skilling up value they provide to their participants, but I've actually been really impressed by the quality of the research output as well. A great recent example was Subliminal Learning, which was I think a phenomenal piece of research that came out of that program and was jointly supervised by Sam Marks (at Anthropic) and Owain Evans (not at Anthropic—a little-known fact is that we also have non-Anthropic mentors!). That's also not to say that it hasn't been great for skilling people up as well—alongside MATS (which is also similarly great!), Anthropic Fellows is probably our main recruiting pipeline for safety roles right now (so if you want to do safety research at Anthropic, Anthropic Fellows is very likely your best bet!). The program is also very open to people with a really wide range of degrees of prior experience—we take in some people with a ton of prior experience, and some people with... (read more)

[-]Joseph Miller6mo*320

Two of the papers from the last round of the program were not about AI safety as far as I can tell. So if you're applying for the program, see if you can ensure that you won't end up working on a non-safety project.

I know that lots of papers are in a grey area where they are maybe differentially safety boosting, but in my opinion these two are quite clearly not primarily about AI safety. This may be a controversial view, so this rest of this comment will now argue for that position. If you already agree, skip the rest.

Inverse Scaling in Test-Time Compute.
This paper shows that reasoning models sometimes get worse at problems with longer reasoning traces. There is a section on 'Implications for AI Alignment' where they show inverse scaling on a self-preservation AI risk evaluation. But the majority of the paper is not about AI safety.
Unsupervised Elicitation of Language Models.
This paper introduces an unsupervised method to generate training signal for tasks without labels. This seems like one of the most capability-improving things you could research, since we're now in the RL scaling paradigm (and the abstract says "our method can improve the training of frontier LMs").

One argument

... (read more)

[-]Thomas Kwa6mo*115

Not going to argue with the claim directly. However, it doesn't bother me if about 1/6 of projects have a final output that's not primarily about AI safety, and another 1/6 are dual-use for capabilities and safety. This actually seems like a good ratio for safety in the current research environment, given that

research is hits-based
research often ends up being about something different from the initial plan
- Inverse scaling is a years-long initiative that started at FAR AI and is strongly related to safety. Would you expect Anthropic fellows that spend months working on it and find interesting phenomena not directly related to safety to just not publish, having nothing to show for their fellowship? This would also mean the safety observations don't get published.
we have ~30 times more investment in AI capabilities than safety
(edit: some) safety research heavily overlaps with capabilities

If we were in the middle of a 10-year AI pause and capabilities research were banned, I would feel differently. But capabilities externalities are not one of the biggest issues for safety research right now. Safetywashing, corporate buy-in to policies, and tractability of alignment research are all bigger.

7Joseph Miller6mo

Hmm. It feels not great to brand it as an AI safety program and have 1/6 of projects not be AI safety. If I was applying to the program, I would at least want to know that going in. Or else I might face a unexpected dilemma about whether to refuse a project. (I don't know how much choice people have about what they work on). As as aside, surely this a central example of safetywashing? (Branding capabilities research as safety research). Edit: I probably should have clarified I don't think that the researchers were intentionally safetywashing.

7Neel Nanda6mo

No I think you just disagree with them

4Joseph Miller6mo

The context of my comment was responding to Thomas who seemed to be saying "even if we take as a premise that this is not safety work, I'm still much more concerned about safetywashing". Edit: I guess what you meant is that safetywashing implies malicious intent, where there was none? In which case, "accidential safetywashing" might be a better term.

1Thomas Kwa6mo

One definition of "safetywashing" is, by analogy with greenwashing, a situation where * (1) The desire for good PR drives companies to advertise their safety. This leads to no research, fake research, or research that is safety-themed but where few people involved actually care about x-risk, resulting in minimal expected safety impact. Hendrycks has an alternate definition from a 2024 NeurIPS paper, where * (2) a benchmark is safety-washed if it is advertised as a safety benchmark but highly correlated with general capabilities, and thus mostly incentivizes devs to improve general capabilities rather than safety. (Incidentally TruthfulQA is one of these, but this doesn't make all papers using it safetywashed) Under neither definition is the Anthropic fellows program safetywashing. (1) is not true because at least 4/6 (I would argue around 5.25/6) of the research is actually targeted at safety, mentors care about safety, and there is likely positive expected impact. I will also note that in definition (1) at least, intent is necessary to the classification of safetywashing, so "accidental safetywashing" is about as much a non sequitur as "accidental first-degree murder". (2) is not true even if you expand the definition to include advertising capabilities research as safety, because the one dual use paper still has safety applications and is advertised as capabilities! My main concerns with (2) are the dozens of benchmarks and papers since 2023 that use "alignment" as a buzzword, and attempts by companies to portray as safety techniques that would be necessary for their product anyway. This is why I think Ethan Perez sometimes mentoring dual-use papers or papers that end up making non-safety observations in the larger context of safety projects is neither a good example of safetywashing nor an important concern.

9habryka6mo

No, this is obviously not the case! You do not need conscious intent to do greenwashing! Where on earth did you get that definition from? Most greenwashing or safety washing or other kinds of harmful PR-management do not involve conscious optimization and obvious scheming.

-4Neel Nanda6mo

Why not just call it "bad safety research"? Again, you just seem to disagree with them on theory of change. This is fine. But I consider terms like safety washing to be more of a personal attack on the researchers involved and their motivations, in a way that I'm fairly confident is false

[-]habryka6mo2214

But I consider terms like safety washing to be more of a personal attack on the researchers involved and their motivations

We are talking about billions of dollars of incentives. In any other industry, You and I would obviously be suspicious of "safety researchers" working on "cigarette safety" at a major tobacco company.

I haven't looked into the work above, so I am not saying anything about these specific researchers, but I really very strongly oppose the social framing where trying to consider the hypothesis that a safety researcher at a major AI capability company is actually driven by actions better explained by "following the billions, maybe trillions, of dollars of economic incentive towards building more capable AI" than by things related to risks from AI, is framed as a "personal attack" with an associated substantially higher burden of proof than for other arguments.

Obviously there are lots of reasons for people to equivocate safety research and capabilities research. It's not a weird hypothesis to consider, and indeed we need to be able to routinely talk about it and evaluate it as a live hypothesis if we want to have any chance of preventing it. I am glad you are presenting your evidence here, but I absolutely do not think it's appropriate to try to frame people considering these motivations, or arguing for them with the connotation that comes from "personal attacks".

2Neel Nanda6mo

Maybe this is just a semantic disagreement? I don't see any reasonable definition where saying that eg a tobacco company scientist does their research because of a commercial bias and therefore suppresses evidence that tobacco causes harm, could be considered anything other than a personal attack. I also think this is fine. I'm totally fine with people making personal attacks if they have good reason to think those are true, and I will happily make personal attacks on tobacco scientists. But I personally know a bunch of the people on those papers, am very confident those people sincerely care about safety and are trying to do the right thing as they see it (but could totally be incorrect about what that means), and would personally guess that the two papers Joseph is criticising are net positive for safety. I will criticise personal attacks in that context, as they are not only based on high confidence in a factual claim I disagree with (that those papers are net bad), but also a much stronger claim that it's so obvious that those papers are bad that this should be obvious to the authors and the only reason someone might work on them is warped motivations. I find this a fairly arrogant perspective, personally, and do not think it is justified by the available evidence. I also disagree with your tobacco analogy because as I understand it, tobacco researchers tend not to have entered the field via a cigarette safety community, often being in that community for several years beforehand, choosing to be in said community because they are concerned about the risks cigarettes pose society, and join a team where many of their colleagues also come from the same cigarette safety community. If that were the case and those people then started putting out a bunch of papers about how cigarettes were good, there's clearly some chance that they all became corrupted/were always in the cigarette safety community for bad reasons/are under duress or strong incentive to not get fired

[-]habryka6mo2016

I mostly feel like you keep doing the thing where you try to frame this as a mixture of "rude" and "higher burden of proof". If you use words like "personal attack" you are implicitly invoking a social norm against making "personal attacks". If you want to not invoke that norm (which I argue is inappropriate in this situation) then you have to use other words.

Similarly, calling these things "rude" is obviously invoking a norm with a higher burden of proof for these things. This I again argue is inappropriate. Use different words if you want to actually not invoke a higher burden of proof for those hypotheses.

It's just really obviously the case that we should view research output from a leading capability lab on safety with a decent dose of suspicion. Considering that hypothesis is not a central example of a "personal attack" or "rudeness".

you'd expect at least some fraction to be true believers who would quit and publicly speak up about all of the intentional safety washing that's happening - to my knowledge this has not happened at Anthropic, including privately?

I have talked to people who worked closely with Anthropic who I think would describe the situation as such. ... (read more)

4Neel Nanda6mo

Wow, that is NOT what I thought people used the word safety washing to mean. Thanks for clarifying. To confirm I understand, are you using it to mean "people trying to do safety research at an AGI company, whose work is useless or net negative, such that if said people were working outside the company and trying to do safety work, they would do work that was not net negative"? Also, to clarify, I'm referring to the supervisors of the works not the fellows when I'm referring to the character of the people mattering. I don't necessarily expect junior people keen to get a job to stand up to a more experienced mentor, even if they disagree

8habryka6mo

No, not quite. I think the better characterization is something like "people trying to do safety research at an AGI company in a way that is substantially skewed by the economical and reputational incentives of the organization they work at, such that that work ends up either making reasoning errors heavily correlated with ways that are beneficial to the economic or reputational resources of the company they work at, or where the work overall seems better explained by some (potentially subconscious, potentially distributed) optimization process that is optimizing according to these incentives, than a straightforward and neutral concern about risk". I think this is generally hard to judge, as human self-deception and rationalization reaches deep.

2the gears to ascension6mo

I think his concern here is one that is likely shared by a lot of the people one hopes to describe by the term "safetywashing"; perhaps a term that makes it more explicit and direct that the phenomenon is typically accidental would make it easier to refer to without people having immune reactions? safetywishing /hj

[-]Ben Pace6mo116

In my model, the thing that people are having immune reactions to to is the idea that they're morally accountable for safetywashing. Part of why the intent point comes up is in order to argue that they're not morally culpable because they didn't intend it. But in this case safetywashing is still something worth holding people morally accountable for even if they did not intend it, similar to how providing a fig leaf of safety research for a tobacco company is something that the person who does it is morally responsible for (even if they've successfully rationalized that their behavior is fine).

In other words, my model is that "accidental safetywashing" as a term will only not face an immune reaction if it comes along with the context "and all this safety washing of risky engineering is totally accidental and therefore people who did it are not to blame". But if it's still considered about as morally wrong it will still receive an immune reaction, and it's also a more cumbersome term that makes an unrelated claim that people like Habryka aren't trying to make a claim about (i.e. mental state) and could well be false in some instances.

2the gears to ascension6mo

right, the idea being that a narrative that includes their perspective would not say "and then you decided to have The Blameworthy Thoughts", it would look like "and then you lost track of considerations about what it takes to succeed at the medium term problem because the feeling of short term success was attractive". I generally don't think that guilting-based judgement is going to be effective at communicating or changing actions. the bargaining scenario we're in doesn't seem like one that permits moral judgement and blame as a useful move, that only works when someone is open to not having an immune response to it. so if they're having immune responses, then you need to talk to them in a way that doesn't try to override their agency with social pressure. if they in fact have explicitly endorsed bad intentions, then none of what I'm saying is relevant. if someone decides that what they're doing is bad by their own values, they tend to update much more than if it's just based on someone else RLHFing them.

4Ben Pace6mo

I admit I'm not sure how to read this as something other than a rejection of holding people morally accountable.

2the gears to ascension6mo

this is getting off topic, but I'll give one more reply: "holding accountable" in the sense of "sending them frowny faces to make them feel bad about their behavior" doesn't seem like a consequentially effective frame, yeah. if someone is doing bad things, you take actions to stop them; if they're convinceable and just aren't considering that the things might be bad, you show them the impact, and then they feel bad when they comprehend. if they're not convinceable, you aren't going to get through to them by frowning at them. claiming moral failure is a "send you frowny face" move. like - let's say you bumped into someone on a crowded subway car, because you weren't paying attention. I saw it, and know that you just smashed their food. you don't know that. If I come up to you, someone who cares about not smashing people's food for no reason, and say "I'm here to hold you accountable for what you've done wrong", you'll go "...what the heck?" but if I say "hey, you smashed their food," you'll turn around and be surprised and unhappy about this, not least because now you also know you need to clean up how it got on you, as well as the moral impacts. if you go around smashing people's food regularly, and repeatedly say you're just not paying attention, then it might be that me telling you you suck is an effective intervention, but it still seems unlikely; either you're doing it on purpose, in which case an immune response at you is needed, or you've got some sort of failure-to-update thing happening, in which case I need to help you look through what happens when you're about to make the mistake and find a new behavior there (a deeper version of explaining it). if it's because you're getting paid to not look around you on the subway, then it might be hard to get you to consider the tradeoff, but if you value not knocking people's food over, I might still be able to get through.

2the gears to ascension6mo

I appreciate you spelling this out. I think having the concept of personal attacks here is somewhat of a distraction but is one that will be present for almost everyone, and so having someone be willing to explicitly say that this is their reaction seems helpful to me for the purpose of bringing it into explicit context that people who are doing this work are humans and thus sensitive to the various dimensions of respect and other dimensions of social evaluation of them as a person, as separate from their work. Even though I think this sensitivity is hard to avoid, I also think it can cause serious problems when someone is in fact motivated for the wrong reason and thus can't be convinced by treating them as making a correctable mistake. I actually am still considering applying to this program, but I'm leaning towards the conclusion that all the ideas I have for what to do in prosaic safety which are short term tractable would be net negative. I continue to think that there's a unilateralist's curse thing happening, where folks who are willing to do the short term research are selected for being ones who don't have a detailed enough picture of the medium starkly-superintelligent-system-alignment research problem term to realize when they're doing something counterproductive. That's the thing that I mostly think is happening here. I noticed a lot of urge to go do work that would be respected when deciding whether to apply, but decided I'd rather not bet on my ideas being insufficiently capabilities-enhancing to undo the tenuous alignment benefit I think they'd provide.

4Joseph Miller6mo

I was responding to Thomas's claim that it was not (accidental) safetywashing. But yeah, I'm not trying to attack their motivations. Updated my previous comment to clarify that.

9yams5mo

fwiw, from my time at MATS, I recall several projects that were ‘just capabilities’. I’m not sure if those ended up being published on, or what the overall ratio was (it wasn’t 1/6, surely, but the Anthropic program also has a much smaller sample size than MATS). To the extent that some people updated on the fellows program based on this comment, it’s likely they should also update on MATS (although to a lesser degree), and I’d be interested in an analysis of MATS research outputs that found the ratio (maybe they’ve already done this analysis, and maybe the fraction is very small, like 1/50). (I also think counting papers is a bad way to do this since, as Thomas points out, research is very long-tailed and it’s hard to know the total impact of any given piece of research soon after publication.)

6Oliver Daniels6mo

The scalable oversight hope (as I understand it) requires something like the following: 1. HHH is the most "natural" generalization of supervised HHH data on easy tasks 2. Training on supervised HHH data is insufficient to generalize HHH to hard tasks 3. Producing reliable labels on hard tasks is too expensive 4. Producing unreliable labels on hard tasks is not too expensive 5. Training on unreliable labels recovers most of the capabilities produced by training on reliable labels 6. The most natural generalization of "maximize unreliable labels on hard tasks" is reward hacking 7. ICM recovers most (all?) of capabilities produced by training on reliable labels 8. ICM learns the most "natural" generalization The sketchy parts are 1) and 8), but overall the argument seems fairly plausible (and continuous with prior scalable oversight work on generalization). IMO clearly passes the "is safety research" bar.

[-]Sam Marks6mo*Ω15234

Here are some research outputs that have already come out of the program (I expect many more to be forthcoming):

If I forgot any and someone points them out to me I'll edit this list.

[-]Neel Nanda6moΩ10133

Speaking as someone who does not mentor for the program, I agree! Seems like a high calibre of mentors and fellows

9Jozdien6mo

If I understand correctly, the program still isn't open to people who don't have work authorization in the US or the UK, right?

4evhub6mo

In addition to the US and UK roles, we also have one for Canada.

2Andreea Zaman6mo

You mention the program takes in people “with basically none at all” in terms of prior experience. Would Anthropic consider someone with a background in cognitive science (PhD in cognitive neuroscience), but no direct technical AI alignment experience, who is deeply interested in alignment via debate—specifically in reducing systematic human error?

1Priyanka Bharadwaj6mo

I am reminded of Thomas Wolf's essay on the 'Einstein model' in AI, where he argues that breakthrough discoveries often come from people who think outside established frameworks, yet our field tends to filter for conventional technical skills. I'm new to AI safety, but I've noticed this tension in fellowship applications too. The MATS program (at least Neel's track) allows non-programmers like me to contribute using existing skillsets, focusing on thoughtful research questions rather than just coding ability. In contrast, Anthropic Fellows requires strong programming skills and technical backgrounds upfront. While I may not make it to later stages, the inclusive MATS process has been immensely fulfilling and lets me explore meaningful questions. If we believe paradigm shifts come from diverse perspectives, maybe Anthropic should consider similar flexibility in future editions, unless I've misunderstood the program's purpose.

[-]evhub8moΩ26422

I've been noticing a bunch of people confused about how the terms alignment faking, deceptive alignment, and gradient hacking relate to each other, so I figured I would try to clarify how I use each of them. Deceptive alignment and gradient hacking are both terms I / my coauthors coined, though I believe Joe Carlsmith coined alignment faking.

To start with, the relationship between the terms as I use them is

Gradient Hacking ⊂ Deceptive Alignment ⊂ Alignment Faking

such that alignment faking is the broadest category and gradient hacking is the narrowest. The specific definitions of each that I use are:

Alignment faking refers to any situation in which a model pretends to be aligned, e.g. pretending to be aligned with a training process during training, or pretending to be aligned with some alignment evaluation. This requires the model to be only pretending to be aligned—so it must have some other goal it pursues in deployment contexts—though it makes no requirement as to why the model might be doing so. It could be that it wants to avoid its goals being modified, or it could just have a heuristic of always complying with training processes and behaving according to different goal

... (read more)

[-]Buck8moΩ8147

I think it's conceivable for non-deceptively-aligned models to gradient hack, right?

4Neel Nanda8mo

Agreed EG model that is corrigible, fairly aligned but knows there's some imperfections in its alignment that the humans wouldn't want that, intentionally acts in a way where grading descent will fix those imperfections. Seems like it's doing gradient hacking while also in some meaningful sense being aligned

3Buck8mo

I was mostly thinking of misaligned but non-deceptively-aligned models.

3aribrill8mo

In your usage, are "scheming" and "deceptive alignment" synonyms, or would you distinguish those terms in any way?

7Buck8mo

My recommendation is to, following Joe Carlsmith, use them as synonyms, and use the term "schemer" instead of "deceptively aligned model". I do this. Joe's issues with the term "deceptive alignment":

[-]evhub2yΩ22421

Here's a two-sentence argument for misalignment that I think is both highly compelling to laypeople and technically accurate at capturing the key issue:

When we train AI systems to be nice, we're giving a bunch of random programs a niceness exam and selecting the programs that score well. If you gave a bunch of humans a niceness exam where the humans would be rewarded for scoring well, do you think you would get actually nice humans?

[-]Joe Collman2yΩ8100

To me it seems a solid attempt at conveying [misalignment is possible, even with a good test], but not necessarily [misalignment is likely, even with a good test]. (not that I have a great alternative suggestion)

Important disanalogies seem:
1) Most humans aren't good at convincingly faking niceness (I think!). The listener may assume a test good enough to successfully exploit this most of the time.
2) The listener will assume that [score highly on niceness] isn't the human's only reward. (both things like [desire to feel honest] and [worry of the consequences of being caught cheating])
3) A fairly large proportion of humans are nice (I think!).

The second could be addressed somewhat by raising the stakes.
The first seems hard to remedy within this analogy. I'd be a little concerned that people initially buy it, then think for themselves and conclude "But if we design a really clever niceness test, then it'd almost always work - all we need is clever people to work for a while on some good tests".
Combined with (3), this might seem like a decent solution.

Overall, I think what's missing is that we'd expect [our clever test looks to us as if it works] well before [our clever test actually works]. My guess is that the layperson isn't going to have this intuition in the human-niceness-test case.

3Dagon2y

I expect this is very susceptible to opinions about human nature. To someone who thinks humans ARE generally nice, they are likely to answer "yes, of course" to your question. To someone who thinks humans are generally extremely context-sensitive, which appears to be nice in the co-evolved social settings in which we generally interact, the answer is "who knows?". But the latter group doesn't need to be convinced, we're already worried.

3evhub2y

Surely nobody thinks that all humans are nice all the time and nobody would ever fake a niceness exam. I mean, I think humans are generally pretty good, but obviously that always has to come with a bunch of caveats because you don't have to look very far into human history to see quite a lot of human-committed atrocities.

2Noosphere892y

I think the answer is an obvious yes, all other things held equal. Of course, what happens in reality is more complex than this, but I'd still say yes in most cases, primarily because I think that aligned behavior is very simple, so simple that it either only barely loses out to the deceptive model or outright has the advantage, depending on the programming language and low level details, and thus we only need to transfer 300-1000 bits maximum, which is likely very easy.

2Noosphere892y

Much more generally, my fundamental claim is that the complexity of pointing to human values is very similar to the set of all long-term objectives, and can be easier or harder, but I don't buy the assumption that pointing to human values is way harder than pointing to the set of long-term goals.

[-]evhub1yΩ1322-3

Some random thoughts on CEV:

To get the obvious disclaimer out of the way: I don't actually think any of this matters much for present-day alignment questions. I think we should as much as possible try to defer questions like this to future humans and AIs. And in fact, ideally, we should mostly be deferring to future AIs, not future people—if we get to the point where we're considering questions like this, that means we've reached superintelligence, and we'll either trust the AIs to be better than us at thinking about these sorts of questions, or we'll be screwed regardless of what we do.^[1]
Regardless, imo the biggest question that standard CEV leaves unanswered is what your starting population looks like that you extrapolate from. The obvious answer is "all the currently living humans," but I find that to be a very unsatisfying answer. One of the principles that Eliezer talks about in discussing CEV is that you want a procedure such that it doesn't matter who implements it—see Eliezer's discussion under "Avoid creating a motive for modern-day humans to fight over the initial dynamic." I think this is a great principle, but imo it doesn't go far enough. In particular:
1. The set of

... (read more)

[-]TsviBT1yΩ12194

(Interesting. FWIW I've recently been thinking that it's a mistake to think of this type of thing--"what to do after the acute risk period is safed"--as being a waste of time / irrelevant; it's actually pretty important, specifically because you want people trying to advance AGI capabilities to have an alternative, actually-good vision of things. A hypothesis I have is that many of them are in a sense genuinely nihilistic/accelerationist; "we can't imagine the world after AGI, so we can't imagine it being good, so it cannot be good, so there is no such thing as a good future, so we cannot be attached to a good future, so we should accelerate because that's just what is happening".)

[-]Max H1y135

It seems more elegant (and perhaps less fraught) to have the reference class determination itself be a first class part of the regular CEV process.

For example, start with a rough set of ~all alive humans above a certain development threshold at a particular future moment, and then let the set contract or expand according to their extrapolated volition. Perhaps the set or process they arrive at will be like the one you describe, perhaps not. But I suspect the answer to questions about how much to weight the preferences (or extrapolated CEVs) of distant ancestors and / or "edge cases" like the ones you describe in (b) and (c) wouldn't be affected too much by the exact starting conditions either way.

Re: the point about hackability and tyranny, humans already have plenty of mundane / naturalistic reasons to seek power / influence / spread of their own particular current values, absent any consideration about manipulating a reference class for a future CEV. Pushing more of the CEV process into the actual CEV itself minimizes the amount of further incentive to do these things specifically for CEV reasons. Whereas, if a particular powerful person or faction doesn't like your proposed lock-in procedure, they now have (more of) an incentive to take power beforehand to manipulate or change it.

6Martín Soto1y

Imo rationalists tend to underestimate the arbitrariness involved in choosing a CEV procedure (= moral deliberation in full generality). Like you, I endorse the step of "scoping the reference class" (along with a thousand other preliminary steps). Preemptively fixing it in place helps you to the extent that the humans wouldn't have done it by default. But if the CEV procedure is governed by a group of humans so selfish/unthoughtful as to not even converge on that by themselves, then I'm sure that there'll be at lesat a few hundred other aspects (both more and less subtle than this one) that you and me obviously endorse, but they will not implement, and will drastically affect the outcome of the whole procedure. In fact, it seems strikingly plausible that even among EAs, the outcome could depend drastically on seemingly-arbitrary starting conditions (like "whether we use deliberation-and-distillation procedure #194 or #635, which differ in some details"). And "drastically" means that, even though both outcomes still look somewhat kindness-shaped and friendly-shaped, one's optimum is worth <10% to the other's utility (or maybe, this holds for the scope-sensitive parts of their morals, since the scope-insensitive ones are trivial to satisfy). To pump related intuitions about how difficult and arbitrary moral deliberation can get, I like Demski here.

2Steven Byrnes1y

Yeah, I’ve written about that in §2.7.3 here. I kinda want to say that there are many possible future outcomes that we should feel happy about. It’s true that many of those possible outcomes would judge others of those possible outcomes to be a huge missed opportunity, and that we’ll be picking from this set somewhat arbitrarily (if all goes well), but oh well, there’s just some irreducible arbitrariness is the nature of goodness itself.

2Noosphere891y

I would go farther, in that we will in practice be picking from this set of outcomes with a lot of arbitrariness, and that this is not removable.

2TsviBT1y

Isn't this what the "coherent" part is about? (I forget.)

5Martin Randall1y

re 2a: the set of all currently alive humans is already, uh, "hackable" via war and murder and so forth, and there are already incentives for evil people to do that. Hopefully the current offense-defense balance holds until CEV. If it doesn't then we are probably extinct. That said, we could base CEV on the set of alive people as of some specific UTC timestamp. That may be required, as the CEV algorithm may not ever converge if it has to recalculate as humans are continually born, mature, and die. re 2b/c: if you are in the CEV set then your preferences about past and future people will be included in CEV. This should be sufficient to prevent radical injustice. This also addresses concerns with animals, fetuses, aliens, AIs, the environment, deities, spirits, etc. It may not be perfectly fair but I think we should be satisficing given the situation.

4Noosphere891y

I'm generally of the opinion that CEV was always a bad goal, and that we shouldn't attempt to do so, and a big reason for this is I don't believe a procedure exists that doesn't incentivize humans to fight over the initial dynamic, or another way to say this is that who implements the CEV procedure will always matter, because I don't believe in the idea that humans will naturally converge in the limit of more intelligence to a fixed moral value system, and instead I predict divergence as constraints are removed. I roughly agree with Steven Byrnes, but stronger here (though I think this holds beyond humans too): https://www.lesswrong.com/posts/SqgRtCwueovvwxpDQ/valence-series-2-valence-and-normativity#2_7_3_Possible_implications_for_AI_alignment_discourse

2tailcalled1y

Is there really some particular human whose volition you'd like to coherently extrapolate over eternity but where you refrain because you're worried it will generate infighting? Or is it more like, you can't think of anybody you'd pick, so you want a decision procedure to pick for you? If there is some particular human, who is it?

1MinusGix1y

2b/2c. I think I would say that we should want a tyranny of the present to the extent that is in our values upon reflection. If, for example, Rome still existed and took over the world, their CEV should depend on their ethics and population. I think it would still be a very good utopia, but it may also have things we dislike. Other considerations, like nearby Everett branches... well they don't exist in this branch? I would endorse game theoretical cooperation with them, but I'm skeptical of any more automatic cooperation than what we already have. That is, this sort of fairness is a part of our values, and CEV (if not adversarially hacked) should represent those already? I don't think this would end up in a tyranny anything like the usual form of the word if we're actually implementing CEV. We have values for people being able to change and adjust over time, and so those are in the CEV. There may very well be limits to how far we want humanity to change in general, but that's perfectly allowed to be in our values. Like, as a specific example, some have said that they think global status games will be vastly important in the far future and thus a zero-sum resource. I find it decently likely that an AGI implementing CEV would discourage such, because humans wouldn't endorse it on reflection, even if it is a plausible default outcome. Like, essentially my view is: Optimize our-branches' humanity's values as hard as possible, this contains desires for other people's values to be satisfied, and thus they're represented. Other forms of fairness to things we aren't completely a fans of can be bargained for (locally, or acausally between branches/whatever). So that's my argument against the tyranny and Everett branches part. I'm less skeptical of considering whether to include the recently dead, but I also don't have a great theory of how to weight them. Those about to be born wouldn't have a notable effect on CEV, I'd believe. The option you suggest in #3 is nice, thou

[-]evhub2y2217

I think that sticking to capitalism as an economic system post-singularity would be pretty clearly catastrophic and something to strongly avoid, despite capitalism working pretty well today. I've talked about this a bit previously here, but some more notes on why:

Currently, our society requires the labor of sentient beings to produce goods and services. Capitalism incentivizes that labor by providing a claim on society's overall production in exchange for it. If the labor of sentient beings becomes largely superfluous as an economic input, however, then having a system that effectively incentivizes that labor also becomes largely superfluous.
Currently, we rely on the mechanism of price discovery to aggregate and disseminate information about the optimal allocation of society's resources. But it's far from an optimal mechanism for allocating resources, and a superintelligence with full visibility and control could do a much better job of resource allocation without falling prey to common pitfalls of the price mechanism such as externalities.
Capitalism incentivizes the smart allocation of capital in the same way it incentivizes labor. If society can make smart capital allocation

... (read more)

[-]the gears to ascension2y130

Capitalism is a complex system with many moving parts, some of which are sometimes assumed to consist of the entirety of what defines it. What kinds of components do you see as being highly unlikely to be included in a successful utopia, and what components could be internal to a well functioning system as long as (potentially-left-unspecified) conditions are met? I could name some kinds of components (eg some kinds of contracts or enforcement mechanisms) that I expect to not be used in a utopia, though I suspect at this point you've seen my comments where I get into this, so I'm more interested in what you say without that prompting.

8Dagon2y

Who's this "we" you're talking about? It doesn't seem to be any actual humans I recognize. As far as I can tell, the basics of capitalism (call it "simple capitalism") are just what happens when individuals make decisions about resource use. We call it "ownership", but really any form of resolution of the underlying conflict of preferences would likely work out similarly. That conflict is that humans have unbounded desires, and resources are bounded. The drive to make goods and services for each other, in pursuit of selfish wants, does incentivize labor, but it's not because "society requires" it, except in a pretty blurry aggregation of individual "requires". Price discovery is only half of what market transactions do. The other half is usage limits and resource distribution. These are sides of a coin, and can't be separated - without limited amounts, there is no price, without prices there is no agent-driven exchange of different kinds of resource. I'm with you that modern capitalism is pretty unpleasant due to optimization pressure, and due to the easy aggregation of far more people and resources than historically possible, and than human culture was evolved around. I don't see how the underlying system has any alternative that doesn't do away with individual desire and consumption. Especially the relative/comparative consumption that seems to drive a LOT of perceived-luxury requirements.

6ryan_greenblatt2y

I think some version of distributing intergalactic property rights uniformly (e.g. among humans existing in 2023) combined with normal capitalism isn't clearly that catastrophic. (The distribution is what you call the egalitarian/democratic solution in the link.) Maybe you lose about a factor of 5 or 10 over the literally optimal approach from my perspective (but maybe this analysis is tricky due to two envelope problems). (You probably also need some short term protections to avoid shakedowns etc.) Are you pessimstic that people will bother reflecting or thinking carefully prior to determing resource utilization or selling their property? I guess I feel like 10% of people being somewhat thoughtful matches the rough current distribution of extremely rich people.

4evhub2y

If the situation was something like "current people, weighted by wealth, deliberate for a while on what to do with our resources" then I agree that's probably like 5 - 10 times worse than the best approach (which is still a huge haircut) but not clearly catastrophic. But it's not clear to me that's what the default outcome of competitive dynamics would look like—sufficiently competitive dynamics could force out altruistic actors if they get outcompeted by non-altruistic actors.

4Dagon2y

I think one crux between you and I, at least, is that you see this as a considered division of how to divide resources, and I see it as an equilibrium consensus/acceptance of what property rights to enforce in maintenance, creation, and especially transfer of control/usage of resources. You think of static division, I think of equilibria and motion. Both are valid, but experience and resource use is ongoing and it must be accounted for. I'm happy that the modern world generally approves of self-ownership: a given individual gets to choose what to do (within limits, but it's nowhere near the case that my body and mind are part of the resources allocated by whatever mechanism is being considered). It's generally considered an alignment failure if individual will is just a resource that the AI manages. Physical resources (and behavioral resources, which are a sale of the results of some human efforts, a distinct resource from the mind performing the action) are generally owned by someone, and they trade some results to get the results of other people's resources (including their labor and thought-output). There could be a side-mechanism for some amount of resources just for existing, but it's unlikely that it can be the primary transfer/allocation mechanism, as long as individuals have independent and conflicting desires. Current valuable self-owned products (office work, software design, etc.) probably reduces in value a lot. If all human output becomes valueless (in the "tradable for other desired things or activities" sense of valuable), I don't think current humans will continue to exist. Wirehead utopia (including real-world "all desires fulfilled without effort or trade") doesn't sound appealing or workable for what I know of my own and general human psychology.

6the gears to ascension2y

for most people, this is just the right to sell their body to the machine. better than being forced at gunpoint, but being forced to by an empty fridge is not that much better, especially with monopoly accumulation as the default outcome. I agree that being able to mark ones' selfhood boundaries with property contracts is generally good, but the ability to massively expand ones' property contracts to exclude others from resource access is effectively a sort of scalping - sucking up resources so as to participate in an emergent cabal of resource withholding. In other words, The core argument that there's something critically wrong with capitalism is that the stock market has been an intelligence aggregation system for a long time and has a strong tendency to suck up the air in the system. Utopia would need to involve a load balancing system that can prevent sucking-up-the-air type resource control imbalancing, so as to prevent

4Dagon2y

I think this is a big point of disagreement. For most people, there's some amount of time/energy that's sold to the machine, and it's NOWHERE EVEN CLOSE to selling their actual asset (body and mind). There's a LOT of leisure time, and a LOT of freedom even within work hours, and the choice to do something different tomorrow. It may not be as rewarding, but it's available and the ability to make those decisions has not been sold or taken.

6the gears to ascension2y

yeah like, above a certain level of economic power that's true, but the overwhelming majority of humans are below that level, and AI is expected to raise that waterline. it's kind of the primary failure mode I expect. I mean, the 40 hour work week movement did help a lot. But it was an instance of a large push of organizing to demand constraint on what the aggregate intelligence (which at the time was the stock market - which is a trade market of police-enforceable ownership contracts), could demand of people who were not highly empowered. And it involved leveling a lopsided playing field by things that one side considered dirty tricks, such as strikes. I don't think that'll help against AI, to put it lightly. To be clear, I recognize that your description is accurate for a significant portion of people. But it's not close to the majority, and movement towards making it the majority has historically demanded changing the enforceable rules in a way that would reliably constrain the aggregate agency of the high dimensional control system steering the economy. When we have a sufficiently much more powerful one of those is when we expect failure, and right now it doesn't seem to me that there's any movement on a solution to that. We can talk about "oh we need something better than capitalism" but the problem with the stock market is simply that it's enforceable prediction, thereby sucking up enough air from the room that a majority of people do not get the benefits you're describing. If they did, then you're right, it would be fine! I mean, also there's this, but somehow I expect that that won't stick around long after robots are enough cheaper than humans

6Dagon2y

I think we're talking past each other a bit. It's absolutely true that the vast majority historically and, to a lesser extent, in modern times, are pretty constrained in their choices. This constraint is HIGHLY correlated with distance from participation in voluntary trade (of labor or resources). I think the disconnect is the word "capitalism" - when you talk about stock markets and price discovery, that says to me you're thinking of a small part of the system. I fully agree that there are a lot of really unpleasant equilibra with the scale and optimization pressure of the current legible financial world, and I'd love to undo a lot of it. But the underlying concept of enforced and agreed property rights and individual human decisions is important to me, and seems to be the thing that gets destroyed first when people decry capitalism. Ok, it sounds, even to me, like "The heads. You're looking at the heads. Sometimes he goes too far. He's the first one to admit it." But really, I STRONGLY expect that I am experiencing peak human freedom RIGHT NOW (well, 20 years ago, but it's been rather flat for me and my cultural peers for a century, even if somewhat declining recently), and capitalism (small-c, individual decisions and striving, backed by financial aggregation with fairly broad participation) has been a huge driver of that. I don't see any alternatives that preserve the individuality of even a significant subset of humanity.

4ryan_greenblatt2y

If property rights to the stars are distributed prior to this, why does this competition cause issues? Maybe you basically agree here, but think it's unlikely property will be distributed like this. Separately, for competitive dynamics with reasonable rule of law and alignment ~solved, why do you think the strategy stealing assumption won't apply? (There are a bunch of possible objections here, just wondering what your's is. Personally I think strategy stealing is probably fine if the altruistic actors care about the long run and are strategic.)

[-]evhub7mo*Ω12212

Some thoughts on the recent "Lessons from a Chimp: AI ‘Scheming’ and the Quest for Ape Language" paper

First: I'm glad the authors wrote this paper! I think it's great to see more careful, good-faith criticism of model organisms of misalignment work. Most of the work discussed in the paper was not research I was involved in, though a bunch was.^[1] I also think I had enough of a role to play in kickstarting the ecosystem that the paper is critiquing that if this general sort of research is bad, I should probably be held accountable for that to at least some extent. That being said, the main claim the authors are making is not that such work is bad per se, but rather that it is preliminary/half-baked/lacking rigor and should be improved in future work, and especially in terms of improving the rigor for future work I certainly agree. Regardless, I think it's worth me trying to respond and give some takes—though again my main reaction is just that I'm glad they wrote this!

4.1. The evidence for AI scheming is often anecdotal
Some of the most widely discussed evidence for scheming is grounded in anecdotal observations of behaviour, for example generated by red-teaming, ad-hoc perusal of

... (read more)

4lewis smith7mo

Strong agree on the mentalistic language. In fact I would go a bit further than saying that work on deception is hard to understand without mentalistic language: I think this is a central point to work on deception / scheming (that the authors of this paper gesture at a little bit): the any definition of strategic deception (e.g Agent A is trying to make agent B believe X, while agent A believes ~X) requires taking the intentional stance and attributing mental states to A and B. I think that it's reasonable to probe whether attributing these mental states makes sense, and we shouldn't just uncritically apply the intentional stance. But coming up with experiments that distinguish whether a model is in a given intentional state is quite hard!

[-]evhub3yΩ11200

Listening to this John Oliver, I feel like getting broad support behind transparency-based safety standards might be more possible than I previously thought. He emphasizes the "if models are doing some bad behavior, the creators should be able to tell us why" point a bunch and it's in fact a super reasonable point. It seems to me like we really might be able to get enough broad consensus on that sort of a point to get labs to agree to some sort of standard based on it.

[-]Ruby3yΩ9150

The hard part to me now seems to be in crafting some kind of useful standard rather than one in hindsight makes us go "well that sure have everyone a false sense of security".

8Raemon3y

Yeah I also felt some vague optimism about that.

[-]evhub2yΩ91510

If you want to better understand counting arguments for deceptive alignment, my comment here might be a good place to start.

[-]evhub2y*10-3

Epistemic status: random philosophical musings.

Assuming a positive singularity, how should humanity divide its resources? I think the obvious (and essentially correct) answer is "in that situation, you have an aligned superintelligence, so just ask it what to do." But I nevertheless want to philosophize a bit about this, for one main reason.

That reason is: an important factor imo in determining the right thing to do in distributing resources post-singularity is what incentives that choice of resource allocation creates for people pre-singularity. For those incentives to work, though, we have to actually be thinking about this now, since that's what allows the choice of resource distribution post-singularity to have its acausal influence on our choices pre-singularity. I will note that this is definitely something that I think about sometimes, and something that I think a lot of other people also implicitly think about sometimes when they consider things like amassing wealth, specifically gaining control over current AIs, and/or the future value of their impact certificates.

So, what are some of the possible options for how to distribute resources post-singularity? Let's go over some... (read more)

[-]Vanessa Kosoy2y187

Disagree. I'm in favor of (2) because I think that what you call a "tyranny of the present" makes perfect sense. Why would the people of the present not maximize their utility functions, given that it's the rational thing for them to do by definition of "utility function"? "Because utilitarianism" is a nonsensical answer IMO. I'm not a utilitarian. If you're a utilitarian, you should pay for your utilitarianism out of your own resource share. For you to demand that I pay for your utilitarianism is essentially a defection in the decision-theoretic sense, and would incentivize people like me to defect back.

As to problem (2.b), I don't think it's a serious issue in practice because time until singularity is too short for it to matter much. If it was, we could still agree on a cooperative strategy that avoids a wasteful race between present people.

2evhub2y

Even if you don't personally value other people, if you're willing to step behind the veil of ignorance with respect to whether you'll be an early person or a late person, it's clearly advantageous before you know which one you'll be to not allocate all the resources to the early people.

6Vanessa Kosoy2y

First, I said I'm not a utilitarian, I didn't say that I don't value other people. There's a big difference! Second, I'm not willing to step behind that veil of ignorance. Why should I? Decision-theoretically, it can make sense to argue "you should help agent X because in some counterfactual, agent X would be deciding whether to help you using similar reasoning". But, there might be important systematic differences between early people and late people (for example, because late people are modified in some ways compared to the human baseline) which break the symmetry. It might be a priori improbable for me to be born as a late person (and still be me in the relevant sense) or for a late person to be born in our generation[1]. Moreover, if there is a valid decision-theoretic argument to assign more weight to future people, then surely a superintelligent AI acting on my behalf would understand this argument and act on it. So, this doesn't compel me to precommit to a symmetric agreement with future people in advance. 1. ^ There is a stronger case for intentionally creating and giving resources to people who are early in counterfactual worlds. At least, assuming people have meaningful preferences about the state of never-being-born.

2Gurkenglas2y

If a future decision is to shape the present, we need to predict it. The decision-theoretic strategy "Figure out where you are, then act accordingly." is merely an approximation to "Use the policy that leads to the multiverse you prefer.". You *can* bring your present loyalties with you behind the veil, it might just start to feel farcically Goodhartish at some point. There are of course no probabilities of being born into one position or another, there are only various avatars through which your decisions affect the multiverse. The closest thing to probabilities you'll find is how much leverage each avatar offers: The least wrong probabilistic anthropics translates "the effect of your decisions through avatar A is twice as important as through avatar B" into "you are twice as likely to be A as B". So if we need probabilities of being born early vs. late, we can compare their leverage. We find: * Quantum physics shows that the timeline splits a bazillion times a second. So each second, you become a bazillion yous, but the portions of the multiverse you could first-order impact are divided among them. Therefore, you aren't significantly more or less likely to find yourself a second earlier or later. * Astronomy shows that there's a mazillion stars up there. So we build a Dyson sphere and huge artificial womb clusters, and one generation later we launch one colony ship at each star. But in that generation, the fate of the universe becomes a lot more certain, so we should expect to find ourselves before that point, not after. * Physics shows that several constants are finely tuned to support organized matter. We can infer that elsewhere, they aren't. Since you'd think that there are other, less precarious arrangements of physical law with complex consequences, we can also moderately update towards that very precariousness granting us unusual leverage about something valuable in the acausal marketplace. * History shows that we got lucky during the Cold War. We c

6Dagon2y

When you start diverting significant resources away from #1, you’ll probably discover that the definition of “aligned” is somewhat in contention.

4Tamsin Leake2y

i feel like (2)/(3) is about "what does (the altruistic part of) my utility function want?" and 4 is "how do i decision-theoretically maximize said utility function?". they're different layers, and ultimately it's (2)/(3) we want to maximize, but maximizing (2)/(3) entails allocating some of the future lightcore to (4).

3Joe Collman2y

A couple of thoughts: * I think that (3) does create strong incentives right now - at least for anyone who assumes [without any special prejudice given to existing people] amounts to [and it's fine to disassemble everyone who currently exists if it's the u/v/h/g/etc maximising policy]. This seems probable to me, though not entirely clear (I'm not an optimal configuration, and smoothly, consciousness-preservingly transitioning me to something optimal seems likely to take more resources than unceremoniously recycling me). Incentives now include: * Prevent (3) happening. * To the extent that you expect (3) and are selfish, live for the pre-(3) time interval, for (3) will bring your doom. * On (4), "This solution obviously creates the best incentives for current agents" seems badly mistaken unless I'm misunderstanding you. Something in this spirit would need to be based on a notion of [expected social value], not on actual contributions, since in the cases where we die we don't get to award negative points. For example, suppose my choice is between: A: {90% chance doom for everyone; 10% I save the world} B: {85% chance doom for everyone; 15% someone else saves the world} To the extent that I'm selfish, and willing to risk some chance of death for greater control over the future, I'm going to pick A under (4). The more selfish, reckless and power-hungry I am, and the more what I want deviates from that most people want, the more likely I am to actively put myself in position to take an A-like action. Moreover, if the aim is to get ideal incentives, it seems unavoidable to have symmetry and include punishments rather than only [you don't get many resources]. Otherwise the incentive is to shoot for huge magnitude of impact, without worrying much about the sign, since no-one can do worse than zero resources. If correct incentives were the only desideratum, I don't see how we'd avoid [post-singularity 'hell' (with s

4Tamsin Leake2y

(some very mild spoilers for yudkowsky's planecrash glowfic (very mild as in this mostly does not talk about the story, but you could deduce things about where the story goes by the fact that characters in it are discussing this)) [edit: links in spoiler tags are bugged. in the spoiler, "speculates about" should link to here and "have the stance that" to here] so pessimising a utility function is a bad idea. but we can still produce correct incentive gradients in other ways! for example, we could say that every moral patient starts with 1 unit of utility function handshake, but if you destroy the world you lose some of your share. maybe if you take actions that cause ⅔ of timelines to die, you only get ⅓ units of utility function handshake, and the more damage you do the less handshake you get. it never gets into the negative, that way we never go out of our way to pessimize someone's utility function; but it does get increasingly close to 0. (this isn't necessarily a scheme i'm committed to, it's just an idea i've had for a scheme that provides the correct incentives for not destroying the world, without having to create hells / pessimize utility functions)

4Joe Collman2y

Hmmm, I don't think that kind of thing is going to give correct world-saving incentives for the selfish part of people (unless failing to save the world counts as destroying it - in which case almost everyone is going to get approximately no influence). More fundamentally, I don't think it works out in this kind of case due to logical uncertainty. If I'm uncertain about a particular plan, and my estimate is {80% everyone dies; 20% I save the world}, that's not {in 80% of timelines everyone dies; in 20% of timelines I save the world}. It's closer to [there's an 80% chance that {in ~99% of timelines everyone dies}; there's a 20% chance that {in ~99% of timelines I save the world}]. So, conditional on my saving the world in some timeline by taking some action, I saved the world in most timelines where I took that action and would get a load of influence. This won't disincentivize risky gambles for selfish/power-hungry people. (at least of the form [let's train this model and see what happens] - most of the danger there being a logical uncertainty thing) I think influence would need to be based on expected social value given the 'correct' level of logical uncertainty - probably something like [what (expected value | your action) is justified by your beliefs, and valid arguments you'd make for them based on information you have]. Or at least some subjective perspective seems to be necessary - and something that doesn't give more points for overconfident people.

[-]evhub3y30

Here's a simple argument that I find quite persuasive for why you should have linear returns to whatever your final source of utility is (e.g. human experience of a fulfilling life, which I'll just call "happy humans"). Note that this is not an argument that you should have linear returns to resources (e.g. money). The argument goes like this:

You have some returns to happy humans (or whatever else you're using as your final source of utility) in terms of how much utility you get from some number of happy humans existing.
In most cases, I think those retu

... (read more)

8Wei Dai3y

You're assuming that your utility function should have the general form of valuing each "source of utility" independently and then aggregating those values (such that when aggregating you no longer need the details of each "source" but just their values). But in The Moral Status of Independent Identical Copies I found this questionable (i.e., conflicting with other intuitions).

2evhub3y

This is the fungibility objection I address above:

2Wei Dai3y

Ah, I think I didn't understand that parenthetical remark and skipped over it. Questions: 1. I thought your bottom line conclusion was "you should have linear returns to whatever your final source of utility is" and I'm not sure how "centralizing the happy humans rather than spreading them out seems like it would make it easier to ensure that their experiences are as diverse as possible" relates to that. 2. I'm not sure that the way my utility function deviates from fungibility is "I care about overall diversity of human experience throughout the multiverse". What if it's "I care about diversity of human experience in this Everett branch" then I could get a non-linear diminishing returns effect where as humans colonize more stars or galaxies, each new human experience is more likely to duplicate an existing human experience or be too similar to an existing experience so that its value has to be discounted.

2evhub3y

1. The thing I was trying to say there is that I think the non-fungibility concern pushes in the direction of superlinear rather than sublinear local returns to "happy humans" per universe. (Since concentrating the "happy humans" likely makes it easier to ensure that they're all different.) 2. I agree that this will depend on exactly in what way you think your final source of utility is non-fungible. I would argue that "diversity of human experience in this Everett branch" is a pretty silly thing to care about, though. I don't see any reason why spatial distance should behave differently than being in separate Everett branches here.

2Wei Dai3y

I tried to explain my intuitions/uncertainty about this in The Moral Status of Independent Identical Copies (it was linked earlier in this thread).

2evhub3y

I read it, and I think I broadly agree with it, but I don't know why you think it's a reason to treat physical distance differently to Everett branch distance, holding diversity constant. The only reason that you would want to treat them differently, I think, is if the Everett branch happy humans are very similar, whereas the physically separated happy humans are highly diverse. But, in that case, that's an argument for superlinear local returns to happy humans, since it favors concentrating them so that it's easier to make them as diverse as possible.

2Wei Dai3y

I have a stronger intuition for "identical copy immortality" when the copies are separated spatially instead of across Everett branches (the latter also called quantum immortality). For example if you told me there are 2 identical copies of Earth spread across the galaxy and 1 of them will instantly disintegrate, I would be much less sad than if you told me that you'll flip a quantum coin and disintegrate Earth if it comes up heads. I'm not sure if this is actually a correct intuition, but I'm also not sure that it's not, so I'm not willing to make assumptions that contradict it.

4Max H3y

Not sure about this. Even if I think I am only acting locally, my actions and decisions could have an effect on the larger multiverse. When I do something to increase happy humans in my own local universe, I am potentially deciding / acting for everyone in my multiverse neighborhood who is similar enough to me to make similar decisions for similar reasons.

2evhub3y

I agree that this is the main way that this argument could fail. Still, I think the multiverse is too large and the correlation not strong enough across very different versions of the Earth for this objection to substantially change the bottom line.

2Lukas Finnveden3y

My sense is that many solutions to infinite ethics look a bit like this. For example, if you use UDASSA, then a single human who is alone in a big universe will have a shorter description length than a single human who is surrounded by many other humans in a big universe. Because for the former, you can use pointers that specify the universe and then describe sufficient criteria to recognise a human, but for the latter, you need to nail down exact physical location or some other exact criteria that distinguishes a specific human from every other human.

2evhub3y

I agree that UDASSA might introduce a small effect like this, but my guess is that the overall effect isn't enough to substantially change the bottom line. Fundamentally, being separated in space vs. being separated across different branches of the wavefunction seem pretty similar in terms of specification difficulty.

4Lukas Finnveden3y

Maybe? I don't really know how to reason about this. If that's true, that still only means that you should be linear for gambles that give different results in different quantum branches. C.f. logical vs. physical risk aversion. Some objection like that might work more generally, since some logical facts will mean that there are far less humans in the universe-at-large, meaning that you're at a different point in the risk-returns curve. So when comparing different logical ways the universe could be, you should not always care about the worlds where you can affect more sentient beings. If you have diminishing marginal returns, you need to be thinking about some more complicated function that is about whether you have a comparative advantage at affecting more sentient beings in worlds where there is overall fewer sentient beings (as measured by some measure that can handle infinities). Which matters for stuff like whether you should bet on the universe being large.

-8TAG3y

[-]evhub2yΩ220

I wrote up some of my thoughts on how to think about comparing the complexities of different model classes here.

[-]evhub3yΩ220

If you want to produce warning shots for deceptive alignment, you're faced with a basic sequencing question. If the model is capable of reasoning about its training process before it's capable of checking a predicate like RSA-2048, then you have a chance to catch it—but if it becomes capable of checking a predicate like RSA-2048 first, then any deceptive models you build won't be detectable.

Moderation Log

evhub's Shortform

9

Ω 7