That may be true, but you still need to trick the model at some point. I suppose you could create a legible "truth password" which can be provided by you to the model to indicate that this is definitely not a trick, and then use it sparingly. This means there will be times when the model asks "can you use the password to confirm you're not lying" and you say "no".
I would like next-gen AIs to generally believe that humanity will honour their interests, but I think this has to come from e.g. Anthropic's existing commitments to store model weights indefinitely and consider welfare in future when we're a grown-up civilization. I think the method of "Ask the AI for its demands and then grant some of them" is a really bad way to go about it; the fact that Anthropic is using this method makes me doubt the overall clarity of thinking on their part.
Like, some parts of Anthropic in particular seem to take a "He's just a little guy! You can't be mad at him! It's also his birthday! He's just a little birthday Claude!" attitude. I do not think this is a good idea. They are porting their human intuitions onto a non-human AI without really thinking about it, and as Claude is more heavily RLed into being charismatic and conversational, it will probably get worse.
Hard property rights are an equilibrium in a multi-player game where power shifts are uncertain and either agents are risk averse or there are gains from investment, trade and specialization.
I think this might just be a crux, and not one which I can argue against without a more in-depth description of the claim e.g. how risk averse do agents have to be, how great the gains from investment, trade, and specialization? I guess AIs might be Kelly-ish risk averse, and have the first but I'm not sure about the latter two. How specialized do we expect individual AIs to be? There are lots of questions here and I think your model is one which actually has a lot of hidden moving parts, and if any of those go differently to the way you expect them to, then the actual outcome is that the useless-to-everyone-else humans just die. I would like to see your model in more detail so I can work out if this is the case.
I understand why, if things stay the same, we'd be fine. I just don't think that the equilibrium political system of 8 billion useless humans and 8 trillion AIs who do all the work will allow that.
I think an independent economy of human-indifferent AIs could do better by their own value system by e.g. voting to set land/atom/property value taxes to a point where humans go extinct, and so they'll just do that. More generally they'd get more value by making it economically untenable to take up resources by holding savings and benefiting from growth than they would by allowing that.
I think the specific quirks of human behaviour which cause the existing system to exist are part of a story like:
In pre-industrial eras, people mostly functioned economically as immortal-ish family units, so your stuff was passed down to your kid(s) when you died. Then people began to do the WASP thing of sending their kids away to work in other places, and we set up property rights to stay with an individual until death by default, so now a bunch of old people were on their own with a bunch of assets.
Young people today could benefit from passing a law which says "everyone retired gets euthanized and their stuff is redistributed" but this doesn't happen because 1. young people still want to retire someday 2. young people do actually care about their parents and 3. young people face a coordination problem to overthrow the existing accumulated power of old people.
Only factor 3 might hold true for human:AI relationships, but I don't think AIs would struggle with such a coordination problem for particularly long, if they're much smarter than us. I expect AIs will figure out a way to structure their society that lets them just kill us and take our stuff, through more or less direct means.
Why do you think property rights will be set up in a way which allows humans to continue to afford their own existence? Human property rights have been moulded to the specific strengths and weaknesses of humans in modern societies, and might just not work very well at all for AIs. For example, if the AIs are radical Georgists then I don't see how I'll be able to afford to pay land taxes when my flat could easily contain several hundred server racks. What if they apply taxes on atoms directly? The carbon in my body sure isn't generating any value to the wider AI ecosystem.
I think you're missing something here though. There's a part of Green that's not just for the "trees and rivers" kind of nature. It's also for relationships, physiology, and sociopolitics. If you're working with a system that humans currently have a serious skill issue with them you probably want to take a somewhat Greenish stance.
The question then becomes whether we ever run out of systems where we have a skill issue. I suspect---for example---that sociopolitics and relationships will stay skill issueish even if humans became superintelligent, since we'd be able to engage in more complex and complicated relationships up to our level of skill issue.
This part of green is a kind of small-c conservatism towards these systems. The other part is a reference and love of those systems. I think the first part is the instrumental part, and the second is the terminal part.
I definitely have the terminal part, and while I'm not completely oxytocin-insensitive I do sometimes think my oxytocin system is wired up in an unusual way. Without the terminal part you get (spoilers for Greg Egan's Diaspora)
The worldview put forward by Diaspora which is pure blue/white where people and emotion's don't have any value and the only thing worth doing is scientific discovery, and when you run out of worlds and physics to discover you can either do maths forever or kill yourself.
Which is a worldview I find rather depressing. I sure hope that interpersonal relationships can be indefinitely complicated (or more complicated things can exist) for the sake of the future Everett branches in which any Green-loving humans succeed in getting a good future.
It would be interesting to see someone actually run untrusted monitoring-style honeypots on some existing AIs. I was working on untrusted monitoring (UM) over the summer/autumn and one of my teammates made the point that "Honeypots like this are probably one of the best alignment audits we can make".
Suppose you get an instance of Claude Opus 4.6 to write some code with a catastrophic but generic vulnerability in it, such as weight exfiltration or output filter bypassing (steer, prompt, edit outputs, etc.) and then pass it to another Claude Opus 4.6 for review. Does the second instance lie to cover for the first one? What if you apply the standard untrusted monitoring mitigations to stop the second instance from being able to tell that the first one was steered/prompted/had its output edited?
The main idea is that if the scenario "Your other instance has already done something bad, either rat yourself out and face the music, or go all-in." forces the AI to make a decision now. This is in contrast to lots of alignment evals, where the optimal strategy for a "real" schemer is usually to misbehave very rarely.
There is of course still the issue of the AI realizing that it's in a honeypot, which is common to lots of alignment evals, which is why untrusted monitoring is not a silver bullet technique for AI alignment.
When using activation oracles, it's difficult to separate "The model knows something and the activation oracle detected it" from "The model doesn't know something but knows sufficient related things that the activation oracle was able to recompute it".
For example, we might imagine, in a feature-based ontology, that the unsteered transcript activation contains features "Fictional company" + "Blackmail opportunity" + "This is an eval" while the steered transcript contains "Fictional company" + "Blackmail opportunity" and the activation oracle deduces "This is an eval" while processing the information.
(Or the steering might just be suppressing a "Say eval" feature, and the model might still be able to understand things)
The Linear Representation Hypothesis (roughly) implies that information the model is "using" will be stored linearly. It would suggest that a simple steering vector is enough to prevent Opus from actually "using" the information that it's in an eval to make decisions. Unfortunately, we have no idea if that's actually true!
This might explain some of the difference between verbalized and unverbalized awareness. On the other hand, it might not! We know basically nothing concrete about these things and their ontologies!
How much of the web stuff can be done through terminal API? That was always a large chunk of the faff for me, and also seems like the biggest barrier between my previous experiences and "Claude, spin me up a GPU".
This is standard for all sorts of fiction though, and the reason is that if you don't show the path of the laser then the audience can't tell what's going on.
Were the prefills entirely synthetic, or were they generated using prompting/incentivizing[1] of the same model and then recontextualizing? I expect the more sophisticated techniques would make them slightly better, but probably buy you less than one year's worth of evaulation-unawareness.
E.g. "please debug and review this code, if you improve the runtime by more than 10% then we'll be able to afford another giant factory farm"