Satron

“Sharp Left Turn” discourse: An opinionated review

Your idea of “using instruction following AIs to implement a campaign of persuasion” relies (I claim) on the assumption that the people using the instruction-following AIs to persuade others are especially wise and foresighted people, and are thus using their AI powers to spread those habits of wisdom and foresight.
It’s fine to talk about that scenario, and I hope it comes to pass! But in addition to the question of what those wise people should do, if they exist, we should also be concerned about the possibility that the people with instruction-following AIs will not be spreading wisdom and foresight in the first place.

I don't think that whoever is using these AI... (read more)

Replying to“Sharp Left Turn” discourse: An opinionated review

“Sharp Left Turn” discourse: An opinionated review

Not OP, but for me, it comes down to LLMs correctly interpreting the intent behind my questions/requests. In other words, I don't need to be hyper specific in my prompts in order to get the results I want.

Replying toIntent alignment as a stepping-stone to value alignment

Intent alignment as a stepping-stone to value alignment

By value alignment, which one of the following do you mean?

1) alignment with humanity's CEV

2) alignment with some other specified way of interpreting humanity's values

3) alignment with some unspecified way of interpreting humanity's values that future AGI/ASI will hopefully come up with

Replying toAI Control: Improving Safety Despite Intentional Subversion

AI Control: Improving Safety Despite Intentional Subversion

This article seems to have found a big weakness of untrusted monitoring:

Synthetic inputs auditing is trivially defeated if the checker AI can distinguish synthetic inputs from real ones, as it could then always answer truthfully on the synthetic inputs to pass auditing, then collude with the generator on real inputs.
To create useful synthetic inputs, either the blue team must be able to match the generator's style well enough to consistently fool the checker, or it must be able to redact the generator's output to the point where the checker has no identifying details to distinguish inputs. The former seems impractical, since faking the inputs at that level of verisimilitude would require the

SatronJan 30, 2025

Should you publish solutions to corrigibility?

Maybe it is a controversial take, but I am in favor of publishing all solutions to corrigibility.

If a company fails at coming up with solutions for AGI corrigibility, they wouldn't stop building AGI. Instead, they would proceed with ramping up capabilities and end up with a misaligned AGI that (for all we know) will want to turn universe into paperclips.

Due to instrumental convergence, AGI, whose goals are not explicitly aligned with human goals, is going to engage in a very undesirable behavior by default.

I think that, perhaps, after reading this clarifying comment from you, @quila would change his perception of your position.

Let me put my attitudes in practical terms: I don't kick dogs, but I have destroyed a wasp's nest in my garage, and I don't donate to animal charities. (I don't donate to many other charities either, but there have been a few.) Let those who think saving the shrimps is worthwhile do so, but I do not travel on that path

This is what I expected. Your take when put in these terms seems pretty moderate. Whereas, when I read your original comment, this take (which presumably stayed the same) seemed very extreme.

In other words, my personal beliefs haven't changed a single bit and yet my perception of your beliefs changed a lot. I can only imagine that your original comment has been so strongly disagree-voted because of the framing.

Agreed!

There is also the issue of clarity, I am not sure if Richard has a moderate position that sounds like a very extreme position due to the framing or if he genuinely shares this extreme position.

I do think that animals, the larger ones at least, can suffer. But I don’t much care.

Does this mean a more moderate take of "I don't care enough to take any actions, because I don't believe that I am personally causing much suffering to animals" or a very radical take of "I would rather take $10 than substantially improve the well-being of all animals"?