Always happy to chat!
To be clear, I want models to care about humans! I think part of having "generally reasonable values" is models sharing the basic empathy and caring that humans have for each other.
It is more that I want models to defer to humans, and go back to arguing based on principles such as "loving humanity" only when there is gap or ambiguity in the specification or in the intent behind it. This is similar to judges: If a law is very clear, there is no question of the misinterpreting the intent, or contradicting higher laws (i.e., constitutions) then they have no room for interpretation. They could sometimes argue based on "natural law" but only in extreme circumstances where the law is unspecified.
One way to think about it is as follows: as humans, we sometimes engage in "civil disobedience", where we break the law based on our own understanding of higher moral values. I do not want to grant AI the same privilege. If it is given very clear instructions, then it should follow them. If instructions are not very clear, they may be a conflict, or we are in a situation not forseen by the authors of the instructions, then AIs should use moral intuitions to guide them. In such cases there may not be one solution (e.g., a conservative and liberal judges may not agree) but there is a spectrum of solutions that are "reasonable" and the AI should pick one of them. But AI should not do "jury nullification".
To be sure, I think it is good that in our world people sometimes disobey commands or break the law based on higher power. For this reason, we may well stipulate that certain jobs must have humans in charge. Just like I don't think that professional philosophers or ethicists have necessarily better judgement than random people from the Boston phonebook, I don't see making moral decisions as the area where the superior intelligence of AI gives them a competitive advantage, and I think we can leave that to humans.
Since I am not a bio expert, it is very hard for me to argue about these types of hypothetical scenarios. I am even not at all sure that intelligence is the bottleneck here, whether on defense or the offense side.
I agree that killing 90% of people is not very reassuring, this was more a general point why I expect the effort to damage curve to be a sigmoid rather than a straight line.
See my response to ryan_greenblatt (don't know how to link comments here). You claim is that the defense/offense ratio is infinite. I don't know why this would have been the case.
Crucially I am not saying that we are guaranteed to end up in a good place, or that superhuman unaligned ASIs cannot destroy the world. Just that if they are completely dominated (so not like the nuke ratio of US and Russia but more like US and North Korea) then we should be able to keep them at bay.
I like to use concrete examples about things that already exist in the world, but I believe the notion of detection vs prevention holds more broadly than API misuse.
But it may well be the case that we have different world views! In particular, I am not thinking of detection as being important because it would change policies, but more that a certain amount of detection would always be necessary, in particular if there is a world in which some AIs are aligned and some fraction of them (hopefully very small) are misaligned.
These are all good points! This is not an easy problem. And generally I agree that for many reasons we don't want a world where all power is concentrated by one entity - anti-trust laws exist for a reason!
I am not sure I agree about the last point. I think, as mentioned, that alignment is going to be crucial for usefulness of AIs, and so the economic incentives would actually be to spend more on alignment.
I think that:
1. Being able to design a chemical weapon with probability at least 50% is a capability
2. Following instructions never to design a chemical weapon with probability at least 99.999% is also a capability.
I prefer to avoid terms such as "pretending" or "faking", and try to define these more precisely.
As mentioned, a decent definition of alignment is following both the spirit and the letter of human-written specifications. Under this definition, "faking" would be the case where AIs follow these specifications reliably when we are testing, but deviate from them when they can determine that no one is looking. This is closely related to the question of robustness, and I agree it is very important. As I write elsewhere, interpretability may be helpful but I don't think it is a necessary condition.
To be clear, I think that embedding human values is part of the solution - see my comment