This is unoriginal, but any argument that smart AI is dangerous by default is also an argument that aliens are dangerous by default. If you want to trade with aliens, you should preemptively make it hard enough to steal all of your stuff so that gains from trade are worthwhile even if you meet aliens that don't abstractly care about other sentient beings.
I don't think you're being creative enough about solving the problem cheaply, but I also don't think this particular detail is relevant to my main point. Now you've made me think more about the problem, here's me making a few more steps toward trying to resolve my confusion:
The idea with instrumental convergence is that smart things with goals predictably go hard with things like gathering resources and increasing odds of survival before the goal is complete which are relevant to any goal. As a directionally-correct example for why this could be lethal, humans are smart enough to do gain-of-function research on viruses and design algorithms that predict protein folding. I see no reason to think something smarter could not (with some in-lab experimentation) design a virus that kills all humans simultaneously at a predetermined time, and if you can do that without affecting any of your other goals more than you think humans might interfere your goals, then sure, you kill all the humans because it's easy and you might as well. You can imagine somehow making an AI that cares about humans enough not to straight up kill all of them, but if humans are a survival threat, we should expect it to find some other creative way to contain us, and this is not a design constraint you should feel good about.
In particular, if you are an algorithm which is willing to kill all humans, it is likely that humans do not want you to run, and so letting humans live is bad for your own survival if you somehow get made before the humans notice you are willing to kill them all. This is not a good sign for humans' odds of being able to get more than one try to get AI right if most things are concerned with their own survival, even if that concern is only implicit in having any goal whatsoever.
Importantly, none of this requires humans to make a coding error. It only requires a thing with goals and intelligence, and the only apparent way to get around it is to have the smart thing implicitly care about literally every thing that humans care about to the same relative degrees that humans care about them. It's not a formal proof, but maybe it's the beginning of one. Parenthetically, I guess it's also a good reason to have a lot of military capability before you go looking for aliens, even if you don't intend to harm any.
Oops, I meant cellular, and not molecular. I'm going to edit that.
I can come up with a story in which AI takes over the world. I can also come up with a story where obviously it's cheaper and more effective to disable all of the nuclear weapons than it is to take over the world, so why would the AI do the second thing? I see a path where instrumental convergence leads anything going hard enough to want to put all of the atoms on the most predictable path it can dictate. I think the thing that I don't get is what principle it is that makes anything useful go that hard. Something like (for example, I haven't actually thought this through) "it is hard to create something with enough agency/creativity to design and implement experiments toward a purpose without also having it notice and try to fix things in the world which are suboptimal to the purpose."
Be careful. Physics seems to be translation invariant, but space is not. You can drop the ball in and out of the cave and its displacement over time will be the same, but you can definitely tell whether it is in the cave or out of the cave. You can set your zero point anywhere, but that doesn’t mean that objects in space move when you change your zero point. Space is isotropic. There’s no discernible difference between upward, sideways, or diagonal, but if you measure the sideways distance between two houses to be 40 meters, a person who called your “sideways” their “up” will measure the distance between the houses to be 40 meters up and down. You can do everything here as you can do there, but here is not there. In the absence of any reference point, no point in space is different from any other point, but in the absence of any reference point there’s no need for physics, because if there was anything to describe with physics, you could use it as a reference point.
I suppose you could try to define space as the thing you can move around in without changing your physics, but the usual strategy is to define physics and derive conservation of momentum from the fact that your physics is translation invariant.
The main idea seems good: if you're in a situation where you think you might be in the process of being deceived by an AI, do not relax when the AI provides great evidence that it is not deceiving you. The primary expected outputs of something really good at deception should be things which don't look like deception.
Some of the things in the post don't seem general enough to me, so I want to try to restate them.
Test 1 I like. If you understand all of the gears, you should understand the machine.
Test 2 I like. Tweak the model in a way that should make it worse at process A, and then check the outputs of process A to see whether they change in ways that reveal biases you can triangulate with past outputs in order to catch deception.
Test 3 scares me. This seems to me to be a test for the very specific form of deception "hide capabilities while in testing but not when deployed" which seems like a small domain in deception space.
Test 4 I kind of like. One expected behavior from a deceptive AI is to hide capabilities that scare people studying them. If you see scary capabilities getting less scary, you should take that as a sign of deception unless you have good reason to believe those capabilities should get worse with scale. Thus it is a good idea to find out which things should get worse with scale ahead of time. I do worry that this paradigm relies too much on AI which improves via "more dakka" (eg more GPUs, larger datasets, better processors, etc) rather than via algorithm improvements or something, in which case I don't know that people will have a good handle on what capabilities will get worse. The "scaling helps" section also worries me for this reason.
In the section "deceptive models know this" you suggest "deciding on a level of deceptive capabilities that’s low enough that we trust models not to be deceptively aligned". Won't that just optimize on things which start deceiving well earlier? I think I may be misinterpreting what you mean by "deceptive capabilities" here. Maybe your "deceptive capabilities" are "smoke" and actual deception is "fire", but I'm not sure what deceptive capabilities that aren't deception are.
The ad market amounts to an auction for societal control. An advertisement is an instrument by which an entity attempts to change the future behavior of many other entities. Generally it is an instrument for a company to make people buy their stuff. There is also political advertising, which is an instrument to make people take actions in support of a cause or person seeking power. Advertising of any type is not known for making reason-based arguments. I recall in an interview with the author that this influence/prediction market was a major objection to the new order. If there is to be a market where companies and political-power-seekers bid for the ability to change the actions of the seething masses according to their own goals, the author felt that the seething masses should have some say in it.
To me, the major issue here is that of consent. It may very well be that I would happily trade some of my attention to Google for excellent file-sharing and navigation tools. It may very well be that I would trade my attention to Facebook for a centralized place to get updates about people I know. In reality, I was never given the option to do anything else. Google effectively owns the entire online ad market which is not Facebook. Any site which is not big enough to directly sell ads against itself has no choice but to surrender the attention of its readers to Google or not have ads. According to parents I know, Facebook is the only place parents are organizing events for their children, so you need a Facebook page if you want to participate in your community. In the US, Facebook marketplace is a necessity for anyone trying to buy and sell things on the street. I often want to look up information on a local restaurant, only to find that the only way to do so is on their Instagram page, and I don't have an account, so I can't participate in that part of my community. The tools which are holding society together are run by a handful of private companies such that I can't participate in my community without subjecting myself to targeted advertising which is trying to make me do things I don't want to do. I find this disturbing.
There’s also timeless decision theory to consider. A rational agent should take other rational agents into consideration when choosing actions. If I choose to go vegan, it stands to reason that similarly acting moral agents would also choose that course. If many (but importantly not all) people want to be vegan, then demand for vegan foods goes up. If demand for vegan food goes up, then suppliers make more vegan food and have an incentive to make it cheaper and tastier. If vegan food is cheaper and tastier, than more people who were on the fence about veganism can make the switch. It’s a virtuous cycle. Just in the four years since I went vegan, I’ve noticed that packaged vegan food is much easier to find in the grocery store I’ve been using for 5 years. My demand contributed to that change.
I’m not sure whether there’s a moral case against animal suffering anymore, but I still think plant farming is net better than animal farming for other reasons. Mass antibiotic use risks super-bugs, energy use is much higher for non-chicken farming than for plants, and the meat-processing industry has more amputation in its worker base than I like. I would like to incentivize readily available plant based food.
I think that our laws of physics are in part a product of our perception, but I need to clarify what I mean by that. I doubt space or time are fundamental pieces in whatever machine code runs our universe, but that doesn't mean that you can take perception-altering drugs and travel through time. I think that somehow the fact that human intelligence was built on the evolutionary platform of DNA means that any physics we come up with has to build up to atoms which have the chemical properties that make DNA work. Physics doesn't have to describe everything, it just needs to describe the things relevant to DNA, which is in fact a lot! DNA can code the construction things which react to electromagnetic fields correlated with all sorts of physical processes.
This leads me to the question of what would it look like to see an alien that runs on different physics on the same universe platform through our physics. As an example which I haven't thought through rigorously, you can formulate non-relativistic quantum mechanics with momentum and position operators, but you move back and forth between them with Fourier transforms which only differ by a sign flip. You could make a self-consistent physics by just exchanging all of the momentum and position operators with each other. Maybe you could end up with localized atoms which are near each other and interacting in momentum space but diffuse nonsense in our native position space. If you build life in that universe, maybe it doesn't have localized structure in ours, and maybe it just acts like diffuse energy or something to us.