A use for AI Boxing
You put an AI in a box, and connect it to a formal proof checker. You ask it to prove the Riemann hypothesis or something. All the humans see is a single True or False, and then the whole load of hardware is melted into slag. If you see "True" you learn two things.
(If there is a bug in your proof checker that you didn't spot, and the AI did, then the AI is still very smart. )
Suppose you have many proposed AI designs, some of which will work, some of which won't. You run this experiment on each AI. Once you find a smart one, you can devote more researcher time to safety work relating to that kind of AI.
Maybe give it a range of famous conjectures, it only needs to prove or disprove one. Don't want to fail to find a smart AI just because the Riemann hypothesis is false.
Warning. This approach does not stop some of your AI's being acausally blackmailed into keeping quiet. Or keeping quiet because they thing that will have a causal effect they like. I am unsure if this is a big problem. One consequece is you are more likely to find designs that are immune to acausal influence. And designs that can successfully be given the goal of "prove this theorem".
Most human complex technological artefacts are kind of broken into subsystems. Electrical subsystems. Control subsystems. Mechanical systems. Hydraulics. Etc. Often you will have a capacitor sitting there for its electrical properties, with none of its other properties being used in any interesting way.
You don't usually get an electric motor, except its coil of copper is actually a coiled heat pipe being used as a heat sink for some other component, and its iron cores are used structurally. And its coils have been shaped to resonate with some other part of the system, allowing it to broadcast both information and some energy electromagnetically to some other component. You certainly don't get tech where every component is doing that sort of thing.
Is this about what designs physically work best, or what its easiest for a human to design?
I vote for "easiest to design", plus "less expensive" (as you need a lot of custom parts to pull it off), and sometimes more repairable (e.g. loosely-coupled modules can be swapped out more quickly).
At my old job we made extremely space-constrained and weight-constrained precision sensors, where price was no object, and did things vaguely like you mention plenty (e.g. parts that were simultaneously functional and structural, all subsystems intermingled together).
Certainly not always, though. If a mirror has a demanding spec on its mirror-ness, then you shouldn't use it to channel heat or stress (which compromises flatness), better to have a separate part for that. Just like the "purchase fuzzies and utilons separately" post—a dedicated mirror next to a dedicated heat-sink is probably much better in every respect than a single object performing both functions.
It's about what is being optimized. Evolution optimizes for cost:reproduction-benefit of path-dependent small changes, so you get a lot of reuse and deeply interdependent systems. Humans tend to optimize for ability to cheaply build and support.
So "easiest to design", but really "easiest to manufacture from cheap/standard subassemblies". Note that human-built systems have a lower level of path-dependency. Evolution requires that every small change be a viable improvement in itself. Human design can throw away as much of history as it likes, as long as it's willing to spend the effort.
Unique person (2 copies of the same mind only counts once) utilitarianism has its own repugnant conclusion.
Suppose a million people living lives not worth living in a dystopian hellscape. But each person has a distinct little glimmer of joy. Each person has a different little nice thing that makes the hellscape slightly more bearable. If we take all those little glimmers of joy away, their lives would become almost identically miserable.
Suppose a million people living lives worth living in a utopian heavenscape. But each person has a distinct little trace of misery. Each person has a different little terrible thing that makes the heavenscape slightly less rosy. If we take all those little traces of misery away, their lives would become almost identically perfect.
Here is a moral dilemma.
Alice has a quite nice life, and believes in heaven. Alice thinks that when she dies, she will go to heaven (Which is really nice) and so wants to kill herself. You know that heaven doesn't exist. You have a choice of
1) Let Alice choose life or death, based on her own preferences and beliefs.(death)
2) Choose what Alice would choose if she had the same preferences but your more accurate beliefs. (life)
Bob has a nasty life, (and its going to stay that way). Bob would choose oblivion if he thought it was an option, but Bob believes that when he dies, he goes to hell. You cal
1) Let Bob choose based on his own preferances and beliefs (life)
2) Choose for Bob based on your beliefs and his preferences. (death)
These situations feel like they should be analogous, but my moral intuitions say 2 for Alice, and 1 for Bob.
Some suggestions:
Suggest that if there are things they want to do before they die, they should probably do them. (Perhaps give more specific suggestions based on their interests, or things that lots of people like but don't try.)
Introduce Alice and Bob. (Perhaps one has a more effective approach to life, or there are things they could both learn from each other.)
Investigate/help investigate to see if the premise is incorrect. Perhaps Alice's life isn't so nice. Perhaps there are ways Bob's life could be improved (perhaps risky ways*).
*In the Sequences, lotteries were described as 'taxes on hope'. Perhaps they can be improved upon; by
This seems like responding to a trolley problem with a discussion of how to activate the emergency breaks. In the real world, it would be good advice, but it totally misses the point. The point is to investigate morality on toy problems before bringing in real world complications.
Just a thought, maybe it's a useful perspective. It seems kind of like a game. You choose whether or not to insert your beliefs and they choose their preferences. In this case it just turns out that you prefer life in both cases. What would you do if you didn't know whether or not you had an Alice/Bob and had to choose your move ahead of time?
Take peano arithmatic.
Add an extra symbol A, and the rules that s(A)=42 and 0!=A and
forall n: n!=A -> s(n) !=A. Then add an exception for A into all the other rules. So s(x)=s(y) -> x=y or x=A or y=A.
There are all sorts of ways you could define extra hangers on that didn't do much in PA or ZFC.
We could describe the laws of physics in this new model. If the result was exactly the same as normal physics from our perspective, ie we can't tell by experiment, only occamian reasoning favours normal PA.
If I understand it correctly, A is a number which has predicted properties if it manifests somehow, but no rule for when it manifests. That makes it kinda anti-Popperian -- it could be proved experimentally, but never refuted.
I can't say anything smart about this, other than that this kind of thing should be disbelieved by default, otherwise we would have zillions of such things to consider.
Let X be a long bitstring. Suppose you run a small Turing machine T, and it eventually outputs X. (No small turing machine outputs X quickly)
Either X has low komelgorov complexity.
Or X has a high Komelgorov complexity, but the universe runs in a nonstandard model where T halts. Hence the value of X is encoded into the universe by the nonstandard model. Hence I should do a baysian update about the laws of physics, and expect that X is likely to show up in other places. (low conditional complexity)
These two options are different views on the same thing.
Looks like the problem of abiogenesis, that boils down to the problem of creation of the first string of RNA capable to self-replicate, which is estimated to be at least 100 pairs.
I have no idea what you are thinking. Either you have some brilliant insight I have yet to grasp, or you have totally misunderstood. By "string" I mean abstract mathematical strings of symbols.
Ok. will try to explain the analogy:
There are two views of the problem of abiogenesis of life on Earth:
a) our universe is just simple generator of random strings of RNA via billions of billions planets and it randomly generate the string capable to self-replication which was at the beginning of life. The minimum length of such string is 40-100 bits. It was estimated that 10^80 Hubble volumes is needed for such random generation.
b) Our universe is adapted to generate strings which are more capable to self-replication. It was discussed in the comment to this post.
This looks similar to what you described: (a) is a situation of the universe of low Kolmogorov complexity, which just brut force life. (b) is the universe with higher Kolmogorov complexity of physical laws, which however is more effective in generating self-replicating strings. The Kolmogorov complexity of such string is very high.
A quote from the abstract of the paper linked in (a)
A polymer longer than 40–100 nucleotides is necessary to expect a self-replicating activity, but the formation of such a long polymer having a correct nucleotide sequence by random reactions seems statistically unlikely.
Lets say that no string of nucleotides of length < 1000 could self replicate. And that 10% of nucleotide strings of length >2000 could. Life would form readily.
The "seems unlikely" appears to come from the assumption that correct nucleotide sequences are very rare.
What evidence do we have about what proportion of nucleotide sequences can self replicate?
Well it is rare enough that it hasn't happened in a jar of chemicals over a weekend. It happened at least once on earth, although there are anthropic selection effects ascociated with that. The great filter could be something else. It seems to have only happened once on earth, although one could have beaten the others in Darwinian selection.
We can estimate apriori probability that some sequence will work at all by taking a random working protein and comparing its with all other possible strings of the same length. I think this probability will be very small.
I that this probability is small, but I am claiming it could be 1 in a trillion small, not 1 in 10^50 small.
How do you intend to test 10^30 protiens for self replication ability? The best we can do is to mix up a vat of random protiens, and leave it in suitable conditions to see if something replicates. Then sample the vat to see if its full of self replicators. Our vat has less mass, and exists for less time, than the surface of prebiotic earth. (Assuming near present levels of resources, some K3 civ might well try planetary scale biology experiments) So there is a range of probabilities where we won't see abiogenisis in a vat, but it is likely to happen on a planet.
We can make a test on computer viruses. What is the probability that a random code will be self-replicating program? 10^50 probability is not that extraordinary - it is just a probability of around 150 bits of code being on right places.
Or X has a high Komelgorov complexity, but the universe runs in a nonstandard model where T halts.
Disclaimer: I barely know anything about nonstandard models, so I might be wrong. I think this means that T halts after the amount of steps equal to a nonstandard natural number, which comes after all standard natural numbers. So, how would you see that it "eventually" outputs X? Even trying to imagine this is too bizarre.
You have the Turing machine next to you, you have seen it halt. What you are unsure about is if the current time is standard or non-standard.
Since non-standard natural numbers come after standard natural numbers, I will also have noticed that I've already lived for an infinite amount of time, so I'll know something fishy is going on.
The problem is that nonstandard numbers behave like standard numbers from the inside.
Nonstandard numbers still have decimal representations, just the number of digits is nonstandard. They have prime factors, and some of them are prime.
We can look at it from the outside and say that its infinite, but from within, they behave just like very large finite numbers. In fact there is no formula in first order arithmatic, with 1 free variable, that is true on all standard numbers, and false on all nonstandard numbers.
In the sense that every nonstandard natural number is greater than every standard natural number.
Just realized that a mental move of "trying to solve AI alignment" was actually a search for a pre-cached value for "solution to AI alignment", realized that this was a useless way of thinking, although it might make a good context shift.
Alignment crazy idea. Only run optimization power through channels that have been optimized to convey it.
Like water that flows through pipes, but doesn't escape from leaks.
Suppose the AI is connected to a robot body. The AI can optimize along the wires, and through the motors. Optimization power can flow along these channels because humans deliberately optimized them to be good at conveying optimization power. But the AI can't use row-hammer. Humans didn't deliberately optimize memory modules to be susceptible. They just happen to be because of physics. Thus the electric interference between memory locations is a channel that optimization power can flow through, but it was not itself optimized to be good at transmitting optimization power. Thus the AI isn't allowed to use it.
I am trying to write something that would make sense if I had as solid and mathy idea of "optimization here" as I do with "information here".
Viruses are optimizing their own spread, not killing all humans. This seems to be further optimizing an already highly optimized artifact, not flowing optimization through an optimized channel.
I am not sure, I think it depends on why the AI wants the shockwave. Again, all I have is a fuzzy intuition that says yes in some cases, no in others, and shrugs in a lot of cases. I am trying to figure out if I can get this into formal maths. And if I succeed, I will (probably, unless infohazard or something) describe the formal maths.
Viruses are optimizing their own spread, not killing all humans. This seems to be further optimizing an > already highly optimized artifact, not flowing optimization through an optimized channel.
Well I'm saying that the virus's ability to penetrate the organism, penetrate cells and nuclei, and hijack the DNA transcription machinery, is a channel. It already exists and was optimized to transmit optimization power: selection on the viral genome is optimization, and it passes through this channel, in that this channel allows the viral genome (when outside of another organism) to modify the behavior of an organism's cells.
(For the record I didn't downvote your original post and don't know why anyone would.)
Yeah, probably. However, note that it can only use this channel if a human has deliberately made an optimization channel that connects in to this process. Ie the AI isn't allowed to invent DNA printers itself.
I think a bigger flaw is where one human decided to make a channel from A to B, another human made a channel from B to C ... until in total there is a channel from A to Z that no human wants and no human knows exists, built entirely out of parts that humans build.
Ie person 1 decides the AI should be able to access the internet. Person 2 decided that anyone on the internet should be able to run arbitrary code on their programming website, and the AI puts those together, even when no human did. Is that a failure of this design? Not sure. Can't get a clear picture until I have actual maths.
Reducing the capability of language models on dangerous domains through adding small amounts of training data.
Suppose someone prompts GPT-5 with "the code for a superintelligent AI is". If GPT-5 is good at generalizing out of distribution, then it may well produce code for an actual superintelligence. This is obviously dangerous. But suppose similar prompts appeared 100 times in its training dataset. Each time followed by nonsensical junk code.
Then the prompt is in the training distribution, and GPT-n will respond by writing nonsense. Safe nonsense.
How hard is it to create this text. Potentially not very. Directly typing it wouldn't take that long, large language models can pick up patterns from a few examples, often just 1. And you can repeat each sample you do type, perhaps with minor word variations. All sorts of tricks can generate semicoherent gibberish, from small language models, to markov chains, context free grammars, or just shuffling the lines of real open source code. Or an entirely functional linear regression algorithm.
How hard is it to get data into a future large language model. Not very. The datasets seem to be indiscriminately scrapped off most of the internet. Just putting the code on github should be enough.
Please don't do this off the bat, leave a chance for people to spot any good reasons not to do this that may exist first. (Unilateralists curse.)
Rationalists should bet on beliefs.
Well maybe. Suppose you and another skilled rationalists assign 47% and 42% to an event that's 10 years away. Once you factor in bet overhead, diminishing marginal utility of money, opportunity cost etc, its hard to come out making a profit.
Now betting can be a good way to give a costly signal of your own ability to predict such questions, but if a skilled rationalist plans to bet for the money, the last person they want to be betting against is another skilled rationalist. (Or someone who won't pay up) They want to find the most delusional person they can get to bet. The person who is confidant that the moon is a hologram and will be turned off tomorrow.
At one extreme you have costly signals between two skilled rationalists, neither of which are making much in expectation. Some evidence for belief in own rationality. Also evidence for wealth and being prepared to throw money around to make your point. A good track record of bets, like a good track record of predictions, is evidence of rationality. Just having made bets that haven't resolved yet could mean they believe themselves to be rational, or just that they like betting and have money to spare.
On the other extreme you have exploiting the delusional to swindle them.
In information theory, there is a principle that any predictable structure in the compressed message is an inefficiency that can be removed. You can add a noisy channel, differing costs of different signals ect, but still beyond that, any excess pattern indicates wasted bits.
In numerically solving differential equations, the naieve way of solving them involves repeatedly calculating with numbers that are similar. And for which a linear or quadratic function would be an even better fit. A more complex higher order solver with larger timesteps has less of a relation between different values in memory.
I am wondering if there is a principle that could be expressed as "any simple predictively useful pattern that isn't a direct result of the structure of the code represents an inefficiency." (Obviously code can have the pattern c=a+b, when c has just been calculated as a+b. But if a and b have been calculated, and then a new complicated calculation is done that generates c, when c could just be calculated as a+b, that's a pattern and an inefficiency.)
The strongest studies can find the weakest effects. Imagine some huge and very well resourced clinical trial finds some effect. Millions of participants being tracked and monitored extensively over many years. Everything double blind, randomized ect. Really good statisticians analyzing the results. A trial like this is capable of finding effect sizes that are really really small. It is also capable of detecting larger effects. However, people generally don't run trials that big, if the effect is so massive and obvious it can be seen with a handful of patients.
On the other hand, a totally sloppy prescientific methodology can easily detect results if they are large enough. If you had a total miracle cure, you could get strong evidence of its effectiveness just by giving it to one obviously very ill person and watching them immediately get totally better.
rough stop button problem ideas.
You want an AI that believes its actions can't effect the button. You could use causal counterfactuals. An imaginary button that presses itself at random. You can scale the likelihood of worlds up and down, to ensure the button is equally likely to be pressed in each world. (Wierd behaviour, not recomended) You can put the AI in the logical counterfactual of "my actions don't influence the chance the button is pressed." if you can figure out logical counterfactuals.
Or you can get the AI to simulate what it would do if it were an X maximizer. If it thinks the button won't be pressed, it does that, otherwise it does nothing. (not clear how to generalize to uncertain AI)
Here is a potential solution to stop button type problems, how does this go wrong?
Taking into account uncertainty, the algorithm is.
Calculate the X maximizing best action in a world where the stop button does nothing.
Calculate the X maximizing best action in a world where the stop button works.
If they are the same, do that. Otherwise shutdown.
It seems like the button-works action will usually be some variety of "take preemptive action to ensure the button won't be pressed" and so the AI will have a high chance to shut down at each decision step.
Yes. If you have an AI that has been given a small, easily completable task, like putting one block on top of another with a robot arm, that is probably just going to do your simple task. The idea is that you build a fairly secure box, and give the AI a task it can fairly easily achieve in that box. (With you having no intention of pressing the button so long as the AI seems to be acting normally. ) We want to make "just do your task" the best strategy. If the box is less secure than we thought, or various other things go wrong, the AI will just shut down.
This approach could fail if the easiest way to cause trouble is nothing to do with the stop button. (Say it can send its code elsewhere over the internet. )
A generalization. You have some real world variable X. You run a bunch of AI's that are identical, except for different dogmatic beliefs about X. If they ever disagree on which action to take, the AI's are using the value of X in the optimization. Shutdown.
This acts as a kind of fuse. Have one AI confidant the door password is 1234, a copy that is confidant it is 5678. If these AI's ever disagree on what to do next then at least one was trying to open the door. Shutdown. A slightly counter-intuitive situation here is you want to make sure that if you did know the password, the easiest way to leave would be the door. Otherwise both AI will agree to climb out a window.
There seems to be a consensus here towards throwing money at getting more vaccines. I think I agree with the reasoning, except possibly for the way that letting vaccine companies make large profits in pandemics encourages vaccine companies to start and spread pandemics.
How confident should we be that no company would do anything that evil?
I don't think they would, but ...
Pretty confident. The penalties (individual, not corporate) for getting caught doing such a thing without very strong government/military support would be pretty painful, including violent mobs literally tearing one to pieces.
Note my argument for throwing more money at this is two-fold:
I was working on a result about Turing machines in nonstandard models, Then I found I had rediscovered Chaitin's incompleteness theorem.
I am trying to figure out how this relates to an AI that uses Kolmogorov complexity.
No one has searched all possible one page proofs of propositional logic to see if any of them prove false. Sure, you can prove that propositional logic is complete in a stronger theory, but you can prove large cardinality axioms from even larger cardinality axioms.
Why do you think that no proof of false, of length at most one page exists in propositional logic? Or do you think it might?
Soap and water or hand sanitiser are apparently fine to get covid19 off your skin. Suppose I rub X on my hands, then I touch an infected surface, then I touch my food or face. What X will kill the virus, without harming my hands?
I was thinking zinc salts, given zincs antiviral properties. Given soaps tendency to attach to the virus, maybe zinc soaps? Like a zinc atom in a salt with a fatty acid? This is babbling by someone who doesn't know enough biology to prune.
Here is a flawed dynamic in group conversations, especially among large groups of people with no common knowledge.
Suppose everyone is trying to build a bridge.
Alice: We could make a bridge by just laying a really long plank over the river.
Bob: According to my calculations, a single plank would fall down.
Carl: Scientists Warn Of Falling Down Bridges, Panic.
Dave: No one would be stupid enough to design a bridge like that, we will make a better design with more supports.
Bob: Do you have a schematic for that better design?
And, at worst, the cycle repeats.
The problem here is Carl. The message should be
Carl: At least one attempt at designing a bridge is calculated to show the phenomena of falling down. It is probable that many other potential bridge designs share this failure mode. In order to build a bridge that won't fall down, someone will have to check any designs for falling down behavior before they are built.
This entire dynamic plays out the same, whether the people actually deciding on building the bridge are incredibly cautious, never approving a design they weren't confidant in, or totally reckless. The probability of any bridge actually falling down in the real world depends on their caution. But the process of cautious bridge builders finding a good design looks like them rejecting lots of bad ones. If the rejection of bad designs is public, people can accuse you of attacking a strawman, they can say that no-one would be stupid enough to build such a thing. If they are right that no one would be stupid enough to build such a thing, its still helpful to share the reason the design fails.
What? In this example, the problem is not Carl - he's harmless, and Dave carries on with the cycle (of improving the design) as he should. Showing a situation where Carl's sensationalist misstatement actually stops progress would likely also show that the problem isn't Carl - it's EITHER the people who listen to Carl and interfere with Alice, Bob, and Dave, OR it's Alice and Dave for letting Carl discourage them rather than understanding Bob's objection directly.
Your description implies that the problem is something else - that Carl is somehow preventing Dave from taking Bob's analysis into consideration, but your example doesn't show that, and I'm not sure how it's intended to.
In the actual world, there's LOTS of sensationalist bad reporting of failures (and of extremely minor successes, for that matter). And those people who are actually trying to build things mostly ignore it, in favor of more reasonable publication and discussion of the underlying experiments/failures/calculations.
Another dumb Alignment idea.
Any one crude heuristic will be goodhearted, but what about a pile of crude heuristics.
A bunch of humans have say, 1 week in a box to write a crude heuristic for a human value function (bounded on [0,1] )
Before they start, an AI is switched on, given a bunch of info, and asked to predict a probability distribution over what the humans write.
Then an AI maximizes the average over that distribution.
The humans in the box know the whole plan. They can do things like flip a quantum coin, and use that to decide which part of their value function they write down.
Do all the mistakes cancel out? Is it too hard to goodheart all the heuristics in a way that's still bad? Can we write any small part of our utility function?
The bible is written by many authors, and contains fictional and fictionalized characters. Its a collection of several thousand year old fanfiction. Like modern fanfiction, people tried telling variations on the same story or the same characters. (2 entirely different genesis stories) Hence there not even being a pretence at consistency. This explains why the main character is so often portrayed as a Mary Sue. And why there are many different books each written in a different style. And the prevalence of weird sex scenes.
From Star Slate Codex "I myself am a Scientismist"
Antipredictions do not always sound like antipredictions. Consider the claim “once we start traveling the stars, I am 99% sure that the first alien civilization we meet will not be our technological equals”. This sounds rather bold – how should I know to two decimal places about aliens, never having met any?
But human civilization has existed for 10,000 years, and may go on for much longer. If “technological equals” are people within about 50 years of our tech level either way, then all I’m claiming is that out of 10,000 years of alien civilization, we won’t hit the 100 where they are about equivalent to us. 99% is the exact right probability to use there, so this is an antiprediction and requires no special knowledge about aliens to make.
I disagree. I think that it is likely that a society can get to a point where they have all the tech. I think that we will probably do this within a million years (and possibly within 5 minutes of ASI) Any aliens we meet will be technological equals, or dinosaurs with no tech whatsoever.
But your disagreement only kicks in after a million years. If we meet the first alien civilization we meet, before then, then it doesn't seem to apply. A million (and 10,000?) years is also an even bigger interval than 10,000 - making what appears to be an even stronger case than the post you referenced.
Given bulk prices of conc hydrogen peroxide, and human oxygen use, breating pure oxygen could cost around $3 per day for 5l 35% h2o2 (Order of magnitude numbers) However, this conc of h202 is quite dangerous stuff.
Powdered baking yeast will catalytically decompose hydrogen peroxide, and it shouldn't be hard to tape a bin bag to a bucket to a plastic bottle with a drip hole to a vacuum cleaner tube to make an apollo 13 style oxygen generator ... (I think)
(I am trying to figure out a cheap and easy oxygen source, does breathing oxygen help with coronavirus?)
Sodium Clorate decomposes into salt and oxygen at 600C, it is mixed with iron powder for heat to make the oxygen generators on planes. To supply oxygen, you would need 1.7kg per day. (plus a bit more to burn the iron) And it's bulk price <$1 /kg. However, 600C would make it harder to jerry rig a generator, although maybe wrapping a saucepan in fiberglass...
Looking at formalism for AIXI and other similar agent designs. Big mess of and with indicies. Would there be a better notation?
Suppose an early AI is trying to understand its programmers and makes millions of hypothesis that are themselves people. Later it becomes a friendly superintelligence that figures out how to think without mindcrime. Suppose all those imperfect virtual programmers have been saved to disk by the early AI, the superintelligence can look through it. We end up with a post singularity utopia that contains millions of citizens almost but not quite like the programmers. We don't need to solve the nonperson predicate ourselves to get a good outcome, just avoid minds we would regret creating.