I've made a few posts that seemed to contain potentially valuable ideas related to AI safety. However, I got almost no feedback on them, so I was hoping some people could look at them and tell me what they think. They still seem valid to me, and if they are, they could potentially be very valuable contributions. And if they aren't valid, then I think knowing the reason for this could potentially help me a lot in my future efforts towards contributing to AI safety.
The posts are:
There is a matter I'm confused about: What exactly is base-level reality, does it necessarily exist, and is it ontologically different from other constructs?
First off, I had gotten the impression that there was a base-level reality, and that in some sense it's ontologically different from the sorts of abstractions we use in our models. I thought that, it some sense, the subatomic particles "actually" existed, whereas our abstractions, like chairs, were "just" abstractions. I'm not actually sure how I got this impression, but I had the sense that other peop...
I had made a post proposing a new alignment technique. I didn't get any responses, but it still seems like a reasonable idea to me, so I'm interested in hearing what others think of it. I think the basic idea of the post, if correct, could be useful for future study. However, I don't want to waste time doing this if the idea is unworkable for a reason I hadn't thought of.
(If you're interested, please read the post before reading below.)
Of course, the idea's not a complete solution to alignment, and things have a risk of going catastrophically wrong due to...
I found what seems to be a potentially dangerous false-negative in the most popular definition of optimizer. I didn't get a response, so I would appreciate feedback on if it's reasonable. I've been focusing on defining "optimizer", so I think feedback would help me a lot. You can see my comment here .
I've realized I'm somewhat skeptical of the simulation argument.
The simulation argument proposed by Bostrom argued, roughly, that either almost exactly all Earth-like worlds don't reach a posthuman level, almost exactly all such civilizations don't go on to build many simulations, or that we're almost certainly in a simulation.
Now, if we knew that the only two sorts of creatures that experience what we experience are either in simulations or the actual, original, non-simulated Earth, then I can see why the argument would be reasonable. However, I don't kno...
I had recently posted a question asking about if iterated amplification was actually more powerful than mere mimicry and arguing that it was not. I had thought I was making a pretty significant point, but the post attracted very little attention. I'm not saying this is a bad thing, but I'm not really sure why it happened, so I would appreciate some insight about how I can contribute more usefully.
Iterated amplification seems to be the leading proposal for created aligned AI, so I thought a post arguing against it, if correct, would be a useful contribution...
If the impact measure was poorly implemented, then I think such an impact-reducing AI could indeed result in the world turning out that way. However, note that the technique in the paper is intended to, for a very wide range of variables, make the world if the AI wasn't turned on as similar as possible to what it would be like if it was turned on. So, you can potentially avoid the AI-controlled-drone scenario by including the variable "number of AI-controlled drones in the world" or something correlated with it, as these variables could be have quite diffe...
I have some concerns about an impact measure proposed here. I'm interested on working on impact measures, and these seem like very serious concerns to me, so it would be helpful seeing what others think about them. I asked Stuart, one of the authors, about these concerns, but he said it was too busy to work on dealing with them.
First, I'll give a basic description of the impact measure. Have your AI be turned on from some sort of stochastic process that may or may not result in the AI being turned on. For example, consider sending a photo through a semi-si...
I've come up with a system of infinite ethics intended to provide more reasonable moral recommendations than previously-proposed ones. I'm very interested in what people think of this, so comments are appreciated. I've made a write-up of it below.
One unsolved problem in ethics is that aggregate consquentialist ethical theories tend to break down if the universe is infinite. An infinite universe could contain both an infinite amount of good and an infinite amount of bad. If so, you are unable to change the total amount of good or bad in the universe, which ...
I have an idea for reasoning about counterpossibles for decision theory. I'm pretty skeptical that it's correct, because it doesn't seem that hard to come up with. Still, I can't see a problem with it, and I would very much appreciate feedback.
This paper provides a method of describing UDP using proof-based counterpossibles. However, it doesn't work on stochastic environments. I will describe a new system that is intended to fix this. The technique seems sufficiently straightforward to come up with that I suspect I'm either doing something wrong or this ha...
I'd like to propose the idea of aligning AI by reverse-engineering its world model and using this to specify its behavior or utility function. I haven't seen this discussed before, but I would greatly appreciate feedback or links to any past work on this.
For example, suppose a smart AI models humans. Suppose it has a model that explicitly specifies the humans' preferences. Then people who reverse-engineered this model could use it as the AI's preferences. If the AI lacks a model with explicit preferences, then I think it would still contain an accurate mod...
I've recently gotten concerned about the possibility that that advanced AIs would "hack" their own utility function. I haven't seen this discussed before, so I wanted to bring it up. If I'm right, this seems like it could be a serious issue, so I would greatly appreciated feedback or links to any previous discussion.
Suppose you come up with a correct, tractable mathematical specification of what you want your AI's utility function to be. So then you write code intended to be an implementation of this.
However, computers are vulnerable to some hardware probl...
There's a huge gulf between "far-fetched" and "quite likely".
The two big ones are failure to work out how to create an aligned AI at all, and failure to train and/or code a correctly designed aligned AI. In my opinion the first accounts for at least 80% of the probability mass, and the second most of the remainder. We utterly suck at writing reliable software in every field, and this has been amply borne out in not just thousands of failures, but thousands of types of failures.
By comparison, we're fairly good at creating at least moderately reliable hardware, and most of the accidental failure modes are fatal to the running software. Flaws like rowhammer are mostly attacks, where someone puts a great deal of intelligent effort into finding an extremely unusual operating mode in which some some assumptions can be bypassed with significant effort into creating exactly the wrong operating conditions.
There are some examples of accidental flaws that affect hardware and aren't fatal to its running software, but they're an insignificant fraction of the number of failures due to incorrect software.
I was wondering if anyone would be interested in reviewing some articles I was thinking about posting. I'm trying to make them as high-quality as I can, and I think getting them reviewed by someone would be helpful for making Less Wrong contain high-quality content.
I have four articles I'm interested in having reviewed. Two are about new alignment techniques, one is about a potential danger with AI that I haven't seen discussed before, and one is about the simulation argument. All are fairly short.
If you're interested, just let me know and I care share drafts of any articles you would like to see.
I've read this paper on low-impact AIs. There's something about it that I'm confused and skeptical about.
One of the main methods it proposes works as follows. Find a probability distribution of many possible variables in the world. Let X represent the statement "The AI was turned on". For each the variables v it considers, the probability distribution over v should, after conditioning on X should, look about the same as the probability distribution over v after conditioning on not-X. That's low impact.
But the paper doesn't mention conditioning on any evide...
I'm questioning whether we would actually want to use Updateless Decision Theory, Functional Decision Theory, or future decision theories like them.
I think that in sufficiently extreme cases, I would act according to Evidential Decision Theory and not according something like UDT, FDT, or any similar successor. And I think I would continue to want to take the evidential decision theoretic-recommended action instead even if I had arbitrarily high intelligence, willpower, and had infinitely long to think about it. And, though I'd like to hear others' thought...
I'm wondering how, in principal, we should deal with malign priors. Specifically, I'm wondering what to do about the possibility that reality itself is, in a sense, malign.
I had previously said that it seems really hard to verifiably learn a non-malign prior. However, now I've realized that I'm not even sure what a non-malign, but still reliable, prior would even look like.
In previous discussion of malign priors, I've seen people talk about the AI misbehaving due to thinking it's in some embedded in a simpler universe than our own that was controlled by ag...
I've been reading about logical induction. I read that logical induction was considered a breakthrough, but I'm having a hard understanding the significance of it. I'm having a hard time seeing how it outperforms what I call "the naive approach" to logical uncertainty. I imagine there is some sort of notable benefit of it I'm missing, so I would very much appreciate some feedback.
First, I'll explain what I mean by "the naive approach". Consider asking an AI developer with no special background in reasoning under logical uncertainty how to make an algorithm...
I've thought of a way in which other civilizations could potentially "hack" Updateless Decision Theoretic agents on Earth in order to make them do whatever the other civilization wants them to do. I'm wondering if this has been discussed before, and if not, what people think about it.
Here I present a method of that would potentially aliens to take control of an AI on Earth that uses Updateless Decision theory.
Note that this crucially depends on different agents with the AI's utility function but different situations terminally valuing different things. For...
I was wondering if there has been any work getting around specifying the "correct" decision theory by just using a more limited decision theory and adjusting terminal values to deal with this.
I think we might be able to get an agent that does what we want without formalizing the right decision theory buy instead making a modification to the value loading used. This way, even an AI with a simple, limited decision theory like evidential decision theory could make good choices.
I think that normally when considering value loading, people imagine finding a way...
I've come up with a system of infinite ethics intended to provide more reasonable moral recommendations than previously-proposed ones. I'm very interested in what people think of this, so comments are appreciated. I've made a write-up of it below.
One unsolved problem in ethics is that aggregate consquentialist ethical theories tend to break down if the universe is infinite. An infinite universe could contain both an infinite amount of good and an infinite amount of bad. If so, you are unable to change the total amount of good or bad in the universe, which can cause aggregate consquentialist ethical systems to break.
There has been a variety of methods considered to deal with this. However, to the best of my knowledge all proposals either have severe negative side-effects or are intuitively undesirable for other reasons.
Here I propose a system of aggregate consquentialist ethics intended to provide reasonable moral recommendations even in an infinite universe.
It is intended to satisfy the desiderata for infinite ethical systems specified in Nick Bostrom's paper, "Infinite Ethics". These are:
I have yet to find a way in which my system fails any of the above desiderata. Of course, I could have missed something, so feedback is appreciated.
My ethical system
First, I will explain my system.
My ethical theory is, roughly, "Make the universe one agents would wish they were born into".
By this, I mean, suppose you had no idea which agent in the universe it would be, what circumstances you would be in, or what your values would be, but you still knew you would be born into this universe. Consider having a bounded quantitative measure of your general satisfaction with life, for example, a utility function. Then try to make the universe such that the expected value of your life satisfaction is as high as possible if you conditioned on you being an agent in this universe, but didn't condition on anything else. (Also, "universe" above means "multiverse" if this is one.)
In the above description I didn't provide any requirement for the agent to be sentient or conscious. If you wish, you can modify the system to give higher priority to the satisfaction of agents that are sentient or conscious, or you can ignore the welfare of non-sentient or non-conscious agents entirely.
It's not entirely clear how to assign a prior over situations in the universe you could be born into. Still, I think it's reasonably intuitive that there would be some high-entropy situations among the different situations in the universe. This is all I assume for my ethical system.
Now I'll give some explanation of what this system recommends.
Suppose you are considering doing something that would help some creature on Earth. Describe that creature and its circumstances, for example, as "<some description of a creature> in an Earth-like world with someone who is <insert complete description of yourself>". And suppose doing so didn't cause any harm to other creatures. Well, there is non-zero prior probability of an agent, having no idea what circumstances it will be in the universe, ending up in circumstances satisfying that description. By choosing to help that creature, you would thus increase the expected satisfaction of any creature in circumstances that match the above description. Thus, you would increase the overall expected value of the life-satisfaction of an agent knowing nothing about where it will be in the universe. This seems reasonable.
With similar reasoning, you can show why it would be beneficial to also try to steer the future state of our accessible universe in a positive direction. An agent would have nonzero probability of ending up in situations of the form, "<some description of a creature> that lives in a future colony originating from people from an Earth-like world that features someone who <insert description of yourself>". Helping them would thus increase an agent's prior expected life-satisfaction, just like above. This same reasoning can also be used to justify doing acausal trades to help creatures in parts of the universe not causally accessible.
The system also values helping as many agents as possible. If you only help a few agents, the prior probability of an agent ending up in situations just like those agents would be low. But if you help a much broader class of agents, the effect on the prior expected life satisfaction would be larger.
These all seem like reasonable moral recommendations.
I will now discuss how my system does on the desiderata.
Infinitarian paralysis
Some infinite ethical systems result in what is called "infinitarian paralysis". This is the state of an ethical system being indifferent in its recommendations in worlds that already have infinitely large amounts of both good and bad. If there's already an infinite amount of both good and bad, then our actions, using regular cardinal arithmetic, are unable to change the amount of good and bad in the universe.
My system does not have this problem. To see why, remember that my system says to maximize the expected value of your life satisfaction given you are in this universe but not conditioning on anything else. And the measure of life-satisfaction was stated to be bounded, say to be in the range [0, 1]. Since any agent can only have life satisfaction in [0, 1], then in an infinite universe, the expected value of life satisfaction of the agent must still be in [0, 1]. So, as long as a finite universe doesn't have expected value of life satisfaction to be 0, then an infinite universe can at most only have finitely more moral value than it.
To say it another way, my ethical system provides a function mapping from possible worlds to their moral value. And this mapping always produces outputs in the range [0, 1]. So, trivially, you can see the no universe can have infinitely more moral value than another universe with non-zero moral value. ∞ just isn't in the domain of my moral value function.
Fanaticism
Another problem in some proposals of infinite ethical systems is that they result in being "fanatical" in efforts to cause or prevent infinite good or bad.
For example, one proposed system of infinite ethics, the extended decision rule, has this problem. Let g represent the statement, "there is an infinite amount of good in the world and only a finite amount of bad". Let b represent the statement, "there is an infinite amount of bad in the world and only a finite amount of good". The extended decision rule says to do whatever maximizes P(g) - P(b). If there are ties, ties are broken by choosing whichever action results in the most moral value if the world is finite.
This results in being willing to incur any finite cost to adjust the probability of infinite good and finite bad even very slightly. For example, suppose there is an action that, if done, would increase the probability of infinite good and finite bad by 0.000000000000001%. However, if it turns out that the world is actually finite, it will kill every creature in existence. Then the extended decision rule would recommend doing this. This is the fanaticism problem.
My system doesn't even place any especially high importance in adjusting the probabilities of infinite good and or infinite bad. Thus, it doesn't have this problem.
Preserving the spirit of aggregate consequentialism
Aggregate consequentialism is based on certain intuitions, like "morality is about making the world as best as it can be", and, "don't arbitrarily ignore possible futures and their values". But finding a system of infinite ethics that preserves intuitions like these is difficult.
One infinite ethical system, infinity shades, says to simply ignore the possibility that the universe is infinite. However, this conflicts with our intuition about aggregate consequentialism. The big intuitive benefit of aggregate consequentialism is that it's supposed to actually systematically help the world be a better place in whatever way you can. If we're completely ignoring the consequences of our actions on anything infinity-related, this doesn't seem to be respecting the spirit of aggregate consequentialism.
My system, however, does not ignore the possibility of infinite good or bad, and thus is not vulnerable to this problem.
I'll provide another conflict with the spirit of consequentialism. Another infinite ethical system says to maximize the expected amount of goodness of the causal consequences of your actions minus the amount of badness. However, this, too, doesn't properly respect the spirit of aggregate consequentialism. The appeal of aggregate consequentialism is that its defines some measure of "goodness" of a universe, and then recommends you take actions to maximize it. But your causal impact is no measure of the goodness of the universe. The total amount of good and bad in the universe would be infinite no matter what finite impact you have. Without providing a metric of the goodness of the universe that's actually affected, this ethical approach also fails to satisfy the spirit of aggregate consequentialism.
My system avoids this problem by providing such a metric: the expected life satisfaction of an agent that has no idea what situation it will be born into.
Now I'll discuss another form of conflict. One proposed infinite ethical system can look at the average life satisfaction of a finite sphere of the universe, and then take the limit of this as the sphere's size approaches infinity, and consider this the moral value of the world. This has the problem that you can adjust the moral value of the world by just rearranging agents. In an infinite universe, it's possible to come up with a method of re-arranging agents so the unhappy agents are spread arbitrarily thinly. Thus, you can make moral value arbitrarily high by just rearranging agents in the right way.
I'm not sure my system entirely avoids this problem, but it does seem to have substantial defense against it.
Consider you have the option of redistributing agents however you want in the universe. You're using my ethical system to decide whether to make the unhappy agents spread thinly.
Well, your actions have an effect on agents in circumstances of the form, "An unhappy agent on an Earthlike world with someone who <insert description of yourself> who is considering spreading the unhappy agents thinly throughout the universe". Well, if you pressed that button, that wouldn't make the expected life satisfaction of any agent satisfying the above description any better. So I don't think my ethical system recommends this.
Now, we don't have a complete understanding of how to assign a probability distribution of what circumstances an agent is in. It's possible that there is some way to redistribute agents in certain circumstances to change the moral value of the world. However, I don't know of any clear way to do this. Further, even if there is, my ethical system still doesn't allow you to get the moral value of the world arbitrarily high by just rearranging agents. This is because there will always be some non-zero probability of having ended up as an unhappy agent in the world you're in, and your life satisfaction after being redistributed in the universe would still be low.
Distortions
It's not entirely clear to me how Bostrom distinguished between distortions and violations of the spirit of aggregate consequentialism.
To the best of my knowledge, the only distortion pointed out in "Infinite Ethics" is stated as follows:
My approach doesn't ignore infinity and thus doesn't have this problem. I don't know of any other distortions in my ethical system.
I think the question of whether insects have preferences in morally pretty important, so I'm interested in hearing what made you think they do have them.
I looked online for "do insects have preferences?", and I saw articles saying they did. I couldn't really figure out why they thought they di... (read more)