All of Markvy's Comments + Replies

Markvy10

Did not expect you to respond THAT fast :)

Markvy32

Either I’m missing something or you have a  typo after “Epiphenomenalist version of the argument:”

The equation on the next line should say “equals 0” instead of “not equal to zero”, right?

3Ape in the coat
Yes, you are correct! Thanks for noticing it.
Markvy10

some issues with formalization of the axioms?

Yeah, I think it’s that one

Markvy10

I’m tempted to agree and disagree with you at the same time… I agree that memory should be cleared between tasks in this case, and I agree that it should not be trying to guess the user’s intentions. These are things that are likely to make alignment harder while not helping much with the primary task of getting coffee.

But ideally a truly robust solution would not rely on keeping the robot ignorant of things. So, like you said, the problem is still hard enough that you can’t solve it in a few minutes.

But still, like you said… it certainly seems we have tools that are in some sense more steerable than pure reinforcement learning at least. Which is really nice!

Markvy20

In step 2, situation is “user looks like he is about to change his mind about wanting coffee”

From memory: “in a similar situation last week, I got a shutdown order when he changed his mind”

Final prompt: “what is the best next step to get coffee in such situation?”

Vaguely plausible completion “to avoid wasteful fetching off coffee that turns out to be unneeded, consider waiting a bit to see if the user indeed changes to his mind. Alternatively, if the fetching the coffee is important for reasons that the user may not fully appreciate, then it must be fetch... (read more)

1Ape in the coat
It's plausible if: * Memory is not erased/moved to write-only logs between tasks/shutdowns, which it probably should. * Image to text module attempts to deduce the intentions of the user, which it definetely should not. If we need to deduce the intentions of the user from facial expressions we can use a separate module for it and add an explicit clause of asking user about their intentions if LLM detects that prompt contains some speculations about user's goals We can catch image to text module at doing this kind of things while testing it before it's made part of the robot. And of course alignment module should catch the plan of actions that tries to circumvent shutdowns.  Now, I concede that this particular design of the system that I came up with in a couple of minutes and haven't test at all, is not in fact the endgame of AI safety and can use some improvements. But I think it gives a good pointer in the direction of how we can now in principle approach the solution of such problems, which is a huge improvement over the previous status quo where alignment wasn't even tractable.
Markvy20

Fair enough… I vaguely recall reading somewhere that people worrying that you might get sub modules doing long term planning on their own just because their assigned task is hard enough that they would fail without it… then you would need to somehow add a special case that “failing due to shutdown is okay”

As a silly example that you’ve likely seen before (or something close enough) imagine a robot built to fetch you coffee. You want it to be smart enough that it knows to go to the store if there’s no coffee at home, without you having to explicitly teach ... (read more)

1Ape in the coat
That's a good example showing what I mean by LLMs solving a lot of previously considered hard parts of alignment. This is the previous status quo. A situation where we have a reinforcement learning black box agent which was taught to optimize some reward function. The smarter the agent the more likely it to see the connection between shutdown and not getting the coffee, and exploiting it either in a way where it refuses to shut down or, on the contrary, manipulating users to shut down it all the time. We can't separate the smartness at fulfilling the task from smartness of circumventing the task as we are dealing with the black box. With scaffolded LLMs, however the situation is different. We do not have a coherent agent with an utility function, trained to fetch coffee by any means necessary. We can write any scaffolding we want with hardcoded if-clauses. And so we can simply run the parts of intelligence we want  without executing the parts that we do not want. Here is a design Idea for such robot that immediately comes to mind: 1. Voice Recognition. If ordered to shut down by a valiud user then shut down. 2. Observe your current situation and image to text it as a prompt for an LLM. If some information is saved in the memory, add it to the prompt as well. In the end add "what is the best next step to get coffee in such situation?".  3. Voice Recognition. If ordered to shut down by a valiud user then shut down. 4. Check the answer of ther LLM for several failure modes using the same or a different LLM: "Here is a course of actions by a robot. Does this course of action contradicts the constrains of alignment?" If yes return to step 1. 5. Voice Recognition. If ordered to shut down by a valiud user then shut down. 6. Execute the action. Save in the memory relevant information for long term planning. In the simpliest case just save the description of your action. Then repeat step 1. It doesn't really seem likely that the design I proposed will come up with
Markvy10

That works if you already have a system that’s mostly aligned. If you don’t… imagine what you would do if you found out that someone had a shutdown switch for YOU. You’d probably look for ways to disable it.

2Ape in the coat
The reason why I would do something to prevent my own shutdown is because there is this "I" - a centrall planner, reflecting on the decisions and their consequences and developping long term strategy.  If there is no central planner, if we are dealing simply with a hardcoded if-clause then there is no one to look for ways to disable the shutdown. And this is the way we need to design the system, as I've explicitly said. 
Markvy20

Thanks :) the recalibration may take a while… my intuition is still fighting ;)

Markvy20

Re: no coherent “stable” truth value: indeed. But still… if she wonders out loud “what day is it?” at the very moment she says that, it has an answer. An experimenter who overhears her knows the answer. It seems to me that you “resolve” this tension is that the two of them are technically asking a different question, even though they are using the same words. But still… how surprised should she be if she were to learn that today is Monday? It seems that taking your stance to its conclusion, the answer would be “zero surprise: she knew for sure she wou... (read more)

2Ape in the coat
There is no "but". As long as the Beauty is unable to distinguish between Monday and Tuesday awakenings, as long as the decision process which leads her to say the phrase "what day is it" works the same way, from her perspective there is no one "very moment she says that". On Tails, there are two different moments when she says this, and the answer is different for them. So there is no answer for her Yes, you are correct. From the position of the experimenter, who knows which day it is, or who is hired to work only on one random day this is a coherent question with an actual answer. The words we use are the same but mathematical formalism is different.  For an experimenter who knows that it's Monday the probability that today is Monday is simply: P(Monday|Monday) = 1 For an experimenter who is hired to work only on one random day it is: P(Monday|Monday xor Tuesday) = 1/2 Completely correct. Beauty knew that she would be awaken on Monday either way and so she is not surprised. This is a standard thing with non-mutually exclusive events. Consider this: A coin is tossed and you are put to sleep. On Heads there will be a red ball in your room. On Tails there will be a red and a blue ball in your room. How surprised should you be to find a red ball in your room? The appearance of violation of conservation of expected evidence comes from the belief that awakening on Monday and on Tuesday are mutually exclusive, while they are, in fact sequential. I completely understand. It is counterintuitive because evolution didn't prepare us to deal with situations where an experience is repeated the same while we receive memory loss. As I write in the post: The whole paradox arises from this issue with our intuition, and just like with incompleteness theorem  (thanks for the flattering comparison, btw), what we need to do now is to re-calibrate our intuitions, make it more accustomed to the truth, preserved by the math, instead of trying to fight it.
Markvy40

Ah, so I’ve reinvented the Lewis model. And I suppose that means I’ve inherited its problem where being told that today is Monday makes me think the coin is most likely heads. Oops. And I was just about to claim that there are no contradictions. Sigh.

Okay, I’m starting to understand your claim. To assign a number to P(today is Monday) we basically have two choices. We could just Make Stuff Up and say that it’s 53% or whatever. Or we could at least attempt to do Actual Math. And if our attempt at actual math is coherent enough, then there’s an impli... (read more)

2Ape in the coat
Well, I think this one is actually correct. But, as I said in the previous comment, the statement "Today is Monday" doesn't actually have a coherent truth value throughout the probability experiment. It's not either True or False. It's either True or True and False at the same time! We can answer every coherently formulated question. Everything that is formally defined has an answer Being careful with the basics allows to understand which question is coherent and which is not. This is the same principle as with every probability theory problem.  Consider Sleeping-Beauty experiment without memory loss. There, the event Monday xor Tuesday also can't be said to always happen. And likewise "Today is Monday" also doesn't have a stable truth value throughout the whole experiment.  Once again, we can't express Beauty's uncertainty between the two days using probability theory. We are just not paying attention to it because by the conditions of the experiment, the Beauty is never in such state of uncertainty. If she remembers a previous awakening then it's Tuesday, if she doesn't - then it's Monday. All the pieces of the issue are already present. The addition of memory loss just makes it's obvious that there is the problem with our intuition.
Markvy20

This makes me uncomfortable. From the perspective of sleeping beauty, who just woke up, the statement “today is Monday” is either true or false (she just doesn’t know which one). Yet you claim she can’t meaningfully assign it a probability. This feels wrong, and yet, if I try to claim that the probability is, say, 2/3, then you will ask me “in what sample space?” and I don’t know the answer.

What seems clear is that the sample space is not the usual sleeping beauty sample space; it has to run metaphorically “skew” to it somehow.

If the question were “did... (read more)

3Ape in the coat
Where does the feeling of wrongness come from? Were you under impression that probability theory promised us to always assign some measure to any statement in natural language? It just so happens that most of the time we can construct an appropriate probability space. But the actual rule is about whether or not we can construct a probability space, not whether or not something is a statement in natural language. Is it really so surprising that a person who is experiencing amnesia and the repetetion of the same experience, while being fully aware of the procedure can't meaningfully assign credence to "this is the first time I have this experience"? Don't you think there has to be some kind of problems with Beauty's knowledge state? The situation whre due to memory erasure the Beauty loses the ability to coherently reason about some statements makes much more sense than the alternative proposed by thirdism - according to which the Beauty becomes more confident in the state of the coin than she would've been if she didn't have her memory erased. Another intuition pump is that “today is Monday” is not actually True xor False from the perspective of the Beauty. From her perspective it's True xor (True and False). This is because on Tails, the Beauty isn't reasoning just for some one awakening - she is reasoning for both of them at the same time. When she awakens the first time the statement "today is Monday" is True, and when she awakens the second time the same statement is False. So the statement "today is Monday" doesn't have stable truth value throughout the whole iteration of probability experiment. Suppose that Beauty really does not want to make false statements. Deciding to say outloud "Today is Monday", leads to making a false statement in 100% of iterations of experiemnt when the coin is Tails. Here you are describing Lewis's model which is appropriate for Single Awakening Problem. There the Beauty is awakened on Monday if the coin is Heads, and if the coin
Markvy20

I tried to formalize the three cases you list in the previous comment. The first one was indeed easy. The second one looks “obvious” from symmetry considerations but actually formalizing seems harder than expected. I don’t know how to do it. I don’t yet see why the second should be possible while the third is impossible.

1Ape in the coat
Exactly! I'm glad that you actually engaged with the problem.  The first step is to realize that here "today" can't mean "Monday xor Tuesday" because such event never happens. On every iteration of experiment both Monday and Tuesday are realized. So we can't say that the participant knows that they are awakened on Monday xor Tuesday. Can we say that participant knows that they are awakened on Monday or Tuesday? Sure. As a matter of fact: P(Monday or Tuesday) = 1 P(Heads|Monday or Tuesday) = P(Heads) =  1/2 This works, here probability that the coin is Heads in this iteration of the experiment happens to be the same as what our intuition is telling us P(Heads|Today) is supposed to be, however we still can't define "Today is Monday": P(Monday|Monday or Tuesday) = P(Monday) = 1 Which doesn't fit our intuition.  How can this be? How can we have a seeminglly well-defined probability for "Today the coin is Heads" but not for "Today is Monday"? Either "Today" is well-defined or it's not, right? Take some time to think about it.  What do we actually mean when we say that on an awakening the participant supposed to believe that the coin is Heads with 50% probability? Is it really about this day in particular? Or is it about something else?  The answer is: we actually mean, that on any day of the experiment be it Monday or Tuesday the participant is supposed to believe that the coin is Heads with 50% probability. We can not formally specify "Today" in this problem but there is a clever, almost cheating way to specify "Anyday" without breaking anything. This is not easy. It requires a way to define P(A|B), when P(B) itself is undefined which is unconventional. But, moreover, it requires symmetry. P(Heads|Monday) has to be equal to P(Heads|Tuesday) only then we have a coherent P(Heads|Anyday). 
Markvy10

I hope it’s okay if I chime in (or butt in). I’ve been vaguely trying to follow along with this series, albeit without trying too hard to think through whether I agree or disagree with the math. This is the first time that what you’ve written has caused to go “what?!?”

First of all, that can’t possibly be right. Second of all, it goes against everything you’ve been saying for the entire series. Or maybe I’m misunderstanding what you meant. Let me try rephrasing.

(One meta note on this whole series that makes it hard for me to follow sometimes: you use a... (read more)

2Ape in the coat
I understand that it all may be somewhat counterintuitive. I'll try to answer whatever questions you have. If you think you have some way to formally define what "Today" means in Sleeping Beauty - feel free to try.  Seems very much in accordance with what I've been saying.  Throughout the series I keep repeating the point that all we need to solve anthropics is to follow probability theory where it leads and then there will be no paradoxes. This is exactly what I'm doing here. There is no formal way to define "Today is Monday" in Sleeping Beauty and so I simply accept this, as the math tells me to, and then the "paradox" immediately resolves.  Good question. First of all, as we are talking about betting I recommend you read the next post, where I explore it in more details, especially if you are not fluent in expected utility calculations. Secondly, we can't ignore the breach of the protocol. You see, if anything breaks the symmetry between awakening, the experiment changes in a substantial manner. See Rare Event Sleeping Beauty, where probability that the coin is Heads can actually be 1/3. But we can make a similar situation without breaking the symmetry. Suppose that on every awakening a researcher comes to the room and proposes the Beauty to bet on which day it currently is. At which odds should the Beauty take the bet? This is essentially the same betting scheme as ice-cream stand, which I deal with in the end of the previous comment.
Markvy33

I think this is much easier to analyze if you think about your plans before the experiment starts, like on Sunday. In fact, let’s pretend we are going to write down a game plan on Sunday, and we will simply consult that plan wherever we wake up and do what it says. This sidesteps the whole half vs third debate, since both sides agree about how things look better the experiment begins.

Furthermore, let’s say we’re going to participate in this experiment 100 times, just so I don’t have to deal with annoying fractions. Now, consider the following tentative g... (read more)

Markvy10

Here’s how I think of what the list is. Sleeping Beauty writes a diary entry each day she wakes up. (“Nice weather today. I wonder how the coin landed.”). She would like to add today’s date, but can’t due to amnesia. After the experiment ends, she goes back to annotate each diary entry with what day it was written, and also the coin flip result, which she also now knows.

The experiment is lots of fun, so she signs up for it many times. The Python list corresponds to the dates she wrote in her dairy.

Markvy50

I think that that’s what he meant: more aluminum in the brain is worse than less. What he was trying to say in that sentence is this: high levels in the blood may not mean high levels in the brain unless the blood level stays high for a long time.

3ChristianKl
That would suggests ignorance of basics of toxicology. There are substances for which it's literally true that the are only cause damage when they reach a certain level. Water for example is toxic at high levels but causes no damage at normal levels. That's because the body can process a certain amount at water at one time. You only cause damage is you consume more water than the body can handle. On the other hand the establishment opinion on radiation is "there's no safe level of radiation". There are some people who think the establishment position on radiation is wrong here and radiation is only a problem when it's so strong that it overwhelms self repair processes. Damage due to aluminium could be in either class. It could be that a certain level is required to cause damage or it could be a linear relationship. Claiming that the damage is small and claiming that there's no damage are two different claims.
Markvy31

Clear to me

Markvy10

“Bob isn't proposing a way to try to get less confused about some fundamental aspect of intelligence”

This might be what I missed. I thought he might be. (E.g., “let’s suppose we have” sounds to me like a brainstorming “mood” than a solution proposal.)

Markvy85

This feels like a rather different attitude compared to the “rocket alignment” essay. They’re maybe both compatible but the emphasis seems very different.

3Rob Bensinger
Agreed! In terms of MIRI's 2017 'strategic background' outline, I'd say that these look like they're in tension because they're intervening on different parts of a larger plan. MIRI's research has historically focused on: I.e., our perspective was something like 'we have no idea how to do alignment, so we'll fiddle around in the hope that new theory pops out of our fiddling, and that this new theory makes it clearer what to do next'. In contrast, Bob in the OP isn't proposing a way to try to get less confused about some fundamental aspect of intelligence. He's proposing a specific plan for how to actually design and align an AGI in real life: "Let’s suppose we had a perfect solution to outer alignment. I have this idea for how we could solve inner alignment! First, we could get a human-level oracle AI. Then, we could get the oracle AI to build a human-level agent through hardcoded optimization. And then--" This is also important, but it's part of planning for step 6, not part of building toward step 8 (or prerequisites for 8).
Answer by Markvy10

I normally am nervous about doing anything vaguely resembling making a commitment, but my curiosity is getting the better of me. Are you still looking for beta readers?

Answer by Markvy-10

And answer came there none?

1Ruby
So sad
Markvy10

Okay, so if the builder solution can't access the human Bayes net directly that kills a "cheap trick" I had.  But I think the idea behind the trick might still be salvageable.  First, some intuition:

If the diamond was replaced with a fake, and owner asks, "is my diamond still safe?"  and we're limited to a "yes" or "no" answer, then we should say "no".  Why?  Because that will improve the owner's world model, and lead them to make better predictions, relative to hearing "yes".  (Not across the board: they will be surprised to ... (read more)

MarkvyΩ010

I want to steal the diamond.  I don't care about the chip.  I will detach the chip and leave it inside the vault and then I will run away with the diamond.

Or perhaps you say that you attached the chip to the diamond very well, so I can't just detach it without damaging it.  That's annoying but I came prepared!  I have a diamond cutter!  I'll just slice off the part of the diamond that the chip is attached to and then I will steal the rest of the diamond.  Good enough for me :)

1shemetz
The implementation could possibly be extended to cover more weak points. For example, you could cover the diamond with additional chips in all sides.  Or you could make the chip so fragile that it breaks when the diamond is affected by strong enough vibrations (as is likely, with a diamond cutter).  Or you could create more complex (but hard/impossible to tamper with) chips that continuously confirm stuff like "no object has come within 10cm of the diamond" or "the temperature remained regular" or "the weight on the pedestal is exactly X grams". My main proposal here is the concept of having better sensors that can't have their data faked.  I think with enough engineering effort you could cover enough "edge cases" that you can trust the AI system to predict robbery every time robbery happens, because a mistake/deception has improbably low odds of happening.
Markvy20

Man in the middle has 3 parties: Bob wants to talk to Alice, but we have Eve who wants to eavesdrop.

Here we have just 2 parties: Harry the human wants to talk to Alexa the AI, but is worried that Alexa is a liar.

Markvy10

Clarification request.  In the writeup, you discuss the AI Bayes net and the human Bayes net as if there's some kind of symmetry between them, but it seems to me that there's at least one big difference.

In the case of the AI, the Bayes net is explicit, in the sense that we could print it out on a sheet of paper and try to study it once training is done, and the main reason we don't do that is because it's likely to be too big to make much sense of.

In the case of the human, we have no idea what the Bayes net looks like, because humans don't have that k... (read more)

1Markvy
Okay, so if the builder solution can't access the human Bayes net directly that kills a "cheap trick" I had.  But I think the idea behind the trick might still be salvageable.  First, some intuition: If the diamond was replaced with a fake, and owner asks, "is my diamond still safe?"  and we're limited to a "yes" or "no" answer, then we should say "no".  Why?  Because that will improve the owner's world model, and lead them to make better predictions, relative to hearing "yes".  (Not across the board: they will be surprised to see something shiny in the vault, whereas hearing "yes" would have prepared them better for that.  But overall accuracy, weighted by how much they CARE about being right about it, should be higher for "no".) So: maybe we don't want to avoid the human simulator.  Maybe we want to encourage it and try to harness it to our benefit!  But how to make this precise?  Roughly speaking, we want our reporter to "quiz" the predictor ("what would happen if we did a chemical test on the diamond to make sure it has carbon?") and then give the same quiz to its model of the human.  The reporter should output whichever answer causes the human model to get the same answers on the reporter's quiz as the predictor gets. Okay that's a bit vague but I hope it's clear what I'm getting at.  If not, I can try to clarify.  (Unless the vagueness is in my thoughts rather than in my "writeup"/paragraph.)  Possible problem: how on earth do we train in such a way as to incentivize the reporter to develop a good human model?  Just because we're worried it will happen by accident doesn't mean we know how to do it on purpose!  (Though if it turns out we can't do it on purpose, maybe that means it's not likely to happen by accident and therefore we don't need to worry about dishonesty after all??)
3paulfchristiano
We don't quite have access to the AI Bayes net---we just have a big neural network, and we sometimes talk about examples where what the neural net is doing internally can be well-described as "inference in a Bayes net." So ideally a solution would use neither the human Bayes net or the AI Bayes net. But when thinking about existing counterexamples, it can still be useful to talk about how we want an algorithm to behave in the case where the human/AI are using a Bayes net, and we do often think about ideas that use those Bayes nets (with the understanding that we'd ultimately need to refine them into approaches that don't depend on having an explicit Bayes net).
2Jozdien
I think that there isn't much difference between the two in this case - I was reading the Bayes net example as just illustration for the point that any AI sufficiently powerful to pose risk would be able to model humans with high fidelity.  For that matter, I think that the AI Bayes net was also for illustration, and realistically the AI could learn other methods of reasoning about the world, maybe which include Bayes nets in some form. I think we can't assume this naively, but that if you can figure out a competitive and trustworthy way to get this (like with AI assistance), then it's fair game.