All of Ian McKenzie's Comments + Replies

One thing is that even given access to the model weights and the code behind the API, you could not tell if the model was password-locked, whereas you would see the hardcoded verifier. Thus if a lab wanted to hide capabilities they could delete the training data and you would have no way of knowing.

The Wikipedia article has a typo in one of these: it should say "I am sparkling; you are unusually talkative; he is drunk." (as in the source)

We did do human validation on the tasks with Surge: redefine-math, quote-repetition, and hindsight-neglect all got 100% agreement, and NeQA got 98% agreement. I agree though that it seems likely many people would do the task ‘wrong’, so maybe the task would be improved by adding clearer instructions.

The situation feels somewhat like model splintering to me: the few-shot examples fit both patterns but the question doesn’t. The larger models are learning the incorrect generalization.

I think it’s important to note that LMs learning to respond in the same way ... (read more)

I think 75% is 1:3 rather than 1:2.

2NunoSempere
Whoops, changed

A couple of things that come to mind:

  • You'll run into ELK-like difficulties when trying to train B: maybe it looks to the human that the off switch is still accessible but in reality you would not be able to activate it for some reason -- A steers into a Faraday cage or something more clever.

  • (In response to your reply to acylhalide) Assuming you solve the training difficulties, B has to be at least as good at modelling the world as A: it needs to be able to predict the consequences of action sequences from A to see if the off switch is accessible.

  • (

... (read more)
Answer by Ian McKenzie10

Is your suggestion to run this system as a source of value, simulating lives for their own sake rather than to improve the quality of life of sentient beings in our universe? Our history (and present) aren't exactly utopian, and I don't see any real reason to believe that slight variations on it would lead to anything happier.

I think we can expect to achieve a lot more than that from a properly aligned AGI. There is so much suffering that could be alleviated right now with proper coordination, as a lower bound on how much better it could be than just effectively running copies of our timeline but at lower resolution.

1Maxime Riché
I am thinking about if we should reasonably expect to produce better result by trying to align an AGI with our value than by simulating a lot of alternate universes. I am not saying that this is net-negative or net-positive. It seems to me that the expected value of both cases may be identical. Also by history, I also meant the future, not only the past and present. (I edited the question to replace "histories" by "trajectories")

Is the "Going Beyond Agents" section part of CHAI's research agenda, or your take on further challenges from an embedded agency perspective?

2Alex Flint
That part is my take on further challenges from an embedded agency perspective.

In the example with Bob, surely the odds of Bob having a crush on you after winking (2:1) should be higher than a random person winking at you (given as 10:1), as we already have reason to suspect that Bob is more likely to have a crush on you than some random person not part of the six.

0Alexander Budkov
You are right. Seems like there is an error in this example and a main problem is not with a prior 1:5 odds, problem is with a bad phrasing and confusion between "crush when winked" odds and "wink likelihood ratio". Is a statement about likelihood ratio (or at least can be interpreted that way) - P(wink|crush):P(wink|!crush)=10:1 And in final calculation likelihood is used and its correct according to Bayes Rule While a statement is a statement about odds P(crush|wink):P(!crush|wink)=10:1 and to apply a Bayes rule to it as if it is ratio would be a mistake, but i'm guessing it's just an author's error in phrasing of this statement. 
2tmercer
This is obviously correct. The error was that Rob interpreted the evidence incorrectly. Getting an anonymous letter DOES NOT inform a rational mind that Bob has 1:5 odds of crushing. It informs the rational mind that AT LEAST ONE of the 6 classmates has a crush on you. It DOES NOT inform a rational mind that 5 of the 6 classmates DO NOT have a crush on you. I also hated this. Obviously, two pieces of evidence should make Bob MORE LIKELY to have a crush on you that one. There's no baseline rate of humans having a crush on us, so the real prior isn't in the problem.
2jeronimo196
What Liliet B said. Low priors will screw with you even after a "definitive" experiment. You might also want to take a look at this: https://www.lesswrong.com/posts/XTXWPQSEgoMkAupKt/an-intuitive-explanation-of-bayes-s-theorem
3Liliet B
Not if you consider that the 1:5 figure constrains that ONLY one person among the six has a crush on you. If you learn for a fact one does, you'll also immediately know the others all don't. Which is not true for a random selection of students - you could randomly pick six that all have a crush on you. Bob belongs to a group in which you know for a fact five people DON'T have a crush on you. So you have evidence lowering Bob's odds relative to a random winker. Either that, or it doesn't matter how many actually have a crush on you, you're looking for the specific one you have definite evidence about. For thisy, a random winker is not qualified to enter the comparison at all - if they're not one of the six, they're not the person you're looking for. So Bob might have a crush on you AND not be the person you're looking for, although his odds are higher than those of the other five you don't have any evidence about. That's the interpretations that make the math not wrong, anyway. If you only know that "at least one of them has a crush on me" and more than one could potentially satisfy your search criteria, the 1:5 figure is not the right odds.