The distinction between what might be called "lying" and "bullshitting" is important here, because they scale with competence differently.
It was pretty interesting watching this develop in my kids. Saying "No!" to "Did you take the cookie from the cookie jar?" is the first thing you get, because it doesn't require a concept for "truth" at all. Which utterance postpones trouble? Those are the sounds I shall make!
Yet for a while my wife and I were in a situation where we could just ask our younger kid about her fight with our older kid, because the younger kid did not have a concept for fabricating a story in order to mislead. She was developed enough to say "No!" to things she knew she did, but not developed enough to form the intention of misleading.
The impression I get from LLM is that they're bullshitters. Predict what text comes next. Reward the one's that sound good to some dumb human, and what's gonna come out? We don't need to postulate an intent to mislead, we just need to notice that there is no robust intent to maintain honest calibration -- which is hard. Much harder than "output text that sounds like knowing the answer".
It takes "growing up" to not default to bullshit out of incompetence. Whether we teach them they'll be rewarded by developing skillful and honest calibration, or by intentionally blowing smoke up our ass is another question.
I like this post because it takes things you can only learn by "actually doing things", and then presents them in ways that can be generalized.
My above description is false, actually. I've been saying that you are trying to hit the limit without going over. Actually, fast drivers hover at the limit. They oscillate between a little bit under and a little bit over. [...]
They find the limit by probing for it, dancing at it.
This part in particular, because the default assumption is "Oh no, can't cross the limit!", yet this is true about a lot of things.
Also, even if you're just driving to visit your grandma and not pushing the limits of traction, a traction aware driver will drive differently than your average driver. For example, it's quite common to approach a red light at their current driving speed, only to start braking harder and harder at the end. Which is a foolish use of the safety margin, and also slower than the person who brakes gently and early, and therefore is more likely to still have momentum when the light turns green.
The problem with talking about things is that we don't really have a good shared ontology of how "preferences"/"desires"/"values"/etc work, and they don't work the way people think they do.
Basically everything is way more context dependent than anyone realizes -- as in, "I only wanted to go to the store because I thought it had the food I wanted", to give a trivial example. But that food you had a preference for is subject to change as your bodies needs for nutrients changes. Even things like people's identities as "asexual" or "straight" are prone to update with the evidence we come across.
So then you try to say "Well, that's 'tastes', when I talk about 'values' I mean things like 'autonomy'". Except that kind of thing is merely instrumental as well -- stabilized by motivated blind spots about how useless autonomy can be in the right contexts. And then the right contexts come along, and your "values" shift. Which can sound like "Oh no! Value drift!" from the outside, but once you get there, it's just "Oh no, that store was closed. It's my recognition of that which has changed".
Then you try to retreat to "Okay, but pain is bad. Like, by definition!". Except it isn't, because masochists. Which aren't even uncommon, with how many people like spicy food, and hard massages that "hurt in a good way".
The last step seems to be avoidance of suffering, saying "Ah, right, pain isn't suffering" but suffering is the definition of bad!. Except we choose that too! Suffering is what we choose in order to stave off the loss of hope. Often without realizing it, so we can get stuck with unproductive suffering which really is good to eliminate, but it's something we choose nonetheless. And becoming conscious of it can allow more deliberate choice between hopelessness and continued suffering.
The whole thing is hard to make sense of, so it's kinda "Of course people are going to use terms in unclear and conflicting ways". When you say people should talk about things like "Their own preferences", are you referring to their preference to go to the store, or to eat the food that they believe the store has for them? Or something upstream of that? When you talk about "normative values", what the heck is that, exactly? If it's "The thing that we should value", then what exactly is that 'should' being used to distract from? Do we have any shared and accurate idea of what this means, descriptively speaking?
I think we need more deliberate study of how human tastes/desires/wants/values/etc change or don't, before we're going to have smooth hiccup-free communication on the topic. I agree with you that these terms conflate things, but I don't think we have the option of not conflating things yet. So I'm nudging away from "Just use clear language and then everything will be clear" and towards "notice what your concepts might be hiding, and how much ambiguity is necessarily left".
While I understand the frustration, I'd rather have more hobby horse riders here. If I ever say something to inspire the charge of a hobby horse, I want that correction.
Because I might get lazy. Or imprecise. The "correction" might be something I immediately recognize as "obviously true", and want to say "Yeah yeah, that's what I meant". But it might not be what I said, and I may have been underweighting the importance of that little "nitpick" when I was writing. After all, that why there's the charging of the hobby horse; the other person doesn't think it's some unimportant nitpick. And neither do the LW voters, in the cases you highlight.
Maybe it's not.
If we try to discourage people from correcting real errors or misleading representations in the text, simply because the person pointing it out is unusually perceptive in this area, or is unusually aware of the importance of this kind of mistake, then we are in effect saying that we don't want to hear from people who are uniquely suited to correcting specific errors. "Sorry, Eliezer, you've been riding this AI hobby horse too much. We agree that making an unfriendly superintelligence would be bad, which is why we're going to make it friendly. Can't we move on and build it now?".
That doesn't cut it when the issue actually is important, and often the awareness of these things falls on few people. "What is a woman?" exploded into such a huge issue that I'm glad we have our resident "hobby horse rider" here, with skin in the game, motivated to do very careful thinking and call out what he sees to be errors on our part. If he's wrong he's wrong, which is a different criticism. If he's right though, I'd rather amend or clarify my writing to the satisfaction of the person who makes getting this particular thing right their thing. It might save me from mistakes I don't properly appreciate.
The qualifier "to the satisfaction of the other person" is important here. I know you think you've gotten things close enough. Likely so do the other authors in your examples. I also know that the hobby horse riding commenters disagree, and so does the audience -- at least in these cases. And that if you can't pass their ITT you can't know if you're missing something that validates their perspective and invalidates yours. And that if you can, they won't continue to think you don't get it, and therefore won't have reason to post those "unnecessary" comments.
I wonder how often questions like "What makes one race car driver faster than another" have a different answer from "What makes all race car drivers way faster than you".
I know from experience that "riding the limits of traction" is the first 90% that most people don't get, but how often is the last ten percent just chasing diminishing returns on the same thing, and how often is it a completely new skill that becomes relevant once you handle the "easy part"?
For example, using long range rifle shooting as an example, the answer to the former is "reading wind". But if you simply hand a rifle to someone who has never shot before, wind won't be the reason they miss. They still have to learn how to stabilize a rifle, calculate drop, etc.
But yes, interesting set of questions either way.
This is a normal consequence of intending at a level that requires more control than we actually have. Which is a normal consequence of not yet perceiving the interrelation and structure of expectation and control
When we control things, the effect of our control is to make our desired outcome expected -- for if we can't hit the center of the target even in expectation, then by definition we aren't in control. "Expecting" an outcome goes hand in hand with aiming to "control to" or "manifest" an expectation.
When the room is too cold, we think "Brr... it shouldn't be this cold in here!" and then go turn the heat up until room's temperature meets our expectations. Okay, fine.
But then what happens when your mom might have cancer?
You've been expecting her to not have cancer, and you want to be able to keep this expectation because who wants their mom to have cancer? So you might focus on the desired world state where your mom has no cancer, acting to do what you can to bring it about. You focus on manifesting no cancer in the biopsy -- and know this will fail, so you get this error signal that tells you it's not working in expectation. And then often in reality.
This resistance to letting go comes because we have something to lose. And there's something to fighting this fight. "Everything I've ever let go of has claw marks on it."
At the same time, it doesn't always work. And the suffering it entails points to our expectations actually being wrong. We're strongly expecting to not see cancer in the biopsy AND we know that this expectation is likely to be falsified. That hint we can update on.
I wish I could have certainty that my mom doesn't have cancer. Of course I wish that. Who wouldn't? At the same time, my mom might actually have cancer, and there ain't shit I can do about what's already true.
What I can do, is make sure her life does not get cut short unnecessarily. Not "My mom doesn't have cancer [dammit!]", but "My mom is going to live as long, healthy, and happy as a life is as absolutely possible. Because I'm going to make sure of it". I'm sure you, too, want to make sure your mom lives as long, healthy, and happily as absolutely possible. And you can act so as to make sure she does.
When that's your frame, where's the spider?
How do you feel about checking the biopsy, now?
For that matter, how do you feel about not checking the biopsy now?
Interesting, right?
So what do you do about the growing aversion to information which is unpleasant to learn?
To answer this directly, I notice. Like, really notice, and sit with it, and then notice what changes as a result as I realize what the implications are and allow the updates to flow through me.
Not "notice-and-then-do-this-instead!" because that's often prematurely jumping to try to a control a thing with insufficient perspective, when the problem itself is caused by trying to jump too quickly to control a thing without sufficient perspective.
So step one is to notice.
And to actively monitor whether I'm trying to "do something about it!", because I already know I don't want to jump to that. Not that I want to "Do-something-about-trying-to-do-something!", just "I don't want to do things that are stupid, lol".
Notice what the existence of this ugh field is telling me. Okay, I already know my expectations are bad. They won't be fulfilled, in my already existing meta-expectation.
What changes?
What doesn't?
Specifically, I look to what I'm realizing I can't control, and to what of value I still can control. And then reorient to that, so that I stop putting ineffectual claw marks on the things that's a goner at the expense of attending to what can still be saved.
So, "Hm. I notice that I don't want to see what's in this email, because I already suspect it will be what I don't want to see. Okay, what don't I want to see. Okay, yeah, I don't want to see that. Of course I don't want to see that. What if I do see that? What might I want to do about that"?
Maybe, "Why does it seem like whatever I do, people will get pissed at me?". "Is that actually true?". "If not, what kind of unseen-stupid am I being to systematically fail like this?". "If so, is that okay?".
The exact sequence and form might change, but the underlying theme is to be really attentive to what feedback I'm getting and where I might be flinching away from updating on this feedback, because all of this struggle results from failing to attend to something with the attention it deserves. The model I'm comparing to, to highlight sources of error, is one where my expectations aren't predictably violated, there's no innate tension underlying everything as a result, and any tension gets released by retreating from obstinate control towards more nuanced and obtainable goals after grieving what must be grieved -- and not what must not.
I see the point you're getting at, and I agree that there's a real failure mode here about I've been annoyed in similar ways. Heck, I kinda think it's silly for people to show up to promotions to receive the black belt they earned, but that's a separate topic.
At the same time, there's another side of this which is important.
At my jiu jitsu gym there's a new instructor who likes doing constraint led games. One of these games had the explicit goal of "get your opponents hands to the mat" with the implicit purpose of learning to off balance the top player. I decided to be a little muchkin and start grabbing peoples hands and pulling them to the mat even when they had a good base.
I actually did get social acclaim for this. The instructor thought that was awesome, and used it as an example of how he wanted people to play the games. In his view, as in mine, the point of the game is to explore how you can maneuver to win at the game as specified, without being restrained by artificial limitations which really ought to be accounted for in the game design.
If the new instructor would have tried to lecture us about playing to some underspecified "spirit" of the rules instead of the rules as he described them -- and about how we're not earning social points with him for gaming the system -- and was visibly annoyed about this... he would have been missing the point that he's not earning social points with me, and likely not with the others either. And I wouldn't much care for winning points with him, if that's how he were to respond. It's a filter. A feature, not a bug.
Breaking the game is to be encouraged, and if playing the game earnestly doesn't suit the intended purpose, "don't hate the player, hate the game". In his case, the game wasn't so broken so as to ruin the game so it turned out to be more fun and probably more useful than I had anticipated. Maybe it wasn't quite optimal, but it was playable for sure. In your case, the broken game is the sign that calibration isn't what we care about -- because that annoying shit was calibrated, and you weren't happy about it. What we need is a better scoring rule that weights calibration appropriately. Which exist!
Any time we find ourselves annoyed, there is a learning opportunity. Annoyance is our cue that reality is violating our expectations. It's a call to update.
Larger effects are easier to measure, and therefore quicker to update on. I didn't take concerns of "too much sweets" very seriously, so i had no restraint whatsoever.
The clearest updates came after wildly overconsuming while also cutting weight. I basically felt like shit which is probably a much exaggerated "sweet tired", and never ate swedish fish again. And snickers bars before that.
Since then the updates have been more subtle and below the level of what's easy to notice and keep good tabs on, but yes "sweet tired". Just generally not feeling satisfied and fulfilled, and developing more of that visceral distaste for frosting that you have as well, until sweets in general have a very limited place in my desires.
It's not a process like "Oh, I felt bad, so therefore I shall resist my cravings for sugar", it's "Ugh, frosting is gross" because it tastes like feeling tired and bad.
That's the right first question to consider, and it's something I was thinking about while writing that comment.
I don't think it's quite the right question to answer though. What I'm doing to generate these explanations is very different than "Go back to the EEA, and predict forward based on first principles", and my point is more about why that's not the thing to be doing in the first place more than about the specific explanation for the popularity of ice cream over bear fat.
It can sound nitpicky, but I think it's important to make hypotheticals concrete because a lot of the time the concrete details you notice upon implementation change which abstractions it makes sense to use. Or, to continue the metaphor, picking little nits when found is generally how you avoid major lice infestations.
In order to "predict" ice cream I have to pretend I don't already know things I already know. Which? Why? How are we making these choices? It will get much harder if you take away my knowledge of domestication, but are we to believe these aliens haven't figured that out? That even if they don't have domestication on their home planet, they traveled all this way and watched us with bears without noticing what we did to wolves? "Domestication" is hindsight in that it would take me much longer than five minutes as a cave man to figure out, but it's a thing we did figure out as cave men before we had any reason to think about ice cream. And it's it's sight that I do have and that the aliens likely would too.
Similarly, I didn't come up with the emulsification/digestion hypothesis until after learning from experience what happens when you consume a lot of pure oils by themselves. I'm sure a digestion expert could have predicted the result in advance, but I didn't have to learn a new field of expertise because I could just run the experiment and then the obvious answer becomes obvious. A lot of times, explanations are a lot easier to verify once they've been identified than they are to generate in the first place, and the fact that the right explanations come to mind vastly more easily when you run the experiment is not a minor detail to gloss over. I mean, it's possible that Zorgax is just musing idly and comes up with a dumb answer like "bear fat", but if he came all this way to get the prediction right you bet your ass he's abducting a few of us and running some experiments on how we handle eating pure fat.
As a general rule, in real life, fast feedback loops and half decent control laws dominate a priori reasoning. If I'm driving in the fog and can't see but 10 feet ahead, I'm really uninterested in the question "What kind of rocks are at the bottom of the cliff 100 feet beyond the fog barrier?" and much more interested in making sure I notice the road swerving in time to keep on a track that points up the mountain. Or, in other words, I don't care to predict which exact flavor of superstimuli I might be on track to overconsume, from the EEA. I care to notice before I get there, which is well in advance given how long ago we figured out domestication. I only need to keep my tastes tethered to reality so that when I get there ice cream and opioids don't ruin my life -- and I get to use all my current tools to do it.
I think this is the right focus for AI alignment too.
The way I see it, Eliezer has been making a critically important argument that if you keep driving in a straight line without checking the results, you inevitably end up driving off a cliff. And people really are this stupid, a lot of times. I'm very much on board with the whole "Holy fuck, guys, we can't be driving with a stopping distance longer than our perceptual distance!" thing. The general lack of respect and terror is itself terrifying, because plenty of people have tried to fly too close to the sun and lost their wings because they were too stupid to notice the wax melting and descend.
And maybe he's not actually saying this, but the connotations I associate with his framing, and more importantly the interpretation that seems widespread in the community, is that "We can't proceed forward until we can predict vanilla ice cream specifically, from before observing domestication". And that's like saying "I can't see the road all the way to the top of the mountain because of fog, so I will wisely stay here at the bottom". And then feeling terror build from the pressure from people wanting to push forward. Quite reasonably, given that there actually aren't any cliffs in view, and you can take at least the next step safely. And then reorient from there, with one more step down the road in view.
I don't think this strategy is going to work, because I don't think you can see that far ahead, no matter how hard you try. And I don't think you can persuade people to stop completely, because I think they're actually right not to.
I don't think you have to see the whole road in advance because there's a lot of years between livestock and widespread ice cream. Lots of chances to empirically notice the difference between cream and rendered fats. There's still time to see it millennia in advance.
What's important is making sure that's enough.
It's not a coincidence that I didn't get to these explanations by doing EEA thinking at all. Ice cream is more popular than bear fat because of how it is cheaper to produce now. It's easier to digest now. Aggliu was concerned with parasites this week. These aren't things we need to refer to the EEA to understand, because they apply today. The only reason I could come up with these explanations, and trivially, is because I'm not throwing away most of what I know, declining to run cheap experiments, and then noticing how hard it is to reason 1M years in advance when I don't have to.
The thread I followed to get there isn't "What would people who knew less want, if they suddenly found themselves blasted with a firehose of new possibilities, and no ability to learn?". The thread I followed is "What do I want, and why". What have I learned, and what have we all learned. Or can we all learn -- and what does this suggest going forward? This framing of people as agents fumbling through figuring out what's good for them pays rent a lot more easily than the framing of "Our desires are set by the EEA". No. Our priors are set by the EEA. But new evidence can overwhelm that pretty quickly -- if you let it.
So for example, EEA thinking says "Well, I guess it makes sense that I eat too much sugar, because it's energy which was probably scarce in the EEA". Hard to do the experiment, not much you can do with that information if it proves true. On the other hand, if you let yourself engage with the question "Is a bunch of sugar actually good?", you can run the experiment and learn "Ew, actually no. That's gross" -- and then watch your desires align with reality. This pays rent in fewer cavities and diabetes, and all sorts of good stuff.
Similarly, "NaCl was hard to get in the EEA, so therefore everyone is programmed to want lots of NaCl!". I mean, maybe. But good luck testing that, and I actually don't care. What I care about is knowing which salts I need in this environment, which will stop these damn cramps. And I can run that test by setting out a few glasses of water with different salts mixed in, and seeing what happens. The result of that experiment was that I already knew which I needed by taste, and it wasn't NaCl that I found my self chugging the moment it touched my lips.
Or with opioids. I took opioids once at a dose that was prescribed to me, and by watching the effects learned from that one dose "Ooh, this feels amazing" and "I don't have any desire to do that again". It took a month or so for it to sink in, but one dose. I talked to a man the other day who had learned the same thing much deeper into that attractor -- yet still in time to make all the difference.
Yes, "In EEA those are endogenous signaling chemicals" or whatever, but we can also learn what they are now. Warning against the dangers of superstimuli is important, but "Woooah man! Don't EVER try drugs, because you're hard coded by the EEA to destroy your life if you do that!" is untrue and counter productive. You can try opioids if you want, just pay real close attention, because the road may be slicker than you think and there are definitely cliffs ahead. Go on, try it. Are you sure you want to? A lot less tempting when framed like that, you know? How careful are you going to be if you do try it, compared to the guy responding "You're not the boss of me Dad!" to the type of dad who evokes it?
So yes, lots of predictions and lots of rent paid. Just not those predictions.
Predictions about how I'll feel if I eat a bowl full of bear fat the way one might with ice cream, despite never having eaten pure bear fat. Predictions about people's abilities to align their desires to reality, and rent paid in actually aligning them. And in developing the skill of alignment so that I'm more capable of detecting and correcting alignment failures in the future, as they may arise.
I predict, too, that this will be crucial for aligning the behaviors of AI as well. Eliezer used to talk about how a mind that can hold religion fundamentally must be too broken to see reality clearly. So too, I predict, that a mind that can hold a desire for overconsumption of sugar must necessarily lack the understanding needed to align even more sophisticated minds.
Though that's one I'd prefer to heed in advance of experimental confirmation.
Any thoughts on what to do if "just explain it to someone" turns into a long back and forth dialog?