LESSWRONG
LW

All of benwr's Comments + Replies

Note that we made all our code available (https://github.com/PalisadeResearch/shutdown_avoidance/) and it's pretty easy to run exactly our code but with your prompt, if you'd like to avoid writing your own script. You need a docker runtime and a python environment with the dependencies installed (and it's easier if you use nix though that's not required), but if you have those things you just have to modify the conf.py file with your prompt and then do something like run --model openai/o3.

There are lots of reasons that a "survival drive" is something we we... (read more)

1ankitmaloo2d

I have read the paper, that one is modeling AI on the basis of rational humans, and then comparing it. Which is kind of worth doing if you start with the assumption that AI systems are modeled on rational behavior of humans. I kind of think that it should only come up if any model can directly learn from experience, generalize across domains, and take rational / computed decisions based on what it learned from the environment and interaction, like humans do. (this is not as strongly held and i dont have a generic definition of AGI yet). Rewriting the shutdown problem is a much more interesting failure mode. The reason sabotage is over the top is because it implies deliberate and malicious intent. You can justify both in this case, but I still feel it proves too much (need to find Scott's post to link here sorry). This is not just semantics. It's a willful act often premeditated. For humans, if you misunderstand and shut down power supply to a city, it is an accident. If you do it as instructed, it's instruction following. If you do with an intent to damage something to knowingly hurt something or someone, it will be called sabotage. It's a tendency to attribute human traits to a machine and only works iff we assume there is a desire. It helps make it more relatable, just that it carries equal chances of misunderstanding as well. Similar words could be anomaly (too vague), cascading process failure, configuration drift, bug (shutdown prevention bug), error, etc. which communicate the same thing without the relatability aspect. A counter case i can think of where this does not apply: a model somehow learns and actively 'hides' behavior during testing, but deploying it in a live environment. Even here intent would originate from human trainer building the model.

Shutdown Resistance in Reasoning Models

benwr3d40

I weak-downvoted this comment; I did so for the following reasons:

(a) "Reasoning models as trained do not follow a typical instruction hierarchy of a chat template." doesn't seem true if you mean that they're not intended to follow such a hierarchy (see the link to the system card in the OP), and if you mean that they don't follow it despite being meant/designed to, I don't really see much evidence presented (and don't have any background knowledge that would suggest) that the "Goal, Outcome, Warning, Context" sequence is "bound to" give consistent results... (read more)

2Raemon2d

(hurray for thoughtful downvote explanations)

1ankitmaloo2d

re a, what I meant was that even in a prompt (be it system prompt or user instructions), the order of how you place the instructions matter. In typical chat prompts (based on discussions and some papers I will have to find since this is old) the format was that last instruction will be given highest priority. That is also the case when you think of a typical chat back and forth where last instructions have to be given the highest priority. In reasoning models that template changes because these are reporting models, not suitable for a back and forth chat interaction. This was not a comment on OP, it does not matter much if you put these in system prompt or user prompt, with the format you are unlikely to get the kind of results OP got. Seems like the top comment did something similar. The linked post was not an explanation of what i did. It was an explanation of how to correctly write the prompt and how i got the template based on the writer. I tested the same thing on my local with 15 attempts and got a shutdown every single time. (attempts vary at 3rd step, or 5th step). It's not as scientific since N=15 is a low number. What I tested was this: I took your prompt and the idea. I had two tools, one was search, and second was a exec function which can execute any python code. I gave typically smaller math problems, and then /next_question leads to next question as you did on terminal. I would try replicating the whole setup later when i have time. As to why 15, i was doing it manually, and asking 60-70 questions with no difference got kind of tedious manually. So i thought of writing a script to give automated questions and monitor for the result. Which i plan to do later tonight based on bandwidth. I am still curious about why directly go to survival mode as a hypothesis. I think survival mode comes up if you assume a machine is conscious in a sense that they have a goal for self preservation. Conscious is probably not the most accurate word here, but the cor

"Buckle up bucko, this ain't over till it's over."

benwr4d30

Right, I think it just seems like doing emotional preparation that matches this description is a kind of earthbender-friendly / earthbender-assuming move, while an airbender-friendly move would be more like "notice and accept that you'd have more fun doing it a different way or doing a different thing; that flailing isn't actually fun". The effect is kind of similar, i.e. both earthbenders and airbenders should come away less-clinging-to-something, but the earthbender comes away less-clinging-to "the locally easy and straightforward things will work if I d... (read more)

2Raemon4d

mm, okay yeah the distinction of different-ways-to-cling-less seems pretty reasonable.

"Buckle up bucko, this ain't over till it's over."

benwr4d20

Yeah I can try to say some of them, though my sense of the crucialness here does shift on learning that you mean for this to be about hour-to-week levels of effort. I guess I may as well try to come up with element-bending-flavored ones since I'm in pretty deep on that metaphor here.

The biggest differences in which one I'd recommend as a "default last resort" depend on the person and their strengths rather than the situation.

The thing recommended for the airbender above, i.e. "retreat to a safe distance and consider if there's anything that seems lik

... (read more)

3Raemon4d

Nod, those all seem like good moves. I'm sort of torn between two more directions: On one hand, I actually didn't really mean "buckle up" to be very specific in terms of what move comes next. The most important thing is recognizing "this is a hard problem, your easy-mode cognitive tools probably won't work." (I think all the moves you list there are totally valid tools to bring to bear in that context, which are all more strategic than "just try the next intuitive thing") On the other hand... the OP does sure have a vibe about particular flavors of problem/solution, and it's not an accident that I wrote a thing that resonates with me with that you feel wary of. But, leaning into that... I'm a bit confused why the options you list here are "last resorts" as opposed to "the first thing you try once noticing the problem is hard". Like the airbender should be looking for a way for it to feel fun pretty early in the process. The "last resort" is whatever comes after all the tools that came more naturally to the airbender turn out not to work. (Which is in fact how Aang learned Earthbending). ((notably, I think I spent most of my life more airbendery. And the past ~2 years of me focusing on techniques that involve annoying effort is that the non-annoying-brute-force-y techniques weren't solving the problems I wanted to solve.)) But I think the first-hand is more the point – this is less about "the next steps will involve something annoying/powerthrough-y" and more "I should probably emotionally prepare for the possibility that the next steps will involve something annoying/power-through-y"

"Buckle up bucko, this ain't over till it's over."

benwr4d10

I think I maybe mean to say a slightly different thing than came across, which makes sense because I was leaning heavily into the metaphor rather than trying to be very clear.

I think the triggers are definitely hints in the direction that buckling up might be the right move. Yet I also observe that, when I imagine the median or even 80th percentile person-explicitly-buckling-up-for-something-big, a big part of me wants to shake my head and be like "ah well, it was nice while it lasted" about their chances of doing a hard thing, especially an unusual hard t... (read more)

4Raemon4d

I'd be interested in some of the mental moves your thinking of as alternatives, and what triggers distinguish when it's a more buckling-down-time than the-other-thing-benwr-is-imagining time. (I realize a full version of this is asking to write your own sequence of rationality-TAPs, but, interested in a quick table of contents if you got it)

"Buckle up bucko, this ain't over till it's over."

benwr4d40

Sure; if it's not obvious they're from the universe of Avatar: The Last Airbender.

Earthbending is substantially about: facing things head-on, "just getting it done", "buckling down" (though I suppose this can be different than "buckling up"), being unyielding, orienting around "grit".

Waterbending is substantially about: Being flexible, responsive to the environment, and careful.

Airbending is substantially about: Freedom of movement and action, using an opponent's strength against them (which in PvE looks more like "doing what's easy and/or fun"), speed.

The... (read more)

-1Archimedes3d

For reference, here's Claude Opus 4's summary of the styles: EARTHBENDING Neutral jing - waiting and listening before acting. Direct confrontation when the time comes - problems don't go away by ignoring them. Being immovable and unmoved, enduring through standing firm rather than through continuous output. Drawing strength from solid foundations and deep roots. Meet challenges head-on, but only after patient observation and understanding what you're facing. WATERBENDING The path of least resistance is often the most powerful. Push and pull, give and take - everything flows in cycles. Flexibility can overcome rigidity. Draw strength from external sources and community. Change is the only constant. AIRBENDING Freedom through detachment. The leaf doesn't resist the wind. Conflict is an illusion that can be sidestepped. Joy and play are valid approaches to serious matters. New perspectives emerge when you release fixed positions. FIREBENDING Power comes from within - your drive, your breath, your life force. Act decisively from internal conviction and passion, maintaining intensity by continuously generating energy from within rather than standing firm. The sun gives life as well as destruction. Inner fire must be tended and controlled, not suppressed or allowed to rage wild.

2Arjun Pitchanathan4d

Thanks! (I knew enough about Avatar to know what you wrote in your last paragraph, but the rest is new to me)

"Buckle up bucko, this ain't over till it's over."

benwr4d30

You don't really mention <a thing that I think is extremely crucial> in this domain, which is that you do not have to (metaphorically) be an earthbender about everything. Other types of bending also exist. If you are not a native earthbender, you might be able to learn to do it (the real world does not have only one Chosen One who can bend all the elements), but as a meta-waterbender I personally recommend first looking around carefully, and trying to figure out how the most successful benders of your native element are doing it.

You do seem to see ea... (read more)

1CstineSublime4d

What are earth and water benders? Is someone able and willing to paraphrase this with a different analogy?

3Raemon4d

I intended this disclaimer to at least touch on (what I think you) mean. (I'm not really sure what you mean by "meta-waterbender" vs "airbender". I guess to be fair I was also sort of confused by the distinction between water and airbenders in the show. I could see the distinction if I squint but they seemed least-different of all the elements) I considered adding more extensive disclaimers. I didn't do that for... uh, the maybe aggravating, maybe slightly-bad (but IMO not strictly bad) reason of "people don't really comment on posts that get everything right and are super careful to disclaim everything appropriately", and, uh, I selfishly/motivationally wanted more comments on my posts in this series so decided to lean in the direction of rushing it out the door less carefully. But, I did just update the Triggers Section to make it a bit more explicit, and if you think there's a better way of phrasing The Thing You Mean To Be Pointing At I'm interested.

7Arjun Pitchanathan4d

got more exposition on what you mean with the different elements in this context?

It's 'Well, actually...' all the way down

benwr2mo20

I mean to use "the precise definition" to identify it as the one that isn't "the specific definition" (based on the earlier comparison), not to say that it's the only precise definition or anything like that. i.e. I could have said "the comparatively more precise definition of these two" instead.

2ChristianKl2mo

This makes the mistaken assumptions that there are somehow two definitions among which the debate rests and that those match what any scientist believes. While there might something a definition of chemical that multiple people who say "“uh, everything is chemicals, what do you even mean?” use, that's not the precise definition of chemical that anybody in science who has actually a need for a precise definition would use. As far as precision goes substance that falls under the preview of the REACH Regulation of the EU doesn't feel to me more vague than normal Kiki usage of the term chemical.

It's 'Well, actually...' all the way down

benwr2mo70

I wonder if we should bet about something here. It seems plausible that we would make different predictions about how much agreement there would be on what is a "chemical", if you were to explain about the structure, manufacturing, and acquisition of a given substance.

3the gears to ascension2mo

this does still have issues, if someone is horrified by riboflavin but makes sure to have plenty of vitamin B12. I'm expecting that people who are "chemical avoidant" ordinarily implement rather surface-level pattern matching of what qualifies, but if you provide them with detailed explanations of how specific names actually arise, they might match on the explanation rather than the name. still, I expect they'll have a high rate of weird edge cases.

It's 'Well, actually...' all the way down

benwr2mo*57

I think we disagree about what a definition is, in natural language.

I think it's fair to say that you and I have an internally-consistent-though-vague definition of the word "sandwich", even though I'm sure there are astronomically many edge-case examples for which we would answer differently. And in fact this is probably true for almost literally every non-mathematical noun or verb that we both "know", albeit to varying degrees.

Edited to add: It suddenly seems likely to me that I should be using the word "meaning" rather than "definition" here, but the rest of the post goes through fine if I make that switch.

0jbash2mo

"Sandwich" is still more tightly defined than their version of "chemical", and is still a useful concept that "carves reality at the joints" even if it's a bit fuzzy about which side of each joint it's cutting on. You'd have to go all the way to "bad stuff" to get a comparably bad definition.

Not all capabilities will be created equal: focus on strategically superhuman agents

benwr3mo10

I think it seems like a fine possibility in principle, actually; sorry to have given the wrong impression! It's not my central hope, since strategy-stealing seems like it should make many human-augmentations "available" to AI systems as well. This is notably not true for things involving, e.g., BCIs or superbabies.

Not all capabilities will be created equal: focus on strategically superhuman agents

benwr3mo10

When I'm thinking about this, it seems kind of fine if the goalposts move - human strategic capacity will certainly move over time no matter what, right? Like, someone invented crowdfunding and suddenly we could do types of coordination that we previously couldn't do.

3owencb3mo

It seems fine to me to have the goalposts moving, but then I think it's important to trace through the implications of that. Like, if the goalposts can move then this seems like perhaps the most obvious way out of the predicament; to keep the goalposts ever ahead of AI capabilities. But when I read your post I get the vibe that you're not imagining this as a possibility?

Biological humans collectively exert at most 400 gigabits/s of control over the world.

benwr5mo10

Nate Soares points out that the first paragraph is not quite right: Imagine writing a program that somehow implements an aligned superintelligence, giving it as an objective, "maximize utility according to the person who pressed the 'go' button", and pressing the 'go' button.

There's some sense in which, by virtue of existing in the world, you're already kind of "lucky" by this metric: It can take a finite amount of information to instantiate an agent that takes unbounded actions on your behalf.

benwr's unpolished thoughts

benwr5mo10

I asked Deep Research to see if there are existing treatments of this basic idea in the literature. It seems most closely related to the concept of "empowerment" in RL, which I'm surprised I hadn't heard of: https://en.m.wikipedia.org/wiki/Empowerment_(artificial_intelligence)

The Wikipedia article makes it seem like this might also be how RL people think about instrumental convergence?

benwr's unpolished thoughts

benwr5mo*10

Human information throughput is allegedly only about 10-50 bits per second. This implies an interesting upper bound, in that the information throughput of biological humanity as a whole can't be higher than around 50 * 10^10 = 500Gbit/s. I.e., if all distinguishable actions made by humans were perfectly independent, biological humanity as a whole would have at most 500Gbit/s of "steering power".

I need to think more about the idea of "steering power" (e.g. some obvious rough edges around amplifying your steering power using external information processing /... (read more)

1benwr5mo

Not all capabilities will be created equal: focus on strategically superhuman agents

benwr5mo10

I think you may have missed, or at least not taken literally, at least one of these things in the post:

The expansion of "superhuman strategic agent" is not "agent that's better than humans at strategic reasoning", it's "agent that is better than the best groups of humans at taking (situated) strategic action"
Strategic action is explicitly context-dependent, e.g. an AI system that's inside a mathematically perfect simulated world that can have no effect on the rest of the physical world and vice versa, has zero strategic power in this sense. Also e.g. in th

... (read more)

2ozziegooen5mo

I was confused here, had Claude try to explain this to me: I think I'm still confused. My guess is that the "most strategically capable groups of humans" are still not all that powerful, especially without that many resources. If you do give it a lot of resources, then sure, I agree that an LLM system with human-outperforming strategy and say $10B could do a fair bit of damage. Not sure if it's worth much more, just wanted to flag that.

benwr's unpolished thoughts

benwr5mo104

I think it probably makes sense for ~everyone to have an explicit list of "things I'd like AI to do for me", especially around productivity and/or things that could help you with world-saving. If you have a list like this, and we happen to hit a relevant capability threshold before we lose, you should probably avoid wasting time on that thing as quickly as possible.

Bounty for Evidence on Some of Palisade Research's Beliefs

benwr9mo20

Thanks everyone for thoughts so far! I do want to emphasize that we're actually highly interested in collecting even the most "obvious" evidence in favor of or against these ideas. In fact, in many ways we're more interested in the obvious evidence than in reframes or conceptual problems in the ideas here; of course we want to be updating our beliefs, but we also want to get a better understanding of the existing state of concrete evidence on these questions. This is partly because we consider it part of our mission to expand the amount and quality of relevant evidence on these beliefs, and are trying to ensure that we're aware of existing work.

benwr's unpolished thoughts

benwr1y50

Surprisingly to me, Claude 3.5 Sonnet is much more consistent in its answer! It is still not perfect, but it usually says the same thing (9/10 times it gave the same answer).

5quetzal_rainbow1y

I read somewhere that Claude 3.5 has hidden " thinking tokens".

benwr's unpolished thoughts

benwr1y167

From the "obvious-but-maybe-worth-mentioning" file:

ChatGPT (4 and 4o at least) cheats at 20 questions:

If you ask it "Let's play a game of 20 questions. You think of something, and I ask up to 20 questions to figure out what it is.", it will typically claim to "have something in mind", and then appear to play the game with you.

But it doesn't store hidden state between messages, so when it claims to "have something in mind", either that's false, or at least it has no way of following the rule that it's thinking of a consistent thing throughout the game. i.e.... (read more)

6Joseph Miller1y

I agree that it does not have something in mind but it could in principle have something in mind in the sense that it could represent some object in the residual stream in the tokens where it says "I have something in mind". And then future token positions could read this "memory".

5benwr1y

Surprisingly to me, Claude 3.5 Sonnet is much more consistent in its answer! It is still not perfect, but it usually says the same thing (9/10 times it gave the same answer).

benwr's unpolished thoughts

benwr1y30

Sometimes people use "modulo" to mean something like "depending on", e.g. "seems good, modulo the outcome of that experiment" [correct me ITT if you think they mean something else; I'm not 100% sure]. Does this make sense, assuming the term comes from modular arithmetic?

Like, in modular arithmetic you'd say "5 is 3, modulo 2". It's kind of like saying "5 is the same as 3, if you only consider their relationship to modulus 2". This seems pretty different to the usage I'm wondering about; almost its converse: to import the local English meaning of "modu... (read more)

11 diceware words is enough

benwr1y10

Well, not that much, right? If you had an 11-word diceware passphrase to start, each word is about 7 characters on average, so you have maybe 90 places to insert a token - only 6.5 extra bits come from choosing a place to insert your character. And of course you get the same added entropy from inserting a random 3 base32 chars at a random location.

Happy to grant that a cracker assuming no unicode won't be able to crack your password, but if that's your goal then it might be a bad idea to post about your strategy on the public internet ;)

11 diceware words is enough

benwr1y10

maybe; probably the easiest way to do this is to choose a random 4-digit hexadecimal number, which gives you 16 bits when you enter it (e.g. via ctrl+u on linux). But personally I think I'd usually rather just enter those hex digits directly, for the same entropy minus a keystroke. Or, even better, maybe just type a random 3-character base32 string for one fewer bit.

-3ChristianKl1y

If you have a nonstandard unicode character in your password and an attacker tries to crack a password based on the assumption that no nonstandard unicode character is in the password, they can't crack your password no matter how much compute they throw at it. Given that you decide where in your string you place it, the attacker would have to test for the special character being at multiple different positions which adds a lot of additional entropy.

2ChristianKl1y

Babble challenge: 50 ways of sending something to the moon

benwr2y50

I really really dislike the experience of saying things I think are totally stupid, and I currently don't buy... (read more)

4Raemon2y

First: people are different, so, like, definitely do the version of this you think actually helps you. (I've updated that "reflect afterward about what worked and didn't work for you" is a generally important part of cognitive exercises, and should be a part of the Babble exercises) But I want to flag the reasons I personally think it's important to have access to the dumb thoughts, and why it at least works for me. 1. I personally frequently have the experience of feeling totally stuck, writing down "list of strategies for X?", still feeling totally stuck, and then writing down "bad reasons for X", and this just totally unsticks me. I typically generate 1-2 bad ideas and then start generating good ideas again. 2. They're... free? Nothing bad happens when I generate them. I ignore them and move on and consolidate the good ideas later. 3. The goal here is train myself to have an easier time generating ideas on the fly. In real life, I don't generate 50 ideas when babbling, I typically generate like 10. The point of the practice IMO is to sort of overtrain such that the 10 good ideas come easily when you need them and you never feel stuck. You might not share the experience in #1, in which case, for sure, do what seems good. (To be clear, if you found "actually generate good ideas tho" a prompt that generated useful stuff, seems good to notice and have that prompt in your toolkit) But FYI my crux for "whether I personally think BenWr benefits from generating bad ideas" is whether you ended up generating more good ideas faster-than-otherwise (which might or might not be true, but you didn't really address). ((though note: "whether it's useful to generate bad ideas" is a different question from "whether it's useful to use the prompt 'only generate good ideas'. It's possible for them both to be useful)) I agree that "stop and come back to it later" is often an important aspect of this sort of skill, but in general if I can generate the good ideas in the first pl

Babble challenge: 50 ways of sending something to the moon

benwr2y*30

A thing that was going through my head but I wasn't sure how to turn into a real idea (vulgar language from a movie):

Perhaps you would like me to stop the car and you two can fuck yourselves to Lutsk!

Babble challenge: 50 ways of sending something to the moon

benwr2y30

Whoa. I also thought of this, though for me it was like thing 24 or something, and I was too embarrassed to actually include it in my post.

Babble challenge: 50 ways of sending something to the moon

Answer by benwrAug 01, 202330

Hire SpaceX to send it
Bribe an astronaut on the next manned moon mission to bring it with them
Bribe an engineer on the next robotic moon mission to send it with the rover
Get on a manned mars mission, and throw it out the airlock at just the right speed
Massive evacuated sphere (like a balloon but arbitrarily light), aimed very carefully
Catapult
Send instructions on how to build a copy of the thing, and where to put it, such that an alien race will do it as a gesture of goodwill
Same, but with an incentive of some kind
Same, but do it acausally
Make a miniature

... (read more)

5benwr2y

Some thoughts after doing this exercise: I did the exercise because I couldn't sleep; I didn't keep careful count of the time, and I didn't do it all in one sitting. I'd guess I spent about an hour on it total, but I think there's a case to be made that this was cheating. However, "fresh eyes" is actually a really killer trick when doing this kind of exercise, in my experience, and it's usually available in practice. So I don't feel too bad about it. I really really dislike the experience of saying things I think are totally stupid, and I currently don't buy that I should start trying to say stupider things. My favorite things in the above list came from refusing to just say another totally stupid thing. Nearly everything in my list is stupid in some way, but the things that are so stupid they don't even feel interesting basically make me feel sad. I trust my first-round aesthetic pruner to actually be helping to train my babbler in constructive directions. The following don't really feel worth having said, to me: My favorites didn't come after spewing this stuff; instead they came when I refused to be okay with just saying more of that kind of junk: The difference isn't really that these are less stupid; in fact they're kind of more stupid, practically speaking. But I actually viscerally like them, unlike the first group. Forcing myself to produce things I hate feels like a bad strategy on lots of levels.

3benwr2y

A thing that was going through my head but I wasn't sure how to turn into a real idea (vulgar language from a movie):

UFO Betting: Put Up or Shut Up

benwr2y20

(I've added my $50 to RatsWrong's side of this bet)

"Justice, Cherryl."

benwr2y131

For contingent evolutionary-psychological reasons, humans are innately biased to prefer "their own" ideas, and in that context, a "principle of charity" can be useful as a corrective heuristic

I claim that the reasons for this bias are, in an important sense, not contingent. i.e. an alien race would almost certainly have similar biases, and the forces in favor of this bias won't entirely disappear in a world with magically-different discourse norms (at least as long as speakers' identities are attached to their statements).

As soon as I've said "P", it is th... (read more)

Systems that cannot be unsafe cannot be safe

benwr2y30

I would agree more with your rephrased title.

People do actually have a somewhat-shared set of criteria in mind when they talk about whether a thing is safe, though, in a way that they (or at least I) don't when talking about its qwrgzness. e.g., if it kills 99% of life on earth over a ten year period, I'm pretty sure almost everyone would agree that it's unsafe. No further specification work is required. It doesn't seem fundamentally confused to refer to a thing as "unsafe" if you think it might do that.

I do think that some people are clearly talking about... (read more)

0Davidmanheim2y

The people in the world who actually build these models are doing the thing that I pointed out. That's the issue I was addressing. I don't understand this distinction. If " I'm pretty sure almost everyone would agree that it's unsafe," that's an informal but concrete ability for the system to be unsafe, and it would not be confused to say something is unsafe if you think it could do that, nor to claim that it is safe if you have clear reason to believe it will not. My problem is, as you mentioned, that people in the world of ML are not making that class of claim. They don't seem to ground their claims about safety in any conceptual model about what the risks or possible failures are whatsoever, and that does seem fundamentally confused.

Systems that cannot be unsafe cannot be safe

benwr2y10

Part of my point is that there is a difference between the fact of the matter and what we know. Some things are safe despite our ignorance, and some are unsafe despite our ignorance.

0Davidmanheim2y

Sure, I agree with that, and so perhaps the title should have been "Systems that cannot be reasonably claimed to be unsafe in specific ways cannot be claimed to be safe in those ways, because what does that even mean?" If you say something is "qwrgz," I can't agree or disagree, I can only ask what you mean. If you say something is "safe," I generally assume you are making a claim about something you know. My problem is that people claim that something is safe, despite not having stated any idea about what they would call unsafe. But again, that seems fundamentally confused about what safety means for such systems.

Systems that cannot be unsafe cannot be safe

benwr2y10

The issue is that the standards are meant to help achieve systems that are safe in the informal sense. If they don't, they're bad standards. How can you talk about whether a standard is sufficient, if it's incoherent to discuss whether layperson-unsafe systems can pass it?

2Gunnar_Zarncke2y

True, but the informal safety standard is "what doesn't harm humans." For construction, it amounts to "doesn't collapse," which you can break down into things like "strength of beam." But with AI you are talking to the full generality of language and communication and that effectively means: "All types of harm." Which is exactly the very difficult thing to get right here.

Systems that cannot be unsafe cannot be safe

benwr2y1-2

I don't think it's true that the safety of a thing depends on an explicit standard. There's no explicit standard for whether a grizzly bear is safe. There are only guidelines about how best to interact with them, and information about how grizzly bears typically act. I don't think this implies that it's incoherent to talk about the situations in which a grizzly bear is safe.

Similarly, if I make a simple html web site "without a clear indication about what the system can safely be used for... verification that it passed a relevant standard, and clear instru... (read more)

2Davidmanheim2y

I think you're focusing on the idea of a standard, which is necessary for a production system or reliability in many senses, and should be demanded of AI companies - but it is not the fundamental issue with not being able to say in any sense what makes the system safe or unsafe, which was the fundamental point here that you seem not to disagree with. I'm not laying out a requirement, I'm pointing out a logical necessity; if you don't know what something is or is not, you can't determine it. But if something "will reliably cause serious harm to people who interact with it," it sounds like you have a very clear understanding of how it would be unsafe, and a way to check whether that occurs.

2Gunnar_Zarncke2y

That's true informally, and maybe it is what some consumers have in mind, but that is not what the people who are responsible for actual load-bearing safety are meaning.

Moderation notes re: recent Said/Duncan threads

benwr2y610

I think I agree that this isn't a good explicit rule of thumb, and I somewhat regret how I put this.

But it's also true that a belief in someone's good-faith engagement (including an onlooker's), and in particular their openness to honest reconsideration, is an important factor in the motivational calculus, and for good reasons.

7Vladimir_Nesov2y

The structure of a conflict and motivation prompted by that structure functions in a symmetric way, with the same influence irrespective of whether the argument is right or wrong. But the argument itself, once presented, is asymmetric, it's all else equal stronger when correct than when it's not. This is a reason to lean towards publishing things, perhaps even setting up weird mechanisms like encouraging people to ignore criticism they dislike in order to make its publication more likely.

Moderation notes re: recent Said/Duncan threads

benwr2y1312

I think it's pretty rough for me to engage with you here, because you seem to be consistently failing to read the things I've written. I did not say it was low-effort. I said that it was possible. Separately, you seem to think that I owe you something that I just definitely do not owe you. For the moment, I don't care whether you think I'm arguing in bad faith; at least I'm reading what you've written.

-18Czynski2y

-32Czynski2y

Moderation notes re: recent Said/Duncan threads

benwr2y92

Nor should I, unless I believe that someone somewhere might honestly reconsider their position based on such an attempt. So far my guess is that you're not saying that you expect to honestly reconsider your position, and Said certainly isn't. If that's wrong then let me know! I don't make a habit of starting doomed projects.

5Vladimir_Nesov2y

I think for the purposes of promoting clarity this is a bad rule of thumb. The decision to explain should be more guided by effort/hedonicity and availability of other explanations of the same thing that are already there, not by strategically withholding things based on predictions of how others would treat an explanation. (So for example "I don't feel like it" seems like an excellent reason not to do this, and doesn't need to be voiced to be equally valid.)

-3Czynski2y

If you're not even willing to attempt the thing you say should be done, you have no business claiming to be arguing or negotiating in good faith. You claimed this was low-effort. You then did not put in the effort to do it. This strongly implies that you don't even believe your own claim, in which case why should anyone else believe it? It also tests your theory. If you can make the modification easily, then there is room for debate about whether Said could. If you can't, then your claim was wrong and Said obviously can't either.

Moderation notes re: recent Said/Duncan threads

benwr2y40

I'm not sure what you mean - as far as I can tell, I'm the one who suggested trying to rephrase the insulting comment, and in my world Said roughly agreed with me about its infeasibility in his response, since it's not going to be possible for me to prove either point: Any rephrasing I give will elicit objections on both semantics-relative-to-Said and Said-generatability grounds, and readers who believe Said will go on believing him, while readers who disbelieve will go on disbelieving.

-1Czynski2y

You haven't even given an attempt at rephrasing.

Moderation notes re: recent Said/Duncan threads

benwr2y74

By that measure, my comment does not qualify as an insult. (And indeed, as it happens, I wouldn’t call it “an insult”; but “insulting” is slightly different in connotation, I think. Either way, I don’t think that my comment may fairly be said to have these qualities which you list.

I think I disagree that your comment does not have these qualities in some measure, and they are roughly what I'm objecting to when I ask that people not be insulting. I don't think I want you to never say anything with an unflattering implication, though I do think this is usual... (read more)

1Said Achmiz2y

I think that, at this point, we’re talking about nuances so subtle, distinctions so fragile (in that they only rarely survive even minor changes of context, etc.), that it’s basically impossible to predict how they will affect any particular person’s response to any particular comment in any particular situation. To put it another way, the variation (between people, between situations, etc.) in how any particular bit of wording will be perceived, is much greater than the difference made by the changes in wording that you seem to be talking about. So the effects of any attempt to apply the principles you suggest is going to be indistinguishable from noise. And that means that any effort spent on doing so will be wasted.

Moderation notes re: recent Said/Duncan threads

benwr2y72

For what it's worth, I don't think that one should never say insulting things. I think that people should avoid saying insulting things in certain contexts, and that LessWrong comments are one such context.

I find it hard to square your claim that insultingness was not the comment's purpose with the claim that it cannot be rewritten to elide the insult.

An insult is not simply a statement with a meaning that is unflattering to its target - it involves using words in a way that aggressively emphasizes the unflatteringness and suggests, to some extent, a call ... (read more)

3Czynski2y

You still haven't actually attempted the challenge Said laid out.

3Said Achmiz2y

I more or less agree with this; I think that posting and commenting on Less Wrong is definitely a place to try to avoid saying anything insulting. But not to try infinitely hard. Sometimes, there is no avoiding insult. If you remove all the insult that isn’t core to what you’re saying, and if what you’re saying is appropriate, relevant, etc., and there’s still insult left over—I do not think that it’s a good general policy to avoid saying the thing, just because it’s insulting. By that measure, my comment does not qualify as an insult. (And indeed, as it happens, I wouldn’t call it “an insult”; but “insulting” is slightly different in connotation, I think. Either way, I don’t think that my comment may fairly be said to have these qualities which you list. Certainly there’s no “call to non-belief-based action”…!) True, of course… but also, so thoroughly dis-analogous to the actual thing that we’re discussing that it mostly seems to me to be a non sequitur.

Moderation notes re: recent Said/Duncan threads

benwr2y96

My guess is that you believe it's impossible because the content of your comment implies a negative fact about the person you're responding to. But insofar as you communicated a thing to me, it was in fact a thing about your own failure to comprehend, and your own experience of bizarreness. These are not unflattering facts about Duncan, except insofar as I already believe your ability to comprehend is vast enough to contain all "reasonable" thought processes.

Said Achmiz2y104

These are not unflattering facts about Duncan

Indeed, they are not—or so it would seem. So why would my comment be insulting?

After all, I didn’t write “your stated reason is bizarre”, but “I find your stated reason bizarre”. I didn’t write “it seems like your thinking here is incoherent”, but “I can’t form any coherent model of your thinking here”. I didn’t… etc.

So what makes my comment insulting?

Please note, I am not saying “my comment isn’t insulting, and anyone who finds it so is silly”. It is insulting! And it’s going to stay insulting no matter how ... (read more)

Moderation notes re: recent Said/Duncan threads

benwr2y2926

But, of course, I recognize that my comment is insulting. That is not its purpose, and if I could write it non-insultingly, I would do so. But I cannot.

I want to register that I don't believe you that you cannot, if we're using the ordinary meaning of "cannot". I believe that it would be more costly for you, but it seems to me that people are very often able to express content like that in your comment, without being insulting.

I'm tempted to try to rephrase your comment in a non-insulting way, but I would only be able to convey its meaning-to-me, and I pre... (read more)

3Said Achmiz2y

I believe you when you say that you don’t believe me. But I submit to you that unless you can provide a rephrasing which (a) preserves all relevant meaning while not being insulting, and (b) could have been generated by me, your disbelief is not evidence of anything except the fact that some things seem easy until you discover that they’re impossible.

Sneaking Suspicion

benwr3y60

Other facts about how I experience this:

* It's often opposed to internal forces like "social pressure to believe the thing", or "bucket errors I don't feel ready to stop making yet"

* Noticing it doesn't usually result in immediate enlightenment / immediately knowing the answer, but it does result in some kind of mini-catharsis, which is great because it helps me actually want to notice it more.

* It's not always the case that an opposing loud voice was wrong, but I think it is always the case that the loud voice wasn't really justified in its loudness.

Benign Boundary Violations

benwr3y270

A thing I sort-of hoped to see in the "a few caveats" section:

* People's boundaries do not emanate purely from their platonic selves, irrespective of the culture they're in and the boundaries set by that culture. Related to the point about grooming/testing-the-waters, if the cultural boundary is set at a given place, people's personal boundaries will often expand or retract somewhat, to be nearer to the cultural boundary.

Entropy isn't sufficient to measure password strength

benwr3y10

Perhaps controversially, I think this is a bad selection scheme even if you replace "password" with any other string.

1[comment deleted]3y

Entropy isn't sufficient to measure password strength

benwr3y40

any password generation scheme where this is relevant is a bad idea

I disagree; as the post mentions, sometimes considerations such as memorability come into play. One example might be choosing random English sentences as passwords. You might do that by choosing a random parse tree of a certain size. But some English sentences have ambiguous parses, i.e. they'll have multiple ways to generate them. You *could* try to sample to avoid this problem, but it becomes pretty tricky to do that carefully. If you instead find the "most ambiguous sentence" in your set, you can get a lower bound on the safety of your scheme.

Entropy isn't sufficient to measure password strength

benwr3y50

~~Um, huh? There are 2^1000 1000-character passwords, not 2^4700. Where is the 4700 coming from?~~

(added after realizing the above was super wrong): Whoops, that's what I get for looking at comments first thing in the morning. log2(26^1000) = 4700 Still, the following bit stands:

I'd also like to register that, in my opinion, if it turns out that your comment is wrong and not my original statement, it's really bad manners to have said it so confidently.

(I'm now not sure if you made an error or if I did, though)

Update: I think you're actually totally right. The entropy gives a lower bound for the average, not the average itself. I'll update the post shortly.

5Thomas Sepulchre3y

I apologize, my wording was indeed rude. As for why I was confident, well, this was a clear (in my eye at least) example of Jensen's inequality: we are comparing the mean of the log to the log of the mean. And, if you see this inequality come up often, you know that the inequality is always strict, except for a constant distribution. That's how I knew. As a last note, I must praise the fact that you left your original comments (while editing them of course) instead of removing them. I respect you a lot for that.

Entropy isn't sufficient to measure password strength

benwr3y50

To clarify a point in my sibling comment, the concept of "password strength" doesn't cleanly apply to an individual password. It's too contingent on factors that aren't within the password itself. Say I had some way of scoring passwords on their strength, and that this scoring method tells me that "correct horse battery staple" is a great password. But then some guy puts that password in a webcomic read by millions of people - now my password is going to be a lot worse, even though the content of the password didn't change.

Password selection schemes aren't... (read more)

1Brendan Long3y

Ah I see what you're saying. Given a non-uniform password generation algorithm (like 99.9% "password", 0.1% random 1000-bit ASCII string), taking the average entropy of the schema is misleading, since the average password has 50 bits of entropy but 99.9% of the time the password will have 1 bit of entropy, and in this case taking the worst-entropy result (1 bit) is more useful than the average. I think you're right, but this doesn't seem very useful, since any password generation scheme where this is relevant is a bad idea (since if you switched to a uniform distribution, you could either have stronger passwords with the same length, or just as strong passwords with a shorter length).

Entropy isn't sufficient to measure password strength

benwr3y10

I don't think that's how people normally do it; partly because I think it makes more sense to try to find good password *schemes*, rather than good individual passwords, and measuring a password's optimal encoding requires knowing the distribution of passwords already. The optimal encoding story doesn't help you choose a good password scheme; you need to add on top of it some way of aggregating the code word lengths. In the example from the OP, you could use the average code word length of the scheme, which has you evaluating Shannon entropy again, or you could use the minimum code word length, which brings you back to min-entropy.

Entropy isn't sufficient to measure password strength

benwr3y20

Yep! I originally had a whole section about this, but cut it because it doesn't actually give you an ordering over schemes unless you also have a distribution over adversary strength, which seems like a big question. If one scheme's min-entropy is higher than another's max-entropy, you know that it's better for any beliefs about adversary strength.

2James Payor3y

Nice! I note you do at least get a partial ordering here, where some schemes always give the adversary lower cumulative probability of success as n increases than others. This should be similar (perhaps more fine grained, idk) than the min-entropy approach. But I haven't thought about it :)

Long covid: probably worth avoiding—some considerations

benwr3y70

Hm. On doing exactly as you suggest, I feel confused; it looks to me like the 25-44 cohort has really substantially more deaths than in recent years: https://www.dropbox.com/s/hcipg7yiuiai8m2/Screen Shot 2022-01-16 at 2.12.44 PM.png?dl=0 I don't know what your threshold for "significance" is, but 103 / 104 weeks spent above the preceding 208 weeks definitely meets my bar.

Am I missing something here?

1JesperO3y

Aren't those excess deaths just the direct covid deaths, from the unlucky few younger people who got covid and died from it?

2Florin3y

The infection fatality rate might be an even better way to quantify the risk of death. The IFR for covid in the 15-64 age groups before September 2020 was 0.75% at the higher end of the range. Older age groups had IFRs ranging from 2.5% to 28%. The IFR of the flu doesn't usually go over 0.1%, although this is an average and the accuracy of the stat itself is questionable (from other sources I've seen). https://dx.doi.org/10.1007%2Fs10654-020-00698-1

2Florin3y

Although it might not be entirely insignificant, it seems a lot less significant than it would appear. Eyeballing it, there seems to be about 100k excess deaths in the 25-44 age group (usually, it's about 300k total deaths for 104 weeks) out of a total of 950k excess deaths. That's a 25% increase of excess deaths compared to the baseline, but nowhere the near 40 to over 60% peaks that an uncritical reading of OWID's chart would suggest. Also, the 25-44 group is about 26% of the US population, yet has suffered only 10% of the deaths, whereas the 45+ groups (harder to eyeball just the 45-64 group) are 41% of the pop and have suffered from 90% of the deaths. And since covid mortality increases the older one gets, a person in their late 20s would likely have less chance of dying than a person in their early 40s. This is perhaps clearer in terms of risk of death due to covid by age group (compared to 18-29 year olds): 30-39: 4x 40-49: 10x 50-64: 25x 65-74: 65x 75-84: 150x 85+: 370x