User Comment Replies

You cannot incentivize people to make that sacrifice at anything close to the proper scale because people don’t want money that badly. How many hands would you amputate for $100,000?

There's just no political will to do it, since the solutions would be harsh or expensive enough that nobody could impose them upon society. A god-emperor, who really wished to increase fertility numbers and could set laws freely without the society revolting, could use some combination of these methods:

If you're childless, or perhaps just unmarried, you pay additional taxe

... (read more)

3Viliam3mo

Probably even more controversial, allow different people to learn at different speeds. So that everyone would study e.g. until they are 20, but the most stupid ones would only learn to read and write, and the smartest ones would get the same university education they get now, only at an earlier age. Make it mandatory for companies above certain size to offer part-time jobs for both women and men who have children.

2dr_s3mo

I mean, the whole point was "how can we have fertility but also not be a dystopia". You just described a dystopia. It's also kind of telling that the only way to make people have children, something that is supposedly a joyous experience, you can think of is "have a tyrannical dictator make it very clear that they'll make sure the alternative is even worse". Someone thinking this way is part of the problem more than they are of the solution.

3michael_mjd3mo

I think the state handling child rearing is the long term solution. The need for new people is a society wide problem and not ultimately one of personal responsibility. Of course people should still be free to do it on their own if they want. It'll be weird that not everyone will have traditional parents, but I think we can figure it out. Maybe a mandatory or highly incentivized big brother/ sister program would help make it more nurturing.

3Eneasz3mo

This is part of my point. Our culture will not survive such measures. A different culture will result.

Dentosal's Shortform

Dentosal3mo10

Communication is indeed hard, and it's certainly possible that this isn't intentional. On the other hand, making mistakes is quite suspicious when they're also useful for your agenda. But I agree that we probably shouldn't read too much into it. The system card doesn't even mention the possibility of the model acting maliciously, so maybe that's simply not in scope for it?

3ryan_greenblatt3mo

It doesn't seem very useful for them IMO.

Dentosal's Shortform

Dentosal3mo93

While reading OpenAI Operator System Card, the following paragraph on page 5 seemed a bit weird:

We found it fruitful to think in terms of misaligned actors, where:

the user might be misaligned (the user asks for a harmful task),

the model might be misaligned (the model makes a harmful mistake), or

the website might be misaligned (the website is adversarial in some way).

Interesting use of language here. I can understand calling the user or website misaligned, as understood as alignment relative to laws or OpenAI's goals. But why call a model misaligne... (read more)

4ryan_greenblatt3mo

My bet is that this isn't an attempt to erode alignment, but is instead based on thinking that lumping together intentional bad actions with mistakes is a reasonable starting point for building safeguards. Then the report doesn't distinguish between these due to a communication failure. It could also just be a more generic communication failure. (I don't know if I agree that lumping these together is a good starting point to start experimenting with safeguards, but it doesn't seem crazy.)

Dentosal's Shortform

Dentosal3mo20

Billionaire Larry Ellison says a vast AI-fueled surveillance system can ensure 'citizens will be on their best behavior'

Ellison is the CTO of Oracle, one of the three companies running the Stargate Project. Even if aligning AI systems to some values can be solved, selecting those values badly can still be approximately as bad as the AI just killing everyone. Moral philosophy continues to be an open problem.

Mechanisms too simple for humans to design

Dentosal3mo71

I would have written a shorter letter, but I did not have the time.

Blaise Pascal

Implications of the inference scaling paradigm for AI safety

Dentosal4mo32

I am actually mildly surprised OA has bothered to deploy o1-pro at all, instead of keeping it private and investing the compute into more bootstrapping of o3 training etc.

I'd expect that deploying more capable models is still quite useful, as it's one of the best ways to generate high-quality training data. In addition to solutions, you need problems to solve, and confirmation that the problem has been solved. Or is your point that they already have all the data they need, and it's just a matter of speding compute to refine that?

A Meritocracy of Taste

Dentosal5mo21

They absolutely do. This phenomenon is called filter bubble.

3CstineSublime5mo

I've yet to see a personalized filter bubble, I've certainly been part of some nebulous other's filter bubble, but I am constantly and consciously struggling to get social media websites to serve up feeds of content I WANT to interact with. Now while I am in someone else's bubble - there is certainly a progressive slant to most of the external or reshared content I see on Meta platforms (and it seems to be American-centric despite the fact I'm not in the United States), it took me an incredibly long time to curate my 'for you/Explore' page in Instagram, to remove all 'meme' videos, basically anything with a white-background and text like "wait for it" or "watch until the end" or intentionally impractical craft projects. It is only through aggressively using the "not interested" option on almost every item in the For You page, it now shows me predominantly High Fashion photography. Alas, one stray follow (i.e., following a friend of a friend) or like and Voompf the bubble bursts and I'll start seeing stuff I have no interest in like "top 5 almond croissants" or "homemade vicks" - I have never EVER liked any local cuisine content ever and I'm really against the idea of getting medicinal craft projects from popular reels, sorry. I suspect the filter bubble is grossly exaggerated - (Counter theory: I am not profitable enough for the algorithm to build such a bubble) - I suspect that much the fear and shock when you see Google ads delivered to you that match a conversation you had just 3 hours ago (although there is a lot of truth to the idea that smart devices are spying on you). Likely this is a combination of good old fashioned Clustering Illusion and the ego-deflating idea that the topic of that conversation you had just isn't that unique or uncommon (particularly). I've had three different people DM me the same OK GO Reel in one week recently. I never interacted with it directly - clearly the algorithm was aggressively pumping that to EVERYONE. Spotify ads sugg

3Viliam5mo

On YouTube I keep getting Russian propaganda videos recommended all the time, and I really don't remember upvoting anything of that kind. So either YouTube does not use this algorithm (or maybe gives to "upvoted by people similar to you" much lower weight than "upvoted in general"), or there is some clever trick the authors of those videos use, but I am not sure what it could be. On Facebook, I think people used this trick that they posted hundred cat videos, people upvoted them, and then they deleted the cat videos and posted some advertisement on their channel, and now the advertisement was shown to everyone, including "people you know liked this page". But I think this should not be possible on YouTube, because you cannot change the videos once uploaded. So... I don't know. I wish YouTube created a bubble for me, but it seemingly does the opposite.

TurnTrout's shortform feed

Dentosal6mo61

I'd go a step beyond this: merely following incentives is amoral. It's the default. In a sense, moral philosophy discusses when and how you should go against the incentives. Superhero Bias resonates with this idea, but from a different perspective.

Viliam6mo100

merely following incentives is amoral. It's the default.

Seems to me that people often miss various win/win opportunities in their life, which could make the actual human default even worse than following the incentives. On the other hand, they also miss many win/lose opportunities, maybe even more frequently.

I guess we could make a "moral behavior hierarchy" like this:

Level 1: Avoid lose/lose actions. Don't be an idiot.
Level 2: Avoid win/lose actions. Don't be evil.
Level 3: Identify and do win/win actions. Be a highly functioning citizen.
Level 4: Identify

... (read more)

Compelling Villains and Coherent Values

Dentosal7mo2-6

Yet Batman lets countless people die by refusing to kill the Joker. What you term "coherence" seems to be mostly "virtue ethics", and The Dark Knight is a warning what happens when virtue ethics goes too far.

I personally identify more with HPMoR's Voldermort than any other character. He seems decently coherent. To me "villain" is a person whose goals and actions are harmful to my ingroup. This doesn't seem to have much to do with coherence.

A reliable gear in a larger machine might be less agentic but more useful than a scheming Machiavellian.

The schem... (read more)

1Cole Wyeth7mo

You raise an interesting point about virtue ethics - I don't think that is required for moral coherence, I think it is just a shortcut. A consequentialist must be prepared to evaluate ~all outcomes to approach moral coherence, but a virtue ethicist really only needs to evaluate their own actions, which is much easier.

Struggling like a Shadowmoth

[+]Dentosal7mo-9-19

Raemon7mo1613

This doesn't actually follow. There isn't a law of the universe saying that some things are painful, and unlock a certain amount of value, and if we could reduce the pain you'd still get more value by enduring more pain.

It might happen to be true in some cases, but there's no reason to expect that to hold in all cases.

Things can be difficult in 3 ways: Painful, time-consuming, or uncontrollable. Is this reasonable to say?

Answer by DentosalApr 11, 202442

Learning a new language isn't hard because it's time-consuming. It's time-consuming because it's hard. The hard part is memorizing all of the details. Sure that takes lots of time, but working harder (more intensively) will reduce that time. It's hard to be dedicated enough to spend the time as well.

Getting a royal flush in poker isn't something I'd call hard. It's rare. But hard? A complete beginner can do that in their first try. It's just luck. But if you play a long time, it'll eventually happen.

Painful or unpleasant things are hard, because they requi... (read more)

3nim1y

To extend this angle -- I notice that we're more likely to call things "difficult" when our expectations of whether we "should" be able to do it are mismatched from our observations of whether we are "able to" do it. The "oh, that's hard actually" observation shows up reliably for me when I underestimated the effort, pain, or luck required to attain a certain outcome.

LessWrong's (first) album: I Have Been A Good Bing

Dentosal1y60

Now on Spotify! https://open.spotify.com/album/0DP8XwSK7voq0rtXiNMhQC

Instrumental deception and manipulation in LLMs - a case study

Dentosal1y10

Such a detailed explanation, thanks. Some additional thoughs:

And I plan to take the easier route of "just make a more realistic simulation" that then makes it easy to draw inferences.

"I don't care if it's 'just roleplaying' a deceptive character, it caused harm!"

That seems like the reasonable path. However, footnote 9 goes further:

Even if it is the case that the model is "actually aligned", but it was just simulating what a misaligned model would do, it is still bad.

This is where I disagree: Models that say "this is how I would deceive, and then... (read more)

2Olli Järviniemi1y

Couple of points here: * When I wrote "just simulating what a misaligned model would do" in footnote 9, I meant to communicate that the so-called aligned model also executes the planned actions. (I presume you mostly agree on this being close to as bad as the model being actually misaligned, correct me if I'm wrong.) * I would very much prefer the model to not plan how to kill humans even if it ultimately does not execute the plan. I would also prefer the model to have the tendency to flag such misaligned goals/plans in case it stumbles upon them. Ideally you have both of those. * Whether I prefer the first over the second depends a lot on the specifics. * I'm quite worried about the "tendency to flag misalignment" part being quite brittle/shallow with current methods. These tendencies are not very robust in the setup I had in the post. Note that even supposedly HHH-models sometimes engage in manipulation for instrumental reasons. * I worry that as models are being used (and trained) to solve more difficult/richer/long-term problems the instrumental convergence arguments start to hit harder, so that we putting pressure against alignment. (Note that there are plenty of completions that definitely talk about goals, not just resources.) I do think that a key reason P(DMGI) is low is that P(G) is low. Not sure how I feel about this overall: On one hand, I talked about how I just give the goal of gold coins to the model, and I apparently didn't succeed very well to robustly instill it. On the other hand, it is quite interesting that the goal does generalize somewhat even though it's not being that forced to the model. Maybe ultimately the latter factor is more important: it's more interesting if telling the model just once "your aim is to get gold-coins" suffices to get robust goals and goal-generalization.

Instrumental deception and manipulation in LLMs - a case study

Dentosal1y20

Thanks for this, it was an interesting read.

One thing I wonder about is whether the model actually deceives, or just plays the part of someone expected to use deception as a tactic. I wouldn't call a system (or a person) manipulative or deceptive when that behavior is observed in games-of-deception, e.g. bluffing in poker. It seems to make more sense to "deceive" when it's strongly implied that you're supposed to do that. Adding more bluff makes the less obvious that you're supposed to be deceptive. The similarity of results with non-private scratchpad see... (read more)

2Olli Järviniemi1y

Thanks for the comment! Yeah, this is a tricky question. I commented a bit on it in footnote 9, see also bullet point 3 in the comment by eggsyntax and my response to it. My current view here is something like: Our aim in these experiments is to determine whether models would behave deceptively in real-life situations. If our experiments are very analogous to the real-life situations, then we can generalize from the experiments verbatim ("I don't care if it's 'just roleplaying' a deceptive character, it caused harm!"). If our experiments are not very analogous to those situations, then we have to be more careful about our observations and inferences generalizing to different situations, and understanding the underlying mechanisms becomes more important. And I plan to take the easier route of "just make a more realistic simulation" that then makes it easy to draw inferences. (This might be a good time to note: my experiments don't directly give information about deceptive alignment / scheming, but they give information about alignment-faking and instrumental deception. I might write more about this later, but there's an axis of context-dependent vs. context-independent goals and deception, with simulations only providing very contextual information whereas full scheming is more about context-independent goals.) Not quite sure what you are pointing at here, but the relevant term might be faithfulness of chain-of-thought. (CoTs and final answers don't always correspond to each other, i.e. the CoT may just be a rationalization instead of portraying the real reasons why the model arrived at the answer it gave. Also, scratchpads may also serve the role of 'giving the model the right persona' in addition to giving them more time and memory.) I found nothing that indicates that they do. While granting that deception is hard to judge based on outputs only, I'm quite confident that this doesn't happen. Reason: I tried really hard to get manipulation (which is ea

When Omnipotence is Not Enough

Dentosal2y20

Omniscience itself is a timeless concept. If consequences of any decision or action are known beforehands, there are no feedback cycles, as every decision is done on the first moment of existence. Every parallel world, every branch of the decision tree, evaluated to the end of time, scored and the best one selected. One wonders what exactly "god" refers to in this model.

You don't get to have cool flaws

Dentosal2y43

Or to put it another way, why can't "how I look getting there" be a goal by itself?

Bigger the obstacles you have to overcome, the more impressive it looks to others. This can be valuable by itself, but in such situations it might be more efficient to reduce the actual flaw, and simply pretend that the effect is bigger. However, honesty is important, and honesty to yourself even more so.

since if you identify with your flaws, then maintaining your flaws is one of your goals

If a flaw is too hard to remove now, you have to figure out a way to manage it. ... (read more)

LESSWRONG
LW

All of Dentosal's Comments + Replies