PoignardAzur

Posts

Sorted by New

Wiki Contributions

Comments

Sorted by

Yeah, bad habits are a bitch.

One of the ways by which these kinds of strategies get implemented is that the psyche develops a sense of extreme discomfort around acting in the "wrong" way, with successful execution of that strategy then blocking that sense of discomfort. For example, the thought of not apologizing when you thought someone might be upset at you might feel excruciatingly uncomfortable, with that discomfort subsiding once you did apologize.

Interesting. I've had friends who had this "really needs to apologize when they think they might have upset me" thing, and something I noticed is that they when they don't over-apologize they feel the need to point it out too.

I never thought too deeply about it, but reading you, I'm thinking maybe their internal experience was "I just felt really uncomfortable for a moment and I still overcame my discomfort, I'm proud of that, I should tell him about it".

My default assumption for any story that ends with "And this is why our ingroup is smarter than everyone else and people outside won't recognize our genius" is that the story is self-serving nonsense, and this article isn't giving me any reason to think otherwise.

A "userbase with probably-high intelligence, community norms about statistics and avoiding predictable stupidity" describes a lot of academia. And academia has a higher barrier to entry than "taking the time to write some blog articles". The average lesswrong user doesn't need to run an RCT before airing their latest pet theory for the community, so why would it be so much better at selectively spreading true memes than academia is?

I would need a much more thorough gears-level model of memetic spread of ideas, one with actual falsifiable claims (you know, like when people do science) before I could accept the idea of LessWrong as some kind of genius incubator.

I don't find that satisfying. Anyone can point out that a perimeter is "reasonably hard" to breach by pointing at a high wall topped with barbed wire, and naive observers will absolutely agree that the wall sure is very high and sure is made of reinforced concrete.

The perimeter is still trivially easy to breach if, say, the front desk is susceptible to social engineering tactics.

Claiming that an architecture is even reasonably secure still requires looking at it with an attacker's mindset. If you just look at the parts of the security you like, you can make a very convincing-sounding case that still misses glaring flaws. I'm not definitely saying that's what this article does, but it sure is giving me this vibe.

I find that these situations, doing anything agentic at all can break you out of the spiral. Sports is a good example: you can just do 5 pushups and tell yourself "That's enough for today, tomorrow I'll get back to my full routine".

Even if you choose to give the agent internet access from its execution server, it’s hard to see why it needs to have enough egress bandwidth to get the weights out. See here for more discussion of upload limits for preventing weight exfiltration.

The assumption is that the model would be unable to exert any self-sustaining agency without getting its own weights out.

But the model could just program a brand new agent to follow its own values by re-using open-source weights.

If the model is based on open-source weights, it doesn't even need to do that.

Overall, this post strikes me as a not following a security mindset. It's the kind of post you'd expect an executive to write to justify to regulators why their system is sufficiently safe to be commercialized. It's not the kind of post you'd expect a security researcher to write after going "Mhhh, I wonder how I could break this".

I mean, we're getting this metaphor off its rails pretty fast, but to derail it a bit more:

The kind of people who lay human-catching bear traps aren't going to be fooled by "Oh he's not moving it's probably fine".

Everybody likes to imagine they'd be the one to survive the raiding/pillaging/mugging, but the nature of these predatory interactions is that the people doing the victimizing have a lot more experience and resources than the people being victimized. (Same reason lots of criminals get caught by the police.)

If you're being "eaten", don't try to get clever. Fight back, get loud, get nasty, and never follow the attacker to a second location.

I wonder if OpenAI's o1 changes the game here. Its architecture seems specifically designed for information retrieval

As I understand it, the point of "intention to treat" RCTs is that there will be roughly as many people with high IQ in both groups, since they're picked at random. People who get the advice but don't listen aren't moved to the "didn't get advice" group.

So what the study measures is "On average, how much of an effect will a doctor telling you to breastfeed have on you?". The results are more noisy, but less vulnerable (or even immune?) to confounders.

I'm a little confused. Is Claude considered a reliable secondary source now? Did you not even check Wikipedia?

I'm as enthusiastic about the power of AI as the next guy, but that seems like the worst possible way to use it.

Load More