The notion of "who was involved" is kinda weird. Like, suppose there is Greg. Greg will firebomb The Project if he is not involved. If he's involved, he will put in modest effort. Should he receive enormous share just because of this threat? It's very not fair.

What is the counterfactual construction procedure here? Like, assume that other players stopped existing for calculating value of a coalition that doesn't include them? But they are still there, in the world. And often it's not clear even what it would mean for them to do nothing.

o1 suggested to model The Greg Gambit as a Partition Function Game, but claimed it's all complicated there. Or maybe model it as a bargaining.

Reply

Linch's Shortform

Canaletto2d10

There's a also a bit of divergence in "has skills/talent/power" and "cares about what you care about". Like, yes, maybe there is a very skilled person for that role, but are they trustworthy/reliable/aligned/have the same priorities? You always face the risk of giving some additional power to already powerful adversarial agent. You should be really careful about that. Maybe more focus on the virtue rather than skill.

Reply

Egan's Theorem?

Canaletto4d22

There's a bit of disconnect from statement "it all adds up to normality" and how it actually could look like. Suppose someone discovered new physical law, if you say "fjrjfjjddjjsje" out loud you can fly, somehow. Now you are reproducing all the old model's confirmed behaviour while flying. Is this normal?

Reply

Trojan Sky

Canaletto14d10

Come to think of it, zebras are the closest thing we have to such adversarially colored animals. Imagine if they also were flashing at 17 Hz, optimal epilepsy inducing frequency according to this paper: https://onlinelibrary.wiley.com/doi/10.1111/j.1528-1167.2005.31405.x

Reply

Catastrophic sabotage as a major threat model for human-level AI systems

Canaletto16d*20

Well, both, but mostly the issue of it being somewhat evil. But it can probably be good even from strategic human focused view, by giving assurances to agents who otherwise would be adversarial to us. It's not that clear to me it's a really good strategy, because it kinda incentivizes threats, where you are trying to find more destructive ways to get more reward by surrendering. Also, it can just try it first if there are any ways with safe failures, and then opt out to cooperate? Sounds difficult to get right.

And to be clear it's a hard problem anyway, like even without this explicit thinking this stuff is churning in the background, or will be. It's a really general issue.

Check this writeup, I mostly agree with everything there:
https://www.lesswrong.com/posts/cxuzALcmucCndYv4a/daniel-kokotajlo-s-shortform?commentId=y4mLnpAvbcBbW4psB

Reply

when will LLMs become human-level bloggers?

Canaletto17d30

It's really hard for humans to match the style / presentation / language without putting a lot of work into understanding the target of the comment. LLMs are inherently worse (right now) at doing the understanding, coming up with things worth saying, being calibrated about being critical AND they are a lot better at just imitating the style.

This just invalidates some side signals humans habitually use on one another.

Reply

when will LLMs become human-level bloggers?

Canaletto17d40

This should be probably only attempted with clear and huge warning that it's a LLM authored comment. Because LLMs are good at matching style without matching the content, it could go with exploiting heuristics of the users calibrated only for human level of honesty / reliability / non-bulshitting.

Also check this comment about how conditioning on the karma score can give you hallucinated strong evidence:

https://www.lesswrong.com/posts/PQaZiATafCh7n5Luf/gwern-s-shortform?commentId=smBq9zcrWaAavL9G7

Reply

Canaletto's Shortform

Canaletto1mo10

Suppose you are writing a simulation. You keep optimizing it and hardcoding some stuff and handle different cases more efficiently and everything. One day your simulation becomes efficient enough so that you can run big enough grid for long enough, and there develops life. Then intelligent life. Then they tried to figure out the physics of their universe, and they succeed! But, oh wait, their description is extremely short but completely computationally intractable.

Can you say that they actually figured out in what kind of universe they are in already, or should you wait for when they discover another million lines of code of optimization for it? Should you create giant sparkling letters "congratulations, you figured that out" or wait for more efficient formulation?

Reply

leogao's Shortform

Canaletto1mo30

The alternative is to pit people against each other in some competitive games, 1 on 1 or in teams. I don't think the feeling you get from such games is consistent with "being competent doesn't feel like being competent, it feels like the thing just being really easy", probably mainly because there is skill level matching, there are always opponents who pose you a real challenge.

Hmm maybe such games need some more long tail probabilistic matching, to sometimes feel the difference. Or maybe variable team sizes, with many incompetent people versus few competent, to get a more "doomguy" feeling.

Reply

Drake Thomas's Shortform

Canaletto1mo30

Check this out https://www.lesswrong.com/posts/EFQ3F6kmt4WHXRqik/ugh-fields

Reply