As far as I understand, the banner is distinct - the team members seem not the same, but with meaningful overlap with the continuation of the agenda. I believe the most likely source of an error here is whether work is actually continuing in what could be called this direction. Do you believe the representation should be changed?
I think your comment adds a relevant critique of the criticism, but given that this comes from someone contributing to the project, I don't believe it's worth leaving it out altogether. I added a short summary and a hyperlink to a footnote.
Good point imo, expanded and added a hyperlink!
Would you agree that the entire agenda of collective intelligence is aimed at addressing 11. Someone else will deploy unsafe superintelligence first and 13. Fair, sane pivotal processes, or does that cut off nuance?
Thanks, added!
I really like the artistry of post-writing here; the introduction to and transition between the three videos felt especially great.
I've been internally using the term elemental for something in this neighborhood - Frame-Breaker elemental, Incentive-Slope elemental, etc. The term feels more totalizing (having two cup-stacking skills is easy to envision; being a several-thing elemental points in the direction of you being some mix of those things, and only those things), but some other connotations feel more on-target (like the difficulty of not doing the thing). I also like the term's aesthetics, but I could well be alone in that.
I'm not sure I understand the cryptographer's constraint very well, especially with regard to language: individual words seem to have different meanings ("awesome", "literally", "love"). It's generally possible to infer which decryption was intended from the wider context, but sometimes the context itself will have different and mutually exclusive decryptions, such as in cases of real or perceived dogwhistling.
One way I could see this specific issue being resolved is by looking at what the intent of the original communication was - this would make it so that there is now a fact that settles which is the “correct” solution -, but that seems to fail in a different way: agents don't seem to have full introspective access to what they are doing or what the likely outcome of their actions is, such as in some cases of infidelity or making of promises.
This, too, could be resolved by saying that an agent's intention is "the outcomes they're attempting to instantiate regardless of self-awareness", but by that point it seems to me that we've agreed with Rosenberg's claim that it's Darwinian all the way down.
What am I missing?
I might be missing the forest for the trees, but all of those still feel like they end up making some kinds of predictions based on the model, even if they're not trivial to test. Something like:
If Alice were informed by some neutral party that she took Bob's apple, Charlie would predict that she would not show meaningful remorse or try to make up for the damage done beyond trivial gestures like an off-hand "sorry" as well as claiming that some other minor extraction of resources is likely to follow, while Diana would predict that Alice would treat her overreach more seriously when informed of it. Something similar can be done on the meta-level.
None of these are slamdunks, and there are a bunch of reasons why the predictions might turn out exactly as laid out by Charlie or Diana, but that just feels like how Bayesian cookies crumble, and I would definitely expect evidence to accumulate over time in one direction or the other.
Strong opinion weakly held: it feels like an iterated version of this prediction-making and tracking over time is how our native bad actor detection algorithms function. It seems to me that shining more light on this mechanism would be good.
I am not one of the Old Guard, but I have an uneasy feeling about something related to the Chakra phenomenon.
It feels like there's a lot of hidden value clustered around wooy topics like Chakras and Tulpas, and the right orientation towards these topics seems fairly straightforward: if it calls out to you, investigate and, if you please, report. What feels less clear to me is how I as an individual or as a member of some broader rat community should respond when, according to me, people do not certain forms of bullshit tests.
This comes from someone with little interest or knowledge about the former, but after accidentally stumbling into some Tulpa-related territory and bumbling around in it for a while, it turns out that the Internal Family Systems model captures a large part of what I was grasping towards, this time with testable predictions and the whole deal.
I haven't given the individual-as-part-of-community thing that much thought, but my intuition is that I would make a poor judge for when to say "nope, your thing is BS" and I'm not sure what metric we might use to figure out who would make for a better judge besides overall faith in reasoning capability.
Very fair observation; my take is that a relevant continuation is occurring under OpenAI Alignment Science, but I would be interested in counterpoints - the main claim I am gesturing towards here is that the agenda is alive in other parts of the community, despite the previous flagship (and the specific team) going down.