LESSWRONG
LW

All of plex's Comments + Replies

LLM AGI will have memory, and memory changes alignment

Accurate, and one of the main reasons why most current alignment efforts will fall apart with future systems. A generalized version of this combined with convergent power-seeking of learned patterns looks like the core mechanism of doom.

4Seth Herd1d

I think the more generous way to think about it is that current prosaic alignment efforts are useful for aligning future systems, but there's a gap they probably don't cover. Learning agents like I'm describing still have an LLM at their heart, so aligning that LLM is still important. Things like RLHF, RLAIF, deliberative alignment, steering vectors, fine tuning, etc. are all relevant. And the other not-strictly-alignment parts of prosaic alignment like mechanistic interpretability, behavioral tests for alignment, capabilities testing, control, etc. remain relevant. (They might be even more relevant if we lose faithful chain of thought, or if it was always an illusion. (I haven't processed that paper, and it will take some processing in contrast to the case for CoT unfaithfulness is overstated)). As for the "core mechanism of doom," I do think that convergent power-seeking is real and very dangerous, but I also think it's not inevitable. I don't think we're doomed. I found it very interesting that the two most compelling scenarios for LLM agent doom I'm aware of, Takeover in 2 Years and yesterday's AI 2027 are both basically stories in which the teams don't really try very hard or think very hard at all about alignment. I found those scenarios distressingly plausible; I think that's how history hinges, all too frequently. But what if the teams in scenarios like that did think just a little harder about alignment? I think they might well have succeeded pretty easily. There are lots of things that couldn've happened to cause more focus on what I think of as actual alignment: creating human-aligned goals/values in an artificial intelligence that has goals/values. Those teams didn't really bother thinking of it in those terms! They mostly neglected alignment too long, until they'd created an entity that did want things. The humans didn't really even try to control what those entities wanted. They just treated them as tools until they weren't tools anymore. Getti

AI 2027: What Superintelligence Looks Like

plex2d590

Podcast version:

plex's Shortform

plex8d153

The new Moore's Law for AI Agents (aka More's Law) has accelerated at around the time people in research roles started to talk a lot more about getting value from AI coding assistants. AI accelerating AI research seems like the obvious interpretation, and if true, the new exponential is here to stay. This gets us to 8 hour AIs in ~March 2026, and 1 month AIs around mid 2027.^[1]

I do not expect humanity to retain relevant steering power for long in a world with one-month AIs. If we haven't solved alignment, either iteratively or once-and-for-all^[2], it's loo... (read more)

4Aprillion8d

My own experience is that if-statements are even 3.5's Achilles heel and 3.7 is somehow worse (when it's "almost" right, that's worse than useless, it's like reviewing pull requests when you don't know if it's an adversarial attack or if they mean well but are utterly incompetent in interesting, hypnotizing ways)... and that METR's baselines more resemble a Skinner box than programming (though many people have that kind of job, I just don't find the conditions of gig economy as "humane" and representative of what how "value" is actually created), and the sheer disconnect of what I would find "productive", "useful projects", "bottlenecks", and "what I love about my job and what parts I'd be happy to automate" vs the completely different answers on How Much Are LLMs Actually Boosting Real-World Programmer Productivity?, even from people I know personally... I find this graph indicative of how "value" is defined by the SF investment culture and disruptive economy... and I hope the AI investment bubble will collapse sooner rather than later... But even if the bubble collapses, automating intelligence will not be undone, it won't suddenly become "safe", the incentives to create real AGI instead of overhyped LLMs will still exists - the danger is not in the presented economic curve going up, it's in what economic actors see as potential, how incentivized are the corporations/governments to search for the thing that is both powerful and dangerous, no?

6Garrett Baker8d

Note the error bars in the original

Towards a scale-free theory of intelligent agency

plex14d20

Nice! I think you might find my draft on Dynamics of Healthy Systems: Control vs Opening relevant to these explorations, feel free to skim as it's longer than ideal (hence unpublished, despite containing what feels like a general and important insight that applies to agency at many scales). I plan to write a cleaner one sometime, but for now it's claude-assisted writing up my ideas, so it's about 2-3x more wordy than it should be.

A Path out of Insufficient Views

plex15d*20

Interesting, yes. I think I see, and I think I disagree with this extreme formulation, despite knowing that this is remarkably often a good direction to go in. If "[if and only if]" was replaced with "especially", I would agree, as I think the continual/regular release process is an amplifier on progress not a full requisite.

As for re-forming, yes, I do expect there is a true pattern we are within, which can be in its full specification known, though all the consequences of that specification would only fit into a universe. I think having fluidity on as ma... (read more)

2Unreal11d

The benefit of fixing on the release/dissolve as a way of being is that it will release/dissolve itself, and that's what makes it safer than fixing on anything that doesn't have an 'expiration date' as it were. I think the confusion on this is that We have this sense that some process is safe or good to fix upon. Because 'process' is more change-y than something static. But even process is not safe to fix upon. You are not a process. We're not in a process. To say 'process' is trying to 'thing-ify' or 'reify' something that does not have a property called 'existence' nor 'non-existence'. We must escape from the flattening dichotomy of existence and non-existence, which is a nonsense. A "universe" cannot be fully specified, and I believe our physics has made that clear. But also our idea of 'universe' is ridiculously small-minded still. Science has narrowed our vision of what is, and fixated upon it, and now we're actually more ignorant / deluded than before. Although I also appreciate the beauty of science and math.

Joseph Miller's Shortform

plex18d42

I'd love to see the reading time listed on the frontpage. That would make the incentives naturally slide towards shorter posts, as more people would click and it would get more karma. Feels much more decision relevant than when the post was posted.

A Path out of Insufficient Views

plex19d40

Yup, DMing for context!

hmmm, I'm wondering if you're pointing at something different from the thing in this space which I intuitively expect is good using words that sound more extreme than I'd use, or whether you're pointing at a different thing. I'll take a shot at describing the thing I'd be happy with of this type and you can let me know whether this feels like the thing you're trying to point to:

An ontology restricts the shape of thought by being of a set shape. All of them are insufficient, the Tao that can be specified is not the true Tao, but each

... (read more)

4Unreal16d

I'm pointing to something more extreme than this, but I'd say this is a good direction. I will attempt, badly, to capture it in words inspired by your description above. I say 'badly' because I'm not fully able to see the Truth and describe it, but there is a Truth, and it can be described. This process you allude to RE: dissolving and releasing is part of how Truth is revealed, and that's what I'm training in. So my re-write of this: You imply, maybe, that the point is to come back and reform. To get all the patterns 'integrated' and come back into structure. But the thing around 'flow' and 'faster' and such—all of this is better achieved with no structure or meta-structure. Because structure opposes flow, period. The point isn't to create some ultimate ontology or structure, no matter how fluid, fast, or integrated you think it is; this is returning to delusion, or making delusion the purpose. This takes sufficient letting go to see, but it's also logically sound even if you can't buy it experientially. Is there a place for structure? Yes, we use structure as middle way stepping stones to release structure. We have to use delusion (concepts, etc.) to escape delusion because delusion is what we have to work with. The fact that it's possible to use delusion to escape delusion is the amazing thing.

I make several million dollars per year and have hundreds of thousands of followers—what is the straightest line path to utilizing these resources to reduce existential-level AI threats?

plex20d53

you could engage with the Survival and Flourishing Fund

Yeah! The S-process is pretty neat, buying into that might be a great idea once you're ready to donate more.

AnnaSalamon20d2323

Elaborating Plex's idea: I imagine you might be able to buy into participation as an SFF speculation granter with $400k. Upsides:
(a) Can see a bunch of people who're applying to do things they claim will help with AI safety;
(b) Can talk to ones you're interested in, as a potential funder;
(c) Can see discussion among the (small dozens?) of people who can fund SFF speculation grants, see what people are saying they're funding and why, ask questions, etc.

So it might be a good way to get the lay of the land, find lots of people and groups, hear peoples' responses to some of your takes and see if their responses make sense on your inside view, etc.

I make several million dollars per year and have hundreds of thousands of followers—what is the straightest line path to utilizing these resources to reduce existential-level AI threats?

plex20d20

Oh, yup, thanks, fixed.

I make several million dollars per year and have hundreds of thousands of followers—what is the straightest line path to utilizing these resources to reduce existential-level AI threats?

plex20d235

Consider reaching out to Rob Miles.

He tends to get far more emails than he can handle so a cold contact might not work, but I can bump this up his list if you're interested.

3momom220d

For making an AI Safety video, we at the CeSIA also have had some success at it and we'd be happy to help by providing technical expertise, proofreading and translation in French. Other channels you could reach out to: * Rational Animations (bit redundant with Rob Miles, but it can't hurt) * Siliconversations * AI Explained

I make several million dollars per year and have hundreds of thousands of followers—what is the straightest line path to utilizing these resources to reduce existential-level AI threats?

plex20d2111

Firstly: Nice, glad to have another competent and well-resourced person on-board. Welcome to the effort.

I suggest: Take some time to form reasonably deep models of the landscape, first technical^[1] and then the major actors and how they're interfacing with the challenge.^[2] This will inform your strategy going forward. Most people, even people who are full time in AI safety, seem to not have super deep models (so don't let yourself be socially-memetically tugged by people who don't have clear models).

Being independently wealthy in this field is a... (read more)

1dx2620d

Currently these two links include the commas so they redirect to 404 pages

plex's Shortform

plex21d40

eh, <5%? More that we might be able to get the AIs to do most of the heavy lifting of figuring this out, but that's a sliding scale of how much oversight the automated research systems need to not end up in wrong places.

plex's Shortform

plex21d61

My current guess as to Anthropic's effect:

0-8 months shorter timelines^[1]
Much better chances of a good end in worlds where superalignment doesn't require strong technical philosophy^[2] (but I put very low odds on being in this world)
Somewhat better chances of a good end in worlds where superalignment does require strong technical philosophy^[3]

^{^}
Shorter due to:
- There being a number of people who might otherwise not have been willing to work for a scaling lab, or not do so as enthusiastically/effectively (~55% weight)
- Encouraging race dynamics (~30%)
- Making

... (read more)

1williawa21d

I basically agree with this. Or I'd put 20% chance on us being in the worlds "where superalignment doesn't require strong technical philosophy", that's maybe not very low. Overall I think the existance of Anthropic is a mild net positive, and the only lab for which this is true (major in the sense of building frontier models). "the existence of" meaning, if they shut down today or 2 years ago, it would've not increased our chance of survival, maybe lowered it. I'm also somewhat more optimistic about the research they're doing helping us in the case where alignment is actually hard.

2Mateusz Bagiński21d

how low?

A Path out of Insufficient Views

plex22d40

By "discard", do you mean remove specifically the fixed-ness in your ontology such that the cognition as a whole can move fluidly and the aspects of those models which don't integrate with your wider system can dissolve, as opposed to the alternate interpretation where "discard" means actively root out and try and remove the concept itself (rather than the fixed-ness of it)?

(also 👋, long time no see, glad you're doing well)

2Unreal19d

Hello! Thanks for the greeting. Do we know each other by chance? Removing fixedness in ontologies is good. I claim it's in the good direction. And then you go further in that direction and remove the ontology itself, which is a fixation on its own. The ontology is not strictly needed, in much the same way that you can be looking at a video of a waterfall—but it's meaningfully better and more true to look directly at a waterfall. In the same way, you don't need the concept of 'waterfall' to truly see it. The concept actually gets in the way. Actively rooting out and removing the concept makes it sound like you are somehow reaching in and pulling it out with force, and that's not really how it goes. It's more of a letting go of conceptual grasping, like unclenching a hand.

Daniel Kokotajlo's Shortform

plex1mo40

I had a similar experience a couple years back when running bio anchors with numbers which seemed more reasonable/less consistently slanted towards longer timelines to me, getting:

before taking into account AI accelerating AI development, which I expected to bring it a few years earlier.

Arbital has been imported to LessWrong

plex1mo21

Also I suggest that given the number of tags in each section, load more should be load all.

Arbital has been imported to LessWrong

plex1mo130

This is awesome! Three comments:

Please make an easy to find Recent Changes feed (maybe a thing on the home page which only appears if you've made wiki edits). If you want an editor community, that will be their home, and the thing they're keeping up with and knowing to positively reinforce each other.
The concepts portal is now a slightly awkward mix of articles and tags, with potentially very high use tags being quite buried because no one's written a good article for it (e.g Rationality Quotes has 136 pages tagged, but zero karma, so requires many clicks

... (read more)

2Ruby1mo

So the nice thing about karma is that if someone thinks a wikitag is worthy of attention for any reason (article, tagged posts, importance of concept), they're able to upvote it and make it appear higher. Much of the current karma comes from Ben Pace and I who did a pass. Rationality Quotes didn't strike me a page I particularly wanted to boost up the list, but if you disagree with me you're able to Like it. In general, I don't think have a lot of tagged posts should mean a wikitag should be ranked highly. It's a consideration, but I like it flowing via people's judgments about whether or not to upvote it. The categorization is an interesting question. Indeed currently only admins can do it and that perhaps requires more thought.

2plex1mo

Also I suggest that given the number of tags in each section, load more should be load all.

A short course on AGI safety from the GDM Alignment team

plex2mo80

Nice! I'll watch through these then probably add a lot of them to the aisafety.video playlist.

Are current LLMs safe for psychotherapy?

plex2mo20

I've heard from people I trust that:

They can be pretty great, if you know what you want and set the prompt up right
They won't be as skilled as a human therapist, and might throw you in at the deep end or not be tracking things a human would

Using them can be very worth it as they're always available and cheap, but they require a little intentionality. I suggest asking your human therapist for a few suggestions of kinda of work you might do with a peer or LLM assistant, and monitoring how it affects you while exploring, if you feel safe enough doing that. Ma... (read more)

Celtic Knots on a hex lattice

plex2mo60

Looks like Tantrix:

2Ben2mo

I wasn't aware of that game. Yes it is identical in terms of the tile designs. Thank you for sharing that, it was very interesting and that Tantrix wiki page lead me to this one, https://en.wikipedia.org/wiki/Serpentiles , which goes into some interesting related stuff with two strings per side or differently shaped tiles.

harfe's Shortform

plex2mo30

oh yup, sorry, I meant mid 2026, like ~6 months before the primary proper starts. But could be earlier.

harfe's Shortform

plex2mo90

Yeah, this seems worth a shot. If we do this, we should do our own pre-primary in like mid 2027 to select who to run in each party, so that we don't split the vote and also so that we select the best candidate.

Someone I know was involved in a DIY pre-primary in the UK which unseated an extremely safe politician, and we'd get a bunch of extra press while doing this.

1harfe2mo

Mid 2027 seems too late to me for such a candidate to start the official campaign. For the 2020 presidential election, many democratic candidates announced their campaign in early 2019, and Yang already in 2017. Debates happened already in June 2019. As a likely unknown candidate, you probably need a longer run time to accumulate a bit of fame.

≤10-year Timelines Remain Unlikely Despite DeepSeek and o3

plex2mo74

Humans without scaffolding can do a very finite number of sequential reasoning steps without mistakes. That's why thinking aids like paper, whiteboards, and other people to bounce ideas off and keep the cache fresh are so useful.

8Steven Byrnes2mo

I think OP is using “sequential” in an expansive sense that also includes e.g. “First I learned addition, then I learned multiplication (which relies on already understanding addition), then I learned the distributive law (which relies on already understanding both addition and multiplication), then I learned the concept of modular arithmetic (which relies on …) etc. etc.” (part of what OP calls “C”). I personally wouldn’t use the word ‘sequential’ for that—I prefer a more vertical metaphor like ‘things building upon other things’—but that’s a matter of taste I guess. Anyway, whatever we want to call it, humans can reliably do a great many steps, although that process unfolds over a long period of time. …And not just smart humans. Just getting around in the world, using tools, etc., requires giant towers of concepts relying on other previously-learned concepts. Obviously LLMs can deal with addition and multiplication and modular arithmetic etc. But I would argue that this tower of concepts building on other concepts was built by humans, and then handed to the LLM on a silver platter. I join OP in being skeptical that LLMs (including o3 etc.) could have built that tower themselves from scratch, the way humans did historically. And I for one don’t expect them to be able to do that thing until an AI paradigm shift happens.

2Rafael Harth2mo

This is true but I don't think it really matters for eventual performance. If someone thinks about a problem for a month, the number of times they went wrong on reasoning steps during the process barely influences the eventual output. Maybe they take a little longer. But essentially performance is relatively insensitive to errors if the error-correcting mechanism is reliable. I think this is actually a reason why most benchmarks are misleading (humans make mistakes there, and they influence the rating).

How do we solve the alignment problem?

plex2mo144

With a large enough decisive strategic advantage, a system can afford to run safety checks on any future versions of itself and anything else it's interacting with sufficient to stabilize values for extremely long periods of time.

Multipolar worlds though? Yeah, they're going to get eaten by evolution/moloch/power seeking/pythia.

Eli's shortform feed

plex2mo220

More cynical take based on the Musk/Altman emails: Altman was expecting Musk to be CEO. He set up a governance structure which would effectively be able to dethrone Musk, with him as the obvious successor, and was happy to staff the board with ideological people who might well take issue with something Musk did down the line to give him a shot at the throne.

Musk walked away, and it would've been too weird to change his mind on the governance structure. Altman thought this trap wouldn't fire with high enough probability to disarm it at any time before it di... (read more)

Chaos Investments v0.31

plex2mo30

Looks fun!

I could also remove Oil Seeker's protection from Pollution; they don't need it for making Black Chips to be worthwhile for them but it makes that less of an amazing deal than it is.

Maybe have the pollution cost halved for Black, if removing it turns out to be too weak?

So You Want To Make Marginal Progress...

plex2mo104

Seems accurate, though I think Thinking This Through A Bit involved the part of backchaining where you look at approximately where on the map the destination is, and that's what some pro-backchain people are trying to point at. In the non-metaphor, the destination is not well specified by people in most categories, and might be like 50 ft in the air so you need a way to go up or something.

And maybe if you are assisting someone else who has well grounded models, you might be able to subproblem solve within their plan and do good, but you're betting your imp... (read more)

Reviewing LessWrong: Screwtape's Basic Answer

plex2mo20

Give me a dojo.lesswrong.com, where the people into mental self-improvement can hang out and swap techniques, maybe a meetup.lesswrong.com where I can run better meetups and find out about the best rationalist get-togethers. Let there be an ai.lesswrong.com for the people writing about artificial intelligence.

Yes! Ish! I'd be keen to have something like this for the upcoming aisafety.com/stay-informed page, where we're looking like we'll currently resort to linking to https://www.lesswrong.com/tag/ai?sortedBy=magic#:~:text=Posts%20tagged%20AI as there's no... (read more)

Stopping unaligned LLMs is easy!

plex2mo64

I'm glad you're trying to figure out a solution. I am however going to shoot this one down a bunch.

If these assumptions were true, this would be nice. Unfortunately, I think all three are false.

LLMs will never be superintelligent when predicting a single token.

In a technical sense, definitively false. Redwood compared human to AI token prediction and even early AIs were far superhuman. Also, in a more important sense, you can apply a huge amount of optimization on selecting a token. This video gives a decent intuition, though in a slightly different settin... (read more)

2Yair Halberstadt2mo

Both of these are examples of more intelligent systems built on top of an LLM where the LLM itself has no state. Again same problem - AI tool use is mediated by text input and output, and the validator just needs access to the LLM input and output.

Gradual Disempowerment, Shell Games and Flinches

plex2mo72

the extent human civilization is human-aligned, most of the reason for the alignment is that humans are extremely useful to various social systems like the economy, and states, or as substrate of cultural evolution. When human cognition ceases to be useful, we should expect these systems to become less aligned, leading to human disempowerment.

oh good, I've been thinking this basically word for word for a while and had it in my backlog. Glad this is written up nicely, far better than I would likely have done :)

The one thing I'm not a big fan of: I'd bet "Gr... (read more)

plex's Shortform

plex2mo60

I think I have a draft somewhere, but never finished it. tl;dr; Quantum lets you steal private keys from public keys (so all wallets that have a send transaction). Upgrading can protect wallets where people move their coins, but it's going to be messy, slow, and won't work for lost-key wallets, which are a pretty huge fraction of the total BTC reserve. Once we get quantum BTC at least is going to have a very bad time, others will have a moderately bad time depending on how early they upgrade.

plex's Shortform

plex2mo20

Nice! I haven't read a ton of Buddhism, cool that this fits into a known framework.

I'm uncertain of how you use the word consciousness here do you mean our blob of sensory experience or something else?

Yeah, ~subjective experience.

plex's Shortform

plex2mo*20

Let's do most of this via the much higher bandwidth medium of voice, but quickly:

Yes, qualia^[1] is real, and is a class of mathematical structure.^[2]
(placeholder for not a question item)
1. Matter is a class of math which is ~kinda like our physics.
2. Our part of the multiverse probably doesn't have special "exists" tags, probably everything is real (though to get remotely sane answers you need a decreasing reality fluid/caring fluid allocation).
3. Math, in the sense I'm trying to point to it, is 'Structure'. By which I mean: Well defined seeds/axioms/starting

... (read more)

Fertility Will Never Recover

plex2mo24

give up large chunks of the planet to an ASI to prevent that

I know this isn't your main point but.. That isn't a kind of trade that is plausible. Misaligned superintelligence disassembles the entire planet, sun, and everything it can reach. Biological life does not survive, outside of some weird edge cases like "samples to sell to alien superintelligences that like life". Nothing in the galaxy is safe.

plex's Shortform

plex2mo*5826

Re: Ayahuasca from the ACX survey having effects like:

“Obliterated my atheism, inverted my world view no longer believe matter is base substrate believe consciousness is, no longer fear death, non duality seems obvious to me now.”

^[1]There's a cluster of subcultures that consistently drift toward philosophical idealist metaphysics (consciousness, not matter or math, as fundamental to reality): McKenna-style psychonauts, Silicon Valley Buddhist circles, neo-occultist movements, certain transhumanist branches, quantum consciousness theorists, and variou... (read more)

Jonas Hallgren2mo120

This suggests something profound about metaphysics itself: Our basic intuitions about what's fundamental to reality (whether materialist OR idealist) might be more about human neural architecture than about ultimate reality. It's like a TV malfunctioning in a way that produces the message "TV isn't real, only signals are real!"

In meditation, this is the fundamental insight, the so called non-dual view. Neither are you the fundamental non-self nor are you the specific self that you yourself believe in, you're neither, they're all empty views, yet that view ... (read more)

3[anonymous]2mo

I sent the following offer (lightly edited) to discuss metaphysics to plex. I extend the same to anyone else.[1] (Note: I don't do psychedelics or participate in the mentioned communities[2], and I'm deeply suspicious of intuitions/deep-human-default-assumptions. I notice unquestioned intuitions across views on this, and the primary thing I'd like to do in discussing is try to get the other to see theirs.) 1. ^ Though I'll likely lose interest if it seems like we're talking past each other / won't resolve any cruxy disagreements. 2. ^ (except arguably the qualia research institute's discord server, which might count because it has psychedelics users in it) 3. ^ (Questioning with the goal of causing the questioned one to notice specific assumptions or intuitions to their beliefs, as a result of trying to generate a coherent answer) 4. ^ From an unposted text:

The Manhattan Trap: Why a Race to Artificial Superintelligence is Self-Defeating

plex2mo86

We do not take a position on the likelihood of loss of control.

This seems worth taking a position on, the relevant people need to hear from the experts an unfiltered stance of "this is a real and perhaps very likely risk".

3Mateusz Bagiński2mo

It seems to me that by saying this the authors wanted to communicate "this is not a place to discuss this". But I agree that the phrasing used may inaccurately (?) communicate that the authors are more uncertain/agnostic about this issue than they really are (or that they believe something like "both sides have comparably good arguments"), so I'd suggest to replace it with something like:

meemi's Shortform

plex2mo42

Agree that takeoff speeds are more important, and expect that FrontierMath has much less affect on takeoff speed. Still think timelines matter enough that the amount of relevantly informing people that you buy from this is likely not worth the cost, especially if the org is avoiding talking about risks in public and leadership isn't focused on agentic takeover, so the info is not packaged with the info needed for that info to have the effects which would help.

meemi's Shortform

plex2mo30

Evaluating the final model tells you where you got to. Evaluating many small models and checkpoints helps you get further faster.

The Case Against AI Control Research

plex2mo146

Even outside of the arguing against the Control paradigm, this post (esp. The Model & The Problem & The Median Doom-Path: Slop, not Scheming) cover some really important ideas, which I think people working on many empirical alignment agendas would benefit from being aware of.

Things I have been using LLMs for

plex2mo90

One neat thing I've explored is learning about new therapeutic techniques by dropping a whole book into context and asking for guiding phrases. Most therapy books do a lot of covering general principles of minds and how to work with them, with the unique aspects buried in a way which is not super efficient for someone who already has the universal ideas. Getting guiding phrases gives a good starting point for what the specific shape of a technique is, and means you can kinda use it pretty quickly. My project system prompt is:

Given the name of, and potentia

... (read more)

meemi's Shortform

plex3mo91

I'm guessing you view having better understanding of what's coming as very high value, enough that burning some runway is acceptable? I could see that model (though put <15% on it), but I think this is at least not good integrity wise to have put on the appearance of doing just the good for x-risk part and not sharing it as an optimizable benchmark, while being funded by and giving the data to people who will use it for capability advancements.

5elifland2mo

Wanted to write a more thoughtful reply to this, but basically yes, my best guess is that the benefits of informing the world are in expectation bigger than the negatives from acceleration. A potentially important background views is that I think takeoff speeds matter more than timelines, and it's unclear to me how having FrontierMath affects takeoff speeds. I wasn't thinking much about the optics, but I'd guess that's not a large effect. I agree that Epoch made a mistake here though and this is a negative. I could imagine changing my mind somewhat easily,.

meemi's Shortform

plex3mo2713

Evaluation on demand because they can run them intensely lets them test small models for architecture improvements. This is where the vast majority of the capability gain is.

Getting an evaluation of each final model is going to be way less useful for the research cycle, as it only gives a final score, not a metric which is part of the feedback loop.

meemi's Shortform

plex3mo*5043

However, we have a verbal agreement that these materials will not be used in model training.

If by this you mean "OpenAI will not train on this data", that doesn't address the vast majority of the concern. If OpenAI is evaluating the model against the data, they will be able to more effectively optimize for capabilities advancement, and that's a betrayal of the trust of the people who worked on this with the understanding that it will be used only outside of the research loop to check for dangerous advancements. And, particularly, not to make those da... (read more)

meemi's Shortform

plex3mo6954

Really high quality high-difficulty benchmarks are much more scarce and important for capabilities advancing than just training data. Having an apparently x-risk focused org do a benchmark implying it's for evaluating danger from highly capable models in a way which the capabilities orgs can't use to test their models, then having it turn out that's secretly funded by OpenAI with OpenAI getting access to most of the data is very sketchy.

Some people who contributed questions likely thought they would be reducing x-risk by helping build bright line warning s... (read more)

4No77e3mo

If the funding didn't come from OpenAI, would OpenAI still be able to use that benchmark? Like, I'd imagine Epoch would still use that to evaluate where current models are at. I think this might be my point of confusion. Maybe the answer is "not as much for it to be as useful to them"?

Don’t ignore bad vibes you get from people

plex3mo118

This is a good idea usually, but critically important when using skills like those described in Listening to Wisdom, in a therapeutic relationship (including many forms of coaching), or while under the influence of substances that increase your rate of cognitive change and lower barriers to information inflow (such as psychedelics).

If you're opening yourself up to receive the content of those vibes on an emotional/embodied/deep way, and those vibes are bad, this can be toxic to an extent you will not be expecting (even if you try to account for this warnin... (read more)

Six Small Cohabitive Games

plex3mo20

Maybe having exact evaluations not being trivial is not entirely a bug, but might make the game more interesting (though maybe more annoying)?

What Is The Alignment Problem?

plex3mo40

I recommend most readers skip this subsection on a first read; it’s not very central to explaining the alignment problem.

Suggest either putting this kind of aside in a footnote, or giving the reader a handy link to the next section for convenience?

Six Small Cohabitive Games

plex3mo40

Nice!

(I wrote the bit about not having to tell people your favourite suit or what cards you have leaves things open for some sharp or clever negotiation, but looking back I think it's mostly a trap. I haven't seen anyone get things to go better for them by hiding the suit.)

To add some layer of this strategy: Giving each person one specific card on their suit that they want with much higher strength might be fun, as the other players can ransom that card if they know (but might be happy trading it anyway). Also having the four suits each having a different multiplier might be fun?

4Screwtape3mo

Yeah, setups where (for instance) Clubs are worth 2x, Hearts and Diamonds are worth 1x, and Spades are worth 1/2x would (I expect) accelerate the effect. The example in Planecrash talks about multipliers like 1.3 or 1.1 where the evaluation is closer, which I turned to an integer multiplier to make the math doable in an average person's head. I have a more complicated and playtested version of Jellychip I mean to publish in a few days :)

How quickly could robots scale up?

plex3mo52

On one side: Humanoid robots have much more density of parts requiring more machine-time than cars, probably slowing things a bunch.

On the other, you mention assuming no speed up due to the robots building robot factories, but this seems like the dominant factor in the growth. Your numbers excluding that are going to be way underestimating things pretty quickly without that. I'd be interested in what those numbers look like assuming reasonable guesses about robot workforce being part of a feedback cycle.

1Benjamin_Todd3mo

Yes - if anyone reading knows more about manufacturing and could comment on how easy it would be to convert, that would be very helpful. I also agree it would be interesting to try to do more analysis of how much ASI and robotics could speed up construction of robot factories, by looking at different bottlenecks and how much they could help. I'm not sure a robot workforce would have a huge effect initially, since there's already a large pool of human workers (though maybe you get some boost by making everything run 24/7). However, at later stages it might become hard to hire enough human workers, while with robots you could keep scaling.

AI Safety as a YC Startup

plex3mo42

Or, worse, if most directions are net negative and you have to try quite hard to find one which is positive, almost everyone optimizing for magnitude will end up doing harm proportional to how much they optimize magnitude.

8Davidmanheim3mo

True, and even more, if optimizing for impact or magnitude has Goodhart effects, of various types, then even otherwise good directions are likely to be ruined by pushing on them too hard. (In large part because it seems likely that the space we care about is not going to have linear divisions into good and bad, there will be much more complex regions, and even when pointed in a directino that is locally better, pushing too far is possible, and very hard to predict from local features even if people try, which they mostly don't.)