All Comments

This philosophy thought experiment is a Problem of Excess Metal. This is where philosophers spice up thought experiments with totally unnecessary extremes, in this case an elite sniper, terrorists, children, and an evil supervisor. This is common, see also the Shooting Room Paradox (aka Snake Eyes Paradox), Smoking Lesion, Trolley Problems, etc, etc. My hypothesis is that this is a status play whereby high decouplers can demonstrate their decoupling skill. It's net negative for humanity. Problems of Excess Metal also routinely contradict basic facts about reality. In this case, children do not have the same surface area as terrorists.

Here is an equivalent question that does not suffer from Excess Metal.

  • A normal archer is shooting towards two normal wooden archery targets on an archery range with a normal bow.
  • The targets are of equal size, distance, and height. One is to the left of the other.
  • There is normal wind, gravity, humidity, etc. It's a typical day on Earth.
  • The targets are some distance away, four times further away than she has fired before.

Q: If the archer shoots at the left target as if there are no external factors, is she more likely to hit the left target than the right target?

A: The archer has a 0% chance of hitting either target. Gravity is an external factor. If she ignores gravity when shooting a bow and arrow over a sufficient distance, she will always miss both targets, and she knows this. Since 0% = 0%, she is not more likely to hit one target than the other.

Q: But zero isn't a probability!

A: Then P(Left|I) = P(Right|I) = 0%, see Acknowledging Background Information with P(Q|I).

Yep, in what's possibly an excess of charity/politeness I sure was glossing "exploiting loopholes and don't want their valuable loopholes removed" as one example of where someone was having an unusual benefit. 

I guess other forums don't literally have a good faith defence, but in practice they mostly only ban people who deliberately refuse to follow the rules/advice they're told about, or personally insult others repeatedly.

I feel like I have encountered fora that had genuinely more active moderation norms. There's a lot of personal discord servers I can think of with the same rough approach as a dinner party. There are reddit threads 

Also, uh, I notice the juxtaposition of "I've been banned from other places, hence this attitude" and "in practice [other forums] mostly only ban people who deliberately refuse to follow the rules/advice they're told about, or personally insult others repeatedly" implies you either refuse to follow rules/advice or that you insult others repeatedly. Obviously you said most cases, not all cases.

In the basketball practice example, if it was magically possible to let the lousy shots continue playing with each other at very low cost, almost every coach would allow it. They would only remove people who have bad faith.

Well, yes, and I've never heard of a coach saying someone wasn't allowed to play basketball anywhere. At least where I live, there's a public court about a ten minute bike ride away and basketballs are cheap. If, say, I'm a student on a college basketball team whose coach asked me to stop doing layups during his practices, I can even use the exact same court later when the team isn't practicing. The equivalent for LessWrong is, I believe, saying you're welcome to continue communicating on the internet but that it will happen on some other forum.

Your average basketball coach doesn't only remove people with bad faith, they also bench people or cut them from the team for not being good at basketball. That's quite common.

Before jumping into critique, the good:
- Kudos to Ben Pace for seeking out and actively engaging with contrary viewpoints
- The outline of the x-risk argument and history of the AI safety movement seem generally factually accurate

The author of the article makes quite a few claims about the details of PauseAI's proposal, its political implications, the motivations of its members and leaders...all without actually joining the public Discord server, participating in the open Q&A new member welcome meetings (I know this because I host them), or even showing evidence of spending more than 10 minutes on the website.  All of these basic research opportunities were readily available and would have taken far less time than spent on writing the article.  This tells you everything you need to know about the author's integrity, motivations, and trustworthiness.

That said, the article raises an important question: "buy time for what?"  The short answer is: "the real value of a Pause is the coordination we get along the way."  Something as big as an international treaty doesn't just drop out of the sky because some powerful force emerged and made it happen against everyone else's will.  Think about the end goal and work backwards:

1) An international treaty requires
2) Provisions for monitoring and enforcement,
3) Negotiated between nations,
4) Each of whom genuinely buys in to the underlying need
5) And is politically capable of acting on that interest because it represents the interests of their constituents
6) Because the general public understands AI and its implications enough to care about it
7) And feels empowered to express that concern through an accessible democratic process
8) And is correct in this sense of empowerment because their interests are not overridden by Big Tech lobbying
9) Or distracted into incoherence by internal divisions and polarization

An organization like PauseAI can only have one "banner" ask (1), but (2-9) are instrumentally necessary--and if those were in place, I don't think it's at all unreasonable to assume society would be in a better position to navigate AI risk.

Side note: my objection to the term "doomer" is that it implies a belief that humanity will fail to coordinate, solve alignment in time, or be saved by any other means, and thus will actually be killed off by AI--which seems like it deserves a distinct category from those who simply believe that the risk of extinction by default is real.

I think you're on to something!

To my taste, what you propose is slightly more specific than required. What I mean, at least for me, the essential takeaway from your reading is a bit broader than what you explicitly write*: A bit of paternalism by the 'state', incentivizing our short-term self to doing stuff good for our long-term self. Which might become more important once the abundance means the biggest enemies to our self-fulfillment are internal. So healthy internal psychology can become more crucial. And we're not used to taking this seriously, or at least not to actively tackling that internal challenge by seeking outside support.

So, the paternalistic incentives you mention could be cool.

Centering our school system, i.e. the compulsory education system, more around this type of a bit more mindful-ish things, could be another part.

Framing: I'd personally not so much frame it as 'supplemental income', even if it also act as that: Income, redistribution, making sure humans even once unemployed are well fed, really shall come from UBI (plus if some humans in the loop remain really bottleneck, all scarcity value for their deeds to go to them, no hesitation), full stop. But that's really just about framing. Overall I agree, yes, some extra incentive payments would seem all in order. To the degree that the material wealth they provide still matters in light of the abundance. Or, even, indeed, in a world where bad psychology does become a major threat to the otherwise affluent society, it could be even an idea to withhold a major part of the spoils from useful AI, just to be able to incentivize use to also do our job to remain/become sane.

*That is, at least I'm not spontaneously convinced exactly those specific aspects you mention are and will remain the most important ones, but overall such types of aspects of sound inner organization within our brain might be and remain crucial in a general sense.

I'm glad you shared this, but it seems way overhyped. Nothing wrong with fine tuning per se, but this doesn't address open problems in value learning (mostly of the sort "how do you build human trust in an AI system that has to make decisions on cases where humans themselves are inconsistent or disagree with each other?").

Wheeee!

Excuse: DeepSeek, and China Might Win!

I'll add one more gear here: I think you can improve on how much you're satisfying all three tradeoffs at once – writing succinctly and clearly and distilling complex things into simpler ones are skills. But, those things take time (both to get better at the skill, and to apply the skill).

LessWrong certainly could do better than we currently have, but we've spent 100s of hours on things like 

  • make the new user guide, and link new users to it
  • make the rate limits link people to pages that explain more about why you're rate limited
  • explain why the moderation norms are what they are

etc. We could put more even more effort into it, but, well, there's also a lot of other stuff to do.

Not being an author in any of those articles, I can only give my own take.

I use the term "weak to strong generalization" to talk about a more specific research-area-slash-phenomenon within scalable oversight (which I define like SO-2,3,4). As a research area, it usually means studying how a stronger student AI learns what a weaker teacher is "trying" to demonstrate, usually just with slight twists on supervised learning, and when that works well, that's the phenomenon.

It is not an alignment technique to me because the phrase "alignment technique" sounds like it should be something more specific. But if you specified details about how the humans were doing demonstrations, and how the student AI was using them, that could be an alignment technique that uses the phenomenon of weak to strong generalization.

I do think the endgame for w2sg still should be to use humans as the weak teacher. You could imagine some cases where you've trained a weaker AI that you trust, and gain some benefit from using it to generate synthetic data, but that shouldn't be the only thing you're doing.

“incentivized to build intuitive self-models” does not necessarily imply “does in fact build intuitive self-models”. As I wrote in §1.4.1, just because a learning algorithm is incentivized to capture some pattern in its input data, doesn’t mean it actually will succeed in doing so.

Right of course. So would this imply that organisms that have very simple brains / roles in their environment (for example: not needing to end up with a flexible understanding of the consequences of your actions), would have a very weak incentive too?

And if an intuitive self model helps with things like flexible planning then even though its a creation of the 'blank-slate' cortex, surely some organisms would have a genome that sets up certain hyperparameters that would encourage it no, since it would seem strange for something pretty seriously adaptive being purely an 'epiphenomenon' (as per language being facilitated by hyperparameters encoded in the genome)? But also its fine if you also just don't have an opinion on this haha. (also: wouldn't some animals not have an incentive to create self-models if creating a self-model would not seriously increase performance in any relevant domain? Like a dog trying to create an in-depth model of the patterns that appear on computer monitors maybe)

It does seem like flexible behaviour in some general sense is perfectly possible without awareness (as I'm sure you know) but I understand that awareness would surely help a whole lot.

You might have no opinion on this at all but would you have any vague guess at all as to why you can only verbally report items in awareness? (cause even if awareness is a model of serial processing and verbal report requires that kind of global projection / high state of attention, I've still seen studies showing that stimuli can be globally accessible / globally projected in the brain and yet still not consciously accessible, presumably in your model due to a lack of modelling of that global-access)

See https://pauseai.info. They think lobbying efforts have been more successful than expected, but politicians are reluctant to act on it before they hear about it from their constituents. Individuals sending emails also helps more than expected. The more we can create common knowledge of the situation, the more likely the government acts.

I don’t think the average person would be asking AI what are the best solutions for preventing existential risks. As evidence, just look around:

There are already people with lots of money and smart human research assistants. How many of those people are asking those smart human research assistants for solutions to prevent existential risks? Approximately zero.

Here’s another: The USA NSF and NIH are funding many of the best scientists in the world. Are they asking those scientists for solutions to prevent existential risk? Nope.

Demis Hassabis is the boss of a bunch of world-leading AI experts, with an ability to ask them to do almost arbitrary science projects. Is he asking them to do science projects that reduce existential risk? Well, there’s a DeepMind AI alignment group, which is great, but other than that, basically no. Instead he’s asking his employees to cure diseases (cf Isomorphic Labs), and to optimize chips, and do cool demos, and most of all to make lots of money for Alphabet.

You think Sam Altman would tell his future powerful AIs to spend their cycles solving x-risk instead of making money or curing cancer? If so, how do you explain everything that he’s been saying and doing for the past few years? How about Mark Zuckerberg and Yann LeCun? How about random mid-level employees in OpenAI? I am skeptical.

Also, even if the person asked the AI that question, then the AI would (we’re presuming) respond: “preventing existential risks is very hard and fraught, but hey, what if I do a global mass persuasion campaign…”. And then I expect the person would reply “wtf no, don’t you dare, I’ve seen what happens in sci-fi movies when people say yes to those kinds of proposals.” And then the AI would say “Well I could try something much more low-key and norm-following but it probably won’t work”, and the person would say “Yeah do that, we’ll hope for the best.” (More such examples in §1 here.)

Violence by radical vegans and left-anarchists has historically not been extremely rare. Nothing in Zizians' actions strike me as particularly different (in kind if not in competency) than, say, the Belle Époque illegalists like the Bonnot Gang, or the Years of Lead leftist groups like the Red Army Fraction or the Weather Underground.

I don't post on LessWrong much but I would much rather be explicitly rate-limited than shadow-banned, if content I was posting needed to be moderated.

+9. This argues that some key puzzle pieces of genius include "solitude," and "sitting with confusion until open-curiosity allows you to find the right questions." This feels like an important key that I'm annoyed at myself for not following up on more. 

The post is sort of focused on "what should an individual do, if they want to cultivate the possibility of genius?".

One of the goals I have, in my work at Lightcone, is to ask "okay but can we do anything to foster genius at scale, for the purpose of averting x-risk?". This might just be an impossible paradox, where trying to make it on purpose intrinsically kills it before it can flower. I think it might be particularly impossible to do at massive scale – if you try to build a system for 1000s of geniuses, that system can't help but develop the pathologies that draw people into fashions that stifle the kind of thinking you need.

But, it doesn't seem impossible to foster-genius-on-the-margin at smallish scales.

Challenges that come immediately to mind:

  1. how do you actually encourage, or teach people (or arrange for them to discover for themselves), how to sit with confusion, and tease out interesting questions?
  2. by default, if you suggest a bunch of people spend time alone in thought, I think you mostly end up wasting a lot of people's time (and possibly destroying their lives?). Many geniuses don't seem that happy, and people who don't actually have the right taste/generators for that thinking to be productive I bet end up even more unhappy. If you try to filter for "has enough taste and/or IQ to have a decent shot", you probably immediately reintroduce all the problems this post argues against.
  3. somehow the people need enough money to live, but any system for allocating money for this would either be hella exploitable, or very prone to centralization/fashion/goodhart collapse.
  4. eventually, when an idea seems genuinely promising, your creativity needs to survive contact with others forming expectations of you.

I've been working on "how to encourage sitting with confusion." 

I think I've been less focused on "how to sit with open curiosity." (Partly because I am bad at it, and it feels harder to thread a needle between "enough open curiosity to identify novel important questions and reframings" without just failing to direct your attention towards the (rather quite dire and urgent) problems the x-risk community needs to figure out)

(But, Logan Strohl seems to be doing a pretty good job at teaching the cultivation of curiosity, which at least gives me hope that it is possible)

I think the unfortunate answer to the third bullet is "well, probably people basically need to self-finance during a longish period where they don't have something legibly worth funding." (But, this maybe suggests an ecosystem where it's sort of normal to have a day job that doesn't consume all your time)

...

What sort of in-person community is good for intellectual progress?

Notably: Lightcone had been working on the sort of coworking space this post argues against. We did stop work on the coworking space before this post even came out. I did have some concrete thoughts when reading this "maybe, with Lighthaven, we'll be able to correct this – Lighthaven more naturally fits into an ecosystem where people are off working mostly alone, but periodically they gather to share what they're working on, in a space that is optimized for exploring curiosity in a spacious way."

But, we haven't really put much attention into "foster a thing where more people make some attempt to go off and think alone on purpose" part.

By the way - I imagine you could do a better job with the evaluation prompts by having another LLM pass, where it formalizes the above more and adds more context. For example, with an o1/R1 pass/Squiggle AI pass, you could probably make something that considers a few more factors with this and brings in more stats. 

Related Manifold question here:
 

what matters algorithmically is how they’re connected

I just realised that quote didn't meant what I thought it did. But yes I do understand this and Key seems to think the recurrent connections just aren't strong (they are 'diffusely interconnected'. but whether this means they have an intuitive self model or not honestly who knows, do you have any ideas of how you'd test it? maybe like Graziano does with attentional control?)

(I think we’re in agreement on this?)

Oh yes definitely.

I know nothing about octopus nervous systems and am not currently planning to learn, sorry.

Heheh that's alright I wasn't expecting you too thanks for thinking about it for a moment anyway. I will simply have to learn myself.

That counts! Thanks for posting. I look forward to seeing what it will get scored as. 

I’m not sure if this is what you’re looking for, but here’s a fun little thing that came up recently I was when writing this post:

Summary: “Thinking really hard for five seconds” probably involves less primary metabolic energy expenditure than scratching your nose. (Some people might find this obvious, but other people are under a mistaken impression that getting mentally tired and getting physically tired are both part of the same energy-preservation drive. My belief, see here, is that the latter comes from an “innate drive to minimize voluntary motor control”, the former from an unrelated but parallel “innate drive to minimize voluntary attention control”.)

Model: The net extra primary metabolic energy expenditure required to think really hard for five seconds, compared to daydreaming for five seconds, may well be zero. For an upper bound, Raichle & Gusnard 2002 says “These changes are very small relative to the ongoing hemodynamic and metabolic activity of the brain. Attempts to measure whole brain changes in blood flow and metabolism during intense mental activity have failed to demonstrate any change. This finding is not entirely surprising considering both the accuracy of the methods and the small size of the observed changes. For example, local changes in blood flow measured with PET during most cognitive tasks are often 5% or less.” So it seems fair to assume it’s <<5% of the ≈20 W total, which gives <<1 W × 5 s = 5 J. Next, for comparison, what is the primary metabolic energy expenditure from scratching your nose? Well, for one thing, you need to lift your arm, which gives mgh ≈ 0.2 kg × 9.8 m/s² × 0.4 m ≈ 0.8 J of mechanical work. Divide by maybe 25% muscle efficiency to get 3.2 J. Plus more for holding your arm up, moving your finger, etc., so the total is almost definitely higher than the “thinking really hard”, which again is probably very much less than 5 J.

Technique: As it happened, I asked Claude to do the first-pass scratching-your-nose calculation. It did a great job!

I'm pretty sure that measures of the persuasiveness of a model which focus on text are going to greatly underestimate the true potential of future powerful AI.

I think a future powerful AI would need different inputs and outputs to perform at maximum persuasiveness.

Inputs

  • speech audio in
  • live video of target's face (allows for micro expression detection, pupil dilation, gaze tracking, bloodflow and heart rate tracking)
  • EEG signal would help, but is too much to expect for most cases
  • sufficiently long interaction to experiment with the individual and build a specific understanding of their responses

Outputs

  • emotionally nuanced voice
  • visual representation of an avatar face (may be cartoonish)
  • ability to present audiovisual data (real or fake, like graphs of data, videos, pictures)

For reference on bloodflow, see: https://youtu.be/rEoc0YoALt0?si=r0IKhm5uZncCgr4z

If anyone is game for creating an agentic research scaffold like that Thane describes

Here's a more detailed the basic structure as envisioned after 5 minutes' thought:

  • You feed a research prompt to the "Outer Loop" of a model, maybe have a back-and-forth fleshing out the details.
  • The Outer Loop decomposes the research into several promising research directions/parallel subproblems.
  • Each research direction/subproblem is handed off to a "Subagent" instance of the model.
  • Each Subagent runs search queries on the web and analyses the results, up to the limits of its context window. After the analysis, it's prompted to evaluate (1) which of the results/sources are most relevant and which should be thrown out, (2) whether this research direction is promising and what follow-up questions are worth asking.
    • If a Subagent is very eager to pursue a follow-up question, it can either run a subsequent search query (if there's enough space in the context window), or it's prompted to distill its current findings and replace itself with a next-iteration Subagent, in whose context it loads only the most important results + its analyses + the follow-up question.
    • This is allowed up to some iteration count.
  • Once all Subagents have completed their research, instantiate an Evaluator instance, into whose context window we dump the final results of each Subagent's efforts (distilling if necessary). The Evaluator integrates the information from all parallel research directions and determines whether the research prompt has been satisfactorily addressed, and if not, what follow-up questions are worth pursuing.
  • The Evaluator's final conclusions are dumped into the Outer Loop's context (without the source documents, to not overload the context window).
  • If the Evaluator did not choose to terminate, the next generation of Subagents is spawned, each prompted with whatever contexts are recommended by the Evaluator.
  • Iterate, spawning further Evaluator instances and Subproblem instances as needed.
  • Once the Evaluator chooses to terminate, or some compute upper-bound is reached, the Outer Loop instantiates a Summarizer, into which it dumps all of the Evaluator's analyses + all of the most important search results.
  • The Summarizer is prompted to generate a high-level report outline, then write out each subsection, then the subsections are patched together into a final report.

Here's what this complicated setup is supposed to achieve:

  • Do a pass over an actually large, diverse amount of sources. Most of such web-search tools (Google's, DeepSeek's, or this shallow replication) are basically only allowed to make one search query, and then they have to contend with whatever got dumped into their context window. If the first search query turns out poorly targeted, in a way that becomes obvious after looking through the results, the AI's out of luck.
  • Avoid falling into "rabbit holes", i. e. some arcane-and-useless subproblems the model becomes obsessed with. Subagents would be allowed to fall into them, but presumably Evaluator steps, with a bird's-eye view of the picture, would recognize that for a failure mode and not recommend the Outer Loop to follow it.
  • Attempt to patch together a "bird's eye view" on the entirety of the results given the limitations of the context window. Subagents and Evaluators would use their limited scope to figure out what results are most important, then provide summaries + recommendations based on what's visible from their limited vantage points. The Outer Loop and then the Summarizer instances, prompted with information-dense distillations of what's visible from each lower-level vantage point, should effectively be looking at (a decent approximation of) the full scope of the problem.

Has anything like this been implemented in the open-source community or one of the countless LLM-wrapper startups? I would expect so, since it seems like an obvious thing to try + the old AutoGPT scaffolds worked in a somewhat similar manner... But it's possible the market's totally failed to deliver.

It should be relatively easy to set up using Flowise plus e. g. Firecrawl. If nothing like this has been implemented, I'm putting a $500 bounty on it.

BuckΩ220

Some reasons why the “ten people on the inside” might have massive trouble doing even cheap things:

  • Siloing. Perhaps the company will prevent info flowing between different parts of the company. I hear that this already happens to some extent already. If this happens, it’s way harder to have a safety team interact with other parts of the company (e.g. instead of doing auditing themselves, they’d have to get someone from all the different teams that are doing risky stuff to do the auditing).
  • Getting cancelled. Perhaps the company will learn “people who are concerned about misalignment risk constantly cause problems for us, we should avoid ever hiring them”. I think this is plausible.
  • Company-inside-the-company. Perhaps AI automation allows the company to work with just a tiny number of core people, and so the company ends up mostly just doing a secret ASI project with the knowledge of just a small trusted group. This might be sensible if the leadership is worried about leaks, or if they want to do an extremely aggressive power grab.
evhubΩ440

We use "alignment" as a relative term to refer to alignment with a particular operator/objective. The canonical source here is Paul's 'Clarifying “AI alignment”' from 2018.

evhubΩ220

I can say now one reason why we allow this: we think Constitutional Classifiers are robust to prefill.

I guess other forums don't literally have a good faith defence, but in practice they mostly only ban people who deliberately refuse to follow the rules/advice they're told about, or personally insult others repeatedly.

I guess they have more bad moderators who ban people for personal/ideological reasons, and I'm actually impressed by LessWrong's moderators being less wrong in this regard.

I still think that being rate-limited and told that, "I don't have a great model of how you can improve at that" is slightly specific to LessWrong.

Many other forums will say things very similar in spirit to

LessWrong is a pretty particular place. We strive to maintain a culture that's uncommon for web forums[1] and to stay true to our values. Recently, many more people have been finding their way here, so I (lead admin and moderator) put together this intro to what we're about.

My hope is that if LessWrong resonates with your values and interests, this guide will help you become a valued member of community. And if LessWrong isn't the place for you, this guide will help you have a good "visit" or simply seek other pastures.

But these forums still implicitly only ban people who have bad faith while advising people with good faith. LessWrong's warning isn't strong enough to distinguish it from those forums.

My idea shouldn't hurt

If you don't want to see the invisible comments, then don't see them. In my opinion the only cost is software and bandwidth.

In the basketball practice example, if it was magically possible to let the lousy shots continue playing with each other at very low cost, almost every coach would allow it. They would only remove people who have bad faith.

Even long term users like Roko have complained about rate-limiting (automatic rate-limiting in his case).[1]

 

  1. ^

    Speaking of Roko, the reputational costs inflicted on the rational community by trying to censor his Basilisk idea was probably 3 orders of magnitude higher than the actual harm from his idea. But that's off topic.

I like it, it is worth a try because it could be very helpful if it works!

A possible objection is that "you can't mentor others on something you suck yourself," and this would require AGI capable of making valuable LessWrong comments themselves, which may be similarly hard to automating AI research (considering the math/programming advantages of LLMs).

This objection doesn't doom your idea, because even if the AI is bad at writing valuable comments, and bad at judging valuable comments written by itself, it may be good at judging the failure modes where a human writes a bad comments. It could still work and is worth a try!

If we're using another instance of the same base model to do the monitoring and interpreting? It seems like that could partially preserve monitoring and interpreting. Another instance would "understand" the steganography it uses - but you might have to supply so much of the context that it would be almost the same instance - so likely to adopt the same goals and use the same deceptions, if any.

So that route does seem like dangerous territory. You'd rather not trust even a fresh instance of the same model to tell you what its agentic counterpart is "thinking" about.

I haven't gotten far in figuring out how well this might work, but it is a possibility. I'll say the little I have thought about in a soon-to-be-finished post.

I don't see how monitoring and interpretability could be unaffected.

(So I take this as modestly bad news — I wasn't totally sure orgs would use task-based end-to-end RL. I wasn't sure if agentic scaffolding might prove easier - see the other discussion thread here for questions about whether it might actually work just as well if someone bothered to implement it.)

Well, or as is often the case, the people arguing against changes are intentionally exploiting loopholes and don't want their valuable loopholes removed.

I don't like the idea. Here's an alternative I'd like to propose:

AI mentoring

After a user gets a post or comment rejected, have them be given the opportunity to rewrite and resubmit it with the help of an AI mentor. The AI mentor should be able to give reasonably accurate feedback, and won't accept the revision until it is clearly above a quality line.

I don't think this is currently easy to make (well), because I think it would be too hard to get current LLMs to be sufficiently accurate in LessWrong specific quality judgement and advice. If, at some point in the future, this became easy for the devs to add, I think it would be a good feature. Also, if an AI with this level of discernment were available, it could help the mods quite a bit in identifying edge cases and auto-resolving clear-cut cases.

I honestly think your experiment made me more temporarily confused than an informal argument would have, but this was still pretty interesting by the end, so thanks.

Few people who take radical veganism and left-anarchism seriously either ever kill anyone, or are as weird as the Zizians, so that can't be the primary explanation. Unless you set a bar for 'take seriously' that almost only they pass, but then, it seems relevant that (a) their actions have been grossly imprudent and predictably ineffective by any normal standard + (b) the charitable[1] explanations I've seen offered for why they'd do imprudent and ineffective things all involve their esoteric beliefs.

I do think 'they take [uncommon, but not esoteric, moral views like veganism and anarchism] seriously' shouldn't be underrated as a factor, and modeling them without putting weight on it is wrong.

  1. ^

    to their rationality, not necessarily their ethics

How on earth could monitorability and interpretability NOT tank when they think in neuralese? Surely changing from English to some gibberish highdimensional vector language makes things harder, even if not impossible?

+4. I most like the dichotomoy of "stick to object level" vs "full contact psychoanalysis." And I think the paragraphs towards the end are important:

The reason I don't think it's useful to talk about "bad faith" is because the ontology of good vs. bad faith isn't a great fit to either discourse strategy.

If I'm sticking to the object level, it's irrelevant: I reply to what's in the text; my suspicions about the process generating the text are out of scope.

If I'm doing full-contact psychoanalysis, the problem with "I don't think you're here in good faith" is that it's insufficiently specific. Rather than accusing someone of generic "bad faith", the way to move the discussion forward is by positing that one's interlocutor has some specific motive that hasn't yet been made explicit—and the way to defend oneself against such an accusation is by making the case that one's real agenda isn't the one being proposed, rather than protesting one's "good faith" and implausibly claiming not to have an agenda.

I think maybe the title of this post is misleading, and sort of clickbaity/embroiled in a particular conflict that is, ironically, a distraction. 

The whole post (admittedly from the very first sentence) is actually about avoiding the frame of bad faith (even if the post argues that basically everyone is in bad faith most of the time). I think it's useful if post titles convey more of a pointer to the core idea of the post, which gives it a shorthand that's more likely to be remembered/used when the time is right. ("Taboo bad faith"? "The object level, vs full-contact psychanalysis"? Dunno. Both of those feel like they lose some nuance I think Zack cares about. But, I think there's some kind of improvement here)

...

That said: I think the post is missing something that lc's comment sort of hints at (although I don't think lc meant to imply my takeaway)

Some people in the comments reply to it that other people self-deceive, yes, but you should assume good faith. I say - why not assume the truth, and then do what's prosocial anyways?

I think the post does give a couple concrete strategies for how to navigate the, er, conflict between conflict and truthseeking, which are one flavor of "prosocial." I think it's missing something about what "assume good faith" is for, which isn't really covered in the post. 

The problem "assume good faith" is trying to solve is "there are a bunch of human tendencies to get into escalation spirals of distrust, and make a bigger deal about the mistakes of people you're in conflict with." You don't have to fix this with false beliefs, you can fix this by shifting your relational stance towards the person and the discussion, and holding the possibility more alive that you've misinterpreted them or are rounding them off to and incorrect guess as to what their agenda is. 

I think the suggestions this post makes on how to deal with that are reasonable, but incomplete, and I think people benefit from also having some tools that more directly engage with "your impulse to naively assume bad faith is part of a spirally-pattern that you may want to step out of somehow." 

Thanks for this feedback, this was exactly the sort of response I was hoping for!

You say you disagree where identity comes from, but then I can't tell where the disagreement is? Reading what you wrote, I just kept nodding along being like 'yep yep exactly.' I guess the disagreement is about whether the identity comes from the RL part (step 3) vs. the instruction training (step 2); I think this is maybe a merely verbal dispute though? Like, I don't think there's a difference in kind between 'imposing a helpful assistant from x constraint' and 'forming a single personality,' it's just a difference of degree.

Miscommunication. I highlight-reacted your text "It doesn't even mention pedestrians" as the claim I'd be happy to bet on. Since you replied I double-checked the Internet Archive Snapshot snapshot on 2024-09-05. It also includes the text about children in a school drop-off zone under rule 4 (accessible via page source).

I read the later discussion and noticed that you still claimed "the rules don't mention pedestrians", so I figured you never noticed the text I quoted. Since you were so passionate about "obvious falsehoods" I wanted to bring it to your attention.

I am updating down on the usefulness of highlight-reacts vs whole-comment reacts. It's a shame because I like their expressive power. In my browser the highlight-react doesn't seem to be giving the correct hover effect - it's not highlighting the text - so perhaps this contributed to the miscommunication. It sometimes works, so perhaps something about overlapping highlights is causing a bug?

Worth taking model wrapper products into account.

For example:

I think the correct way to address this is by also testing the other models with agent scaffolds that supply web search and a python interpreter.

I think it's wrong to jump to the conclusion that non-agent-finetuned models can't benefit from tools.


See for example:

Frontier Math result

https://x.com/Justin_Halford_/status/1885547672108511281

o3-mini got 32% on Frontier Math (!) when given access to use a Python tool. In an AMA, @kevinweil / @snsf (OAI) both referenced tool use w reasoning models incl retrieval (!) as a future rollout.

METR RE-bench

Models are tested with agent scaffolds

AIDE and Modular refer to different agent scaffolds; Modular is a very simple baseline scaffolding that just lets the model repeatedly run code and see the results; AIDE is a more sophisticated scaffold that implements a tree search procedure.

Good work, thanks for doing this.

For future work, you might consider looking into inference suppliers like Hyperdimensional for DeepSeek models.

FWIW in my mind I was comparing this to things like Glen Weyl's Why I Am Not a Technocrat, and thought this was much better. (Related: Scott Alexander's response, Weyl's counter-response).

Well, I upvoted your comment, which I think adds important nuance. I will also edit my shortform to explicitly say to check your comment. Hopefully, the combination of the two is not too misleading. Please add more thoughts as they occur to you about how better to frame this.

Yeah it’s super-misleading that the post says:

Look at other unsolved problems:
- Goldbach: Can every even number of 1s be split into two prime clusters of 1s?
- Twin Primes: Are there infinite pairs of prime clusters of 1s separated by two 1s?
- Riemann: How are the prime clusters of 1s distributed?

For centuries, they resist. Why?

I think it would be much clearer to everyone if the OP said

Look at other unsolved problems:
- Goldbach: Can every even number of 1s be split into two prime clusters of 1s?
- Twin Primes: Are there infinite pairs of prime clusters of 1s separated by two 1s?
- Riemann: How are the prime clusters of 1s distributed?
- The claim that one odd number plus another odd number is always an even number: When we squash together two odd groups of 1s, do we get an even group of 1s?
- The claim that √2 is irrational: Can 1s be divided by 1s, and squared, to get 1+1?

For centuries, they resist. Why?

I request that Alister Munday please make that change. It would save readers a lot of time and confusion … because the readers would immediately know not to waste their time reading on …

Good questions. I don't have much of a guess about whether this is discernably "smarter" than Claude or Gemini would be in how it understands and integrates sources.

If anyone is game for creating an agentic research scaffold like that Thane describes, I'd love to help design it and/or to know about the results.

I very much agree with that limitation on Google's deep research. It only accepts a short request for the report, and it doesn't seem like it can (at least easily) get much more in-depth than the default short gloss. But that doesn't mean the model isn't capable of it.

Along those lines, Notebook LM has similar limitations on its summary podcasts, and I did tinker with that enough to get some more satisfying results. Using keywords like "for an expert audience already familiar with all of the terms and concepts in these sources" and "technical details" did definitely bend it in a more useful direction. There I felt I ran into limitations on the core intelligence/expertise of the system; it wasn't going to get important but fine distinctions in alignment research. Hearing its summaries was sort of helpful in that new phrasings and strange ideas (when prompted to "think creatively") can be a useful new-thought-generation aid.

We shall see whether o3 has more raw reasoning ability that it can apply to really doing expert-level research.

I don't think it's an outright meaningless comparison, but I think it's bad enough that it feels misleading or net-negative-for-discourse to describe it the way your comment did. Not sure how to unpack that feeling further.

Yeah, I just found a cerebras post which claims 2100 serial tokens/sec.

links 02/03/25:

  • Jasmine Sun on the Tech Right
    • https://jasmi.news/p/tech-right
    • https://jasmi.news/p/arjun-ramani
    • I like this. she's not a theorizer! she's just Some Guy, actually expressing her thoughts on current events. do I agree with everything she says? maybe not.
    • But it's normal-ass blogging rather than inhibited silence or sloppy thoughts packaged as a Grand Narrative, and I think that's healthy. we need normal-ass blogging.
      • [[Scott Alexander]] is a normal-ass blogger who kept up a regular schedule and has a gift for puns and a fairly high appetite for books and research papers.
      • like, that's all it is, it's being yourself in public, consistently year over year, while having a healthy (but not extraordinary!) degree of interest in the world around you.
    • https://www.programmablemutter.com/p/why-did-silicon-valley-turn-right this is cited in her piece; i'm also not sure what i think of this, might be a piece of the puzzle
  • https://meetmeoffline.com/ nice, Shreeda Segan's dating app is live
  • https://lambda.chat/ nice hosting platform for multiple models including the DeepSeek ones.
    • note that the DeepSeek phone app is reputed to have a keylogger and location tracker; web apps are preferable.
  • https://www.betonit.ai/p/trust_and_diverhtml Bryan Caplan close-reads the famous Robert Putman paper that purportedly shows that "more ethnic diversity leads to lower social trust."
    • It doesn't show that; it shows that black, Latino, and Asian people have lower "trust" than white people.
    • also correlated with lower trust: poor neighborhoods; high-crime neighborhoods; high-density neighborhoods; neighborhoods where most people moved in recently; neighborhoods with lots of renters; neighborhoods with few US citizens
    • also correlated with higher trust: individual income, bachelor's degree, homeownership
    • this just reduces to socioeconomic status. there's not a separate thing going on here about diversity.
      • obviously you can increase the average of many quality-of-life metrics in a community by restricting it to people of higher socioeconomic status. but then those same metrics would (mathematically) decline outside the elite community.
      • this does not support claims like "everybody on average would be better off under residential segregation".
  • https://marginalrevolution.com/marginalrevolution/2025/02/sundry-observations-on-the-trump-tariffs.html  Tariffs harm economies, news at 11.
  • https://www.complexsystemspodcast.com/episodes/the-landmines-buried-in-the-fine-print-of-chicagos-new-casino-deal/ Patrick McKenzie has details (lots of details) about the new casino being built in Chicago.
    • in order to meet requirements to be 25% woman-and-minority owned, they went around to black churches to offer an extremely misleading "investment deal" to working-class people who cannot afford it and won't understand the fine print.
      • the financial structure includes saddling these "investors" with a surprise enormous tax bill that only kicks in years after purchase.
    • of course, anybody can get in on special "women and minorities only" financial opportunities, even if they're a white man; set up a shell company "owned" by a woman and/or minority, who is your wife, or an associate of yours willing to serve as your front. this happens ALL THE TIME.
  • https://en.m.wikipedia.org/wiki/Advanced_Research_and_Invention_Agency ARIA was founded in 2021.
    • A friend asked me a good question: what was the UK's DARPA-equivalent before?
      • UKRI https://en.m.wikipedia.org/wiki/UK_Research_and_Innovation was the older UK science-and-tech funding org, which ARIA is unaffiliated with; it brought together nine older funding bodies, most of which were founded in the 21st century, and none of which could plausibly be the bodies that funded the bulk of UK nationally-funded (non-medical) science and engineering through the mid-20th-century.
      • so...what was the UK's DARPA, or for that matter its NSF, in 1945-1990?
  • https://trevorklee.substack.com/p/on-the-responsibility-of-size this is a nice, slightly oblique thought by Trevor Klee. do I agree? maybe. i don't expect to form considered opinions on All This until years later, if at all.
  • things I learned while reading about Venice:
    • https://en.m.wikipedia.org/wiki/Sigismondo_Pandolfo_Malatesta amazing guy. mercenary. murdered two out of three wives. first person that the Pope explicitly "canonized into Hell." patron of Piero della Francesca and Leon Battista Alberti. rehabilitated in literature several times, including by Ezra Pound, which figures.
  • https://www.palladiummag.com/2025/01/31/the-failed-strategy-of-artificial-intelligence-doomers/ critical take on the ineffectiveness of "AI safety" as a political strategy.
    • if you successfully convince the world that AI is potentially very powerful (and dangerous), this does not make people go "ok let's not build AI then", it makes people think "i want to be powerful and dangerous!!!"
    • i'm not on board with everything in this article but i think i largely agree.
    • when you're an idealistic nerd who despises playing politics and isn't very good at it, it will probably end badly if you dive enthusiastically into politics.
      • "so how can you do anything helpful at all?"
        • provide true information. don't optimize too aggressively for mass appeal.
          • this doesn't mean actively hide or be deliberately cryptic -- i think that's often going too far. i think clarity and open communication are, where possible, good practices.
          • but maybe don't make it your full-time job to strategically maximize the number of humans who believe a given thing.
        • focus your efforts on goals that you're quite confident will be beneficial and that can be done without coercion. beware of making your job about "what the government should do".
        • do not develop an identity around being an expert at strategic adversarial thinking.
          • you may be a literal chessmaster (like, at the game of chess);
          • you may be very skilled at things that would be useful to a modern military;
          • but if you are, in the colloquial sense, "kind of aspie", you are not an expert in detecting when somebody's about to screw you over. do not be over-eager to swim in shark-infested waters.

             

https://artificialanalysis.ai/leaderboards/providers claims that Cerebras achieves that OOM performance, for a single prompt, for 70B-parameter models. So nothing as smart as R1 is currently that fast, but some smart things come close.

Yeah, of course. Just trying to get some kind of rough idea at what point future systems will be starting from.

I don't see how it's possible to make a useful comparison this way; human and LLM ability profiles, and just the nature of what they're doing, are too different. An LLM can one-shot tasks that a human would need non-typing time to think about, so in that sense this underestimates the difference, but on a task that's easy for a human but the LLM can only do with a long chain of thought, it overestimates the difference.

Put differently: the things that LLMs can do with one shot and no CoT imply that they can do a whole lot of cognitive work in a single forward pass, probably a lot more than a human can ever do in the time it takes to type one word. But that cognitive work doesn't compound like a human's; it has to pass through the bottleneck of a single token, and be substantially repeated on each future token (at least without modifications like Coconut).

(Edit: The last sentence isn't quite right — KV caching means the work doesn't have to all be recomputed, though I would still say it doesn't compound.)

I assume that what's going on here is something like,
"This was low-hanging fruit, it was just a matter of time until someone did the corresponding test."

This would imply that OpenAI's work here isn't impressive, and also, that previous LLMs might have essentially been underestimated. There's basically a cheap latent capabilities gap.

I imagine a lot of software engineers / entrepreneurs aren't too surprised now. Many companies are basically trying to find wins where LLMs + simple tools give a large gain. 

So some people could look at this and say, "sure, this test is to be expected", and others would be impressed by what LLMs + simple tools are capable of. 

Oops, bamboozled. Thanks, I'll look into it more and edit accordingly.

Your idea of “using instruction following AIs to implement a campaign of persuasion” relies (I claim) on the assumption that the people using the instruction-following AIs to persuade others are especially wise and foresighted people, and are thus using their AI powers to spread those habits of wisdom and foresight.

It’s fine to talk about that scenario, and I hope it comes to pass! But in addition to the question of what those wise people should do, if they exist, we should also be concerned about the possibility that the people with instruction-following AIs will not be spreading wisdom and foresight in the first place.

I don't think that whoever is using these AI powers (let's call him Alex) needs to be that wise (beyond the wiseness of an average person who could get their hands on a powerful AI, which is probably higher-than-average).

Alex doesn't need to come up with @Noosphere89's proposed solution of persuasion campaigns all by himself. Alex merely needs to ask his AI what are the best solutions for preventing existential risks. If Noosphere's proposal is indeed wise, then AI would suggest it. Alex could then implement this solution.

Alex doesn't necessarily need to want to spread wisdom and foresight in this scheme. He merely needs to want to prevent existential risks.

That violates assumption one (a single pass cannot produce super intelligent output).

This is overall output throughput not latency (which would be output tokens per second for a single context).

a single server with eight H200 GPUs connected using NVLink and NVLink Switch can run the full, 671-billion-parameter DeepSeek-R1 model at up to 3,872 tokens per second. This throughput

This just claims that you can run a bunch of parallel instances of R1.

You ask a superintendent LLM to design a drug to cure a particular disease. It outputs just a few tokens with the drug formula. How do you use a previous gen LLM to check whether the drug will have some nasty humanity-killing side-effects years down the road?

 

Edited to add: the point is that even with a few tokens, you might still have a huge inferential distance that nothing with less intelligence (including humanity) could bridge.

I feel like there are some critical metrics are factors here that are getting overlooked in the details. 

I agree with your assessment that it's very likely that many people will lose power. I think it's fairly likely that most humans won't be able to provide much economic value at some point, and won't be able to ask for many resources in response. So I could see an argument for incredibly high levels of inequality.

However, there is a key question in that case, of "could the people who own the most resources guide AIs using those resources to do what they want, or will these people lose power as well?"

I don't see a strong reason why these people would lose power or control. That would seem like a fundamental AI alignment issue - in a world where a small group of people own all the world's resources, and there's strong AI, can those people control their AIs in ways that would provide this group a positive outcome?
 

2. There are effectively two ways these systems maintain their alignment: through explicit human actions (like voting and consumer choice), and implicitly through their reliance on human labor and cognition. The significance of the implicit alignment can be hard to recognize because we have never seen its absence.

3. If these systems become less reliant on human labor and cognition, that would also decrease the extent to which humans could explicitly or implicitly align them. As a result, these systems—and the outcomes they produce—might drift further from providing what humans want. 

There seems to be a key assumption here that people are able to maintain control because of the fact that their labor and cognition is important. 

I think this makes sense for people who need to work for money, but not for those who are rich.

Our world has a long history of dumb rich people who provide neither labor nor cognition, and still seem to do pretty fine. I'd argue that power often matters more than human output, and would expect the importance of power to increase over time. 

I think that many rich people now are able to maintain a lot of control, with very little labor/cognition. They have been able to decently align other humans to do things for them. 

I really wonder how much of the perceived performance improvement comes from agent-y training, as opposed to not sabotaging the format of the model's answer + letting it do multi-turn search.

Compared to most other LLMs, Deep Research is able to generate reports up to 10k words, which is very helpful for providing comprehensive-feeling summaries. And unlike Google's analogue, it's not been fine-tuned to follow some awful deliberately shallow report template[1].

In other words: I wonder how much of Deep Research's capability can be replicated by putting Claude 3.5.1 in some basic agency scaffold where it's able to do multi-turn web search, and is then structurally forced to generate a long-form response (say, by always prompting with something like "now generate a high-level report outline", then with "now write section 1", "now write section 2", and then just patching those responses together).

Have there been some open-source projects that already did so? If not, anyone willing to try? This would provide a "baseline" for estimating how well OpenAI's agency training actually works, how much improvement is from it.

Subjectively, its use as a research tool is limited - it found only 18 sources in a five-minute search

Another test to run: if you give those sources to Sonnet 3.5.1 (or a Gemini model, if Sonnet's context window is too small) and ask it to provide a summary, how far from Deep Research's does it feel in quality/insight?

  1. ^

    Which is the vibe I got from Google's Deep Research when I'd been experimenting. I think its main issue isn't any lack of capability, but that the fine-tuning dataset for how the final reports are supposed to look had been bad: it's been taught to have an atrocious aesthetic taste.

If you wanted to have an unaligned LLM that doesn't abuse humans, couldn't you just never sample from it after training it to be unaligned?

Great point. The engineers setting this whole thing up would need to build the tooling to hide the relevant info from the models and from the viewers.

Mhm, I do think that sometimes happens and I wish more of those places would say "The rule is the moderator shall do whatever they think reasonable." That's basically my moderation rule for like, my dinner parties, or the ~30 person discord I mostly use to advertise D&D games.

But uh, I also suspect "The moderators shall do whatever they want" (and the insinuation that the moderators are capricious and tyrannical) is a common criticism leveled when clearness is sacrificed and the user disagrees with a moderation call. 

Imagine a forum with two rules. "1. Don't say false things, 2. don't be a jerk." It would not surprise me at all to hear Bob the user saying that he was being perfectly reasonable and accurate, the other user Carla was lying and being a jerk, and the mod just did whatever they wanted and banned Bob. Maybe the rule was secretly "The moderators shall do whatever they want." But maybe the rule wasn't clear, the moderator made a judgement call, and the correct tradeoff is happening. It's really, really hard to legislate clear rules against being a jerk. Even the 'false things' line has a surprising amount of edge cases!

I'm not happy about this but it seems basically priced in, so not much update on p(doom).

We will soon have Bayesian updates to make. If we observe that incentives created during end-to-end RL naturally produce goal guarding and other dangerous cognitive properties, it will be bad news. If we observe this doesn't happen, it will be good news (although not very good news because web research seems like it doesn't require the full range of agency).

Likewise, if we observe models' monitorability and interpretability start to tank as they think in neuralese, it will be bad news. If monitoring and interpretability are unaffected, good news.

Interesting times.

Yeah, just went through this whole same line of evasion. Alright, the Collatz conjecture will never be "proved" in this restrictive sense—and neither will the Steve conjecture or the irrationality of √2—do we care? It may still be proved according to the ordinary meaning.

How much of their original capital did the French nobility retain at the end of the French revolution?

How much capital (value of territorial extent) do chimpanzees retain now as compared to 20k years ago?

I haven't thought deeply about this specific case, but I think you should consider this like any other ablation study--like, what happens if you replace the SAE with a linear probe?

Eh, I think unclear rules and high standards are fine for some purposes. Take a fiction magazine. Good ones have a high standard for what they publish, and (apart from some formatting and wordcount rules) the main rule is it has to fit the editor's taste. The same is true for scientific publications.

I understand the motivation behind this, but there is little warning that this is how the forum works.

I mildly disagree with this. The New Users Guide says 

LessWrong is a pretty particular place. We strive to maintain a culture that's uncommon for web forums[1] and to stay true to our values. Recently, many more people have been finding their way here, so I (lead admin and moderator) put together this intro to what we're about.

My hope is that if LessWrong resonates with your values and interests, this guide will help you become a valued member of community. And if LessWrong isn't the place for you, this guide will help you have a good "visit" or simply seek other pastures.

On the margin, is there room for improvement? Seems likely, but doesn't seem bad. If I was in charge I'd be tempted to open the New Users Guide with like, four bullet points that said 'This place is for aspiring rationalists, don't say false things, don't be a jerk, for examples of what we mean by that read on.' That's somewhat stylistic though.

There is no warning that trying to contribute in good faith isn't sufficient

Wait, now I'm confused. Most forums I'm aware of don't have much of a Good Faith defense. I looked up the rules for the first one I thought of, Giant In The Playground, and while it's leaning a bit more Comprehensive and Clear I don't see a place where it says if you break a rule in good faith you're fine. 

In general, someone trying to contribute to a thing who but doing so badly doesn't get that much of a pass? Like, I've been politely ejected from a singing group before because I was badly off-key. Nobody doubted I was trying to sing well! It doesn't change the fact that the group wanted to have everyone singing the right notes. 

I suggest that instead of making rate-limited users (who used up their rate) unable to comment at all, their additional comments should be invisible, but still visible to other rate-limited users (and users who choose to see them).

Meh. The internet is big. If the kind of thing that got someone rate-limited on LessWrong got them rate-limited or banned everywhere else, I'd be supportive of having somewhere they were allowed to post. Reddit's right over there, you know?

I think giving special emphasis to rate-limited users for rate-limited users is straightforwardly a bad idea. If someone got rate-limited, in general I assume it's because they were writing in ways the mods and/or other users thought they shouldn't do. If someone is going to stick around, I want their attention on people doing well, not doing badly. Imagine a basketball practice; if I'm a lousy shot, the coach might tell me to sit out the drill and watch a couple of the good players for few minutes. If I'm really bad, I get cut from the team. No coach is going to say, "hey, you're a lousy shot, so pay special attention to these other players who are just as bad as you."

A big component of this is I tend to think of LessWrong as a place I go to get better at a kind of mental skill, hence analogies to choir or basketball practice. You may have other goals here.

I think my quick guess is that what's going on is something like:
- People generally have a ton of content to potentially consume and limited time, and are thus really picky. 
- Researchers often have unique models and a lot of specific nuanced they care about.
- Most research of this type is really bad. Tons of people on Twitter now seem to have some big-picture theory of what AI will do to civilization.
- Researchers also have the curse of knowledge, and think their work is simpler than it is.

So basically, people aren't flinching because of bizarre and specific epistemic limitations. It's more like,
> "this seems complicated, learning it would take effort, my prior is that this is fairly useless anyway, so I'll be very quick to dismiss this."

My quick impression is that this is a brutal and highly significant limitation of this kind of research. It's just incredibly expensive for others to read and evaluate, so it's very common for it to get ignored. (Learned in part from myself trying to put a lot of similar work out there, then seeing it get ignored)

Related to this -
I'd predict that if you improved the arguments by 50%, it would lead to little extra uptake. But if you got someone really prestigious to highly recommend it, then suddenly a bunch of people would be much more convinced. 

And all of this is livestreamed on Twitch

Also, each agent has a bank account which they can receive donations/transfers into (I think Twitch makes this easy?) and from which they can e.g. send donations to GiveDirectly if that's what they want to do.

One minor implementation wrinkle for anyone implementing this is that "move money from a bank account to a recipient by using text fields found on the web" usually involves writing your payment info into said text fields in a way that would be visible when streaming your screen. I'm not sure if any of the popular agent frameworks have good tooling around only including specific sensitive information in the context while it is directly relevant to the model task and providing hook points on when specific pieces of information enter and leave the context - I don't see any such thing in e.g. the Aider docs - and I think that without such tooling, using payment info in a way that won't immediately be stolen by stream viewers would be a bit challenging.

Yes, the technique of formal proofs, in effect, involves translation of high-level proofs into arithmetic.

So self-reference is fully present (that's why we have Gödel's results and other similar results).

What this implies, in particular, is that one can reduce a "real proof" to the arithmetic; this would be ugly, and one should not do it in one's informal mathematical practice; but your post is not talking about pragmatics, you are referencing "fundamental limit of self-reference".

And, certainly, there are some interesting fundamental limits of self-reference (that's why we have algorithmically undecidable problems and such). But this is different from issues of pragmatic math techniques.

What high-level abstraction buys us is a lot of structure and intuition. The constraints related to staying within arithmetic are pragmatic, and not fundamental (without high-level abstractions one loses some very powerful ways to structure things and to guide our intuition, and things stop being comprehensible to a human mind).

Your post purports to conclude: “That's why [the Collatz conjecture] will never be solved”.

Do you think it would also be correct to say: “That's why [the Steve conjecture] will never be solved”?

If yes, then I think you’re using the word “solved” in an extremely strange and misleading way.

If no, then you evidently messed up, because your argument does not rely on any property of the Collatz conjecture that is not equally true of the Steve conjecture.

make fewer points, selected carefully to be bulletproof, understandable to non-experts, and important to the overall thesis

That conflicts with eg:

If you replied with this, I would have said something like "then what's wrong with the designs for diamond mechanosynthesis tooltips, which don't resemble enzymes

Anyway, I already answered that in 9. diamond.

I could imagine something vaguely sorta like this being true but that isn't like, something I'd confidently predict is a common sort of altered mental state to fall into, having been in altered states somewhere around that cluster.

I'd suspect that like, maybe there's a component where they intuitively overestimate the dependence relative to other people, but probably it involves deliberate decisions to try to see things a certain way and stuff like that. (Though actually I have no idea what "strength of subjunctive dependence" really means, I think there are unsolved philosophical problems there.) 

Perhaps I'm misusing the word "representable"? But what I meant was that any single sequence of actions generate by the agent could also have been generated by an outcome-utility maximizer (that has the same world model). This seems like the relevant definition, right?

"all" humans?

 
The vast majority of actual humans are already dead.  The overwhelming majority of currently-living humans should expect 95%+ chance they'll die in under a century.  

If immortality is solved, it will only apply to "that distorted thing those humans turn into".   Note that this is something the stereotypical Victorian would understand completely - there may be biological similarities with today's humans, but they're culturally a different species.

This isn't a solution to aligned LLMs being abused by humans, but to unaligned LLMs abusing humans.

That's not right

Are you saying that my description (following) is incorrect? 

[incomplete preferences w/ caprice] would be equivalent to 1. choosing the best policy by ranking them in the partial order of outcomes (randomizing over multiple maxima), then 2. implementing that policy without further consideration.

Or are you saying that it is correct, but you disagree that this implies that it is "behaviorally indistinguishable from an agent with complete preferences"? If this is the case, then I think we might disagree on the definition of "behaviorally indistinguishable"? I'm using it like: If you observe a single sequence of actions from this agent (and knowing the agent's world model), can you construct a utility function over outcomes that could have produced that sequence.

Or consider another example. The agent trades A for B, then B for A, then declines to trade A for B+. That's compatible with the Caprice rule, but not with complete preferences.

This is compatible with a resolute outcome-utility maximizer (for whom A is a maxima). There's no rule that says an agent must take the shortest route to the same outcome (right?).


As Gustafsson notes, if an agent uses resolute choice to avoid the money pump for cyclic preferences, that agent has to choose against their strict preferences at some point.
...
There's no such drawback for agents with incomplete preferences using resolute choice.

Sure, but why is that a drawback? It can't be money pumped, right? Agents following resolute choice often choose against their local strict preferences in other decision problems. (E.g. Newcomb's). And this is considered an argument in favour of resolute choice.

The theorem prover point is interesting but misses the key distinction:

  1. Yes, theorem provers operate at a mechanical/syntactic level
  2. But they still encode and use higher mathematical concepts
  3. The proofs they verify still require going above pure arithmetic
  4. The mechanical verification is different from staying within arithmetic concepts

In pure arithmetic we can only:

  • Do basic operations (+,-,×,÷)
  • Check specific cases
  • Compute results

Any examples of real conjectures proven while staying at this basic level?

The pilot episode of NUMB3RS.

The tie-in to rationality is that instead of coming up with a hypothesis about the culprit, the hero comes up with algorithms for finding the culprit, and quantifies how well they would have worked applied to past cases.

It's really a TV episode about computational statistical inference, rather than a movie about rationality, but it's good media for cognitive algorithm appreciators.

I recently created a simple workflow to allow people to write to the Attorneys General of California and Delaware to share thoughts + encourage scrutiny of the upcoming OpenAI nonprofit conversion attempt. 

I think this might be a high-leverage opportunity for outreach. Both AG offices have already begun investigations, and Attorneys General are elected officials who are primarily tasked with protecting the public interest, so they should care what the public thinks and prioritizes. Unlike e.g. congresspeople, I don't AGs often receive grassroots outreach (I found ~0 examples of this in the past), and an influx of polite and thoughtful letters may have some influence — especially from CA and DE residents, although I think anyone impacted by their decision should feel comfortable contacting them.

Personally I don't expect the conversion to be blocked, but I do think the value and nature of the eventual deal might be significantly influenced by the degree of scrutiny on the transaction.

Please consider writing a short letter — even a few sentences is fine. Our partner handles the actual delivery, so all you need to do is submit the form. If you want to write one on your own and can't find contact info, feel free to dm me.

I understand the motivation behind this, but there is little warning that this is how the forum works.

The New User Guide, which gets DMd to every user I feel like does get it across pretty well that we have high and particular standards: 

LessWrong is a pretty particular place. We strive to maintain a culture that's uncommon for web forums[1] and to stay true to our values. Recently, many more people have been finding their way here, so I (lead admin and moderator) put together this intro to what we're about.

My hope is that if LessWrong resonates with your values and interests, this guide will help you become a valued member of community. And if LessWrong isn't the place for you, this guide will help you have a good "visit" or simply seek other pastures.

It might be useful to have somewhere a chart of legal names and chosen names so that those of us not on rationalist twitter can keep track. Ivory is the Max who has been charged with Mr. Lind's murder?

Alright, so Collatz will be proved, and the proof will not be done by "staying in arithmetic". Just as the proof that there do not exist numbers p and q that satisfy the equation p² = 2 q² (or equivalently, that all numbers do not satisfy it) is not done by "staying in arithmetic". It doesn't matter.

How would such a validator react if you tried to hack the LLM by threatening to kill all humnas unless it complies?

Another source of confusion is that often, the stated rules presented as concise or comprehensive are just lies not meant to be taken seriously, and the only real rule is "The moderators shall do whatever they want."

John Nash, Bobby Fisher, Georg Cantor, Kurt Gödel.

Schitzophrenia can already happen with the current complexity of a brain. On the abstract level it is the same: a set of destructive ideas steal resources from other neuremes, leading to death of personality. In our case memes do not have enough mechanisms to destroy the brain directly, but in a simulated environment where things can be straight up deallocated (free()), damage will be greater, and faster.

[Nice to see you, didn't expect to find familiar name after quitting manifold markets]

https://www.lesswrong.com/posts/oXHcPTp7K4WgAXNAe/emrik-quicksays?commentId=yoYAadTcsXmXuAmqk

Actually, Collatz DOES require working under these constraints if staying in arithmetic. The conjecture itself needs universal quantification ("for ALL numbers...") to even state it. In pure arithmetic we can only verify specific cases: "4 goes to 2", "5 goes to 16 to 8 to 4 to 2", etc.

I don’t understand what “go above the arithmetic level” means.

But here’s another way that I can restate my original complaint.

Collatz Conjecture:

  • If your number is odd, triple it and add one.
  • If your number is even, divide by two.
  • …Prove or disprove: if you start at positive integer, then you’ll eventually wind up at 1.

Steve Conjecture:

  • If your number is odd, add one.
  • If your number is even, divide by two.
  • …Prove or disprove: if you start at a positive integer, then you’ll eventually wind up at 1.

Steve Conjecture is true and easy to prove, right?

But your argument that “That's why it will never be solved” applies to the Steve Conjecture just as much as it applies to the Collatz Conjecture, because your argument does not mention any specific aspects of the Collatz Conjecture that are not also true of the Steve Conjecture. You never talk about the factor of 3, you never talk about proof by induction, you never talk about anything that would distinguish Collatz Conjecture from Steve Conjecture. Therefore, your argument is invalid, because it applies equally well to things that do in fact have proofs.

When a solution is formalized inside a theorem prover, it is reduced to the level of arithmetic (a theorem prover is an arithmetic-level machine).

So a theory might be a very high-brow math, but a formal derivation is still arithmetic (if one just focuses on the syntax and the formal rules, and not on the presumed semantics).

Not off the top of my head, but since a proof of Collatz does not require working under these constraints, I don't think the distinction has any important implications.

Discussion in this other comment is relevant:

Anthropic releasing their RSP was an important change in the AI safety landscape. The RSP was likely a substantial catalyst for policies like RSPs—which contain if-then commitments and more generally describe safety procedures—becoming more prominent. In particular, OpenAI now has a beta Preparedness Framework, Google DeepMind has a Frontier Safety Framework but there aren't any concrete publicly-known policies yet, many companies agreed to the Seoul commitments which require making a similar policy, and SB-1047 required safety and security protocols.

However, I think the way Anthropic presented their RSP was misleading in practice (at least misleading to the AI safety community) in that it neither strictly requires pausing nor do I expect Anthropic to pause until they have sufficient safeguards in practice. I discuss why I think pausing until sufficient safeguards are in place is unlikely, at least in timelines as short as Dario's (Dario Amodei is the CEO of Anthropic), in my earlier post.

...

Actually, your example still goes beyond arithmetic:

  1. "No integer solutions" is a universal statement about ALL integers
  2. Proof by contradiction is still needed
  3. Even reframed, it requires proving properties about ALL possible p and q

In pure arithmetic we can only check specific cases: "3² ≠ 2×2²", "4² ≠ 2×3²", etc. Any examples using just counting and basic operations?

Yeah, it's easy to not be on the pareto frontier. Sometimes you can just make things better, and most people aren't going to argue much against doing that. They might argue a little, because change is a cost. A few people will argue a lot, because they have some unusual benefit. If lots of people argue a lot, that suggests a tradeoff is happening.

My observation is that some people do not prioritize one of the three corners of this triangle, and are confused when others argue about tradeoffs they don't see as important.

We can eliminate the concept of rational numbers by framing it as the proof that there are no integer solutions to p² = 2 q²... but... no proof by contradiction? If escape from self-reference is that easy, then surely it is possible to prove the Collatz conjecture. Someone just needs to prove that the existence of any cycle beyond the familiar one implies a contradiction.

The proof requires:

  1. The concept of rational numbers (not just natural numbers)
  2. Proof by contradiction (logical structure above arithmetic)
  3. Divisibility properties beyond basic operations

We can only use counting and basic operations (+,-,×,÷) in pure arithmetic. Any examples that stay within those bounds?

context retrieval, reasoning models

Both of these are examples of more intelligent systems built on top of an LLM where the LLM itself has no state.

Thirdly, the AI can use the environment as state in ways which would be nearly impossible to fully trace or mitigate.

Again same problem - AI tool use is mediated by text input and output, and the validator just needs access to the LLM input and output.

Yes, the proof that there's no rational solution to x²=2. It doesn't require real numbers.

If Neanderthals could have created a well aligned agent,far more powerful than themselves, they would still be around and we, almost certainly, would not.
The mereset possibility of creating super human, self improving, AGI is a total philosophical game changer.
My personal interest is in the interaction between longtermism and the Fermi paradox - Any such AGIs actions are likely to be dominated by the need to prevail over any alien AGI that it ever encounters as such an encounter is almost certain to end one or the other.

I am looking for a counter example - one that doesn't go above the arithmetic level for both system and level of proof - can you name any?

Load More