AI Control in the context of AI Alignment is a category of plans that aim to ensure safety and benefit from AI systems, even if they are goal-directed and are actively trying to subvert your control measures. From The case for ensuring that powerful AIs are controlled:.. (read more)
Archetypal Transfer Learning (ATL) is a proposal by @whitehatStoic for what is argued by the author to be a fine tuning approach that "uses archetypal data" to "embed Synthetic Archetypes". These Synthetic Archetypes are derived from patterns that models assimilate from archetypal data, such as artificial stories. The method yielded a shutdown activation rate of 57.33% in the GPT-2-XL model after fine-tuning. .. (read more)
If you are new to LessWrong, the current iteration of this, is the place to introduce yourself... (read more)
Repositories are pages that are meant to collect information and advice of a specific type or area from the LW community. .. (read more)
A threat model is a story of how a particular risk (e.g. AI) plays out... (read more)
A Self Fulfilling Prophecy is a prophecy that, when made, affects the environment such that it becomes more likely. similarly, a Self Refuting Prophecy is a prophecy that when made makes itself less likely. This is also relevant for beliefs that can affect reality directly without being voiced, for example, the belief "I'm confident" can increase a person confidence, thus making it true, while the opposite belief can reduce a person's confidence, thus also making it true... (read more)
A project announcement is what you might expect - an announcement of a project.
Posts that are about a project's announcement, but do not themselves announce anything, should not have this tag... (read more)
A rational agent is an entity which has a utility function, forms beliefs about its environment, evaluates the consequences of possible actions, and then takes the action which maximizes its utility. They are also referred to as goal-seeking. The concept of a rational agent is used in economics, game theory, decision theory, and artificial intelligence... (read more)
| User | Post Title | Wikitag | Pow | When | Vote |
Summaries of discussions, takeaways, etc. from LessWrong meetups that have already taken place.
Inkhaven is a 30-day residency where one has to publish posts every day, as part of an effort to grow stronger as a writer. While this has produced some excellent posts it also produces a fair bit of noise too, and also many more hastily-written or experimental posts than usual.
Inkhaven-like posts emerge when other people try to imitate this manner on a smaller scale (e.g. Lightcone team members doing their own 1-week writing stints, or 'HalfHaven' where remote LessWrongers aim to post 30 posts over the course of two months).
Focuses on the intersection of frontier AI agents and traditional infrastructure security, including exploit detection, system persistence, and hardware-level attributability
When formalized, causal relationships are usually formalized as a directed acyclic graph from parent events to child events saying how to compute the probable child given the state of its parents.
Inkhaven is a 30-day residency where one has to publish posts every day.day, as part of an effort to grow stronger as a writer. While this likely helps one in the longer term, the shorter-term effect ishas produced some excellent posts it also produces a fair bit of noise too, and also many more likely creation ofhastily-written or experimental posts with less effort to doublecheck the arguments and, as a result, with epistemic problems. than usual.
Inkhaven-like posts emerge when other people try to imitate this manner on a smaller scale (e.g. Lightcone team members doing their own 1-week writing stints)stints, or 'HalfHaven' where remote LessWrongers aim to post 30 posts over the course of two months).
In doxastic modal logic, the statement "P is a hyperstition" is written as □P→□P→P. Modal reasoners that satisfy Löb's Theorem believe all personal hyperstitions. This can cause some problems for modal embedded agents. Löbian cooperation works by making mutual cooperation a collective hyperstition.
A reasoning step is "logically valid" when that kind of step never produces a false conclusion from true premises. For example, in algebra, "Add 2 to both sides of the equation" is valid because it only produces true equations from true equations, while "Divide both sides by x" is invalid because x might be 0. So even if "2x = (y+1)x", letting x = 0 and y = 2, the original equation can be true while "2 = y + 1" is false. But "2x + 2 = (y+1)x + 2" will be true in every semantic model where the original equation is true.
More generally in life, there's a question of "did you execute each local step of reasoning correctly", which can be considered apart from "did you arrive at the correct conclusion". Validity is a local property of a reasoning step or sequence; we can (and should) evaluate each step's validity separately from whether we agree with the premises or end up agreeing with the conclusion. For near-logical domains, this asks "Does the next proposition follow (with very high probability, given other things usually believed about the world or explicitly introduced as premises) from the previous proposition?" For probabilistic reasoning, informal validity asks, "Given everything else believed or introduced as a premise, is this next step adjusting probabilities by the right amount?" or "Does this kind of reasoning step in general produce well-calibrated conclusions from well-calibrated premises?"
Eg, consider why the ad hominem fallacy should be seen as "invalid" or a "locally invalid reasoning step" from this viewpoint. Suppose you start out with well-calibrated probabilities (things you say "60%" for, happen around 60% of the time). You assign 60% probability that the sky is blue. Then somebody says, "Yeah, well, people who believe in blueskyism are ugly" and you nod and adjust your credence in blueskyism down to 40%. Your odds just went from 3:2 to 2:3, so by Bayes's Rule you should've heard evidence with a likelihood ratio of 4:9 to produce that probability shift. Unless you already believe that false propositions are 225% as likely as true propositions to be believed by ugly people, you should already expect that believing an ad hominem argument is something that can produce ill-calibrated conclusions in expectation from well-calibrated premises.
Main articles:
Scalable oversight is an approach to AI control [1]in which AIs supervise each other.the problem of providing reliable supervision of outputs from AIs, even as they become smarter than humans. Often groups of weaker AIs supervise a stronger AI, or AIs are set in a zero-sum interactiondebate with each other.
People used to refer to scalable oversight as a set of AI alignment techniques, but they usually work on the level of incentives to the AIs, and have less to do with architecture.
Focuses on the intersection of frontier AI agents and traditional infrastructure security, including exploit detection, system persistence, and hardware-level attributability
By Ruthenis (summarized; includes level 0):
Inkhaven is a 30-day residency where one has to publish posts every day. While this likely helps one in the longer term, the shorter-term effect is a more likely creation of posts with less effort to doublecheck the arguments and, as a result, with epistemic problems.
Inkhaven-like posts emerge when other people try to imitate this manner on a smaller scale (e.g. Lightcone team members doing their own 1-week writing stints).
The main problems with CEV include, firstly, the great difficulty of implementing such a program - “If one attempted to write an ordinary computer program using ordinary computer programming skills, the task would be a thousand lightyears beyond hopeless.” Secondly, the possibility that human values may not converge. Yudkowsky considered CEV obsolete almost immediately after its publication in 2004. He states that there's a "principled distinction between discussing CEV as an initial dynamic of Friendliness, and discussing CEV as a Nice Place to Live" and his essay was essentially conflating the two definitions.
ATOW (2025-09-09)(2026-04-03), nothing has been published that claimMoore et al. (2026) is probably the best academic account of LLM-Induced Psychosis (LIP) is a definite, real, phenomena. Though, many anecdotal accounts exist. It is not yet clear, if LIP is caused by AIs, if pre-existing disillusion are 'sped up' or reinforced by interactinginduced psychosis. They "analyze logs of conversations with an AI, or, if LIP exists at all.LLM chatbots from 19 users who report having experienced psychological harms from chatbot use" where the users mostly came from " support group for such chatbot users."
ML4Good is a France-based field-building organisation that runs AI Safety bootcamps.
Summaries of discussions, takeaways, etc. from LessWrong meetups that have already taken place.










If youWe used to have 100 or more karma on both LessWronga feature for crossposting to EA Forum. It caused a lot of bugs that were difficult to deal with and didn't feel like it was pulling its weight, so we remove it in the EA Forum, you can automatically crosspost from LessWronglatest update to the EA Forum (and from the EA Forum to LessWrong). You also need to have accepted the EA Forum's Terms of Use,which you can do by trying to create a new post on the EA Forum (if you haven't already done so after the Terms of Use requirement was put in place).
You should be logged in on both sites. To ensure that a post is crossposted after it's published, or to crosspost an already-published post, follow the authentication flow in the Options menu on the post editor page.
hey Chris and Mick! wanna include Atlas Computing? we're a fieldbuilding org scoping the problems in AGI risks that make recruiting expertise to lead those orgs easier.
we're also hiring: https://atlascomputing.org/jobs
our onepager here:
https://docs.google.com/document/d/1v9yVAkfnjrFwsp3jH5aYTwfwjVBsNYND/edit?usp=sharing&ouid=109085206565751232228&rtpof=true&sd=true