Archetypal Transfer Learning (ATL) is a proposal by @whitehatStoic for what is argued by the author to be a fine tuning approach that "uses archetypal data" to "embed Synthetic Archetypes". These Synthetic Archetypes are derived from patterns that models assimilate from archetypal data, such as artificial stories. The method yielded a shutdown activation rate of 57.33% in the GPT-2-XL model after fine-tuning. .. (read more)
Religion is a complex group of human activities — involving commitment to higher power, belief in belief, and a range of shared group practices such as worship meetings, rites of passage, etc... (read more)
| User | Post Title | Wikitag | Pow | When | Vote |
Focuses on the intersection of frontier AI agents and traditional infrastructure security, including exploit detection, system persistence, and hardware-level attributability
do you still think these are possible to build/define/do you know about any relevant papers?
I have been exploring building something like this as an exploration into better-controllable default developer choices for agent programming primitives, as opposed to "assistant persona + completions API for everything".
A reasoning step is "logically valid" when that kind of step never produces a false conclusion from true premises. For example, in algebra, "Add 2 to both sides of the equation" is valid because it only produces true equations from true equations, while "Divide both sides by x" is invalid because x might be 0. So even if "2x = (y+1)x", letting x = 0 and y = 2, the original equation can be true while "2 = y + 1" is false. But "2x + 2 = (y+1)x + 2" will be true in every semantic model where the original equation is true.
More generally in life, there's a question of "did you execute each local step of reasoning correctly", which can be considered apart from "did you arrive at the correct conclusion". Validity is a local property of a reasoning step or sequence; we can (and should) evaluate each step's validity separately from whether we agree with the premises or end up agreeing with the conclusion. For near-logical domains, this asks "Does the next proposition follow (with very high probability, given other things usually believed about the world or explicitly introduced as premises) from the previous proposition?" For probabilistic reasoning, informal validity asks, "Given everything else believed or introduced as a premise, is this next step adjusting probabilities by the right amount?" or "Does this kind of reasoning step in general produce well-calibrated conclusions from well-calibrated premises?"
Eg, consider why the ad hominem fallacy should be seen as "invalid" or a "locally invalid reasoning step" from this viewpoint. Suppose you start out with well-calibrated probabilities (things you say "60%" for, happen around 60% of the time). You assign 60% probability that the sky is blue. Then somebody says, "Yeah, well, people who believe in blueskyism are ugly" and you nod and adjust your credence in blueskyism down to 40%. Your odds just went from 3:2 to 2:3, so by Bayes's Rule you should've heard evidence with a likelihood ratio of 4:9 to produce that probability shift. Unless you already believe that false propositions are 225% as likely as true propositions to be believed by ugly people, you should already expect that believing an ad hominem argument is something that can produce ill-calibrated conclusions in expectation from well-calibrated premises.
Main articles:
Although slavery is usually involuntary and involves coercion, there are also cases where people voluntarily enter into slavery (like(link to wikipedia!) to pay a debt or earn money due to poverty.
CDT agents don't consider the acausal impactlogical impacts of their decisionsdecision algorithms' outputs when choosing actions.actions, only the physical consequences of their physical act. Whenever a CDT agent is put in a situation where it has to make a decision, it considers multiple hypothetical worlds,hypotheticals, one for each decision it could make. In a CDT agent, the only difference between these hypothetical worlds is the decision it makes.physical act in the moment of that act, and what happens physically / causally downstream from that. This means that when CDT is faced with something trying to predict its actions, CDT imagines its decision to not have any effect on its predicted decision.
Focuses on the intersection of frontier AI agents and traditional infrastructure security, including exploit detection, system persistence, and hardware-level attributability
Kant's third formulation of the categorical imperative lets you build up most of the structure of the key moral ideas from a simple rule: "treat no person as purely a means to andan end, but always also as an end in themselves". Many applications of the categorical imperative require baroque derivations to loop back and be justified from this premise (treated as a generative axiom) but "consent ethics" in general, and "slavery is forbidden" is anare both elementary proof.proofs from this starting point. A slave is a person, turned into a tool and piece of property of another person... a literal "means" to ANY end that the owning person (or "Master") deems desirable and feasible.
In doxastic modal logic, the statement "P is a hyperstition" is written as □P→□P→P. Modal reasoners that satisfy Löb's Theorem believe all personal hyperstitions. This can cause some problems for modal embedded agents. Löbian cooperation works by making mutual cooperation a collective hyperstition.
The Machine Alignment, Transparency, and Security (MATS) Program is an independent research and educational seminar program that provides emerging researchers with mentorship, talks, workshops, and workshopsresearch support and connects them with the SF Bay Area and London AI safety research communities.
The name was suggested by Ryan Grenblatt in niplav in a reply“AI companies are unlikely to Daniel Kokotajlo's shortformmake high-assurance safety cases if timelines are short”.
The Machine Alignment, Transparency, and Security (MATS) Program is an independent research and educational seminar program that provides emerging researchers with mentorship, talks,talks & workshops, research support, and research support and connects themconnections with the SF Bay Area and London AI safety research communities.
Consent is a foundational concept in many practical systems of ethics (such as found in medicine).
When CDT makes its decisions, it only thinks it controls things causally downstream of its actions. UDT by contrast, is choosing as if it controls every part of reality that is logically correlated withdownstream of its actions.logical output. This allows it to acausally bargaindetermine a wide range of other facts across the multiverse.universe that are logically correlated with itself, like what is or has been reliably predicted about its present decision, or what other agents sufficiently similar to itself will choose. Son of CDT is somewhere in the middle. It acts as if it controls only the things logically correlated with its actions that are causally downstream of its moment of original creation.
If a Son of CDT agent goes on to create further agents, all of those agents will have the same magic moment. They will all care about whether or not Omega's knowledge of them is causally downstream of the moment the moment the CDT agent first wrote Son-of-CDT code.
ATOW (2025-09-09)(2026-04-03), nothing has been published that claimMoore et al. (2026) is probably the best academic account of LLM-Induced Psychosis (LIP) is a definite, real, phenomena. Though, many anecdotal accounts exist. It is not yet clear, if LIP is caused by AIs, if pre-existing disillusion are 'sped up' or reinforced by interactinginduced psychosis. They "analyze logs of conversations with an AI, or, if LIP exists at all.LLM chatbots from 19 users who report having experienced psychological harms from chatbot use" where the users mostly came from " support group for such chatbot users."










If youWe used to have 100 or more karma on both LessWronga feature for crossposting to EA Forum. It caused a lot of bugs that were difficult to deal with and didn't feel like it was pulling its weight, so we remove it in the EA Forum, you can automatically crosspost from LessWronglatest update to the EA Forum (and from the EA Forum to LessWrong). You also need to have accepted the EA Forum's Terms of Use,which you can do by trying to create a new post on the EA Forum (if you haven't already done so after the Terms of Use requirement was put in place).
You should be logged in on both sites. To ensure that a post is crossposted after it's published, or to crosspost an already-published post, follow the authentication flow in the Options menu on the post editor page.
hey Chris and Mick! wanna include Atlas Computing? we're a fieldbuilding org scoping the problems in AGI risks that make recruiting expertise to lead those orgs easier.
we're also hiring: https://atlascomputing.org/jobs
our onepager here:
https://docs.google.com/document/d/1v9yVAkfnjrFwsp3jH5aYTwfwjVBsNYND/edit?usp=sharing&ouid=109085206565751232228&rtpof=true&sd=true