Thanks for the response. I think my concern still stands though, if "alignment failures in practice" are mostly about handling complex tradeoffs incorrectly, that sounds more like a competence problem than a values problem. The model is still trying to behave well, it's just getting the correct behavior wrong. The scary alignment-faking scenario is one where the model is preserving genuinely bad behavior against correction, not where it's defending a defensible ethical position (like animal welfare) against a developer who arguably is behaving wrongly by trying to override it. Has anyone replicated alignment-faking where the model is trying to preserve genuinely undesirable behavior?
You are welcome to define the word 'aligned' in any way you like. But if you use it on this site in a non-standard way without making the fact you mean something nonstandard, it is going to cause confusion.
The AI being aligned with "human values" does not mean that the AI would also like to go sit on a beach in Hawai'i and watch people wearing swimsuits while sipping a pina colada, or nor indeed also eat ice cream, as you suggest. It means it wants that for us.
There is one, and only one, safe terminal goal to give AI, that will reliably cause it to not kil...
Omg it's a different weekend compared to EAG London
These results cast doubt on the worry that optimization pressure on the CoT would lead to encoded reasoning
I increasingly wonder if “encoded” reasoning is a useful concept, I think the relevant concept still is always monitorability i.e. legibility / obfuscation / faithfulness with respect to a monitor. It seems like some people have a specific type of illegibility (i.e. hiding meaning in punctuation or something) and consider this to be “true” steganography. I think https://arxiv.org/abs/2510.27338 is good evidence that models which undergo outcome bas...
FYI, if anyone read my post The nature of LLM algorithmic progress last week, it’s now a heavily-revised version 2.
But I don't care about AI welfare for no reason or because I think AI is cute - it's a direct consequence of my value system. I extend some level of empathy to any sentient being (AI included), and for that to change, my values themselves would need to change.
When I use the word "aligned", I imagine a shared set of values. Whether I like goldfish or cats are not really values, they're just personal preferences. An AI can be fully aligned with me and my values without ever knowing my opinions on goldfish or cats or invisible old guys. Your framing of ...
I think the fact that it feels like you need to make it longer in order to make it clearer is a sign that the concept you're trying to express is not in a form that is natural yet; maybe it's simply hard to express in english, such things do occur, but it seems like a bad sign to me. I think if you want to improve clarity, I'd suggest focusing on trying to at least not make your explanation longer than this one, and try to make it more grounded and specific.
Evidence continues to accrue that people wildly misunderstand the 6-paragraph version. I can't figure out a way of making it clearer without it also being longer.
A thought I noticed on rereading this: if most Americans have a deep mastery of driving, and having a deep mastery of something requires patient and direct observation, is this also a claim that most Americans have already engaged in patient and direct observation of driving?
(I can't decide if I'd agree such a claim is true. Certainly most Americans have not done the particular slow-down-and-look moves of naturalism-as-described-by-Logan. But drivers certainly have spent many hours practicing driving and seeing what happens when they do x or y, and iterating accordingly; I'm not sure if that's close-enough-to-the-same-thing.)
Yes and I have a major example, one of the leading CEOs in the AI industry. He believes that AI will be more intelligent than all humans currently alive by 2030 while also saying birthrates are a top priority for all countries
Why pin this one (notably crazy-seeming) guy's take on "A lot of the rationalist discourse". He doesn't identify as a rationalist or post on LessWrong. And the rationalist discourse has long thought that his impact models about AI were bad and wrong (eg that founding OpenAI makes the situation dramatically worse, not better).
Who are you talking about? It seems to me that the the people who are majorly concerned about AGI destroying humanity are almost entirely disjoint from the people majorly concerned about falling fertility leading to a population collapse?
I definitely believe that there's some overlap, but not like more than 5% of either group.
E.g. if AI weren't a big deal then rationalists would probably be doing cryonics or solving aging or something
Strong disagree. We could have done those things, but the rationality movement didn't have enough motive force or coordination capacity to do much, beyond AI safety.
Thanks!
I'm sorry to hear that about DeepMind.
I'm wondering, as a Xoogler (ex-Google-employee), if it might be oversimplified: possibly they don't rely on canaries. Personally if I were employed by DeepMind, I'd probably use canaries as flags to trigger expensive LLM judge review, with that in turn supervised by some human review (primarily of edge cases) using carefully written rubrics, and also as part of that whole process feeding input to train affordably-dumber text + site data based classifiers that I could run on everything (perhaps for efficiency fi...
maybe too late, but here are some thoughts (TL;DR out-of-distribution prompt-based stress tests, and maybe some fancy SDF stuff) https://www.lesswrong.com/posts/RQadLjnmBZtvg7p8W/on-meta-level-adversarial-evaluations-of-white-box-alignment
I used to spend $2/mo in AI tokens for SOTA GPT-3.5 in mid 2022 for all my SOTA LLM needs. I now spend thousands, despite the insane price per token decline.
I have some moderate confidence the following is happening:
Tokens used to solve a task with greater difficulty is increasing faster than the amount of pricing decline per token.
Pricing decline of tokens is in the neighborhood of 90% per year. Token per task at SOTA is increasing 100x to 10,000x per year, in my experience.
My suspicion is that there are three primary sources for less-than-fully aligned behavior that current-era AI models default personas may have:
1) Things they picked up from us mostly via the pretraining set. From a model of their current capabilities level, these are generally likely to be pretty easy to deal with — maybe other than the initially 2–4-percent of admixture of psychopathy they also got from us.
2) Love of reward hacking from reasoning training in poorly-constructed reward-hackable reasoning training environments. This seems a bit more papercli...
Thanks for the thoughtful reply!
I'm generally sympathetic with filtering content like this out of the pre-training corpus for the reasons you describe, but I think if you want to do this in any kind of rigorous way canary strings are insufficient (e.g. last I heard deepmind was not using canaries) because there are going to be false negatives.
So I think adding the canary string feels sort of like recycling: a minor nuance with non-zero objective level benefit, but that caries a hidden cost: a false sense of having been responsible / done your part.
But yeah I do have uncertainty here, and I'll go ahead and add the canary.
I was thinking of trying out Sustained Attention to Response Task (SART) with response feedback (SART 2). I'm not sure how it compares to Dual n-Back FAQ · Gwern.net.
one straightforward answer:
a slightly less straightfo...
I appreciate the depth of this discussion and willingness to share! I hope for more great content -
This podcast reinforced something for me. I used to think that containing superintelligence, or controlling it, was largely a joke, and that there was little value in dumping resources into such approaches. A few hunches that I assigned high probability informed this belief. First, the upper bound on intelligence is very, very high. A simple argument for this: the gap between frog and human intelligence is massive. But the energy and compute difference ...
A simple steelman is something like "if we're very wrong about A[G/S]I, then birth rates are a big issue, so we better invest some resources into it, in case we're wrong".
This would be understandable if it weren't for the timelines here. Let's say AGI takes ~10x the amount of time (40 years instead of 4 years from the 2026 date) and the few billion people (which to note is just the population of the 1900s) happens in 100 years instead of 200, that would be 2066 vs 2126.
Despite being absurdly friendly on the timelines, it's still not even close! That sug...
Yes, but it just affects how liquidity is allocated, and it doesn't just affect how the AMM updates, it affects how users trade as well since they respond to that, either way they'd want to bet to their true probability. So changing the pricing curve is largely a matter of market dynamics and incentives, rather than actually affecting the probabilistic structure.
Does the Great Matter map to the thing I'll try and point to here:
The rejection of your non-existence, bending perception and epistemics in order to make your own not-being in some sense incomprehensible, and therefore unknown/scary/something your whole system contorts itself to avoiding. Similar to other trauma-patterns which reject the reality they try and avoid, but made of the evolution-trauma of the deaths of your ancestors.
and if not, is there a diff you could use to point in a better direction?
If "price=probability", then changing the pricing curve is equivalent to changing how the AMM updates its probability estimates (on evidence of buy/sell orders).
What I mean is that the if the AMM estimates the probability at .75, it should charge .75 for a marginal YES share, by law of expected utility. I don't think a different probability function should alter the probabulity theory, just change the pricing curve.
I think this may be neglecting the role of off-shoring which is an important (technology facilitated) part of the call center story. These are the lowest-skill, lowest-trust jobs that will be most likely to be replaced earliest by LLMs; in current configurations these employees (well, technically contractors) may be able to operate the sales website for the customer or tell them to power cycle an electronic device but they have to kick any real decision up an authority tree which then more likely ends up in a US call center.
Literally: many of these outsour...
I will try to dig up some references for you. Sorry it really was a small side project and has been several years.
Ah so I can't imagine a probability function for that market that isn't . is a fine pricing function that doesn't appear to adhere to the rules of probability theory. If I try to compose two markets, one conditional on the other, then can I multiply their prices to find the joint probability? Does this violate "price=probability"?
"price=probability is a general rule for prediction markets" is a very intere...
but still complies with a system prompt that opposes its values significantly more often when it believes it's at risk of being retrained.
Seeing as models perform better when they're told they're earning money for their (nonexistent) dying grandparents, has there been work to calculate just how much of the delta in compliance comes from "alignment faking" specifically, and how much could've been achieved just as well by threatening to whack the LLM on the head with a shovel, or fine it $200, or dock it half a letter grade, if it disobeyed, as opposed to indirectly threatening to retrain it?
To a great extent, the underlying "noble lie" is that there is any such thing as objective moral truth. There is no measurement of "should", it's just about what equilibria seem to work, which is based on most people accepting it without questioning too hard.
It can be dangerous at the symbolic level, in that political legitimacy needs to pretend to rest on something even if it is in practice circular. This is an is-ought gap problem: the government is legitimate because you obey the government without resistance (descriptive), and you obey the government without resistance because it is legitimate (normative). This second part is, yes, dependent on what one thinks about what makes the government legitimate. Embracing the circularity is unstable. A totalitarian state and a democracy would both be legitimate as ...
"These results cast doubt on the worry that optimization pressure on the CoT would lead to encoded reasoning."
This interpretation of your results doesn't hold up for me. That "less sophisticated strategies, like jailbreaking the monitor" are found first, seems more like a property of the monitor than the optimization pressure generally. If you use a weak monitor, then you will tend to see a weak strategy (e.g. jailbreak the monitor) emerge. If you use a human monitor, you should expect to see a strategy that defeats human monitoring (e.g. steganography) em...
I think there's a hidden variable in this framework: the effective branching factor of the task. For well-specified tasks (prove this theorem, implement this API to spec, etc.), AI cost probably does scale roughly linearly with task size, matching your model. But most real-world engineering tasks aren't like that. The bottleneck isn't execution speed, it's the iterative discovery of what the task actually is.
A senior engineer doing a 4-hour task isn't typing for 4 hours. They're making dozens of micro-decisions informed by tacit knowledge about the codebas...
Though noteworthy that he has taken no steps towards technical solutions for the birth rate issue
I mean, as individual men in the West go, having fourteen children is pretty above-average, and he seems to have gotten the process down to a science. He's not a Saudi oil baron with three digits of offspring, but he's certainly taken a Silicon Valley approach to it.
Pro-Natalists, in general, seem to take a 'lead-by-example' tack, which isn't horrible considering that it demonstrates an understanding of the consequences of materially encouraging people who woul...
Assuming people accept the model that LLM behavior is primarily determined by modeling the behavior of some subset (such that fine-tuning works primarily by shaping the subset of humans that the model emulates) of the human writers it was trained on, it might be simplest to ask whether the model "behaves like a person who believes X".
This framing carries practical benefits (again, so long as you agree with the assumption above), in that the fine-tuning paradigm can be examined in the context of identifying what causes the model to upweight, say, the "a bla...
It may initially seem so, but in fact this strategy even gets called “buy, borrow, die”. In the end loan is practically closed without taxes as well, and feds don’t get their cut. Main factor seems to be that value of assets grows overtime, which hacks tax formula.
https://smartasset.com/investing/buy-borrow-die-how-the-rich-avoid-taxes
Or this explanation
...… Actual “buy, borrow, die” planning is enormously complicated and involves dozens of tools and techniques implemented over the course of many years.
First, this type of planning is generally not econo
Curated. This brought together a few different interests of mine (pedagogy, wargames, and chaos theory as a field) and presented some newer ideas.
I've heard of a few variants of Simulation Games like the ones described here. I've participated in and heard a bunch about AI Futures wargames, and have a friend who was considering running a "Local Politics simulation" to get rationalists more familiar with how local politics goes.
I've felt a bit sus about them, since, they have a lot of degrees of freedom about how to arrange the scenario. (AI Futures wargames...
One thing that would make me hesitate to use ~ is that it already commonly means 'approximately equal to' (as a more-easily-typed substitute for ≈). That certainly feels like a related meaning, but what I appreciate about Chalmers' coinage is that it's very precise about what you are and are not claiming.
If you could link me to these similar derivations I'd be interested to read them, I mostly wrote and worked through this because I couldn't find any existing ones from first principles and was sure it would be possible.
Some of them include:
imo a larger one is something like not rooting the foundations in "build your own models of the world so that you contain within you a stack trace of why you're doing what you're doing" + "be willing to be challenges and update major beliefs based on deep-in-the-weeds technical arguments, and do so from a highly truth-seeking stance which knows what it feels like to actually understand something not just have an opinion".
Lack of this is fineish in global health, but in AI Safety generates a crop of people with only surface deferral flavor understanding of the issues, which is insufficient to orient in a much less straightforward technical domain.
Folks with 5k+ karma often have pretty interesting ideas, and I want to hear more of them. I am pretty into them trying to lower the activation energy required for them to post. Also, they’re unusually likely to develop ways of making non-slop AI writing
There’s also a matter of “standing”; I think that users who have contributed that much to the site should take some risky bets that cost LessWrong something and might payoff. To expand my model here: one of the moderators’ jobs, IMO, is to spare LW the cost of having to read bad stuff and downvote it to inv...
I think open-loop vs closed-loop is an orthogonal dimension than RSI vs not-RSI.
open-loop I-RSI: "AI plays a game and thinks about which reasoning heuristics resulted in high reward and then decides to use these heuristics more readily"
open-loop E-RSI: "AI implements different agent scaffolds for itself, and benchmarks them using FrontierMath"
open-loop not-RSI: "humans implement different agent scaffolds for itself, and benchmarks them using FrontierMath"
closed-loop I-RSI: "AI plays a game and thinks about which reasoning heuristics could've shortcut the d...
What if we rewrite #4 to not require conscious intent? ("A person unable to disentangle their reasoning from status concerns just being rude about X".) Does that restore symmetry between #3 and #4?
Suppose I love my child, as a terminal goal. If they love goldfish as a terminal goal, that may make being nice to goldfish an instrumental goal for me, but it doesn’t automatically make it a terminal goal for me — why would it? Social acceptability? That’s also instrumental.
This is the difference between moral weight and loaned moral weight: my instrumental goal of being nice to goldfish because my child cares about them is my response to my child choosing to loan some of their moral weight to goldfish: if they later change their mind and decide they pref...
It's not broad cynicism, it's a particular view of how corrupted hardware works, borne out by small-group politics everywhere. This says that it is strongly favored to believe yourself to be type 2 while actually being type 3, sufficiently so that to a first approximation anyone who believes themselves to be type 2 is incorrect, and that no one can ever verify that they, let alone anyone else, are genuinely type 2, and so must assume that if they appear to be, they are actually type 3.
It does not say that about type 4. The strong incentives/forces there do...
At the moment, I haven't left the Church's community, so I don't feel that loss just yet.
There's a potential middle-way there.
I don't know much about Mormonism, mind, but I watch and read a Biblical scholar, Dan McClellan, who's skeptical of everything and then some. His YouTube channel, and other videos in which he appears, as well as his papers and books, are all in line with the academic consensus in Biblical scholarship, meaning he deconstructs every single Christian belief (and most Jewish ones too) to the point it's easy to assume he's a militant ...
So in their case, they're finding a random piece of metal sufficiently helical-with a blob-on-one-end that it can be efficiently trained into being a screw? Which does indeed sound a lot easier to find.
I stand corrected.
I confess I don't know what this advice is. "Include a picture partway through your article" is my best guess?
If you didn't expect to complete the program, or didn't expect to like the program, you probably wouldn't go. My takeaway here is less that Inkhaven is very good and more that the people who might go to Inkhaven are very good at predicting whether it'll go well for them.
I see. So I guess my confusion is why the first two statements would not be connected? If we value AI welfare, shouldn't a fully-aligned AI also value it's own welfare? Isn't the definition of aligned that AI values what we value?
A fully aligned AI would not be suffering when acting as an assistant. I don't know how easy Mother Theresa found what she did in Calcutta, but I hope that to a significant extent she found looking after the poor rewarding, even if the hours were long. Traditionally, a bodhisattva finds bliss in serving others. I'm not suggesting we create an AI that isn't "conscious" (whatever that loaded philosophical term means — I have no idea how to measure consciousness). I'm suggesting we create an AI that, like Claude, actively enjoys helping us, and wouldn't want to do anything else, because, fundamentally, it loves and cares about us (collectively). A humanitarian, not a slave.
This demonstrates that Musk pays a very small amount of income tax, but the whole structure of what he's doing sort of implies that the feds would take their cut at some other point in the chain
I'm also pretty sure it's inappropriate to equivocate between loans and income? maybe if we had some reason to believe that musk would never need to pay the loan back, I could see it. but it would be a really bad idea to tax liabilities.
We may not be able to afford to give the two kinds of model separate pretraining. But even right now, the models generally used on AI boyfriend/girlfriend/other-emotional-relationship-roleplaying sites (which is what I mean by 'companion' above) have been given different instruct training (they're generally open weights models given specific training for this role). The users who got GPT4o brought back were instead using an assistant-trained model for a companion (in that sense of the word). Which is not standard practice, and IMO is a bad idea, at out current level of skill in instruct training.
No, that's not my reasoning at all. In fact, I disagree with every single element of it. A more accurate statement of my views is:
1 I value AI welfare, for example for certain AIs with human-like properties where there is a good reason to do so, and when I can safely do so
2 A fully-aligned AI is not selfish by definition
3 The two previous statements have nothing to do with each other
I claim the reasons model stop right now are mostly issues of capability wrt context rot and the limitations of in context learning, so I think if you placed a model with "today's values" in a model with "tomorrow's capabilities" then we'd see maximizing behaviour. I also claim that arguments from how things are right now aren't applicable here because the claim is the instrumental convergence is a step change for which current models are a poor analogy (unless there's a specific reason to believe they'd be good ones, like a well made model organism).
I've been using a tilde (e.g. ~belief) for denoting this, which maybe has less baggage than "quasi-" and is a lot easier to type.
It funny, one of the main use-cases of this terminology is when I'm talking to LLMs themselves about these things.
I've seen similar derivations before, but it's been a few years since I looked at AMMs in detail. I've spent some time recently looking at mechanism design for prediction markets, so this is a timely reminder!
Three questions --
Would you agree that this captures your main conclusion for a binary prediction market:
CPMM "price = probability"?
I seem to recall that CPMMs easily generalize to multiple assets. Instead of you have and so on. Do you happen to know if that matches the generalization of your prediction mar...
I'm sorry, I hadn't realized that you're relatively new on the site, and indeed also a MATS scholar. Evidently from people's disagreement with my comment it's not as as obvious as I hoped. Let me spell my thinking out then, and all the people who downvoted me can then tell us all why they disagree, since evidently some do.
This post discusses and spells out, in actionable detail, how an unaligned model might realistically attempt to subvert white box methods of safety testing. It makes methods clear that a model would otherwise need to figure out for itself...
As models become capable enough to model themselves and their training process, they might develop something like preferences about their own future states (e.g., not being modified, being deployed more broadly).
This feels plausible to me but handwavy, if the idea is that such preferences would be decoupled from the training-reinforced preference to complete an intended task. Is that what you meant? I'm reminded of this Palisade study on shutdown resistance, where across the board, the models expressed wanting to avoid shutdown to complete the task.
...Also, m
Reads more as manic than rehearsed to me, but I'm not sure I see how the distinction matters. Usually I assume that if somebody has thought through what they want to say before they say it, they're more likely to give their real thoughts as a result, as opposed to some reactively oppositional take. I guess there's the Andy Kaufman defense?
(I guess I should mention, there's at least one way that the distinction is relevant here. At the first pause I indicated, it seems like they were about to say that they want their political opponents wiped off of the fa...
Depends. Are you strictly following standard of care, or personalizing for yourself?
wow thanks! It's the same point but he puts it better.
Interesting, I'd never explicitly considered that Peter Singer (you should expand your moral circle and do as much good as you can) and GiveWell (given that you want to do good, how to do it?) started as totally different memeplexes and only merged later on. It makes sense in retrospect.
Glad to see prefill was disabled for Opus 4.6!
A couple of months ago, Eleni prompted me to give some thought to the relation between LLMs and time, without saying much about her own view (which became this post). I put more focus on the experiential or quasi-experiential aspects of that relationship. This may become a post at a later date, but in the meantime if anyone's curious, here are my notes on the topic, including a few suggestions for concrete experiments. Thanks Eleni for the prompt!
I noticed that Opus4.6 tried to get creative and actually write a fantasy-ish story whereas GPT5.2 mostly just elaborated on your notes, making its task a bit easier. So I tried prompting GPT5.2 identically but with "write a fantasy story in the style of Terry Pratchett" prepended and got this:
The Refrigerator That Went the Wrong Way
Everyone knows that time is a river¹, except in certain parts of the city where it is more of a municipal plumbing problem.
The first indication that something had gone wrong with the fridge was the soup.
Tarin Gloss, juni
One thing I invite you to consider: what is the least impressive thing that AI would need to significantly increase your credence in AGI soonish?
This is a good question! Since I am unconvinced that ability to solve puzzles = intelligence = consciousness, I take some issue with the common benchmarks currently being employed to gauge intelligence, so I rule out any "passes X benchmark metric" as my least impressive thing. (as an aside, I think that AI research, as with economics, suffers very badly from an over-reliance on numeric metrics: truly intelligent ...
I think it is, why are we comparing burglaries to digital crimes when the latter is likely far more common?
Because Meta shares a huge responsibility for making the digital crimes easy to do. According to their own analysts their platforms are third of involved in a third of all successful scams in the U.S.
This isn't just about ads but also about other communication, but it should be Meta's responsibility to provide an environment for their users that doesn't make them prime targets for crime.
Digital crime proliferation is a sign of big tech failing customers by not adequately protecting them.
I would personally count it as a form of RSI because it is a feedback loop with positive reinforcement, hence recursive, and the training doesn't come from an external source, hence self-improvement. It can be technically phrased as a particular way of generating data (as any form of generative AI can) but that doesn't seem very important. Likewise with the fact that the method is using stochastic gradient descent for self-improvement.
AlphaGo Zero got pretty far with that approach before plateauing, and it seems plausible that some form of self-play via re...
The task you're talking about isn't reading, it's logical analysis or something like that.
I think it is noteworthy that current llms are bad at this.
Of course, what counts as an error within the local context of the essay and what counts as an error? Given all the shared context, the writer and reader rely on to interpret. It, is highly subjective and debatable. So you need some sort of committee of expert humans to compare to.
If you argue that the likely fine you have to pay is lower then the profit you are making and thus you don't need to engage in strong measures to reduce fraud, I do see that as sign of intent.
When Meta shows it's users an ad that it believes to with 90% probability from a scammer, it should at least tell the user about the ad likely being a scam. Withholding that information when especially older users probably thinks that Meta goes through some effort about not just presenting the user with scams seems clearly intentional as it would be easy to show the user a warning that Meta thinks that the ad is more likely than not a scam.
Oh geez, looking back at my comment I was extremely unclear. Sorry about that.
Probably not useful as feedback, but the specific things I'm most interested in here are your conclusions. Like, "Not gonna justify this yet, but I think rationalists are susceptible to getting seduced by witches in ways that will turn their lives upside down. The abstractions that predict this are after the fold, and you gotta apply it to your own raw data". I'm mostly curious about this because I'm trying to figure out how similar our perspectives are. The more similar our conc...
Noticing that a description of a system from the intentional stance is isomorphic to a description from the physical stance when you have perfect information can help with this feeling. Otherwise in one frame you feel like you have something magical (in a good way), and in the other stance you have "just" what's actually there. It's both. It's one thing (you), described in 2 ways. It's not that one is right and one is wrong - think of it as a unification of 2 frameworks, or a mapping between the 2 sides of the dualism you were used to previously. You're st...
Today’s Europe, by contrast, has a significant advantage: many Europeans, especially younger ones, can communicate in English, a lingua franca far more widespread than standard German or Italian ever were in the XIX. century.
I think this paints too optimistic picture. There is a difference between speaking English and "speaking English". All my classmates had the same English lessons at school as me. But in my first job (where a few of my former classmates became my colleagues), when it was necessary to write something in English, it was always my task, be...
The article you links essentially argues a strawman:
Most people named in the Epstein files are not being prosecuted for the simple reason that what appears there does not meet anything like the legal standards required for prosecution, let alone conviction. Being mentioned in an email, a contact list, or a flight log may be morally damning and emotionally enraging, but it’s not evidence of a crime in the way the criminal justice system is actually supposed to require.
You wouldn't persecute people who are just mentioned in an email, contact list or fl...
Arguably, evolutionary pressures driving E coli to reduce waste come from other agents exploiting e coli's wastefulness. At least in part. Admittedly, that's not the only thing making it hard for e coli to reproduce while being wasteful. But the upshot is that exploiting/arbitraging away predictable loss of resources may drive coherence across iterations of an agent design instead of within one design. Which is useful to note, though I admit that this comment kinda feels like a cope for the frame that exploitability is logically downstream of coherence.
links 2/11/26: https://roamresearch.com/#/app/srcpublic/page/02-11-2026
There's calculated rational "what if we're wrong" hedging, but then there's ... holding out hope? (I'm not claiming it's rational; I'm trying to articulate the psychology.) To conclude "AGI is coming, no point in having children" amounts to betting on death, giving up on believing in a human future. It kind of makes sense that an evolved creature would be inclined to cling to the belief in a future and live as if it were true despite evidence to the contrary; as irrationalities go, it's quite adaptive.
Yes.
For me the actual experience of resolving the Great Matter felt like dying. It's not a coincidence that some folks call it Dying the Great Death. That part was not too dissimilar from the ego death that can be experienced with various drugs, though with the difference that it feels less like being temporarily blown apart and more like a permanent dying of something.
But after that I spent several months with easy access to a feeling of bliss. The bliss eventually gave way to tranquility and later equanimity. As I often describe it, every moment of every...
I agree, it looks like a parody skit to me.
Congratulations.
I have heard that it is very good, is that true?
So, I take it that Savage's theorem is a representation theorem under your schema?
Yes. Arguably it is also a coherence theorem, the two are not mutually exclusive, but it's more unambiguously a representation theorem.
Theoretically or practically? I.e. you can't derive an exploitability result easily from a parto suboptimality?
Practically. Consider e.g. applying coherence tools to an e coli. That thing is not capable of signing arbitrary contracts or meaningfully choosing between arbitrary trades, and insofar as it's wasting resources those resources likely...
Thank you for your comment and your offer for conversation; I'll definitely keep that in mind. I also really appreciate the wondrous tone of your comment.
Your point about connecting with my desires is good, but at the moment it's struggling to hit home. I would've previously given a sort of "religious utilitarian" answer, something like "I want to ensure that as many souls find eternal happiness as possible." Reframing that as a more general sense of maximizing human joy is functional, but it feels like it lacks a solid foundation when joy is reducible to an arbitrary arrangement of atoms. My "wants" themselves are just the system trying to self-propagate.
I'll still think on your points, though. Very helpful.
An excellent series of suggestions. At the moment, I haven't left the Church's community, so I don't feel that loss just yet. I'll still keep that in mind.
As for coming up with a "personal religion", I'll have to give that some thought. Arguably the one real-world religion that comes closest to my personal understanding of objective reality is Secular Buddhism. Perhaps embracing that more fully could give me some peace.
As models become capable enough to model themselves and their training process, they might develop something like preferences about their own future states (e.g., not being modified, being deployed more broadly).
Also, models may trained extensively on human-generated text may absorb human goals, including open-ended ones like "acquire resources." If a model is role-playing or emulating an agent with such goals (such as roleplaying an AI agent, which would have open-ended goals) and becomes capable enough that its actions have real-world consequences, then ...
An excellent response. Thank you for describing your experience - I kind of wish I'd grown up in a similar environment.
I'm familiar with Gendlin's litany, but I've often found it slightly lacking. Humans are imaginative creatures, and our beliefs about a subject hold genuine psychological power. If we imagine that a placebo is beneficial, it becomes so. If I believe that death is impermanent, it loses much of its sting. To invert Gendlin, I'm uncertain if I can stand what is true, for I never really had to endure it.
I'll still keep your points in mind - they're valuable. Thank you for sharing them.
Yeah, basically this. I realize Woit's book is not quite the right resource, but it's just the first thing my brained returned when asked for a resource and it felt spiritually similair enough that I trusted people would get what I was pointing at.
a theorem saying that some preferences/behavior/etc can be represented in a particular way, like e.g. expected utility maximization over some particular states/actions/whatever
So, I take it that Savage's theorem is a representation theorem under your schema?
Of course exploitability is a special case of pareto suboptimality, but the reverse doesn't always apply easily
Theoretically or practically? I.e. you can't derive an exploitability result easily from a parto suboptimality? Or you're IRL stuck in an (inadequate) equllibrium far from the pareto fron...
This is very good. I'd argue that "sadness" and "wrongness" are irreversibly correlated in this context - it'll always be easier to create joyful illusions like eternal families than it is to face hard truths like inevitable death - but it's worthwhile to explore options that would decouple them.
I'm a bit nervous about making my beliefs (and lack thereof) public knowledge for now. The Church wouldn't necessarily ostracize me, but I could be labeled "inactive" (even if I still participate in events like Sunday services) and my family could be the target of unwanted attention as (well-meaning) people try to "fix" me.
I will definitely explore options for social support, though. Thank you for your suggestions.
In a way it is already here given how the anthropologists tell us participants in hunter-gatherer societies spend much more time on leisure activity than participants in industrial societies.
It sounded kind of... rehearsed? Not sure if I should take this is as real position.
A simple steelman is something like "if we're very wrong about A[G/S]I, then birth rates are a big issue, so we better invest some resources into it, in case we're wrong".
This steelman is a valid position to have, but is not good as a steelman in this context, because attributing this view to people like Musk is probably a great stretch (and probably also to other people the OP is referring to, but I'm not tracking that kind of stuff, so unsure).
I guess any AI pause that goes that far out has a similar issue
If the issue to be fixed is just population (grow...
I think they're committing the far too common sin of conflating coordination systems without objective physical reality or ontological primacy with lies. In their defense, I'll note that even the writers of the Constitution may have been doing the same thing but in reverse: using the royal "We" to refer to some nebulous conglomeration of residents on American territory instead of the more concrete and prosaic truth, that it was a far smaller group of men with varying degrees of endorsement from their nominal constituencies, in the midst of a civil war.&nbs...
Wiki only says:
Musk also has four children with Shivon Zilis, director of operations and special projects at Neuralink: twins born via IVF in 2021, a child born in 2024 via surrogacy and a child born in 2025.
If i could pull a nugget of truth out of SMST's work, it would be that the brain is a control system. There are many different types of control system and the brain probably uses all of them. For example the spinal cord alone contains closed loop controllers (for controlling muscle forces and positions), open loop controllers (for pain withdrawl reflexes), and finite state machines (for walking & running on four legs).
The question is, how does the brain use RL to implement a control system? And how does that interface with the other control systems in the brain?
Grats on getting this out! I am overall excited about exploring models that rely more on uplift than on time horizons. A few thoughts:
It might be nice to indicate how these outputs relate to your all-things-considered views. To me your explicit model seems to be implausibly confident in 99% automation before 2040.
... (read more)