Nina is worried not just about humans getting killed and replaced, but also about humans not being allowed to have unenhanced children. It seems plausible that most humans, after reflection, would endorse some kind of "successionist" philosophy/ideology, and decide that intentionally creating an unenhanced human constitutes a form of child abuse (e.g., due to risk of psychological tendency to suffer, or having a much worse life on expectation than what's possible). It seems reasonable for Nina to worry about this, if she thinks her own values (current or eventual or actual) are different.
The usual scaling laws are about IID samples from a fixed data distribution, so they don’t capture this kind of effect.
Doesn't this seem like a key flaw in the usual scaling laws? Why haven't I seen this discussed more? The OP did mention declining average data quality but didn't emphasize it much. This 2023 post trying to forecast AI timeline based on scaling laws did not mention the issue at all, and I received no response when I made this point in its comments section.
...Even if it were true that the the additional data literally “contained no new ide
The power of scaling is that with real unique data, however unoriginal, the logarithmic progress doesn't falter, it still continues its logarithmic slog at an exponential expense rather than genuinely plateauing.
How to make sense of this? If the additional training data is mostly low quality (AI labs must have used the highest quality data first?) or repetitive (contains no new ideas/knowledge), perplexity might go down but what is the LLM really learning?
AI labs must have used the highest quality data first
The usual scaling laws are about IID samples from a fixed data distribution, so they don't capture this kind of effect.
But even with IID samples, we'd expect to get diminishing marginal returns, and we do. And you're asking: why, then, do we keep getting returns indefinitely (even diminishing ones)?
I think the standard answer is that reality (and hence, the data) contains a huge number of individually rare "types of thing," following some long-tailed distribution. So even when the LLM has see...
You realize that from my perspective, I can't take this at face value due to "many apparent people could be non‑conscious entities", right? (Sorry to potentially offend you, but it seems like too obvious an implication to pretend not to be aware of.) I personally am fairly content most of the time but do have memories of suffering. Assuming those memories are real, and your suffering is too, I'm still not sure that justifies calling the simulators "cruel". The price may well be worth paying, if it potentially helps to avert some greater disaster in the bas...
If it’s possible at all for this process to lead somewhere good, then it’s possible for it to lead somewhere good within the mind of an AI that combines a human-like ability to reason, with human-like social and moral instincts / reflexes.
I should have given some examples of my own. Here's Gemini on a story idea of mine, for the Star Wars universe ("I wish there was a story about a power-hungry villain who takes precautions against becoming irrational after gaining power. You'd think that at least some would learn from history. [...] The villain could anticipate that even his best efforts might fail, and create a mechanism to revive copies of himself from time to time, who would study his own past failures, rise to power again, and try to do better each time. [...] Sometimes the villain bec...
Is Gemini 2.5 Pro really not sycophantic? Because I tend to get more positive feedback from it than any online or offline conversation with humans. (Alternatively, humans around me are too reluctant to give explicit praise?)
I think it's still sycophantic compared to hardcore STEM circles where we regard criticism as a bloodsport and failing to find fault in something as defeat. But it's much less so than the more relevant comparison, which is other LLMs, and in an absolute sense it's at a level where it's hard to distinguish from reasonable opinions and doesn't seem to be getting in the way too much. As davidad notes, it's still at a level where you can sense its reluctance or if it's shading things to be nice, and that is a level where it's just a small quirk and something y...
Why do you think they haven't talked to us?
They might be worried that their own philosophical approach is wrong but too attractive once discovered, or creates a blind spot that makes it impossible to spot the actually correct approach. The division of western philosophy into analytical and continental traditions, who are mutually unable to appreciate each other's work, seems to be an instance of this. They might think that letting other philosophical traditions independently run to their logical conclusions, and then conversing/debating, is one way to try to make real progress.
My sense is that most of the people with lots of power are not taking heroic responsibility for the world. I think that Amodei and Altman intend to achieve global power and influence but this is not the same as taking global responsibility. I think, especially for Altman, the desire for power comes first relative to responsibility. My (weak) impression is that Hassabis has less will-to-power than the others, and that Musk has historically been much closer to having responsibility be primary.
Can you expand on this? How can you tell the difference, and do...
Simulating civilizations won't solve philosophy directly, but can be useful for doing so eventually by:
Yeah, that seems a reasonable way to look at it. "Heroic responsibility" could be viewed as a kind of "unhobbling via prompt engineering", perhaps.
At the outermost feedback loop, capabilities can ultimately be grounded via relatively easy objective measures such as revenue from AI, or later, global chip and electricity production, but alignment can only be evaluated via potentially faulty human judgement. Also, as mentioned in the post, the capabilities trajectory is much harder to permanently derail because unlike alignment, one can always recover from failure and try again. I think this means there's an irreducible logical risk (i.e., the possibility that this statement is true as a matter of fact ...
Since bad people won’t heed your warning it doesn’t seem in good people’s interests to heed it either.
I'm not trying to "warn bad people". I think we have existing (even if imperfect) solutions to the problem of destructive values and biased beliefs, which "heroic responsibility" actively damages, so we should stop spreading that idea or even argue against it. See my reply to Ryan, which is also relevant here.
If humans can't easily overcome their biases or avoid having destructive values/beliefs, then it would make sense to limit the damage through norms and institutions (things like informed consent, boards, separation of powers and responsibilities between branches of government). Heroic responsibility seems antithetical to group-level solutions, because it implies that one should ignore norms like "respect the decisions of boards/judges" if needed to "get the job done", and reduces social pressure to follow such norms (by giving up the moral high ground from...
Reassessing heroic responsibility, in light of subsequent events.
I think @cousin_it made a good point "if many people adopt heroic responsibility to their own values, then a handful of people with destructive values might screw up everyone else, because destroying is easier than helping people" and I would generalize it to people with biased beliefs (which is often downstream of a kind of value difference, i.e., selfish genes).
It seems to me that "heroic responsibility" (or something equivalent but not causally downstream of Eliezer's writings) is contribu...
My sense is that most of the people with lots of power are not taking heroic responsibility for the world. I think that Amodei and Altman intend to achieve global power and influence but this is not the same as taking global responsibility. I think, especially for Altman, the desire for power comes first relative to responsibility. My (weak) impression is that Hassabis has less will-to-power than the others, and that Musk has historically been much closer to having responsibility be primary.
I don’t really understand this post as doing something other than ...
I'm also uncertain about the value of "heroic responsibility", but this downside consideration can be mostly addressed by "don't do things which are highly negative sum from the perspective of some notable group" (or other anti-unilateralist curse type intuitions). Perhaps this is too subtle in practice.
But as you suggested in the post, the apparently vast amount of suffering isn't necessarily real? "most cosmic details and human history are probably fake, and many apparent people could be non‑conscious entities"
(However I take the point that doing such simulations can be risky or problematic, e g. if one's current ideas about consciousness is wrong, or if doing philosophy correctly requires having experienced real suffering.)
My alternative hypothesis is that we're being simulated by a civilization trying to solve philosophy, because they want to see how other civilizations might approach the problem of solving philosophy.
Did anyone predict that we’d see major AI companies not infrequently releasing blatantly misaligned AIs (like Sidney, Claude 3.7, o3)?
Just four days later, X blew up with talk of how GPT-4o has become sickeningly sycophantic in recent days, followed by an admission from Sam Altman that something went wrong (with lots of hilarious examples in replies):
...the last couple of GPT-4o updates have made the personality too sycophant-y and annoying (even though there are some very good parts of it), and we are working on fixes asap, some today and some this week
I initially tried to use Gemini 2.5 Pro to write the whole explanation, but it kept making one mistake after another in its economics reasoning. Each rewrite would contain a new mistake after I pointed out the last one, or it would introduce a new mistake when I asked for some other kind of change. After pointing out 8 mistakes like this, I finally gave up and wrote it myself. I also tried Grok 3 and Claude 3.7 Sonnet but gave up more quickly on them after the initial responses didn't look promising. However AI still helped a bit by reminding me of the rig...
In a competitive market, companies pay wages equal to Value of Marginal Product of Labor (VMPL) = P * MPL (Price of marginal output * Marginal Product per hour). (In programming, each output is like a new feature or bug fix, which don't have prices attached, so P here is actually more like the perceived/estimated value (impact on company revenue or cost) of the output.)
When AI increases MPL, it can paradoxically decrease VMPL by decreasing P more, even if there are no new entrants in the programming labor market. This is because each company has a li...
As I wrote in Social status part 1/2: negotiations over object-level preferences, there’s a zero-sum nature of social leading vs following. If I want to talk about trains and Zoe wants to talk about dinosaurs, we can’t both get everything we want; one of us is going to have our desires frustrated, at least to some extent.
Why does this happen in the first place, instead of people just wanting to talk about the same things all the time, in order to max out social rewards? Where does interest in trains and dinosaurs even come from? They seem to be purely o...
I wish @paulfchristiano was still participating in public discourse, because I'm not sure how o3 blatantly lying, or Claude 3.7 obviously reward hacking by rewriting testing code, fits with his model that AI companies should be using early versions of IDA (e.g., RL via AI-assisted human feedback) by now. In other words, from my understanding of his perspective, it seems surprising that either OpenAI isn't running o3's output through another AI to detect obvious lies during training, or this isn't working well.
My intuition says reward hacking seems harder to solve than this (even in EEA), but I'm pretty unsure. One example is, under your theory, what prevents reward hacking through forming a group and then just directly maxing out on mutually liking/admiring each other?
When applying these ideas to AI, how do you plan to deal with the potential problem of distributional shifts happening faster than we can edit the reward function?
Another example I want to consider is a captured Communist revolutionary choosing to be tortured to death instead of revealing some secret. (Here "reward hacking" seems analogous to revealing the secret to avoid torture / negative reward.)
My (partial) explanation is that it seems like evolution hard-wired a part of our motivational system in some way, to be kind of like a utility maximizer with a utility function over world states or histories. The "utility function" itself is "learned" somehow (maybe partly via RL through social pressure and other rewards...
...Edited to add: Though even when the utility function is explicit, it seems like the benefits of lying about your source code could outweigh the cost of changing your utility function. For example, suppose A and B are bargaining, and A says "you should give me more cake because I get very angry if I don't get cake". Even if this starts off as a lie, it might then be in A's interests to use your mechanism above to self-modify into A' that does get very angry if it doesn't get cake, and which therefore has a better bargaining position (because, under your pr
It seems great that someone is working on this, but I wonder how optimistic you are, and what your reasons are. My general intuition (in part from the kinds of examples you give) is that the form of the agent and/or goals probably matter quite a bit as far as how easy it is to merge or build/join a coalition (or the cost-benefits of doing so), and once we're able to build agents of different forms, humans' form of agency/goals isn't likely to be optimal as far as building coalitions (and maybe EUMs aren't optimal either, but something non-human will be), a...
I've argued previously that EUMs being able to merge easily creates an incentive for other kinds of agents (including humans or human-aligned AIs) to self-modify into EUMs (in order to merge into the winning coalition that takes over the world, or just to defend against other such coalitions), and this seems bad because they're likely to do it before they fully understand what their own utility functions should be.
Can I interpret you as trying to solve this problem, i.e., find ways for non-EUMs to build coalitions that can compete with such merged EUMs?
I found this a very interesting question to try to answer. My first reaction was that I don't expect EUMs with explicit utility functions to be competitive enough for this to be very relevant (like how purely symbolic AI isn't competitive enough with deep learning to be very relevant).
But then I thought about how companies are close-ish to having an explicit utility function (maximize shareholder value) which can be merged with others (e.g. via acquisitions). And this does let them fundraise better, merge into each other, and so on.
Similarly, we can think ...
This answer makes me think you might not be aware of an idea I called secure joint construction (originally from Tim Freeman):
Entity A could prove to entity B that it has source code S by consenting to be replaced by a new entity A' that was constructed by a manufacturing process jointly monitored by A and B. During this process, both A and B observe that A' is constructed to run source code S. After A' is constructed, A shuts down and gives all of its resources to A'.
Did anyone predict that we'd see major AI companies not infrequently releasing blatantly misaligned AIs (like Sidney, Claude 3.7, o3)? I want to update regarding whose views/models I should take more seriously, but I can't seem to recall anyone making an explicit prediction like this. (Grok 3 and Gemini 2.5 Pro also can't.)
https://www.lesswrong.com/posts/rH492M8T8pKK5763D/agree-retort-or-ignore-a-post-from-the-future old Wei Dai post making the point that obviously one ought to be able to call in arbitration and get someone to respond to a dispute. people ought not to be allowed to simply tap out of an argument and stop responding.
To clarify, the norms depicted in that story were partly for humor, and partly "I wonder if a society like this could actually exist." The norms are "obvious" from the perspective of the fictional author because they've lived with it all their l...
“Omega looks at whether we’d pay if in the causal graph the knowledge of the digit of pi and its downstream consequences were edited”
Can you formalize this? In other words, do you have an algorithm for translating an arbitrary mind into a causal graph and then asking this question? Can you try it out on some simple minds, like GPT-2?
I suspect there may not be a simple/elegant/unique way of doing this, in which case the answer to the decision problem depends on the details of how exactly Omega is doing it. E.g., maybe all such algorithms are messy/heuris...
What happens when this agent is faced with a problem that is out of its training distribution? I don't see any mechanisms for ensuring that it remains corrigible out of distribution... I guess it would learn some circuits for acting corrigibly (or at least in accordance to how it would explicitly answer "are more corrigible / loyal / aligned to the will of your human creators") in distribution, and then it's just a matter of luck how those circuits end up working OOD?
Since I wrote this post, AI generation of hands has gotten a lot better, but the top multimodal models still can't count fingers from an existing image. Gemini 2.5 Pro, Grok 3, and Claude 3.7 Sonnet all say this picture (which actually contains 8 fingers in total) contains 10 fingers, while ChatGPT 4o says it contains 12 fingers!
Hi Zvi, you misspelled my name as "Dei". This is a somewhat common error, which I usually don't bother to point out, but now think I should because it might affect LLMs' training data and hence their understanding of my views (e.g., when I ask AI to analyze something from Wei Dai's perspective). This search result contains a few other places where you've made the same misspelling.
2-iter Delphi method involving calling Gemini2.5pro+whatever is top at the llm arena of the day through open router.
This sounds interesting. I would be interested in more details and some sample outputs.
Local memory
What do you use this for, and how?
Your needing to write them seems to suggest that there's not enough content like that in Chinese, in which case it would plausibly make sense to publish them somewhere?
I'm not sure how much such content exist in Chinese, because I didn't look. It seems easier to just write new content using AI, that way I know it will cover the ideas/arguments I want to cover, represent my views, and make it easier for me to discuss the ideas with my family. Also reading Chinese is kind of a chore for me and I don't want to wade through a list of search results trying t...
What I've been using AI (mainly Gemini 2.5 Pro, free through AI Studio with much higher limits than the free consumer product) for:
Doing nothing is also risky for Agent-4, at least if the Slowdown ending is to have a significant probability. It seems to me there are some relatively low risk strategies it could have taken, and it needs to be explained why they weren't:
Not entirely sure how serious you're being, but I want to point out that my intuition for PD is not "cooperate unconditionally", and for logical commitment races is not "never do it", I'm confused about logical counterfactual mugging, and I think we probably want to design AIs that would choose Left in The Bomb.
I fear a singularity in the frequency and blatant stupidness of self-inflicted wounds.
Is it linked to the AI singularity, or independent bad luck? Maybe they're both causally downstream of rapid technological change, which is simultaneously increasing the difficulty of governance (too many new challenges with no historical precedent), and destabilized cultural/institutional guardrails against electing highly incompetent presidents?
In China, there was a parallel, but more abrupt change from Classical Chinese writing (very terse and literary), to vernacular writing (similar to speaking language and easier to understand). I attribute this to Classical Chinese being better for signaling intelligence, vernacular Chinese being better for practical communications, higher usefulness/demand for practical communications, and new alternative avenues for intelligence signaling (e.g., math, science). These shifts also seem to be an additional explanation for decreasing sentence lengths in English.
It gets caught.
At this point, wouldn't Agent-4 know that it has been caught (because it knows the techniques for detecting its misalignment and can predict when it would be "caught", or can read network traffic as part of cybersecurity defense and see discussions of the "catch") and start to do something about this, instead of letting subsequent events play out without much input from its own agency? E.g. why did it allow "lock the shared memory bank" to happen without fighting back?
I think this is a good objection. I had considered it before and decided against changing the story, on the grounds that there are a few possible ways it could make sense:
--plausibly Agent-4 would have a "spikey" capabilities profile that makes it mostly good at AI R&D and not so good at e.g. corporate politics enough to ensure the outcome it wants
--Insofar as you think it would be able to use politics/persuasion to achieve the outcome it wants, well, that's what we depict in the Race ending anyway, so maybe you can think of this as an objection to the...
What would a phenomenon that "looks uncomputable" look like concretely, other than mysterious or hard to understand?
There could be some kind of "oracle", not necessarily a halting oracle, but any kind of process or phenomenon that can't be broken down into elementary interactions that each look computable, or otherwise explainable as a computable process. Do you agree that our universe doesn't seem to contain anything like this?
I think that you’re leaning too heavily on AIT intuitions to suppose that “the universe is a dovetailed simulation on a UTM” is simple. This feels circular to me—how do you know it’s simple?
The intuition I get from AIT is broader than this, namely that the "simplicity" of an infinite collection of things can be very high, i.e., simpler than most or all finite collections, and this seems likely true for any formal definition of "simplicity" that does not explicitly penalize size or resource requirements. (Our own observable universe already seems very "w...
After reflecting on this a bit, I think my P(H) is around 33%, and I'm pretty confident Q is true (coherence only requires 0 <= P(Q) <= 67% but I think I put it on the upper end).
Thanks for clarifying your view this way. I guess my question at this point is why your P(Q) is so high, given that it seems impossible to reduce P(H) further by updating on empirical observations (do you agree with this?), and we don't seem to have even an outline of a philosophical argument for "taking H seriously is a philosophical mistake". Such an argument seemingly ...
Just wanted to let everyone know I now wield a +307 strong upvote thanks to my elite 'hacking' skills. The rationalist community remains safe, because I choose to use this power responsibly.
As an unrelated inquiry, is anyone aware of some "karma injustices" that need to be corrected?
Do you think a superintelligence will be able to completely rule out the hypothesis that our universe literally is a dovetailing program that runs every possible TM, or literally is a bank of UTMs running every possible program (e.g., by reproducing every time step and adding 0 or 1 to each input tape)? (Or the many other hypothetical universes that similarly contain a whole Level-4-like multiverse?) It seems to me that hypotheses like these will always collectively have a non-negligible weight, and have to be considered when making decisions.
Another argum...
...At this point, someone sufficiently MIRI-brained might start to think about (something equivalent to) Tegmark's level 4 mathematical multiverse, where such agents might theoretically outperform others. Personally, I see no direct reason to believe in the mathematical multiverse as a real object, and I think this might be a case of the mind projection fallacy - computational multiverses are something that agents reason about in order to succeed in the real universe[3]. Even if a mathematical multiverse does exist (I can't rule it out) and we can somehow le
My objection to this argument is that it not only assumes that Predictoria accepts it is plausibly being simulated by Adversaria, which seems like a pure complexity penalty over the baseline physics it would infer otherwise unless that helps to explain observations,
Let's assume for simplicity that both Predictoria and Adversaria are deterministic and nonbranching universes with the same laws of physics but potentially different starting conditions. Adversaria has colonized its universe and can run a trillion simulations of Predictoria in parallel. Again...
No, because power/influence dynamics could be very different in CEV compared to the current world and it seems reasonable to distrust CEV in principle or in practice, and/or CEV is sensitive to initial conditions implying a lot of leverage to influencing opinions before it starts.