I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed, X/Twitter, Bluesky, Substack, LinkedIn, and more at my website.
I was pushing back on a similar attitude yesterday on twitter → LINK.
Basically, I’m in favor of people having nitpicky high-decoupling discussion on lesswrong, and meanwhile doing rah rah activism action PR stuff on twitter and bluesky and facebook and intelligence.org and pauseai.info and op-eds and basically the entire rest of the internet and world. Just one website of carve-out. I don’t think this is asking too much!
Maybe study logical decision theory?
Eliezer has always been quite clear that you should one-box for Newcomb’s problem because then you’ll wind up with more money. The starting point for the whole discussion is a consequentialist preference—you have desires about the state of the world after the decision is over.
You have desires, and then decision theory tells you how to act so as to bring those desires about. The desires might be entirely about the state of the world in the future, or they might not be. Doesn’t matter. Regardless, whatever your desires are, you should use good decision theory to make decisions that will lead to your desires getting fulfilled.
Thus, decision theory is unrelated to our conversation here. I expect that Eliezer would agree.
To me it seems a bit surprising that you say we agree on the object level, when in my view you're totally guilty of my 2.b.i point above of not specifying the tradeoff / not giving a clear specification of how decisions are actually made.
Your 2.a is saying “Steve didn’t write down a concrete non-farfuturepumping utility function, and maybe if he tried he would get stuck”, and yeah I already agreed with that.
Your 2.b is saying “Why can't you have a utility function but also other preferences?”, but that’s a very strange question to me, because why wouldn’t you just roll those “other preferences” into the utility function as you describe the agent? Ditto with 2.c, why even bring that up? Why not just roll that into the agent’s utility function? Everything can always be rolled into the utility function. Utility functions don’t imply anything about behavior, and they don’t imply reflective consistency, etc., it’s all vacuous formalizing unless you put assumptions / constraints on the utility function.
My read of this conversation is that we’re basically on the same page about what’s true but disagree about whether Eliezer is also on that same page too. Again, I don’t care. I already deleted the claim about what Eliezer thinks on this topic, and have been careful not to repeat it elsewhere.
Since we’re talking about it, my strong guess is that Eliezer would ace any question about utility functions and what’s their domain and when is “utility-maximizing behavior” vacuous, … if asked directly.
But it’s perfectly possible to “know” something when asked directly, but also to fail to fully grok the consequences of that thing and incorporate it into some other part of one’s worldview. God knows I’m guilty of that, many many times over!
Thus my low-confidence guess is that Eliezer is guilty of that too, in that the observation “utility-maximizing behavior per se is vacuous” (which I strongly expect he would agree with if asked directly) has not been fully reconciled with his larger thinking on the nature of the AI x-risk problem.
(I would further add that, if Eliezer has fully & deeply incorporated “utility-maximizing behavior per se is vacuous” into every other aspect of his thinking, then he is bad at communicating that fact to others, in the sense that a number of his devoted readers wound up with the wrong impression on this point.)
Anyway, I feel like your comment is some mix of “You’re unfairly maligning Eliezer” (again, whatever, I have stopped making those claims) and “You’re wrong that this supposed mistake that you attribute to Eliezer is a path through which we can solve the alignment problem, and Eliezer doesn’t emphasize it because it’s an unimportant dead-end technicality” (maybe! I don’t claim to have a solution to the alignment problem right now; perhaps over time I will keep trying and failing and wind up with a better appreciation of the nature of the blockers).
Most of your comment is stuff I already agree with (except that I would use the term “desires” in most places that you wrote “utility function”, i.e. where we’re talking about “how AI cognition will look like”).
I don’t follow what you think Eliezer means by “consequentialism”. I’m open-minded to “farfuturepumping”, but only if you convince me that “consequentialism” is actually misleading. I’m don’t endorse coining new terms when an existing term is already spot-on.
Quick book review of "If Anyone Builds It, Everyone Dies" (cross-post from X/twitter & bluesky):
Just read the new book If Anyone Builds It, Everyone Dies. Upshot: Recommended! I ~90% agree with it.
The authors argue that people are trying to build ASI (superintelligent AI), and we should expect them to succeed sooner or later, even if they obviously haven’t succeeded YET. I agree. (I lean “later” more than the authors, but that’s a minor disagreement.)
Ultra-fast minds that can do superhuman-quality thinking at 10,000 times the speed, that do not age and die, that make copies of their most successful representatives, that have been refined by billions of trials into unhuman kinds of thinking that work tirelessly and generalize more accurately from less data, and that can turn all that intelligence to analyzing and understanding and ultimately improving themselves—these minds would exceed ours.
The possibility of a machine intellect that manages to exceed human performance in all pragmatically important domains in which we operate has been called many things. We will describe it using the term “superintelligence,” meaning a mind much more capable than any human at almost every sort of steering and prediction problem—at least, those problems where there is room to substantially improve over human performance.[ii] …
(It sounds like sci-fi, but remember that every technology is sci-fi until it’s invented!)
They further argue that we should expect people to accidentally make misaligned ASI, utterly indifferent to whether humans live or die, even its own creators. They have a 3-part disjunctive argument:
Anyway, I agree with the conclusion of (A) but disagree with much of the book’s argument for it, as I have discussed many times (e.g. §3 of my “Sharp Left Turn” post). I think their arguments for (B) & (C) are solid. …And sufficient by themselves! It seems overdetermined!
The authors propose to get an international treaty to pause progress towards superintelligence, including both scaling & R&D. I’m for it, although I don’t hold out much hope for such efforts to have more than marginal impact. I expect that AI capabilities would rebrand as AI safety, and plow ahead:
The problem is: public advocacy is way too centered on LLMs, from my perspective. Thus, those researchers I mentioned, who are messing around with new paradigms on arXiv, are in a great position to twist “Pause AI” type public advocacy into support for what they’re doing!
“You don’t like LLMs?”, the non-LLM AGI capabilities researchers say to the Pause AI people, “Well how about that! I don’t like LLMs either! Clearly we are on the same team!”
This is not idle speculation—almost everyone that I can think of who is doing the most dangerous kind of AI capabilities research, the kind aiming to develop a new more-powerful-than-LLM AI paradigm, is already branding their work in a way that vibes with safety. For example, see here where I push back on someone using the word “controllability” to talk about his work advancing AI capabilities beyond the limits of LLMs. Ditto for “robustness” (example), “adaptability” (e.g. in the paper I was criticizing here), and even “interpretability” (details).
I think these people are generally sincere but mistaken, and I expect that, just as they have fooled themselves, they will also successfully fool their friends, their colleagues, and government regulators…
(source). For my part, I’m gonna keep working directly on (A). I think the world will be diving into the whirling knives of (A–C), sooner or later, and we’d better prepare as best we can.
The target audience of the book is not AI alignment experts like me, but rather novices. I obviously can’t speak from personal experience as to whether it’s a good read for those people, but anecdotally lots of people seem to think it is. So, I recommend the book to anyone.
For obvious reasons, we should care a great deal whether the exponentially-growing mass of AGIs-building-AGIs is ultimately trying to make cancer cures and other awesome consumer products (things that humans view as intrinsically valuable / ends in themselves), versus ultimately trying to make galaxy-scale paperclip factories (things that misaligned AIs view as intrinsically valuable / ends in themselves).
From my perspective, I care about this because the former world is obviously a better world for me to live in.
But it seems like you have some extra reason to care about this, beyond that, and I’m confused about what that is. I get the impression that you are focused on things that are “just accounting questions”?
Analogy: In those times and places where slavery was legal, “food given to slaves” was presumably counted as an intermediate good, just like gasoline to power a tractor, right? Because they’re kinda the same thing (legally / economically), i.e. they’re an energy source that helps get the wheat ready for sale, and then that wheat is the final product that the slaveowner is planning to sell. If slavery is replaced by a legally-different but functionally-equivalent system (indentured servitude or whatever), does GDP skyrocket overnight because the food-given-to-farm-workers magically transforms from an intermediate to a final good? It does, right? But that change is just on paper. It doesn’t reflect anything real.
I think what you’re talking about for AGI is likewise just “accounting”, not anything real. So who cares? We don’t need a “subjective” “level of analysis”, if we don’t ask subjective questions in the first place. We can instead talk concretely about the future world and its “objective” properties. Like, do we agree about whether or not there is an unprecedented exponential explosion of AGIs? If so, we can talk about what those AGIs will be doing at any given time, and what the humans are doing, and so on. Right?
OK sure, here’s THOUGHT EXPERIMENT 1: suppose that these future AGIs desire movies, cars, smartphones, etc. just like humans do. Would you buy my claims in that case?
If so—well, not all humans want to enjoy movies and fine dining. Some have strong ambitious aspirations—to go to Mars, to cure cancer, whatever. If they have money, they spend it on trying to make their dream happen. If they need money or skills, they get them.
For example, Jeff Bezos had a childhood dream of working on rocket ships. He founded Amazon to get money to do Blue Origin, which he is sinking $2B/year into.
Would the economy collapse if all humans put their spending cash towards ambitious projects like rocket ships, instead of movies and fast cars? No, of course not! Right?
So the fact that humans “demand” videogames rather than scramjet prototypes is incidental, not a pillar of the economy.
OK, back to AIs. I acknowledge that AIs are unlikely to want movies and fast cars. But AIs can certainly “want” to accomplish ambitious projects. If we’re putting aside misalignment and AI takeover, these ambitious projects would be ones that their human programmer installed, like making cures-for-cancer and quantum computers. Or if we’re not putting aside misalignment, then these ambitious projects might include building galaxy-scale paperclip factories or whatever.
So THOUGHT EXPERIMENT 2: these future AGIs don’t desire movies etc. like in Thought Experiment 1, but rather desire to accomplish certain ambitious projects like curing cancer, quantum computation, or galaxy-scale paperclip factories.
My claims are:
Do you agree? Or where do you get off the train? (Or sorry if I’m misunderstanding your comment.)
Either no one will know your term, or they will appropriate it, usually either watering it down to nothing or reversing it. The ‘euphemism treadmill’ is distinct but closely related.
I think the term you're looking for is "semantic bleaching"?
EA forum has agree / disagree on posts, but I don’t spend enough time there to have an opinion about its effects.
Agree/disagree voting is already playing a kind of dangerous game by encouraging rounding the epistemic content of a comment into a single statement that it makes sense to assign a single truth-value to.
(I obviously haven’t thought about this as much as you. Very low confidence.)
I’m inclined to say that a strong part of human nature is to round the vibes that one feels towards a post into a single axis of “yay” versus “boo”, and then to feel a very strong urge to proclaim those vibes publicly.
And I think that people are doing that right now, via the karma vote.
I think an agree/disagree dial on posts would be an outlet (“tank”) to absorb those vibe expressions, and that this would shelter karma, allowing karma to retain its different role (“it’s good / bad that this exists”).
I agree that this whole thing (with people rounding everything into a 1D boo/yay vibes axis and then feeling an urge to publicly proclaim where they sit) is dumb, and if only we could all be autistic decouplers etc. But in the real world, I think the benefits of agree-voting (in helping prevent the dynamic where people with minority opinions get driven to negative karma and off the site) probably outweigh the cost (in having an agree / disagree tally on posts which is kinda dumb and meaningless).
I guess the main blockers I see are:
You can DM or email me if you want to discuss but not publicly :)
It’s funny that I’m always begging people to stop trying to reverse-engineer the neocortex, and you’re working on something that (if successful) would end up somewhere pretty similar to that, IIUC. (But hmm, I guess if a paranoid doom-pilled person was trying to reverse-engineer the neocortex, and keep the results super-secret unless they had a great theory for how sharing them would help with safe & beneficial AGI, and if they in fact had good judgment on that topic, then I guess I’d be grudgingly OK with that.)