Better control solutions make AI more economically useful, which speeds up the AI race and makes it even harder to do an AI pause.
When we have controlled unaligned AIs doing economically useful work, they probably won't be very useful for solving alignment. Alignment will still be philosophically confusing, and it will be hard to trust the alignment work done by such AIs. Such AIs can help solve some parts of alignment problems, parts that are easy to verify, but alignment as a whole will still be bottle-necked on philosophically confusing, hard to verify ...
Better control solutions make AI more economically useful, which speeds up the AI race and makes it even harder to do an AI pause.
[...]
Such AIs will probably be used to solve control problems for more powerful AIs, so the basic situation will continue and just become more fragile, with humans trying to control increasingly intelligent unaligned AIs.
It currently seems unlikely to me that marginal AI control research I'm excited about is very economically useful. I agree that some control or control-adjacent research will end up being at least somewhat ec...
And I agree with Bryan Caplan's recent take that friendships are often a bigger conflict of interest than money, so Open Phil higher-ups being friends with Anthropic higher-ups is troubling.
No kidding. From https://www.openphilanthropy.org/grants/openai-general-support/:
OpenAI researchers Dario Amodei and Paul Christiano are both technical advisors to Open Philanthropy and live in the same house as Holden. In addition, Holden is engaged to Dario’s sister Daniela.
Wish OpenPhil and EAs in general were more willing to reflect/talk publicly about their mistake...
To be clear, by "indexical values" in that context I assume you mean indexing on whether a given world is "real" vs "counterfactual," not just indexical in the sense of being egoistic? (Because I think there are compelling reasons to reject UDT without being egoistic.)
I think being indexical in this sense (while being altruistic) can also lead you to reject UDT, but it doesn't seem "compelling" that one should be altruistic this way. Want to expand on that?
Maybe breaking up certain biofilms held together by Ca?
Yeah there's a toothpaste on the market called Livfree that claims to work like this.
IIRC, high EDTA concentration was found to cause significant amounts of erosion.
Ok, that sounds bad. Thanks.
ETA: Found an article that explains how Livfree works in more detail:
...Tooth surfaces are negatively charged, and so are bacteria; therefore, they should repel each other. However, salivary calcium coats the negative charges on the tooth surface and bacteria, allowing them to get very close (within 10 nm).
I actually no longer fully endorse UDT. It still seems a better decision theory approach than any other specific approach that I know, but it has a bunch of open problems and I'm not very confident that someone won't eventually find a better approach that replaces it.
To your question, I think if my future self decides to follow (something like) UDT, it won't be because I made a "commitment" to do it, but because my future self wants to follow it, because he thinks it's the right thing to do, according to his best understanding of philosophy and normativity...
This means that in the future, there will likely be a spectrum of AIs of varying levels of intelligence, some much smarter than humans, others only slightly smarter, and still others merely human-level.
Are you imagining that the alignment problem is still unsolved in the future, such that all of these AIs are independent agents unaligned with each other (like humans currently are)? I guess in my imagined world, ASIs will have solved the alignment (or maybe control) problem at least for less intelligent agents, so you'd get large groups of AIs aligned wi...
It therefore seems perfectly plausible for AIs to simply get rich within the system we have already established, and make productive compromises, rather than violently overthrowing the system itself.
So assuming that AIs get rich peacefully within the system we have already established, we'll end up with a situation in which ASIs produce all value in the economy, and humans produce nothing but receive an income and consume a bunch, through ownership of capital and/or taxing the ASIs. This part should be non-controversial, right?
At this point, it becomes ...
I have a slightly different take, which is that we can't commit to doing this scheme even if we want to, because I don't see what we can do today that would warrant the term "commitment", i.e., would be binding on our post-singularity selves.
In either case (we can't or don't commit), the argument in the OP loses a lot of its force, because we don't know whether post-singularity humans will decide to do this kind scheme or not.
So the commitment I want to make is just my current self yelling at my future self, that "no, you should still bail us out even if 'you' don't have a skin in the game anymore". I expect myself to keep my word that I would probably honor a commitment like that, even if trading away 10 planets for 1 no longer seems like that good of an idea.
This doesn't make much sense to me. Why would your future self "honor a commitment like that", if the "commitment" is essentially just one agent yelling at another agent to do something the second agent doesn't want to...
Over time I have seen many people assert that “Aligned Superintelligence” may not even be possible in principle. I think that is incorrect and I will give a proof - without explicit construction - that it is possible.
The meta problem here is that you gave a "proof" (in quotes because I haven't verified it myself as correct) using your own definitions of "aligned" and "superintelligence", but if people asserting that it's not possible in principle have different definitions in mind, then you haven't actually shown them to be incorrect.
Apparently the current funding round hasn't closed yet and might be in some trouble, and it seems much better for the world if the round was to fail or be done at a significantly lower valuation (in part to send a message to other CEOs not to imitate SamA's recent behavior). Zvi saying that $150B greatly undervalues OpenAI at this time seems like a big unforced error, which I wonder if he could still correct in some way.
What hunches do you currently have surrounding orthogonality, its truth or not, or things near it?
I'm very uncertain about it. Have you read Six Plausible Meta-Ethical Alternatives?
as far as I can tell humans should by default see themselves as having the same kind of alignment problem as AIs do, where amplification can potentially change what's happening in a way that corrupts thoughts which previously implemented values.
Yeah, agreed that how to safely amplify oneself and reflect for long periods of time may be hard problems that should be solved (...
As a tangent to my question, I wonder how many AI companies are already using RLAIF and not even aware of it. From a recent WSJ story:
Early last year, Meta Platforms asked the startup to create 27,000 question-and-answer pairs to help train its AI chatbots on Instagram and Facebook.
When Meta researchers received the data, they spotted something odd. Many answers sounded the same, or began with the phrase “as an AI language model…” It turns out the contractors had used ChatGPT to write-up their responses—a complete violation of Scale’s raison d’être.
So ...
but we only need one person or group who we’d be somewhat confident would do alright in CEV. Plausibly there are at least a few eg MIRIers who would satisfy this.
Why do you think this, and how would you convince skeptics? And there are two separate issues here. One is how to know their CEV won't be corrupted relative to what their values really are or should be, and the other is how to know that their real/normative values are actually highly altruistic. It seems hard to know both of these, and perhaps even harder to persuade others who may be very dist...
If indeed OpenAI does restructure to the point where its equity is now genuine, then $150 billion seems way too low as a valuation
Why is OpenAI worth much more than $150B, when Anthropic is currently valued at only $30-40B? Also, loudly broadcasting this reduces OpenAI's cost of equity, which is undesirable if you think OpenAI is a bad actor.
Apparently the current funding round hasn't closed yet and might be in some trouble, and it seems much better for the world if the round was to fail or be done at a significantly lower valuation (in part to send a message to other CEOs not to imitate SamA's recent behavior). Zvi saying that $150B greatly undervalues OpenAI at this time seems like a big unforced error, which I wonder if he could still correct in some way.
To clarify, I don't actually want you to scare people this way, because I don't know if people can psychologically handle it or if it's worth the emotional cost. I only bring it up myself to counteract people saying things like "AIs will care a little about humans and therefore keep them alive" or when discussing technical solutions/ideas, etc.
...If a misaligned AI had 1/trillion "protecting the preferences of whatever weak agents happen to exist in the world", why couldn't it also have 1/trillion other vaguely human-like preferences, such as "enjoy watching the suffering of one's enemies" or "enjoy exercising arbitrary power over others"?
From a purely selfish perspective, I think I might prefer that a misaligned AI kills everyone, and take my chances with continuations of myself (my copies/simulations) elsewhere in the multiverse, rather than face whatever the sum-of
I'm thinking that the most ethical (morally least risky) way to "insure" against a scenario in which AI takes off and property/wealth still matters is to buy long-dated far out of the money S&P 500 calls. (The longest dated and farthest out of the money seems to be Dec 2029 10000-strike SPX calls. Spending $78 today on one of these gives a return of $10000 if SPX goes to 20000 by Dec 2029, for example.)
My reasoning here is that I don't want to provide capital to AI industries or suppliers because that seems wrong given what I judge to be high x-risk th...
What is going on with Constitution AI? Does anyone know why no LLM aside from Claude (at least none that I can find) has used it? One would think that if it works about as well as RLHF (which it seems to), AI companies would be flocking to it to save on the cost of human labor?
Also, apparently ChatGPT doesn't know that Constitutional AI is RLAIF (until I reminded it) and Gemini thinks RLAIF and RLHF are the same thing. (Apparently not a fluke as both models made the same error 2 out of 3 times.)
Isn't the basic idea of Constitutional AI just having the AI provide its own training feedback using written instruction? My guess is there was a substantial amount of self-evaluation in the o1 training with complicated written instructions, probably kind of similar to a constituion (though this is just a guess).
The main asymmetries I see are:
About a week ago FAR.AI posted a bunch of talks at the 2024 Vienna Alignment Workshop to its YouTube channel, including Supervising AI on hard tasks by Jan Leike.
What do you think about my positions on these topics as laid out in and Six Plausible Meta-Ethical Alternatives and Ontological Crisis in Humans?
My overall position can be summarized as being uncertain about a lot of things, and wanting (some legitimate/trustworthy group, i.e., not myself as I don't trust myself with that much power) to "grab hold of the whole future" in order to preserve option value, in case grabbing hold of the whole future turns out to be important. (Or some other way of preserving option value, such as preserving the status quo / doin...
As long as all mature superintelligences in our universe don't necessarily have (end up with) the same values, and only some such values can be identified with our values or what our values should be, AI alignment seems as important as ever. You mention "complications" from obliqueness, but haven't people like Eliezer recognized similar complications pretty early, with ideas such as CEV?
It seems to me that from a practical perspective, as far as what we should do, your view is much closer to Eliezer's view than to Land's view (which implies that alignment ...
"as important as ever": no, because our potential influence is lower, and the influence isn't on things shaped like our values, there has to be a translation, and the translation is different from the original.
CEV: while it addresses "extrapolation" it seems broadly based on assuming the extrapolation is ontologically easy, and "our CEV" is an unproblematic object we can talk about (even though it's not mathematically formalized, any formalization would be subject to doubt, and even if formalized, we need logical uncertainty over it, and logical induction ...
I think the relevant implication from the thought experiment is that thinking a bunch about metaethics and so on will in practice change your values
I don't think that's necessarily true. For example some people think about metaethics and decide that anti-realism is correct and they should just keep their current values. I think that's overconfident but it does show that we don't know whether correct thinking about metaethics necessarily leads one to change one's values. (Under some other metaethical possibilities the same is also true.)
Also, even if it ...
This made me curious enough to read Land's posts on the orthogonality thesis. Unfortunately I got a pretty negative impression from them. From what I've read, Land tends to be overconfident in his claims and fails to notice obvious flaws in his arguments. Links for people who want to judge for themselves (I had to dig up archive.org links as the original site has disappeared):
...I think the relevant implication from the thought experiment is that thinking a bunch about metaethics and so on will in practice change your values; the pill itself is not very realistic, but thinking can make people smarter and will cause value changes. I would agree Land is overconfident (I think orthogonal and diagonal are both wrong models).
I'm actually pretty confused about what they did exactly. From the Safety section of Learning to Reason with LLMs:
...Chain of thought reasoning provides new opportunities for alignment and safety. We found that integrating our policies for model behavior into the chain of thought of a reasoning model is an effective way to robustly teach human values and principles. By teaching the model our safety rules and how to reason about them in context, we found evidence of reasoning capability directly benefiting model robustness: o1-preview achieved substantially im
If the other player is a stone with “Threat” written on it, you should do the same thing, even if it looks like the stone’s behavior doesn’t depend on what you’ll do in response. Responding to actions and ignoring the internals when threatened means you’ll get a lot fewer stones thrown at you.
In order to "do the same thing" you either need the other's player's payoffs, or according to the next section "If you receive a threat and know nothing about the other agent’s payoffs, simply don’t give in to the threat!" So if all you see is a stone, then presuma...
#1 has obviously happened. Nordstream 1 was blown up within weeks of my OP, and AFAIK Russian hasn't substantially expanded its other energy exports. Less sure about #2 and #3, as it's hard to find post-2022 energy statistics. My sense is that the answers are probably "yes" but I don't know how to back that up without doing a lot of research.
However coal stocks (BTU, AMR, CEIX, ARCH being the main pure play US coal stocks) haven't done as well as I had expected (the basket is roughly flat from Aug 2022 to today) for two other reasons: A. There have been tw...
Unfortunately this ignores 3 major issues:
Just riffing on this rather than starting a different comment chain:
If alignment is "get AI to follow instructions" (as typically construed in a "good enough" sort of way) and alignment is "get AI to do good things and not bad things," (also in a "good enough" sort of way, but with more assumed philosophical sophistication) I basically don't care about anyone's safety plan to get alignment except insofar as it's part of a plan to get alignment.
Philosophical errors/bottlenecks can mean you don't know how to go from 1 to 2. Human safety pr...
I think there’s a steady stream of philosophy getting interested in various questions in metaphilosophy
Thanks for this info and the references. I guess by "metaphilosophy" I meant something more meta than metaethics or metaepistemology, i.e., a field that tries to understand all philosophical reasoning in some unified or systematic way, including reasoning used in metaethics and metaepistemology, and metaphilosophy itself. (This may differ from standard academic terminology, in which case please let me know if there's a preferred term for the concept I'...
Sadly, I don't have any really good answers for you.
Thanks, it's actually very interesting and important information.
I don't know of specific cases, but for example I think it is quite common for people to start studying meta-ethics because of frustration at finding answers to questions in normative ethics.
I've noticed (and stated in the OP) that normative ethics seems to be an exception where it's common to express uncertainty/confusion/difficulty. But I think, from both my inside and outside views, that this should be common in most philosophical ...
Thank you for your view from inside academia. Some questions to help me get a better sense of what you see:
My understanding of what happened (from reading this) is that you wanted to explore in a new direction very different from the then preferred approach of the AF team, but couldn't convince them (or someone else) to join you. To me this doesn't clearly have much to do with streetlighting, and my current guess is that it was probably reasonable of them to not be convinced. It was also perfectly reasonable of you to want to explore a different approach, but it seems unreasonable to claim without giving any details that it would have produced better results if...
If you say to someone
Ok, so, there's this thing about AGI killing everyone. And there's this idea of avoiding that by making AGI that's useful like an AGI but doesn't kill everyone and does stuff we like. And you say you're working on that, or want to work on that. And what you're doing day to day is {some math thing, some programming thing, something about decision theory, ...}. What is the connection between these things?
and then you listen to what they say, and reask the question and interrogate their answers, IME what it very often grounds out into...
(Upvoted since your questions seem reasonable and I'm not sure why you got downvoted.)
I see two ways to achieve some justifiable confidence in philosophical answers produced by superintelligent AI:
Having worked on some of the problems myself (e.g. decision theory), I think the underlying problems are just very hard. Why do you think they could have done "so much more, much more intently, and much sooner"?
I've had this tweet pinned to my Twitter profile for a while, hoping to find some like-minded people, but with 13k views so far I've yet to get a positive answer (or find someone expressing this sentiment independently):
Among my first reactions upon hearing "artificial superintelligence" were "I can finally get answers to my favorite philosophical problems" followed by "How do I make sure the ASI actually answers them correctly?"
Anyone else reacted like this?
This aside, there are some people around LW/rationality who seem more cautious/modest/self-crit...
"Signal group membership" may be true of the fields you mentioned (political philosophy and philosophy of religion), but seems false of many other fields such as philosophy of math, philosophy of mind, decision theory, anthropic reasoning. Hard to see what group membership someone is signaling by supporting one solution to Sleeping Beauty vs another, for example.
I'm increasingly worried that philosophers tend to underestimate the difficulty of philosophy. I've previously criticized Eliezer for this, but it seems to be a more general phenomenon.
Observations:
Philosophy is frequently (probably most of the time) done in order to signal group membership rather than as an attempt to accurately model the world. Just look at political philosophy or philosophy of religion. Most of the observations you note can be explained by philosophers operating at simulacrum level 3 instead of level 1.
I have a lot of disagreements with section 6. Not sure where the main crux is, so I'll just write down a couple of things.
One intuition pump here is: in the current, everyday world, basically no one goes around with much of a sense of what people’s “values on reflection” are, or where they lead.
This only works because we're not currently often in danger of subjecting other people to major distributional shifts. See Two Neglected Problems in Human-AI Safety.
...That is, ultimately, there is just the empirical pattern of: what you would think/feel/value gi
High population may actually be a problem, because it allows the AI transition to occur at low average human intelligence, hampering its governance. Low fertility/population would force humans to increase average intelligence before creating our successor, perhaps a good thing!
This assumes that it's possible to create better or worse successors, and that higher average human intelligence would lead to smarter/better politicians and policies, increasing our likelihood of building better successors.
Some worry about low fertility leading t...
...Social media sites are already getting overwhelmed by spam, fake images, fake videos, blackmail attempts, phishing, etc. The only way to counteract the speed and volume of massive AI-driven attacks is with AI-powered defenses. These defenses need rules. If those rules aren't formal and proven robust, then they will likely be hacked and exploited by adversarial AIs. So at the most basic level, we need infrastructure rules which are provably robust against classes of attacks. What those attack classes are and what properties those rules guarantee is part o
It seems confusing/unexpected that a user has to click on "Personal Blog" to see organisational announcements (which are not "personal"). Also, why is it important or useful to keep timeful posts out of the front page by default?
If it's because they'll become less relevant/interesting over time, and you want to reduces the chances of them being shown to users in the future, it seems like that could be accomplished with another mechanism.
I guess another possibility is that timeful content is more likely to be politically/socially sensitive, and you want to ...
If you only care about the real world and you're sure there's only one real world, then the fact that you at time 0 would sometimes want to bind yourself at time 1 (e.g., physically commit to some action or self-modify to perform some action at time 1) seems very puzzling or indicates that something must be wrong, because at time 1 you're in a strictly better epistemic position, having found out more information about which world is real, so what sense does it make that your decision theory makes you-at-time-0 decide to override you-at-time-1's decision?
(I... (read more)