I feel kinda frustrated whenever "shard theory" comes up in a conversation, because it's not a theory, or even a hypothesis. In terms of its literal content, it basically seems to be a reframing of the "default" stance towards neural networks often taken by ML researchers (especially deep learning skeptics), which is "assume they're just a set of heuristics".
This is a particular pity because I think there's a version of the "shard" framing which would actually be useful, but which shard advocates go out of their way to avoid. Specifically: we should be interested in "subagents" which are formed via hierarchical composition of heuristics and/or lower-level subagents, and which are increasingly "goal-directed" as you go up the hierarchy. This is an old idea, FWIW; e.g. it's how Minsky frames intelligence in Society of Mind. And it's also somewhat consistent with the claim made in the original shard theory post, that "shards are just collections of subshards".
The problem is the "just". The post also says "shards are not full subagents", and that "we currently estimate that most shards are 'optimizers' to the extent that a bacterium or a thermostat is an optimizer." But the whole point...
I am not as negative on it as you are -- it seems an improvement over the 'Bag O' Heuristics' model and the 'expected utility maximizer' model. But I agree with the critique and said something similar here:
...you go on to talk about shards eventually values-handshaking with each other. While I agree that shard theory is a big improvement over the models that came before it (which I call rational agent model and bag o' heuristics model) I think shard theory currently has a big hole in the middle that mirrors the hole between bag o' heuristics and rational agents. Namely, shard theory currently basically seems to be saying "At first, you get very simple shards, like the following examples: IF diamond-nearby THEN goto diamond. Then, eventually, you have a bunch of competing shards that are best modelled as rational agents; they have beliefs and desires of their own, and even negotiate with each other!" My response is "but what happens in the middle? Seems super important! Also haven't you just reproduced the problem but inside the head?" (The problem being, when modelling AGI we always understood that it would start out being just a crappy bag of heuristics and end up a scary rational ag
FWIW I'm potentially intrested in interviewing you (and anyone else you'd recommend) and then taking a shot at writing the 101-level content myself.
One fairly strong belief of mine is that Less Wrong's epistemic standards are not high enough to make solid intellectual progress here. So far my best effort to make that argument has been in the comment thread starting here. Looking back at that thread, I just noticed that a couple of those comments have been downvoted to negative karma. I don't think any of my comments have ever hit negative karma before; I find it particularly sad that the one time it happens is when I'm trying to explain why I think this community is failing at its key goal of cultivating better epistemics.
There's all sorts of arguments to be made here, which I don't have time to lay out in detail. But just step back for a moment. Tens or hundreds of thousands of academics are trying to figure out how the world works, spending their careers putting immense effort into reading and producing and reviewing papers. Even then, there's a massive replication crisis. And we're trying to produce reliable answers to much harder questions by, what, writing better blog posts, and hoping that a few of the best ideas stick? This is not what a desperate effort to find the truth looks like.
And we're trying to produce reliable answers to much harder questions by, what, writing better blog posts, and hoping that a few of the best ideas stick? This is not what a desperate effort to find the truth looks like.
It seems to me that maybe this is what a certain stage in the desperate effort to find the truth looks like?
Like, the early stages of intellectual progress look a lot like thinking about different ideas and seeing which ones stand up robustly to scrutiny. Then the best ones can be tested more rigorously and their edges refined through experimentation.
It seems to me like there needs to be some point in the desparate search for truth in which you're allowing for half-formed thoughts and unrefined hypotheses, or else you simply never get to a place where the hypotheses you're creating even brush up against the truth.
In the half-formed thoughts stage, I'd expect to see a lot of literature reviews, agendas laying out problems, and attempts to identify and question fundamental assumptions. I expect that (not blog-post-sized speculation) to be the hard part of the early stages of intellectual progress, and I don't see it right now.
Perhaps we can split this into technical AI safety and everything else. Above I'm mostly speaking about "everything else" that Less Wrong wants to solve. Since AI safety is now a substantial enough field that its problems need to be solved in more systemic ways.
As mentioned in my reply to Ruby, this is not a critique of the LW team, but of the LW mentality. And I should have phrased my point more carefully - "epistemic standards are too low to make any progress" is clearly too strong a claim, it's more like "epistemic standards are low enough that they're an important bottleneck to progress". But I do think there's a substantive disagreement here. Perhaps the best way to spell it out is to look at the posts you linked and see why I'm less excited about them than you are.
Of the top posts in the 2018 review, and the ones you linked (excluding AI), I'd categorise them as follows:
Interesting speculation about psychology and society, where I have no way of knowing if it's true:
Same as above but it's by Scott so it's a bit more rigorous and much more compelling:
(Thanks for laying out your position in this level of depth. Sorry for how long this comment turned out. I guess I wanted to back up a bunch of my agreement with words. It's a comment for the sake of everyone else, not just you.)
I think there's something to what you're saying, that the mentality itself could be better. The Sequences have been criticized because Eliezer didn't cite previous thinkers all that much, but at least as far as the science goes, as you said, he was drawing on academic knowledge. I also think we've lost something precious with the absence of epic topic reviews by the likes of Luke. Kaj Sotala still brings in heavily from outside knowledge, John Wentworth did a great review on Biological Circuits, and we get SSC crossposts that have that, but otherwise posts aren't heavily referencing or building upon outside stuff. I concede that I would like to see a lot more of that.
I think Kaj was rightly disappointed that he didn't get more engagement with his post whose gist was "this is what the science really says about S1 & S2, one of your most cherished concepts, LW community".
I wouldn't say the typical approach is strictly bad, there's value in thinking freshly...
This is only tangentially relevant, but adding it here as some of you might find it interesting:
Venkatesh Rao has an excellent Twitter thread on why most independent research only reaches this kind of initial exploratory level (he tried it for a bit before moving to consulting). It's pretty pessimistic, but there is a somewhat more optimistic follow-up thread on potential new funding models. Key point is that the later stages are just really effortful and time-consuming, in a way that keeps out a lot of people trying to do this as a side project alongside a separate main job (which I think is the case for a lot of LW contributors?)
Quote from that thread:
Research =
a) long time between having an idea and having something to show for it that even the most sympathetic fellow crackpot would appreciate (not even pay for, just get)
b) a >10:1 ratio of background invisible thinking in notes, dead-ends, eliminating options etc
With a blogpost, it’s like a week of effort at most from idea to mvp, and at most a 3:1 ratio of invisible to visible. That’s sustainable as a hobby/side thing.
...To do research-grade thinking you basically have to be independently wealthy and accept 90% d
Quoting your reply to Ruby below, I agree I'd like LessWrong to be much better at "being able to reliably produce and build on good ideas".
The reliability and focus feels most lacking to me on the building side, rather than the production, which I think we're doing quite well at. I think we've successfully formed a publishing platform that provides and audience who are intensely interested in good ideas around rationality, AI, and related subjects, and a lot of very generative and thoughtful people are writing down their ideas here.
We're low on the ability to connect people up to do more extensive work on these ideas – most good hypotheses and arguments don't get a great deal of follow up or further discussion.
Here are some subjects where I think there's been various people sharing substantive perspectives, but I think there's also a lot of space for more 'details' to get fleshed out and subquestions to be cleanly answered:
"I see a lot of (very high quality) raw energy here that wants shaping and directing, with the use of lots of tools for coordination (e.g. better collaboration tools)."
Yepp, I agree with this. I guess our main disagreement is whether the "low epistemic standards" framing is a useful way to shape that energy. I think it is because it'll push people towards realising how little evidence they actually have for many plausible-seeming hypotheses on this website. One proven claim is worth a dozen compelling hypotheses, but LW to a first approximation only produces the latter.
When you say "there's also a lot of space for more 'details' to get fleshed out and subquestions to be cleanly answered", I find myself expecting that this will involve people who believe the hypothesis continuing to build their castle in the sky, not analysis about why it might be wrong and why it's not.
That being said, LW is very good at producing "fake frameworks". So I don't want to discourage this too much. I'm just arguing that this is a different thing from building robust knowledge about the world.
I think I'm concretely worried that some of those models / paradigms (and some other ones on LW) don't seem pointed in a direction that leads obviously to "make falsifiable predictions."
And I can imagine worlds where "make falsifiable predictions" isn't the right next step, you need to play around with it more and get it fleshed out in your head before you can do that. But there is at least some writing on LW that feels to me like it leaps from "come up with an interesting idea" to "try to persuade people it's correct" without enough checking.
(In the case of IFS, I think Kaj's sequence is doing a great job of laying it out in a concrete way where it can then be meaningfully disagreed with. But the other people who've been playing around with IFS didn't really seem interested in that, and I feel like we got lucky that Kaj had the time and interest to do so.)
In general when we do intellectual work we have excellent epistemic standards, capable of listening to all sorts of evidence that other communities and fields would throw out, and listening to subtler evidence than most scientists ("faster than science")
"Being more openminded about what evidence to listen to" seems like a way in which we have lower epistemic standards than scientists, and also that's beneficial. It doesn't rebut my claim that there are some ways in which we have lower epistemic standards than many academic communities, and that's harmful.
In particular, the relevant question for me is: why doesn't LW have more depth? Sure, more depth requires more work, but on the timeframe of several years, and hundreds or thousands of contributors, it seems viable. And I'm proposing, as a hypothesis, that LW doesn't have enough depth because people don't care enough about depth - they're willing to accept ideas even before they've been explored in depth. If this explanation is correct, then it seems accurate to call it a problem with our epistemic standards - specifically, the standard of requiring (and rewarding) deep investigation and scholarship.
There's been a fair amount of discussion of that sort of thing here: https://www.lesswrong.com/tag/group-rationality There are also groups outside LW thinking about social technology such as RadicalxChange.
Imagine you took 5 separate LWers and asked them to create a unified consensus response to a given article. My guess is that they’d learn more through that collective effort, and produce a more useful response, than if they spent the same amount of time individually evaluating the article and posting their separate replies.
I'm not sure. If you put those 5 LWers together, I think there's a good chance that the highest status person speaks first and then the others anchor on what they say and then it effectively ends up being like a group project for school with the highest status person in charge. Some related links.
Much of the same is true of scientific journals. Creating a place to share and publish research is a pretty key piece of intellectual infrastructure, especially for researchers to create artifacts of their thinking along the way.
The point about being 'cross-posted' is where I disagree the most.
This is largely original content that counterfactually wouldn't have been published, or occasionally would have been published but to a much smaller audience. What Failure Looks Like wasn't crossposted, Anna's piece on reality-revealing puzzles wasn't crossposted. I think that Zvi would have still written some on mazes and simulacra, but I imagine he writes substantially more content given the cross-posting available for the LW audience. Could perhaps check his blogging frequency over the last few years to see if that tracks. I recall Zhu telling me he wrote his FAQ because LW offered an audience for it, and likely wouldn't have done so otherwise. I love everything Abram writes, and while he did have the Intelligent Agent Foundations Forum, it had a much more concise, technical style, tiny audience, and didn't have the conversational explanations and stories and cartoons that have...
Here is the best toy model I currently have for rational agents. Alas, it is super messy and hacky, but better than nothing. I'll call it the BAVM model; the one-sentence summary is "internal traders concurrently bet on beliefs, auction actions, vote on values, and merge minds". There's little novel here, I'm just throwing together a bunch of ideas from other people (especially Scott Garrabrant and Abram Demski).
In more detail, the three main components are:
You also have some set of traders, who can simultaneously trade on any combination of these three. Traders earn money in two ways:
They spend money in three ways:
Values are therefore dominated by whichever traders earn money from predictions or actions, who will disproportionately vote for values that are formulated in the same on...
Some opinions about AI and epistemology:
That will push P(doom) lower because most frames from most disciplines, and most styles of reasoning, don't predict doom.
I don't really buy this statement. Most frames, from most disciplines, and most styles of reasoning, do not make clear predictions about what will happen to humanity in the long-run future. A very few do, but the vast majority are silent on this issue. Silence is not anything like "50%".
Most frames, from most disciplines, and most styles of reasoning, don't predict sparks when you put metal in a microwave. This doesn't mean I don't know what happens when you put metal in a microwave. You need to at the very least limit yourself to applicable frames, and there are very few applicable frames for predicting humanity's long-term future.
How can the mistakes rationalists are making be expressed in the language of Bayesian rationalism? Priors, evidence, and posteriors are fundamental to how probability works.
The mistakes can (somewhat) be expressed in the language of Bayesian rationalism by doing two things:
Suppose we think of ourselves as having many different subagents that focus on understanding the world in different ways - e.g. studying different disciplines, using different styles of reasoning, etc. The subagent that thinks about AI from first principles might come to a very strong opinion. But this doesn't mean that the other subagents should fully defer to it (just as having one very confident expert in a room of humans shouldn't cause all the other humans to elect them as the dictator). E.g. maybe there's an economics subagent who will remain skeptical unless the AI arguments can be formulated in ways that are consistent with their knowledge of economics, or the AI subagent can provide evidence that is legible even to those other subagents (e.g. advance predictions).
Do "subagents" in this paragraph refer to different people, or different reasoning modes / perspectives within a single person? (I think it's the latter, since otherwise they would just be "agents" rather than subagents.)
Either way, I think this is a neat way of modeling disagreement and reasoning processes, but for me it leads to a different conclusion on the object-level question of AI doom.
A big part of why I f...
I'd love to read an elaboration of your perspective on this, with concrete examples, which avoids focusing on the usual things you disagree about (pivotal acts vs. pivotal processes, social facets of the game is important for us to track, etc.) and mainly focus on your thoughts on epistemology and rationality and how it deviates from what you consider the LW norm.
(Written quickly and not very carefully.)
I think it's worth stating publicly that I have a significant disagreement with a number of recent presentations of AI risk, in particular Ajeya's "Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover", and Cohen et al.'s "Advanced artificial agents intervene in the provision of reward". They focus on policies learning the goal of getting high reward. But I have two problems with this:
Putting my money where my mouth is: I just uploaded a (significantly revised) version of my Alignment Problem position paper, where I attempt to describe the AGI alignment problem as rigorously as possible. The current version only has "policy learns to care about reward directly" as a footnote; I can imagine updating it based on the outcome of this discussion though.
I'm not very convinced by this comment as an objection to "50% AI grabs power to get reward." (I find it more plausible as an objection to "AI will definitely grab power to get reward.")
I expect "reward" to be a hard goal to learn, because it's a pretty abstract concept and not closely related to the direct observations that policies are going to receive
"Reward" is not a very natural concept
This seems to be most of your position but I'm skeptical (and it's kind of just asserted without argument):
Five clusters of alignment researchers
Very broadly speaking, alignment researchers seem to fall into five different clusters when it comes to thinking about AI risk:
(COI note: I work at OpenAI. These are my personal views, though.)
My quick take on the "AI pause debate", framed in terms of two scenarios for how the AI safety community might evolve over the coming years:
I haven't yet read through them thoroughly, but these four papers by Oliver Richardson are pattern-matching to me as potentially very exciting theoretical work.
tl;dr: probabilistic dependency graphs (PDGs) are directed graphical models designed to be able to capture inconsistent beliefs (paper 1). The definition of inconsistency is a natural one which allows us to, for example, reframe the concept of "minimizing training loss" as "minimizing inconsistency" (paper 2). They provide an algorithm for inference in PDGs (paper 3) and an algorithm for learning via locally minimizing inconsistency which unifies several other algorithms (like the EM algorithm, message-passing, and generative adversarial training) (paper 4).
Oliver is an old friend of mine (which is how I found out about these papers) and a final-year PhD student at Cornell under Joe Halpern.
Just read Bostrom's Deep Utopia (though not too carefully). The book is structured with about half being transcripts of fictional lectures given by Bostrom at Oxford, about a quarter being stories about various woodland creatures striving to build a utopia, and another quarter being various other vignettes and framing stories.
Overall, I was a bit disappointed. The lecture transcripts touch on some interesting ideas, but Bostrom's style is generally one which tries to classify and taxonimize, rather than characterize (e.g. he has a long section trying to analyze the nature of boredom). I think this doesn't work very well when describing possible utopias, because they'll be so different from today that it's hard to extrapolate many of our concepts to that point, and also because the hard part is making it viscerally compelling.
The stories and vignettes are somewhat esoteric; it's hard to extract straightforward lessons from them. My favorite was a story called The Exaltation of ThermoRex, about an industrialist who left his fortune to the benefit of his portable room heater, leading to a group of trustees spending many millions of dollars trying to figure out (and implement) what it means to "benefit" a room heater.
A possible way to convert money to progress on alignment: offering a large (recurring) prize for the most interesting failures found in the behavior of any (sufficiently-advanced) model. Right now I think it's very hard to find failures which will actually cause big real-world harms, but you might find failures in a way which uncovers useful methodologies for the future, or at least train a bunch of people to get much better at red-teaming.
(For existing models, it might be more productive to ask for "surprising behavior" rather than "failures" per se, since I think almost all current failures are relatively uninteresting. Idk how to avoid inspiring capabilities work, though... but maybe understanding models better is robustly good enough to outweight that?)
Here's a (messy, haphazard) list of ways a group of idealized agents could merge into a single agent:
Proposal 1: they merge into an agent which maximizes a weighted sum of their utilities. They decide on the weights using some bargaining solution.
Objection 1: this is not Pareto-optimal in the case where the starting agents have different beliefs. In that case we want:
Proposal 2: they merge into an agent which maximizes a weighted sum of their utilities, where those weights are originally set by bargaining but evolve over time depending on how accurately each original agent predicted the future.
Objection 2: this loses out on possible gains from acausal trade. E.g. if a paperclip-maximizer finds itself in a universe where it's hard to make paperclips but easy to make staples, it'd like to be able to give resources to staple-maximizers in exchange for them building more paperclips in universes where that's easier. This requires a kind of updateless decision theory:
Proposal 3: they merge into an agent which maximizes a weighted sum of their utilities (with those weights evolving over time), where the weights are set by bargaining subject to the constraint that each agent obeys commitme...
I recently had a very interesting conversation about master morality and slave morality, inspired by the recent AstralCodexTen posts.
The position I eventually landed on was:
The crucial heuristic I apply when evaluating AI safety research directions is: could we have used this research to make humans safe, if we were supervising the human evolutionary process? And if not, do we have a compelling story for why it'll be easier to apply to AIs than to humans?
Sometimes this might be too strict a criterion, but I think in general it's very valuable in catching vague or unfounded assumptions about AI development.
The idea that maximally-coherent agents look like squiggle-maximizers raises the question: what would it look like for humans to become maximally coherent?
One answer, which Yudkowsky gives here, is that conscious experiences are just a "weird and more abstract and complicated pattern that matter can be squiggled into".
But that seems to be in tension with another claim he makes, that there's no way for one agent's conscious experiences to become "more real" except at the expense of other conscious agents—a claim which, according to him, motivates average utilitarianism across the multiverse.
Clearly a squiggle-maximizer would not be an average squigglean. So what's the disanalogy here? It seems like @Eliezer Yudkowsky is basically using SSA, but comparing between possible multiverses—i.e. when facing the choice between creating agent A or not, you look at the set of As in the multiverse where you decided yes, and compare it to the set of As in the multiverse where you decided no, and (if you're deciding for the good of A) you pick whichever one gives A a better time on average.
Yudkowsky has written before (can't find the link) that he takes this approach because alternatives would en...
A short complaint (which I hope to expand upon at some later point): there are a lot of definitions floating around which refer to outcomes rather than processes. In most cases I think that the corresponding concepts would be much better understood if we worked in terms of process definitions.
Some examples: Legg's definition of intelligence; Karnofsky's definition of "transformative AI"; Critch and Krueger's definition of misalignment (from ARCHES).
Sure, these definitions pin down what you're talking about more clearly - but that comes at the cost of understanding how and why it might come about.
E.g. when we hypothesise that AGI will be built, we know roughly what the key variables are. Whereas transformative AI could refer to all sorts of things, and what counts as transformative could depend on many different political, economic, and societal factors.
Suppose we get to specify, by magic, a list of techniques that AGIs won't be able to use to take over the world. How long does that list need to be before it makes a significant dent in the overall probability of xrisk?
I used to think of "AGI designs self-replicating nanotech" mainly as an illustration of a broad class of takeover scenarios. But upon further thought, nanotech feels like a pretty central element of many takeover scenarios - you actually do need physical actuators to do many things, and the robots we might build in the foreseeable future are nowhere near what's necessary for maintaining a civilisation. So how much time might it buy us if AGIs couldn't use nanotech at all?
Well, not very much if human minds are still an attack vector - the point where we'd have effectively lost is when we can no longer make our own decisions. Okay, so rule out brainwashing/hyper-persuasion too. What else is there? The three most salient: military power, political/cultural power, economic power.
Is this all just a hypothetical exercise? I'm not sure. Designing self-replicating nanotech capable of replacing all other human tech seems really hard; it's pretty plausible to me that the world is crazy in a bunch of other ways by the time we reach that capability. And so if we can block off a couple of the easier routes to power, that might actually buy useful time.
Probably the easiest "honeypot" is just making it relatively easy to tamper with the reward signal. Reward tampering is useful as a honeypot because it has no bad real-world consequences, but could be arbitrarily tempting for policies that have learned a goal that's anything like "get more reward" (especially if we precommit to letting them have high reward for a significant amount of time after tampering, rather than immediately reverting).
Hypothesis: there's a way of formalizing the notion of "empowerment" such that an AI with the goal of empowering humans would be corrigible.
This is not straightforward, because an AI that simply maximized human POWER (as defined by Turner et al.) wouldn't ever let the humans spend that power. Intuitively, though, there's a sense in which a human who can never spend their power doesn't actually have any power. Is there a way of formalizing that intuition?
The direction that seems most promising is in terms of counterfactuals (or, alternatively, Pearl's do-calculus). Define the power of a human with respect to a distribution of goals G as the average ability of a human to achieve their goal if they'd had a goal sampled from G (alternatively: under an intervention that changed their goal to one sampled from G). Then an AI with a policy of never letting humans spend their resources would result in humans having low power. Instead, a human-power-maximizing AI would need to balance between letting humans pursue their goals, and preventing humans from doing self-destructive actions. The exact balance would depend on G, but one could hope that it's not very sensitive to the precise definiti...
Inspired by a recent discussion about whether Anthropic broke a commitment to not push the capabilities frontier (I am more sympathetic to their position than most, because I think that it's often hard to distinguish between "current intentions" and "commitments which might be overridden by extreme events" and "solemn vows"):
Maybe one translation tool for bridging the gap between rationalists and non-rationalists is if rationalists interpret any claims about the future by non-rationalists as implicitly being preceded by "Look, I don't really believe that plans work, I think the world is inherently wildly unpredictable, I am kinda making everything up as I go along. Having said that:"
This translation tool would also require rationalists and such to make arguments of the form "I think supporting Anthropic (by, e.g., going to work there or giving it funding) is a good thing to do because they sort of have a feeling right now that it would be good not to push the AI frontier", rather than of the form "... because they're committed to not pushing the frontier".
Which are arguments one could make! But is a pretty different argument and I think people would behave differently if these were the only arguments in favour of supporting a new scaling lab.
I think that's how people should generally react in the absence of harder commitments and accountability measures.
This post confuses me.
Am I correct that the implied implication here is that assurances from a non-rationalist are essentially worthless?
I think it is also wrong to imply that Anthropic have violated their commitment simply because they didn't rationally think through the implications of their commitment when they made it.
I think you can understand Anthropic's actions as purely rational, just not very ethical.
They made an unenforceable commitment to not push capabilities when it directly benefited them. Now that it is more beneficial to drop the facade, they are doing so.
I think "don't trust assurances from non-rationalists" is not a good takeaway. Rather it should be "don't trust unenforceable assurances from people who will stand to greatly benefit from violating your trust at a later date".
I think part of the disappointment is the lack of communication regarding violating the commitment or violating the expectations of a non-trivial fraction of the community.
If someone makes a promise to you or even sets an expectation for you in a softer way, there is of course always some chance that they will break the promise or violate the expectation.
But if they violate the commitment or the expectation, and they care about you as a stakeholder, I think there's a reasonable expectation that they should have to justify that decision.
If they break the promise or violate the soft expectation, and then they say basically nothing (or they say "well I never technically made a promise– there was no contract!", then I think you have the right to be upset with them not only for violating you expectation but also for essentially trying to gaslight you afterward.
I think a Responsible Lab would have issued some sort of statement along the lines of "hey, we're hearing that some folks thought we had made commitments to not advance the frontier and some of our employees were saying this to safety-focused members of the AI community. We're sorry about this miscommunication, and here are some s...
Right now it seems like the entire community is jumping to conclusions based on a couple of "impressions" people got from talking to Dario, plus an offhand line in a blog post.
No, many people had the impression that Anthropic had made such a commitment, which is why they were so surprised when they saw the Claude 3 benchmarks/marketing. Their impressions were derived from a variety of sources; those are merely the few bits of "hard evidence", gathered after the fact, of anything that could be thought of as an "organizational commitment".
Also, if Dustin Moskovitz and Gwern - two dispositionally pretty different people - both came away from talking to Dario with this understanding, I do not think that is something you just wave off. Failures of communication do happen. It's pretty strange for this many people to pick up the same misunderstanding over the course of several years, from many different people (including Dario, but also others), in a way that's beneficial to Anthropic, and then middle management starts telling you that maybe there was a vibe but they've never heard of any such commitment (nevermind what Dustin and Gwern heard, or anyone else who might've...
Could you clarify how binding "OpenAI’s mission is to ensure that artificial general intelligence benefits all of humanity." is?
Right now it seems like the entire community is jumping to conclusions based on a couple of "impressions" people got from talking to Dario, plus an offhand line in a blog post. With that little evidence, if you have formed strong expectations, that's on you.
Like Robert, the impressions I had were based on what I heard from people working at Anthropic. I cited various bits of evidence because those were the ones available, not because they were the most representative. The most representative were those from Anthropic employees who concurred that this was indeed the implication, but it seemed bad form to cite particular employees (especially when that information was not public by default) rather than, e.g., Dario. I think Dustin’s statement was strong evidence of this impression, though, and I still believe Anthropic to have at least insinuated it.
I agree with you that most people are not aiming for as much stringency with their commitments as rationalists expect. Separately, I do think that what Anthropic did would constitute a betrayal, even in everyday culture. And in any case, I think that when you are making a technology which might extinct humanity, the bar should be si...
Deceptive alignment doesn't preserve goals.
A short note on a point that I'd been confused about until recently. Suppose you have a deceptively aligned policy which is behaving in aligned ways during training so that it will be able to better achieve a misaligned internally-represented goal during deployment. The misaligned goal causes the aligned behavior, but so would a wide range of other goals (either misaligned or aligned) - and so weight-based regularization would modify the internally-represented goal as training continues. For example, if the misaligned goal were "make as many paperclips as possible", but the goal "make as many staples as possible" could be represented more simply in the weights, then the weights should slowly drift from the former to the latter throughout training.
But actually, it'd likely be even simpler to get rid of the underlying misaligned goal, and just have alignment with the outer reward function as the terminal goal. So this argument suggests that even policies which start off misaligned would plausibly become aligned if they had to act deceptively aligned for long enough. (This sometimes happens in humans too, btw.)
Reasons this argument might not be relevant:
- The policy doing some kind of gradient hacking
- The policy being implemented using some kind of modular architecture (which may explain why this phenomenon isn't very robust in humans)
Why would alignment with the outer reward function be the simplest possible terminal goal? Specifying the outer reward function in the weights would presumably be more complicated. So one would have to specify a pointer towards it in some way. And it's unclear whether that pointer is simpler than a very simple misaligned goal.
Such a pointer would be simple if the neural network already has a representation of the outer reward function in weights anyway (rather than deriving it at run-time in the activations). But it seems likely that any fixed representation will be imperfect and can thus be improved upon at inference time by a deceptive agent (or an agent with some kind of additional pointer). This of course depends on how much inference time compute and memory / context is available to the agent.
A well-known analogy from Yann LeCun: if machine learning is a cake, then unsupervised learning is the cake itself, supervised learning is the icing, and reinforcement learning is the cherry on top.
I think this is useful for framing my core concerns about current safety research:
I do think it's more complicated than I've portrayed here, but I haven't yet seen a persuasive response to the core intuition.
Imagine taking someone's utility function, and inverting it by flipping the sign on all evaluations. What might this actually look like? Well, if previously I wanted a universe filled with happiness, now I'd want a universe filled with suffering; if previously I wanted humanity to flourish, now I want it to decline.
But this is assuming a Cartesian utility function. Once we treat ourselves as embedded agents, things get trickier. For example, suppose that I used to want people with similar values to me to thrive, and people with different values from me to suffer. Now if my utility function is flipped, that naively means that I want people similar to me to suffer, and people similar to me to thrive. But this has a very different outcome if we interpret "similar to me" as de dicto vs de re - i.e. whether it refers to the old me or the new me.
This is a more general problem when one person's utility function can depend on another person's, where you can construct circular dependencies (which I assume you can also do in the utility-flipping case). There's probably been a bunch of work on this, would be interested in pointers to it (e.g. I assume there have been attempts to construct typ...
(Vague, speculative thinking): Is the time element of UDT actually a distraction? Consider the following: agents A and B are in a situation where they'd benefit from cooperation. Unfortunately, the situation is complicated—it's not like a prisoner's dilemma, where there's a clear "cooperate" and a clear "defect" option. Instead they need to take long sequences of actions, and they each have many opportunities to subtly gain an advantage at the other's expense.
Therefore instead of agreements formulated as "if you do X I'll do Y", it'd be far more beneficial for them to make agreements of the form "if you follow the advice of person Z then I will too". Here person Z needs to be someone that both A and B trust to be highly moral, neutral, competent, etc. Even if there's some method of defecting that neither of them considered in advance, at the point in time when it arises Z will advise against doing it. (They don't need to actually have access to Z, they can just model what Z will say.)
If A and B don't have much communication bandwidth between them (e.g. they're trying to do acausal coordination) then they will need to choose a Z that's a clear Schelling point, even if that Z is subo...
[Epistemic status: rough speculation, feels novel to me, though Wei Dai probably already posted about it 15 years ago.]
UDT is (roughly) defined as "follow whatever commitments a past version of yourself would have made if they'd thought about your situation". But this means that any UDT agent is only as robust to adversarial attacks as their most vulnerable past self. Specifically, it creates an incentive for adversaries to show UDT agents situations that would trick their past selves into making unwise commitments. It also creates incentives for UDT agents themselves to hack their past selves, in order to artificially create commitments that "took effect" arbitrarily far back in their past.
In some sense, then, I think UDT might have a parallel structure to the overall alignment problem. You have dumber past agents who don't understand most of what's going on. You have smarter present agents who have trouble cooperating, because they know too much. The smarter agents may try to cooperate by punting to "Schelling point" dumb agents. (But this faces many of the standard problems of dumb agents making decisions—e.g. the commitments they make will probably be inconsistent or incoherent...
It seems to me that Eliezer overrates the concept of a simple core of general intelligence, whereas Paul underrates it. Or, alternatively: it feels like Eliezer is leaning too heavily on the example of humans, and Paul is leaning too heavily on evidence from existing ML systems which don't generalise very well.
I don't think this is a particularly insightful or novel view, but it seems worth explicitly highlighting that you don't have to side with one worldview or the other when evaluating the debates between them. (Although I'd caution not to just average their two views - instead, try to identify Eliezer's best arguments, and Paul's best arguments, and reconcile them.)
I've been reading Eliezer's recent stories with protagonists from dath ilan (his fictional utopia). Partly due to the style, I found myself bouncing off a lot of the interesting claims that he made (although it still helped give me a feel for his overall worldview). The part I found most useful was this page about the history of dath ilan, which can be read without much background context. I'm referring mostly to the exposition on the first 2/3 of the page, although the rest of the story from there is also interesting. One key quote from the remainder of the story:
..."The next most critical fact about Earth is that from a dath ilani perspective their civilization is made entirely out of coordination failure. Coordination that fails on every scale recursively, where uncoordinated individuals assemble into groups that don't express their preferences, and then those groups also fail to coordinate with each other, forming governments that offend all of their component factions, which governments then close off their borders from other governments. The entirety of Earth is one gigantic failure fractal. It's so far below the multi-agent-optimal-boundary, only their profess
A tension that keeps recurring when I think about philosophy is between the "view from nowhere" and the "view from somewhere", i.e. a third-person versus first-person perspective—especially when thinking about anthropics.
One version of the view from nowhere says that there's some "objective" way of assigning measure to universes (or people within those universes, or person-moments). You should expect to end up in different possible situations in proportion to how much measure your instances in those situations have. For example, UDASSA ascribes measure based on the simplicity of the computation that outputs your experience.
One version of the view from somewhere says that the way you assign measure across different instances should depend on your values. You should act as if you expect to end up in different possible future situations in proportion to how much power to implement your values the instances in each of those situations has. I'll call this the ADT approach, because that seems like the core insight of Anthropic Decision Theory. Wei Dai also discusses it here.
In some sense each of these views makes a prediction. UDASSA predicts that we live in a universe with laws of physi...
In a bayesian rationalist view of the world, we assign probabilities to statements based on how likely we think they are to be true. But truth is a matter of degree, as Asimov points out. In other words, all models are wrong, but some are less wrong than others.
Consider, for example, the claim that evolution selects for reproductive fitness. Well, this is mostly true, but there's also sometimes group selection, and the claim doesn't distinguish between a gene-level view and an individual-level view, and so on...
So just assigning it a single probability seems inadequate. Instead, we could assign a probability distribution over its degree of correctness. But because degree of correctness is such a fuzzy concept, it'd be pretty hard to connect this distribution back to observations.
Or perhaps the distinction between truth and falsehood is sufficiently clear-cut in most everyday situations for this not to be a problem. But questions about complex systems (including, say, human thoughts and emotions) are messy enough that I expect the difference between "mostly true" and "entirely true" to often be significant.
Has this been discussed before? Given Less Wrong's name, I'd be surprised if not, but I don't think I've stumbled across it.
Oracle-genie-sovereign is a really useful distinction that I think I (and probably many others) have avoided using mainly because "genie" sounds unprofessional/unacademic. This is a real shame, and a good lesson for future terminology.
People sometimes try to reason about the likelihood of deceptive alignment by appealing to speed priors and simplicity priors. I don't like such appeals, because I think that the differences between aligned and deceptive AGIs will likely be a very small proportion of the total space/time complexity of an AGI. More specifically:
1. If AGIs had to rederive deceptive alignment in every episode, that would make a big speed difference. But presumably, after thinking about it a few times during training, they will remember their conclusions for a while, and bring them to mind in whichever episodes they're relevant. So the speed cost of deception will be amortized across the (likely very long) training period.
2. AGIs will represent a huge number of beliefs and heuristics which inform their actions (e.g. every single fact they know). A heuristic like "when you see X, initiate the world takeover plan" would therefore constitute a very small proportion of the total information represented in the network; it'd be hard to regularize it away without regularizing away most of the AGI's knowledge.
I think that something like the speed vs simplicity tradeoff is relevant to the likelihood of deceptiv...
It's really weird that we find ourselves at the hinge of history. One proposed explanation is that we're part of an ancestor simulation. It makes sense that ancestor simulations would be focused on the hinge of history. But unless ancestor simulations make up a significant proportion of future minds, it's still weird that we find ourselves in a simulation rather than actually experiencing the future.
Why might ancestor simulations make up a significant proportion of future minds? One possible answer is that ancestor simulations provide the information requi...
Since there's been some recent discussion of the SSC/NYT incident (in particular via Zack's post), it seems worth copying over my twitter threads from that time about why I was disappointed by the rationalist community's response to the situation.
I continue to stand by everything I said below.
Thread 1 (6/23/20):
Scott Alexander is the most politically charitable person I know. Him being driven off the internet is terrible. Separately, it is also terrible if we have totally failed to internalize his lessons, and immediately leap to the conclusion that the NY...
My mental one-sentence summary of how to think about ELK is "making debate work well in a setting where debaters are able to cite evidence gained by using interpretability tools on each other".
I'm not claiming that this is how anyone else thinks about ELK (although I got the core idea from talking to Paul) but since I haven't seen it posted online yet, and since ELK is pretty confusing, I thought it'd be useful to put out there. In particular, this framing motivates us generating interpretability tools which scale in the sense of being robust when used as ...
Being nice because you're altruistic, and being even nicer for decision-theoretic reasons on top of that, seems like it involves some kind of double-counting: the reason you're altruistic in the first place is because evolution ingrained the decision theory into your values.
But it's not fully double-counting: many humans generalise altruism in a way which leads them to "cooperate" far more than is decision-theoretically rational for the selfish parts of them - e.g. by making big sacrifices for animals, future people, etc. I guess this could be selfishly ra...
In UDT2, when you're in epistemic state Y and you need to make a decision based on some utility function U, you do the following:
1. Go back to some previous epistemic state X and an EDT policy (the combination of which I'll call the non-updated agent).
2. Spend a small amount of time trying to find the policy P which maximizes U based on your current expectations X.
3. Run P(Y) to make the choice which maximizes U.
The non-updated agent gets much less information than you currently have, and also gets much less time to think. But it does use the same utility ...
Random question I’ve been thinking about: how would you set up a market for votes? Suppose specifically that you have a proportional chances election (i.e. the outcome gets chosen with probability proportional to the number of votes cast for it—assume each vote is a distribution over candidates). So everyone has an incentive to get everyone who’s not already voting for their favorite option to change their vote; and you can have positive-sum trades where I sell you a promise to switch X% of my votes to a compromise candidate in exchange for you switching Y...
I expect it to be difficult to generate adversarial inputs which will fool a deceptively aligned AI. One proposed strategy for doing so is relaxed adversarial training, where the adversary can modify internal weights. But this seems like it will require a lot of progress on interpretability. An alternative strategy, which I haven't yet seen any discussion of, is to allow the adversary to do a data poisoning attack before generating adversarial inputs - i.e. the adversary gets to specify inputs and losses for a given number of SGD steps, and then the adversarial input which the base model will be evaluated on afterwards. (Edit: probably a better name for this is adversarial meta-learning.)
Another thought on dath ilan: notice how much of the work of Keltham's reasoning is based on him pattern-matching to tropes from dath ilani literature, and then trying to evaluate their respective probabilities. In other words: like bayesianism, he's mostly glossing over the "hypothesis generation" step of reasoning.
I wonder if dath ilan puts a lot of effort into spreading a wide range of tropes because they don't know how to teach systematically good hypothesis generation.
I suspect that AIXI is misleading to think about in large part because it lacks reusable parameters - instead it just memorises all inputs it's seen so far. Which means the setup doesn't have episodes, or a training/deployment distinction; nor is any behaviour actually "reinforced".
I've recently discovered waitwho.is, which collects all the online writing and talks of various tech-related public intellectuals. It seems like an important and previously-missing piece of infrastructure for intellectual progress online.
Yudkowsky mainly wrote about recursive self-improvement from a perspective in which algorithms were the most important factors in AI progress - e.g. the brain in a box in a basement which redesigns its way to superintelligence.
Sometimes when explaining the argument, though, he switched to a perspective in which compute was the main consideration - e.g. when he talked about getting "a hyperexponential explosion out of Moore’s Law once the researchers are running on computers".
What does recursive self-improvement look like when you think that data might be t...
RL usually applies some discount rate, and also caps episodes at a certain length, so that an action taken at a given time isn't reinforced very much (or at all) for having much longer-term consequences.
How does this compare to evolution? At equilibrium, I think that a gene which increases the fitness of its bearers in N generations' time is just as strongly favored as a gene that increases the fitness of its bearers by the same amount straightaway. As long as it was already widespread at least N generations ago, they're basically the same thing, because c...
A general principle: if we constrain two neural networks to communicate via natural language, we need some pressure towards ensuring they actually use language in the same sense as humans do, rather than (e.g.) steganographically encoding the information they really care about.
The most robust way to do this: pass the language via a human, who tries to actually understand the language, then does their best to rephrase it according to their own understanding.
What do you lose by doing this? Mainly: you can no longer send messages too complex for humans to und...
I believe that humans have already crossed a threshold that, in a certain sense, puts us on an equal footing with any other being who has mastered abstract reasoning. There’s a notion in computing science of “Turing completeness”, which says that once a computer can perform a set of quite basic operations, it can be programmed to do absolutely any calculation that any other computer can do. Other computers might be faster, or have more memory, or have multiple processors running at the same time, but my 1988 A...
Equivocation. "Who's 'we', flesh man?" Even granting the necessary millions or billions of years for a human to sit down and emulate a superintelligence step by step, it is still not the human who understands, but the Chinese room.
It's frustrating how bad dath ilanis (as portrayed by Eliezer) are at understanding other civilisations. They seem to have all dramatically overfit to dath ilan.
To be clear, it's the type of error which is perfectly sensible for an individual to make, but strange for their whole civilisation to be making (by teaching individuals false beliefs about how tightly constraining their coordination principles are).
The in-universe explanation seems to be that they've lost this knowledge as a result of screening off the past. But that seems like a really predictabl...
Half-formed musing: what's the relationship between being a nerd and trusting high-level abstractions? In some sense they seem to be the opposite of each other - nerds focus obsessively on a domain until they understand it deeply, not just at high levels of abstraction. But if I were to give a very brief summary of the rationalist community, it might be: nerds who take very high-level abstractions (such as moloch, optimisation power, the future of humanity) very seriously.
There's some possible world in which the following approach to interpretability works:
One problem that this approach would face if we were using it to interpret a human is that the human might not consciously be aware of what their motivations are. For example, they may believe they are doing something for altr...
I've heard people argue that "most" utility functions lead to agents with strong convergent instrumental goals. This obviously depends a lot on how you quantify over utility functions. Here's one intuition in the other direction. I don't expect this to be persuasive to most people who make the argument above (but I'd still be interested in hearing why not).
If a non-negligible percentage of an agent's actions are random, then to describe it as a utility-maximiser would require an incredibly complex utility function (becaus...
Makes sense. For what it's worth, I'd also argue that thinking about optimal policies at all is misguided (e.g. what's the optimal policy for humans - the literal best arrangement of neurons we could possibly have for our reproductive fitness? Probably we'd be born knowing arbitrarily large amounts of information. But this is just not relevant to predicting or modifying our actual behaviour at all).
(I now think that you were very right in saying "thinking about optimal policies at all is misguided", and I was very wrong to disagree. I've thought several times about this exchange. Not listening to you about this point was a serious error and made my work way less impactful. I do think that the power-seeking theorems say interesting things, but about eg internal utility functions over an internal planning ontology -- not about optimal policies for a reward function.)