All of xuan's Comments + Replies

xuanΩ110

It seems to me that it's not right to assume the probability of opportunities to trade are zero?

Suppose both John and David are alive on a desert island right now (but slowly dying), and there's a chance that a rescue boat will arrive that will save only one of them, leaving the other to die. What would they contract to? Assuming no altruistic preferences, presumably neither would agree to only the other person being rescued.

It seems more likely here that bargaining will break down, and one of them will kill off the other, resulting in an arbitrary resolution of who ends up on the rescue boat, not a "rational" resolution.

3Dweomite
Doesn't irreversibility imply that there is zero probability of a trade opportunity to reverse the thing?  I'm not proposing a new trait that your original scenario didn't have; I'm proposing that I identified which aspect of your scenario was load-bearing.   I don't think I understand how your new hypothetical is meant to be related to anything discussed so far.  As described, the group doesn't have strongly incomplete preferences, just 2 mutually-exclusive objectives.
xuanΩ220

While I've focused on death here, I think this is actually much more general -- there are a lot of irreversible decisions that people make (and that artificial agents might make) between potentially incommensurable choices. Here's a nice example from Elizabeth Anderson's "Value in Ethics & Economics" (Ch. 3, P57 re: the question of how one should live one's life, to which I think irreversibility applies
 


Similar incommensurability applies, I think, to what kind of society we collectively we want to live in, given that path dependency makes many cho... (read more)

xuanΩ6123

Interesting argument! I think it goes through -- but only under certain ecological / environmental assumptions:

  1. That decisions  / trades between goods are reversible.
  2. That there are multiple opportunities to make such trades / decisions in the environment.

But this isn't always the case! Consider:

  • Both John and David prefer living over dying.
  • Hence, John would not trade (John Alive, David Dead) for (John Dead, David Alive), and vice versa for David.

This is already a case of weakly incomplete preferences which, while technically reducible to a complete orde... (read more)

3Dweomite
Rather than talking about reversibility, can this situation be described just by saying that the probability of certain opportunities is zero?  For example, if John and David somehow know in advance that no one will ever offer them pepperoni in exchange for anchovies, then the maximum amount of probability mass that can be shifted from mushrooms to pepperoni by completing their preferences happens to be zero.  This doesn't need to be a physical law of anchovies; it could just be a characteristic of their trade partners. But in this hypothetical, their preferences are effectively no longer strongly incomplete--or at least, their trade policy is no longer strongly incomplete.  Since we've assumed away the edge between pepperoni and anchovies, we can (vacuously) claim that John and David will collectively accept 100% of the (non-existent) trades from anchovies to pepperoni, and it becomes possible to describe their trade policy as being a utility maximizer.  (Specifically, we can say anchovies = mushrooms because they won't trade between them, and say pepperoni > mushrooms because they will trade mushrooms for pepperoni.  The original problem was that this implies that pepperoni > anchovies, which is false in their preferences, but it is now (vacuously) true in their trade policy if such opportunities have probability zero.)
3quetzal_rainbow
Well, it can be overcame by future contracts, no? We replace "Jonh dead" with "John dies tomorrow" and perform trades today.
2xuan
While I've focused on death here, I think this is actually much more general -- there are a lot of irreversible decisions that people make (and that artificial agents might make) between potentially incommensurable choices. Here's a nice example from Elizabeth Anderson's "Value in Ethics & Economics" (Ch. 3, P57 re: the question of how one should live one's life, to which I think irreversibility applies   Similar incommensurability applies, I think, to what kind of society we collectively we want to live in, given that path dependency makes many choices irreversible.
xuanΩ360

Not sure if this is the same as the awards contest entry, but EJT also made this earlier post ("There are no coherence theorems") arguing that certain Dutch Book / money pump arguments against incompleteness fail!

xuanΩ110

Very interesting work! This is only a half-formed thought, but the diagrams you've created very much remind me of similar diagrams used to display learned "topics" in classic topic models like Latent Dirichlet Allocation (Figure 8 from the paper is below):

I think there's possibly something to be gained by viewing what the MLPs and attention heads are learning as something like "topic models" -- and it may be the case that some of the methods developed for evaluating topic interpretability and consistency will be valuable here. A couple of references:

... (read more)
xuan*Ω22-1

Regarding causal scrubbing in particular, it seems to me that there's a closely related line of research by Geiger, Icard and Potts that it doesn't seem like TAISIC is engaging with deeply? I haven't looked too closely, but it may be another example of duplicated effort / rediscovery:

The importance of interventions

Over a series of recent papers (Geiger et al. 2020, Geiger et al. 2021, Geiger et al. 2022, Wu et al. 2022a, Wu et al. 2022b), we have argued that the theory of causal abstraction (Chalupka et al. 2016, Rubinstein et al. 2017, Beckers and Halpern

... (read more)
7LawrenceC
We were quite familiar with Geiger et al's work before writing the post, and think it's importantly different. Though it seems like we forgot to cite it in the Causal Scrubbing AF post, whoops. Hopefully this will be fixed with the forthcoming arXiv paper!
xuanΩ5107

Strongly upvoting this for being a thorough and carefully cited explanation of how the safety/alignment community doesn't engage enough with relevant literature from the broader field, likely at the cost of reduplicated work, suboptimal research directions, and less exchange and diffusion of important safety-relevant ideas. While I don't work on interpretability per se, I see similar things happening with value learning / inverse reinforcement learning approaches to alignment.

1David Reber
Ditto. I've recently started moving into interpretability / explainability and spent the past week skimming the broader literature on XAI, so the timing of this carefully cited post is quite impactful for me. I see similar things happening with causality generally, where it seems to me that (as a 1st order heuristic) much of alignment forum's reference for causality is frozen at Pearl's 2008 textbook, missing what I consider to be most of the valuable recent contributions and expansions in the field.  * Example: Finite Factored Sets seems to be reinventing causal representation learning [for a good intro, see Schölkopf 2021], where it seems to me that the broader field is outpacing FFS on its own goals. FFS promises some theoretical gains (apparently to infer causality where Pearl-esque frameworks can't) but I'm no longer as sure about the validity of this. * Counterexample(s): the Causal Incentives Working Group, and David Krueger's lab, for instance. Notably these are embedded in academia, where there's more culture (incentive) to thoroughly relate to previous work. (These aren't the only ones, just 2 that came to mind.)
1[anonymous]
Strong upvote here as well. The points about how even simple terminological differences can isolate research pursuits are especially pertinent, considering the tendency of people on and around LW to coin new phrases/ideas on a dime. Novel terminology is a valuable resource that we have been spending very frivolously.
2xuan
Regarding causal scrubbing in particular, it seems to me that there's a closely related line of research by Geiger, Icard and Potts that it doesn't seem like TAISIC is engaging with deeply? I haven't looked too closely, but it may be another example of duplicated effort / rediscovery: Source: https://ai.stanford.edu/blog/causal-abstraction/ 
xuanΩ7156

Fascinating evidence!

I suspect this maybe because RLHF elicits a singular scale of "goodness" judgements from humans, instead of a plurality of "goodness-of-a-kind" judgements. One way to interpret language models is as *mixtures* of conversational agents: they first sample some conversational goal, then some policy over words, conditioned on that goal:

On this interpretation, what RL from human feedback does is shift/concentrate the distribution ov... (read more)

2Sam Marks
This seems like a good way to think about some of the examples of mode collapse, but doesn't obviously cover all the cases. For example, when asking the model to produce a random number, is it really the case that there's a particular conversational goal which the RLHF'd model is optimizing, such that 97 is the best random number for that goal? In this case, Paul's guess that RLHF'd models tend to push probability mass onto the base model's most likely tokens seems more explanatory.
9paulfchristiano
I agree that the (unprompted) generative model is doing something kind of like: choose a random goal, then optimize it. In some sense that does reflect the "plurality of realistic human goals." But I don't think it's a good way to reflect that diversity. It seems like you want to either (i) be able to pick which goal you pursue, (ii) optimize an aggregate of several goals. Either way, I think that's probably best reflected by a deterministic reward function, and you'd probably prefer be mindful about what you are getting rather than randomly sampling from webtext. (Though as I mention in my other comment, I think there are other good reasons to want the pure generative model.)
xuan30

Apologies for the belated reply.

Yes, the summary you gave above checks out with what I took away from your post. I think it sounds good on a high level, but still too vague / high-level for me to say much in more detail. Values/ethics are definitely a system (e.g., one might think that morality was evolved by humans for the purposes of co-operation), but at the end of the day you're going to have to make some concrete hypothesis about what that system is in order to make progress. Contractualism is one such concrete hypothesis, and folding ethics under the... (read more)

2Q Home
In any case, I think your idea (and abramdemski's) should be getting more attention. I think the next possible step, before trying to guess a specific system/formalization, is to ask "what can we possibly gain by generalizing?" For example, if you generalize values to normativity (including language normativity): * You may translate the process of learning language into the process of learning human values. You can test alignment of the AI on language. * And maybe you can even translate some rules of language normativity into the rules of human normativity.  I speculated that if you generalize values to statements about systems, then: * You can translate some statements about simpler systems into statements about human values. You get simple, but universal justifications for actions. * You get very "dense" justifications of actions. E.g. you have a very big amount of overlapping reasons to not turn the world into paperclips. * You get very "recursive" justifications. "Recursivness" means how easy it is to derive/reconstruct one value from another. What do we gain (in the best case scenario) by generalizing values to "contracts"? I thought that maybe we could discuss what possible properties this generalization may have. Finding an additional property you want to get out of the generalization may help with the formalization (it can restrict the space of possible formal models). It's not a very useful generalization/reduction if we don't get anything from it, if "statements about the natural world" don't have significant convenient properties.
xuan30

Hmm, I'm not sure I fully understand the concept of "X statements" you're trying to introduce, though it does feel similar in some ways to contractualist reasoning. Since the concept is still pretty vague to me, I don't feel like I can say much about it, beyond mentioning several ideas / concepts that might be related:

- Immanent critique (a way of pointing out the contradictions in existing systems / rules)
- Reasons for action (especially justificatory reasons)
- Moral naturalism (the meta-ethical position that moral statements are statements about the natu... (read more)

1Q Home
Thank you! Sorry, I should have formulated my question better. I meant that from time to time people come up with the idea "maybe AI shouldn't learn human values/ethics in the classical sense" or "maybe learning something that's not human values can help to learn human values": * Impact measures. "Impact" by itself is not a human value. It exists beyond human values. * Your idea of contractualism. "Contracts" are not human values in the classical sense. You say that human values make sense only in context of society and a specific reality. * Normativity by abramdemski. "Normativity" is not 100% about human values: for example, there's normativity in language. * My idea: describe values/ethics as a system and study it in the context of all other systems. The common theme of all those ideas is describing human values as a part of something bigger. I thought it would be rational to give a name to this entire area "beyond human values" and compare ideas in that context. And answer the question: why do we even bother going there? what can we gain there in the perfect case? (Any approach in theory can be replaced by a very long list of direct instructions, but we look for something more convenient than "direct instructions".) Maybe we should try to answer those questions in general before trying to justify specific approaches. And I think there shouldn't be a conflict between different approaches: different approaches can share results and be combined in various ways. What do you think about that whole area "beyond human values"?
xuanΩ230

Because the rules are meant for humans, with our habits and morals and limitations, and our explicit understanding of them only works because they operate in an ecosystem full of other humans.  I think our rules/norms would fail to work if we tried to port them to a society of octopuses, even if those octopuses were to observe humans to try to improve their understanding of the object-level impact of the rules.


I think there's something to this, but I think perhaps it only applies strongly if and when most of the economy is run by or delegated to AI se... (read more)

xuanΩ220

But here I would expect people to reasonably disagree on whether an AI system or community of systems has made a good decision, and therefore it seems harder to ever fully trust machines to make decisions at this level. 

I hope the above is at least partially addressed by the last paragraph of the section on Reverse Engineering Roles and Norms! I agree with the worry, and to address it I think we could design systems that mostly just propose revisions or extrapolations to our current rules, or highlight inconsistencies among them (e.g. conflicting laws... (read more)

1phillchris
Hey! Absolutely, I think a lot of this makes sense. I assume you were meaning this paragraph with the Reverse Engineering Roles and Norms paragraph: For both points here, I guess I was getting more at this question by asking these: how ought we structure this collaborative process? Like what constitutes feedback a machine sees to interactively improve with society? Who do AI interact with? What constitutes a datapoint in the moral learning process? These seem like loaded questions, and let me more concrete. In decisions without unanimity with regards to a moral fact, using simple majority rule, for example, could lead to disastrously bad moral theory: you could align an AI with norms resulting in of exploiting 40% of the public by 60% of the public (for example, if a majority deems it moral to exploit / under-provide for a minority, in an extreme case). It strikes me that to prevent this kind of failure mode, there must be some baked-in context of "obviously wrong" beforehand. If you require total unanimity, well then, you will never get even a single datapoint: people will reasonably disagree (I would argue to infinity, after arbitrary amounts of reasonable debate) about basic moral facts due to differences in values. I think this negotiation process is in itself really really important to get right if you advocate this kind of approach, and not by advancing any one moral view of the world. I certainly don't think it's impossible, just as it isn't impossible to have relatively well-functioning democracy. But this is the point I guess: are there limit guarantees to society agreeing after arbitrary lengths of deliberation? Has modern democracy / norm-setting historically risen from mutual deliberation, or from exertion of state power / arbitrary assertion of one norm over another? I honestly don't have sufficient context to answer that, but it seems like relevant empirical fact here.  Maybe another follow up: what are your idealized conditions for "rational / mutu
xuan*Ω110

Hmm, I'm confused --- I don't think I said very much about inner alignment, and I hope to have implied that inner alignment is still important! The talk is primarily a critique of existing approaches to outer alignment (eg. why human preferences alone shouldn't be the alignment target) and is a critique of inner alignment work only insofar as it assumes that defining the right training objective / base objective is not a crucial problem as well.

Maybe a more refined version of the disagreement is about how crucial inner alignment is, vs. defining the right ... (read more)

2Noosphere89
This is the crux. I actually think outside alignment, while hard, has possible solutions, but inner alignment has the nearly impossible task of aligning a mesa-optimizer, and ensuring that no deceptiveness ensues. I think this is nearly impossible under a simplicity prior regime, which is probably the most likely prior to work. I think inner alignment is more important than outer alignment. Don't get me wrong, this is a non-trivial advance, and I hope more such posts come. But I do want to lower expectations that will come with such posts.
xuanΩ230

Agreed that the interpreting law is hard, and the "literal" interpretation is not enough! Hence the need to represent normative uncertainty (e.g. a distribution over multiple formal interpretations of a natural language statement + having uncertainty over what terms in the contract are missing), which I see the section on "Inferring roles and norms" as addressing in ways that go beyond existing "reward modeling" approaches.

Let's call the above "wilful compliance", and the fully-fledged reverse engineering approach as "enlightened compliance". It seems like... (read more)

2Charlie Steiner
I'd be interested :) I think my two core concerns are that our rules/norms are meant for humans, and that even then, actors often have bad impacts that would only be avoided with a pretty broad perspective about their responsibilities. So an AI that follows rules/norms well can't just understand them on the object level, it has to have a really good understanding of what it's like to be a human navigating these rules/norms, and use that understanding to make things go well from a pretty broad perspective. That first one means that not only do I not want the AI to think about what rules mean "in a vacuum," I don't even want it to merely use human knowledge to refine its object-level understanding of the rules. Because the rules are meant for humans, with our habits and morals and limitations, and our explicit understanding of them only works because they operate in an ecosystem full of other humans. I think our rules/norms would fail to work if we tried to port them to a society of octopuses, even if those octopuses were to observe humans to try to improve their understanding of the object-level impact of the rules. An example (maybe not great because it only looks at one dimension of the problem) is that our norms may implicitly assume a certain balance between memetic offense and defense that AIs would upset. E.g. around governmental lobbying (those are also maybe a bad example because they're kinda insufficient already). Also! While listening to the latest Inside View podcast, it occurred to me that this perspective on AI safety has some natural advantages when translating into regulation that present governments might be able to implement to prepare for the future. If AI governance people aren't already thinking about this, maybe bother some / convince people in this comment section to bother some?
xuanΩ420

On the contrary, I think there exist large, complex, symbolic models of the world that are far more interpretable and useful than learned neural models, even if too complex for any single individual to understand, e.g.:

- The Unity game engine (a configurable model of the physical world)
- Pixar's RenderMan renderer (a model of optics and image formation)
- The GLEAMviz epidemic simulator (a model of socio-biological disease spread at the civilizational scale)

Humans are capable of designing and building these models, and learning how to build/write them as th... (read more)

8Steven Byrnes
I agree that gwern’s proposal “Any model simple enough to be interpretable is too simple to be useful” is an exaggeration. Even the Lake et al. handwritten-character-recognizer is useful. I would have instead said “Any model simple enough to be interpretable is too simple to be sufficient for AGI”. I notice that you are again bringing the discussion back to a comparison between program synthesis world-models versus deep learning world-models, whereas I want to talk about the possibility that neither would be human-interpretable by the time we reach AGI level.
xuanΩ670

Adding some thoughts as someone who works on probabilistic programming, and has colleagues who work on neurosymbolic approaches to program synthesis:

  • I think a lot of Bayes net structure learning / program synthesis approaches (Bayesian or otherwise) have the issue of uninformative variable names, but I do think it's possible to distinguish between structural interpretability and naming interpretability, as others have noted.
  • In practice, most neural or Bayesian program synthesis applications I'm aware of exhibit something like structural interpretability, b
... (read more)
5Steven Byrnes
Thanks for your reply! When I squint out towards the horizon, I see future researchers trying to do a Bayesian program synthesis thing that builds a generative model of the whole world—everything from “tires are usually black”, to “it’s gauche to wear white after labor day”, to “in this type of math problem, maybe try applying the Cauchy–Schwarz inequality”, etc. etc. etc. I’m perfectly happy to believe that Lake et al. can program-synthesis a little toy generative model of handwritten characters such that it has structural interpretability. But I’m concerned that we’ll work our way up to the thing in the previous paragraph, which might be a billion times more complicated, and it will no longer have structural interpretability. (And likewise I’m concerned that solutions to “uninformative variable names” won’t scale—e.g., how are we going to automatically put English-language labels on the various intuitive models / heuristics that are involved when Ed Witten is thinking about math, or when MLK Jr is writing a speech?) Nominally, I agree with this. But “relative to” is key here. Your takeaway seems to be “OK, great, let’s do probabilistic generative models, they’re better!”. By contrast, my perspective is: “If we take the probabilistic generative model approach, we’re in huge trouble with respect to interpretability, oh man this is really really bad, we gotta work on this ASAP!!! (Oh and by the way if we take the deep net approach then it’s even worse.)”.
xuanΩ230

This was a great read! I wonder how much you're committed to "brain-inspired" vs "mind-inspired" AGI, given that the approach to "understanding the human brain" you outline seems to correspond to Marr's computational and algorithmic levels of analysis, as opposed to the implementational level (see link for reference). In which case, some would argue, you don't necessarily have to do too much neuroscience to reverse engineer human intelligence. A lot can be gleaned by doing classic psychological experiments to validate the functional roles of various aspect... (read more)

3Steven Byrnes
Thanks! I guess my feeling is that we have a lot of good implementation-level ideas (and keep getting more), and we have a bunch of algorithm ideas, and psychology ideas and introspection and evolution and so on, and we keep piecing all these things together, across all the different levels, into coherent stories, and that's the approach I think will (if continued) lead to AGI. Like, I am in fact very interested in "methods for fast and approximate Bayesian inference" as being relevant for neuroscience and AGI, but I wasn't really interested in it until I learned bunch of supporting ideas about what part of the brain is doing that, and how it works on the neuron level, and how and when and why that particular capability evolved in that part of the brain. Maybe that's just me. I haven't seen compelling (to me) examples of people going successfully from psychology to algorithms without stopping to consider anything whatsoever about how the brain is constructed . Hmm, maybe very early Steve Grossberg stuff? But he talks about the brain constantly now. One reason it's tricky to make sense of psychology data on its own, I think, is the interplay between (1) learning algorithms, (2) learned content (a.k.a. "trained models"), (3) innate hardwired behaviors (mainly in the brainstem & hypothalamus). What you especially want for AGI is to learn about #1, but experiments on adults are dominated by #2, and experiments on infants are dominated by #3, I think.
xuanΩ110

Yup! And yeah I think those are open research questions -- inference over certain kinds of non-parametric Bayesian models is tractable, but not in general. What makes me optimistic is that humans in similar cultures have similar priors over vast spaces of goals, and seem to do inference over that vast space in a fairly tractable manner. I think things get harder when you can't assume shared priors over goal structure or task structure, both for humans and machines.

xuanΩ250

Belatedly reading this and have a lot of thoughts about the connection between this issue and robustness to ontological shifts (which I've written a bit about here), but I wanted to share a paper which takes a very small step in addressing some of these questions by detecting when the human's world model may diverge from a robot's world model, and using that as an explanation for why a human might seem to be acting in strange or counter-productive ways:

Where Do You Think You're Going?: Inferring Beliefs about Dynamics from Behavior
Siddharth Reddy, Anca D.

... (read more)
xuanΩ6140

Belatedly seeing this post, but I wanted to note that probabilistic programming languages (PPLs) are centered around this basic idea! Some useful links and introductions to PPLs as a whole:
- Probabilistic models of cognition (web book)
- WebPPL
- An introduction to models in Pyro
- Introduction to Modeling in Gen

And here's a really fascinating paper by some of my colleagues that tries to model causal interventions that go beyond Pearl's do-operator, by formalizing causal interventions as (probabilistic) program transformations:

Bayesian causal inference via pr

... (read more)
1Rudi C
What useful problems do PPLs solve? Ideally some applications that are interesting for us non-corporate people. Can it be used for medical statistics (e.g., in nutrition)? (Any examples?) Is the reason it is not used the illiteracy of the scientists, or are the mainstream methods better?
xuanΩ450

Replying to the specific comments:

This still seems like a fair way to evaluate what the alignment community thinks about, but I think it is going to overestimate how parochial the community is. For example, if you go by "what does Stuart Russell think is important", I expect you get a very different view on the field, much of which won't be in the Alignment Newsletter.

I agree. I intended to gesture a little bit at this when I mentioned that "Until more recently, It’s also been excluded and not taken very seriously within traditional academia", because I th... (read more)

6Rohin Shah
Re: worries about "reward", I don't feel like I have a great understanding of what your worry is, but I'd try to summarize it as "while the abstraction of reward is technically sufficiently expressive, 1) it may not have the right inductive biases, and so the framework might fail in practice, and 2) it is not a good framework for thought, because it doesn't sufficiently emphasize many important concepts like logic and hierarchical planning". I think I broadly agree with those points if our plan is to explicitly learn human values, but it seems less relevant when we aren't trying to do that and are instead trying to In this framework, "knowledge about what humans want" doesn't come from a reward function, it comes from something like GPT-3 pretraining. The AI system can "invent" whatever concepts are best for representing its knowledge, which includes what humans want. Here, reward functions should instead be thought of as akin to loss functions -- they are ways of incentivizing particular kinds of outputs. I think it's reasonable to think on priors that this wouldn't be sufficient to get logical / hierarchical behavior, but I think GPT and AlphaStar and all the other recent successes should make you rethink that judgment. ---- I agree that trend-following behavior exists. I agree that this means that work on deep learning is less promising than you might otherwise think. That doesn't mean it's the wrong decision; if there are a hundred other plausible directions, it can still be the case that it's better to bet on deep learning rather than try your hand at guessing which paradigm will become dominant next. To quote Rodney Brooks: He also predicts that the "next big thing" will happen by 2027 (though I get the sense that he might count new kinds of deep learning architectures as a "big thing" so he may not be predicting something as paradigm-shifting as you're thinking). Whether to diversify depends on the size of the field; if you have 1 million alignment res
xuanΩ340

Thanks for this summary. Just a few things I would change:

  1. "Deep learning" instead of "deep reinforcement learning" at the end of the 1st paragraph -- this is what I meant to say, and I'll update the original post accordingly.
  2. I'd replace "nice" with "right" in the 2nd paragraph.
  3. "certain interpretations of Confucian philosophy" instead of "Confucian philosophy", "the dominant approach in Western philosophy" instead of "Western philosophy" -- I think it's important not to give the impression that either of these is a monolith.
4Rohin Shah
Done :)
xuanΩ780

Thanks for these thoughts! I'll respond to your disagreement with the framework here, and to the specific comments in a separate reply.

First, with respect to my view about the sources of AI risk, the characterization you've put forth isn't quite accurate (though it's a fair guess, since I wasn't very explicit about it). In particular:

  1. These days I'm actually more worried by structural risks and multi-multi alignment risks, which may be better addressed by AI governance than technical research per se. If we do reach super-intelligence, I think it's more like
... (read more)
6Rohin Shah
I agree with you on 1 and 2 (and am perhaps more optimistic about not building globally optimizing agents; I actually see that as the "default" outcome). I think this is where I disagree. I'd offer two main reasons not to believe this: 1. Children learn to follow common sense, despite not having (explicit) meta-ethical and meta-normative beliefs at all. (Though you could argue that the relevant meta-ethical and meta-normative concepts are inherent in / embedded in / compiled into the human brain's "priors" and learning algorithm.) 2. Intuitively, it seems like sufficiently good imitations of humans would have to have (perhaps implicit) knowledge of "common sense". We can see this to some extent, where GPT-3 demonstrates implicit knowledge of at least some aspects of common sense (though I do not claim that it acts in accordance with common sense). (As a sanity check, we can see that neither of these arguments would apply to the "learning human values" case.) I'm going to assume that Quality Y is "normative" if determining whether an object X has quality Y depends on who is evaluating Y. Put another way, an independent race of aliens that had never encountered humans would probably not converge to the same judgments as we do about quality Y. This feels similar to the is-ought distinction: you cannot determine "ought" facts from "is" facts, because "ought" facts are normative, whereas "is" facts are not (though perhaps you disagree with the latter). I think "common sense is normative" is sufficient to argue that a race of aliens could not build an AI system that had our common sense, without either the aliens or the AI system figuring out the right meta-normative concepts for humanity (which they presumably could not do without encountering humans first). I don't see why it implies that we cannot build an AI system that has our common sense. Even if our common sense is normative, its effects are widespread; it should be possible in theory to back out the conc
xuan*Ω340

In exchange for the mess, we get a lot closer to the structure of what humans think when they imagine the goal of "doing good." Humans strive towards such abstract goals by having a vague notion of what it would look and feel like, and by breaking down those goals into more concrete sub-tasks. This encodes a pattern of preferences over universe-histories that treats some temporally extended patterns as "states."

Thank you for writing this post! I've had very similar thoughts for the past year or so, and I think the quote above is exactly right. IMO, part of... (read more)

5Charlie Steiner
Oh wait, are you the first author on this paper? I didn't make the connection until I got around to reading your recent post. So when you talk about moving to a hierarchical human model, how practical do you think it is to also move to a higher-dimensional space of possible human-models, rather than using a few hand-crafted goals? This necessitates some loss function or prior probability over models, and I'm not sure how many orders of magnitude more computationally expensive it makes everything.
3Charlie Steiner
Sorry for being slow :) No, I haven't read anything of Bratman's. Should I? The synopsis looks like it might have some interesting ideas but I'm worried he could get bogged down in what human planning "really is" rather than what models are useful. I'd totally be happy to chat either here or in PMs. Full Bayesian reasoning seems tricky if the environment is complicated enough to make hierarchical planning attractive - or do you mean optimizing a model for posterior probability (the prior being something like MML?) by local search? I think one interesting question there is if it can learn human foibles. For example, suppose we're playing a racing game and I want to win the race, but fail because my driving skills are bad. How diverse a dataset about me do you need to actually be able to infer that a) I am capable of conceptualizing how good my performance is b) I wanted it to be good c) It wasn't good, from a hierarchical perpective, because of the lower-level planning faculties I have. I think maybe you could actually learn this only from racing game data (no need to make an AGI that can ask me about my goals and do top-down inference), so long as you had diverse enough driving data to make the "bottom-up" generalization that my low-level driving skill can be modeled as bad almost no matter the higher-level goal, and therefore it's simplest to explain me not winning a race by taking the bad driving I display elsewhere as a given and asking what simple higher-level goal fits on top.
xuanΩ6160

Thanks for writing up this post! It's really similar in spirit to some research I've been working on with others, which you can find on the ArXiv here: https://arxiv.org/abs/2006.07532 We also model bounded goal-directed agents by assuming that the agent is running some algorithm given bounded compute, but our approach differs in the following ways:

  • We don't attempt to compute full policies over the state space, since this is generally intractable, and also cognitively implausible, at least for agents like ourselves. Instead, we compute (par
... (read more)
2adamShimi
Sorry for the delay in answering. Your paper looks great! It seems to tackle in a clean and formal way what I was vaguely pointing at. We're currently reading a lot of papers and blog posts to prepare for an in-depth literature review about goal-directedness, and I added your paper to the list. I'll try to come back here and comment after I read it.