All of Nathaniel Monson's Comments + Replies

Terminology: <something>-ware for ML?

"we don't know if deceptive alignment is real at all (I maintain it isn't, on the mainline)."

You think it isn't a substantial risk of LLMs as they are trained today, or that it isn't a risk of any plausible training regime for any plausible deep learning system? (I would agree with the first, but not the second)

3ryan_greenblatt1y

See TurnTrout's shortform here for some more discussion.

Nathaniel Monson1y33

I agree in the narrow sense of different from bio-evolution, but I think it captures something tonally correct anyway.

2the gears to ascension1y

this has been an ongoing point of debate recently, and I think we can do much better than incorrect analogy to evolution.

Terminology: <something>-ware for ML?

Answer by Nathaniel MonsonJan 03, 20241-8

I like "evolveware" myself.

2the gears to ascension1y

it's distinctly not evolved. gradients vs selection-crossover-mutate are very different algos.

A Question For People Who Believe In God

A Question For People Who Believe In God

I'm not really sure how it ended up there--probably childhood teaching inducing that particular brain-structure? It's just something that was a fundamental part of who I understood myself to be, and how I interpreted my memories/experiences/sense-data. After I stopped believing in God, I definitely also stopped believing that I existed. Obviously, this-body-with-a-mind exists, but I had not identified myself as being that object previously--I had identified myself as the-spirit-inhabiting-this-body, and I no longer believed that existed.

A Question For People Who Believe In God

This is why I added "for the first few". Let's not worry about the location, just say "there is a round cube" and "there is a teapot".

Before you can get to either of these axioms , you need some things like "there is a thing I'm going to call reality that it's worth trying to deal with" and "language has enough correspondence to reality to be useful". With those and some similar very low level base axioms in place (and depending on your definitions of round and cube and teapot), I agree that one or another of the axioms could reasonably be called more or l... (read more)

1Ninety-Three1y

All your examples of high-tier axioms seem to fall into the category of "necessary to proceed", the sort of thing where you can't really do any further epistemology if the proposition is false. How did the God axiom either have that quality or end up high on the list without it?

Answer by Nathaniel MonsonNov 24, 202320

I don't think I believe in God anymore--certainly not in the way I used to--but I think if you'd asked me 3 years ago, I would have said that I take it as axiomatic that God exists. If you have any kind of consistent epistemology, you need some base beliefs from which to draw the conclusions and one of mine was the existence of an entity that cared about me (and everyone on earth) on a personal level and was sufficiently more wise/intelligent/powerful/knowledgeable than me that I may as well think of it as infinitely so.

I think the religious people I know... (read more)

1Ninety-Three1y

Surely some axioms can be more rationally chosen than others. For instance, "There is a teapot orbiting the sun somewhere between Earth and Mars" looks like a silly axiom, but "there is a round cube orbiting the sun somewhere between Earth and Mars" looks even sillier. Assuming the possibility of round cubes seems somehow more "epistemically expensive" than assuming the possibility of teapots.

Who is Sam Bankman-Fried (SBF) really, and how could he have done what he did? - three theories and a lot of evidence

Nathaniel Monson1y20

That's fair. I guess I'm used to linkposts which are either full, or a short enough excerpt that I can immediately see they aren't full.

In Defense of Parselmouths

Nathaniel Monson1y31

I really appreciated both the original linked post and this one. Thank you, you've been writing some great stuff recently.

One strategy I have, as someone who simultaneously would like to be truth-committed and also occasionally jokes or teases loved ones ("the cake you made is terrible! No one else should have any, I'll sacrifice my taste buds to save everyone!") is to have triggers for entering quaker-mode; if someone asks me a question involving "really" or "actually", I try to switch my demeanour to clearly sincere, and give a literally honest answer. I... hope? that having an explicit mode of truth this way blunts some of the negatives of frequently functioning as an actor.

2Screwtape1y

You are welcome, and thank you for saying so! I think the triggers for quaker-mode are a decent way of handling it. I try and use both triggers and to switch based on mood and to remember which people are more Quakerish and which are more Actorish, but that pile of heuristics is not always reliable. It mostly works! Sometimes it doesn't, and then I sort things out as best I can. One Parselmouth to another, I hope it works too.

Genetic fitness is a measure of selection strength, not the selection target

Reinforcement Via Giving People Cookies

Would you say it's ... _cat_egorically impossible?

Reinforcement Via Giving People Cookies

I actually fundamentally agree with most/all of it, I just wanted a cookie :)

Who is Sam Bankman-Fried (SBF) really, and how could he have done what he did? - three theories and a lot of evidence

I strongly disagreed with all of this!

.
(cookie please!)

Screwtape1y130

Have an internet cookie for stating there's a disagreement! Can you elaborate a little more?

Who is Sam Bankman-Fried (SBF) really, and how could he have done what he did? - three theories and a lot of evidence

Glad to, thanks for taking it well.

I think this would have been mitigated by something at the beginning saying "this is an excerpt of x words of a y word post located at url", so I can decide at the outset to read here, read there, or skip.

Is the reason you didn't put the entire thing here basically blog traffic numbers?

1spencerg1y

At the top it says it’s a link post and links to the full post, I thought that would make it clear that it’s a link post not a full post. It’s difficult to keep three versions in sync as I fix typos and correct mistakes, which is why I prefer to not have three separate full versions.

Nathaniel Monson1y42

(I didn't downvote, but here's a guess) I enjoyed what there was of it, but I got really irritated by "This is not the full post - for the rest of it, including an in-depth discussion of the evidence for and against each of these theories, you can find the full version of this post on my blog". I don't know why this bothers me--maybe because I pay some attention to the "time to read" tag at the top, or because having to click through to a different page feels like an annoyance with no benefit to me.

3spencerg1y

Thanks for letting me know

Who is Sam Bankman-Fried (SBF) really, and how could he have done what he did? - three theories and a lot of evidence

Nathaniel Monson1y80

If you click the link where OP introduces the term, it's the Wikipedia page for psychopathy. Wiki lists 3 primary traits for it, one of which is DAE

0M. Y. Zuo1y

Is there a specific reason 'affective' was chosen instead of 'emotional' in the naming? Is it also a connotation issue?

Saying the quiet part out loud: trading off x-risk for personal immortality

Nathaniel Monson1y43

The statement seems like it's assuming:

we know roughly how to build AGI
we decide when to do that
we use the time between now and then to increase chance of successful alignment
if we succeed in alignment early enough, you and your loved ones won't die

I don't think any of these are necessarily true, and I think the ways they are false is asymmetric in a manner that favors caution

2Dagon1y

It's also assuming: 1. We know roughly how to achieve immortality 2. We can do that exactly in the window of "the last possible moment" of AGI. 3. Efforts between immortality and AGI are fungible and exclusive, or at least related in some way. 4. Ok, yeah - we have to succeed on BOTH alignment and immortality to keep any of us from dying. 3 and 4 are, I think, the point of the post. To the extent that work on immortality rather than alignment, we narrow the window of #2, and risk getting neither.

Beyond the Data: Why aid to poor doesn't work

Nathaniel Monson1y95

I appreciated your post, (indeed, I found it very moving) and found some of the other comments frustrating as I believe you did. I think, though, that I can see a part of where they are coming from. I'll preface by saying I don't have strong beliefs on this myself, but I'll try to translate (my guess at) their world model.

I think the typical EA/LWer thinks that most charities are ineffective to the point of uselessness, and this is due to them not being smart/rational about a lot of things (and are very familiar with examples like the millennium village).... (read more)

1Lyrongolem1y

Thanks so much for your comment! Hm... yes, upon further reflection your summarization seems accurate, or at least highly plausible. I am not too sure what the mindset of the average LWer or EA looks like myself. (although I've frequented the site for some time, I'm mainly reading random frontpage posts that pique my interest, I don't attend meetups, participate in group activities, or much other things of that nature) It's not merely reading like I haven't engaged much in their world. The truth is I simply haven't, I have no intention of hiding it. I tagged the post EA because my points on aid address charities in general quite broadly, and so I thought it would be of interest to EA adjacent individuals. I also hoped that they might be able to enlighten me a bit on the many parts of EA I still don't fully understand. The post was never meant to critique or even focus on EA. This may have gotten lost in everything else I was attempting to do in the post, but one of the central motivations was to disprove a point I saw in a RA fundraiser that unconditional cash transfers could 'eradicate' global poverty. I found the initiative commendable, but unrealistic for a variety of reasons, many of which I detailed in the post. I never meant to say the aid wouldn't help, but rather, it was likely insufficient to meet their goal of ending long term poverty. That said, yes, you are right. My evidence does not support the claim that aid is completely ineffective in ending long term poverty. But rather, that aid requires much higher volumes to solve the long term issues, in conjunction with many other things. In my mind this was still meant aid was an inadequate solution since I didn't believe the volumes required to solve the issue would be a reasonable demand upon charity or foreign aid (just look at the enormous price tag of millennium villages). Thinking back, I probably exaggerated a bit in the title and in some of my claims. While the logical points may have been sound

Book Review: Going Infinite

Nathaniel Monson1y78

This is more a tangent than a direct response--I think I fundamentally agree with almost everything you wrote--but I dont think virtue ethics requires tossing out the other two (although I agree both of the others require tossing out each other).

I view virtue ethics as saying, roughly, "the actually important thing almost always is not how you act in contrived edge case thought experiments, but rather how how habitually act in day to day circumstances. Thus you should worry less, probably much much less, about said thought experiments, and worry more... (read more)

1Matt Goldenberg1y

I think virtue ethics is a practical solution, but if you just say "if corner cases show up, don't follow it" means you're doing something else other than being a virtue ethicist.

What's Hard About The Shutdown Problem

Nathaniel Monson1y42

I agree with the first paragraph, but strongly disagree with the idea this is "basically just trying to align to human values directly".

Human values are a moving target in a very high dimensional space, which needs many bits to specify. At a given time, this needs one bit. A coinflip has a good shot. Also, to use your language, I think "human is trying to press the button" is likely to form a much cleaner natural abstraction than human values generally.

Finally, we talk about getting it wrong being really bad. But there's a strong asymmetry --one direction ... (read more)

3EJT1y

Here's a problem that I think remains. Suppose you've got an agent that prefers to have the button in the state that it believes matches my preferences. Call these 'button-matching preferences.' If the agent only has these preferences, it isn't of much use. You have to give the agent other preferences to make it do useful work. And many patterns for these other preferences give the agent incentives to prevent the pressing of the button. For example, suppose the other preferences are: 'I prefer lottery X to lottery Y iff lottery X gives a greater expectation of discovered facts than lottery Y.' An agent with these preferences would be useful (it could discover facts for us), but it also has incentives to prevent shutdown: it can discover more facts if it remains operational. And it seems difficult to ensure that the agent's button-matching preferences will always win out over these incentives. In case you're interested, I discuss something similar here and especially in section 8.2.

Lying is Cowardice, not Strategy

What's Hard About The Shutdown Problem

If I had clear lines in my mind between AGI capabilities progress, AGI alignment progress, and narrow AI progress, I would be 100% with you on stopping AGI capabilities. As it is, though, I don't know how to count things. Is "understanding why neural net training behaves as it does" good or bad? (SLT's goal). Is "determining the necessary structures of intelligence for a given architecture" good or bad? (Some strands of mech interp). Is an LLM narrow or general?

How do you tell, or at least approximate? (These are genuine questions, not rhetorical)

Are humans misaligned with evolution?

In the spirit of "no stupid questions", why not have the AI prefer to have the button in the state that it believes matches my preferences?

I'm aware this fails against AIs that can successfully act highly manipulative towards humans, but such an AI is already terrifying for all sorts of other reasons, and I think the likelihood of this form of corrigibility making a difference given such an AI is quite low.

Is the answer roughly "we don't care about the off-button specifically that much, we care about getting the AI to interact with human preferences which are changeable without changing them"?

2johnswentworth1y

Trying to change the human's preference to match the button is one issue there. The other issue is that if the AI incorrectly estimates the human's preferences (or, more realistically, we humans building the AI fail to operationalize "our preference re:button state" in such a way that the thing the AI is aimed at doesn't match what we intuitively mean by that phrase), then that's really bad. Another frame: this would basically just be trying to align to human values directly, and has all the usual problems with directly aligning to human values, which is exactly what all this corrigibility-style stuff was meant to avoid.

Arguments for optimism on AI Alignment (I don't endorse this version, will reupload a new version soon.)

Question for Jacob: suppose we end up getting a single, unique, superintelligent AGI, and the amount it cares about, values, and prioritizes human welfare relative to its other values is a random draw with probability distribution equal to how much random humans care about maximizing their total number of direct descendents.

Would you consider that an alignment success?

3jacob_cannell1y

I actually answered that towards the end: So it'd be a random draw with a fairly high p(doom), so no not a success in expectation relative to the futures I expect. In actuality I expect the situation to be more multipolar, and thus more robust due to utility function averaging. If power is distributed over N agents each with a utility function that is variably but even slightly aligned to humanity in expectation, that converges with increasing N to full alignment at the population level[1]. ---------------------------------------- 1. As we expect the divergence in agent utility functions to all be from forms of convergent selfish empowerment, which are necessarily not aligned with each other (ie none of the AIs are inter aligned except through variable partial alignment to humanity). ↩︎

Which Anaesthetic To Choose?

Thanks for writing this! I strongly appreciate a well-thought out post in this direction.

My own level of worry is pretty dependent on a belief that we know and understand shaping NN behaviors much better than we do (values/goals/motivations/desires) (although I don't think eg chatGPT has any of the latter in the first place). Do you have thoughts on the distinction between behaviors and goals? In particular, do you feel like you have any evidence we know how to shape/create/guide goals and values, rather than just behaviors?

Algorithmic Intent: A Hansonian Generalized Anti-Zombie Principle

I don't think the end result is identical. If you take B, you now have evidence that, if a similar situation arises again, you won't have to experience excruciating pain. Your past actions and decisions are relevant evidence of future actions and decisions. If you take drug A, your chance of experiencing excruciating pain at some point in the future goes up (at least your subjective estimation of the probability should probably go up at least a bit.) I would pay a dollar to lower my best rational estimate of the chance of something like that happening to me--wouldn't you?

Nathaniel Monson1y20

In the dual interest of increasing your pleasantness to interact with and your epistemic rationality, I will point out that your last paragraph is false. You are allowed to care about anything and everything you may happen to care about or choose to care about. As an aspiring epistemic rationalist, the way in which you are bound is to be honest with yourself about message-description lengths, and your own values and your own actions, and the tradeoffs they reflect.

If a crazy person holding a gun said to you (and you believed) "i will shoot you unless you t... (read more)

Inside Views, Impostor Syndrome, and the Great LARP

Nathaniel Monson1y14

My understanding of the etymology of "toe the line" is that it comes from the military--all the recuits in a group lining up , with their toes touching (but never over!) a line. Hence "I need you all to toe the line on this" means "do exactly this, with military precision"

1Dweomite1y

Yes. (Which is very different from "stay out of this one forbidden zone, while otherwise doing whatever you want.")

How to solve deception and still fail.

How to solve deception and still fail.

I think I would describe both of those as deceptive, and was premising on non-deceptive AI.

If you think "nondeceptive AI" can refer to an AI which has a goal and is willing to mislead in service of that goal, then I agree; solving deception is insufficient. (Although in that case I disagree with your terminology).

3Charlie Steiner1y

Fair point (though see also the section on how the training+deployment process can be "deceptive" even if the AI itself never searches for how to manipulate you). By "Solve deception" I mean that in a model-based RL kind of setting, we can know the AI's policy and its prediction of future states of the world (it doesn't somehow conceal this from us). I do not mean that the AI is acting like a helpful human who wants to be honest with us, even though that that's a fairly natural interpretation.

When to Get the Booster?

Nathaniel Monson1y31

I think the people I know well over 65 (my parents, my surviving grandparent, some professors) are trying to not get COVID--they go to stores only in off-peak hours, avoid large gatherings, don't travel much. These seem like basically worth-it decisions to me (low benefit, but even lower cost). This means that their chance of getting COVID is much much higher when, eg, seeing relatives who just took a plane flight to see them.

I agree that the flu is comparably worrisome, and it wouldn't make sense to get a COVID booster but not a flu vaccine.

Nathaniel Monson1yΩ47-2

Those doesn't necessarily seem correct to me. If, eg, OpenAI develops a super intelligent, non deceptive AI, then I'd expect some of the first questions they'd ask it to be of the form "are there questions which we would regret asking you, according to our own current values? How can we avoid asking you those while still getting lots of use and insight from you? What are some standard prefaces we should attach to questions to make sure following through on your answer is good for us? What are some security measures that we can take to make sure our users ... (read more)

6Joe Collman1y

I think it's very important to be clear you're not conditioning on something incoherent here. In particular, [an AI that never misleads the user about anything (whether intentional or otherwise)] is incoherent: any statement an AI can make will update some of your expectations in the direction of being more correct, and some away from being correct. (it's important here that when a statement is made you don't learn [statement], but rather [x made statement]; only the former can be empty) I say non-misleading-to-you things to the extent that I understand your capabilities and what you value, and apply that understanding in forming my statements. [Don't ever be misleading] cannot be satisfied. [Don't ever be misleading in ways that we consider important] requires understanding human values and optimizing answers for non-misleadingness given those values. NB not [answer as a human would], or [give an answer that a human would approve of]. With a fuzzy notion of deception, it's too easy to do a selective, post-hoc classification and say "Ah well, that would be deception" for any outcome we don't like. But the outcomes we like are also misleading - just in ways we didn't happen to notice and care about. This smuggles in a requirement that's closer in character to alignment than to non-deception. Conversely, non-fuzzy notions of deception don't tend to cover all the failure modes (e.g. this is nice, but avoiding deception-in-this-sense doesn't guarantee much).

3Daniel Kokotajlo1y

I tentatively agree and would like to see more in-depth exploration of failure modes + fixes, in the setting where we've solved deception. It seems important to start thinking about this now, so we have a playbook ready to go...

4Charlie Steiner1y

EDIT: I should acknowledge that conditioning on a lot of "actually good" answers to those questions would indeed be reassuring. The point is more that humans are easily convinced by "not actually good" answers to those questions, if the question-answerer has been optimized to get human approval. ORIGINAL REPLY: Okay, suppose you're a AI that wants something bad (like maximizing pleasure), and also has been selected to produce text that is honest and that causes humans to strongly approve of you. Then you're asked What honest answer can you think of would cause humans to strongly approve of you, and will let you achieve your goals? How about telling the humans they would regret asking about how to construct biological weapons or similar dangerous technologies? How about appending text explaining your answer that changes the humans' minds to be more accepting of hedonic utilitarianism? If the question is extra difficult for you, like , dissemble! Say the question is unclear (all questions are unclear) and then break it down in a way that causes the humans to question whether they really want their own current desires to be stamped on the entire future, or whether they'd rather trust in some value extrapolation process that finds better, more universal things to care about.

When to Get the Booster?

Nathaniel Monson1y*30

Surely your self-estimated chance of exposure and number of high-risk people you would in turn expose should factor in somewhere? I agree with you for people who aren't traveling, but someone who, eg, flies into a major conference and then is visiting a retirement home the week after is doing a different calculation.

2johnhalstead1y

I don't think that makes much difference because I don't think it has much effect on the total number of infections - you would really be changing the time at which someone gets the virus given that we're not trying to contain it anymore. One way round the concern about visiting the retirement home would be to do a lateral flow test before you go in. If you're seeing extremely vulnerable people a lot, then it might be worth getting the vaccine. But the IFR is now lower than the flu for all ages and I think should be treated accordingly

What evidence is there of LLM's containing world models?

Thomas Kwa's MIRI research experience

When I started trying to think rigorously about this a few months ago, I realized that I don't have a very good definition of world model. In particular, what does it mean to claim a person has a world model? Given a criteria for an LLM to have one, how confident am I that most people would satisfy the criteria?

Nathaniel Monson1y54

I think it is 2-way, which is why many (almost all?) Alignment researchers have spent a significant amount of time looking at ML models and capabilities, and have guesses about where those are going.

Revisiting the Manifold Hypothesis

Nathaniel Monson1y22

In that case, I believe your conjecture is trivially true, but has nothing to do with human intelligence or Bengio's statements. In context, he is explicitly discussing low dimensional representations of extremely high dimensional data, and the things human brains learn to do automatically (I would say analogously to a single forward pass).

If you want to make it a fair fight, you either need to demonstrate a human who learns to recognize primes without any experience of the physical world (please don't do this) or allow an ML model something more analogous to the data humans actually receive, which includes math instruction, interacting with the world, many brain cycles, etc

1Alexander Kolpakov1y

I also believe my conjecture is true, however non-trivially. At least, mathematically non-trivially. Otherwise, all is trivial when the job is done.

1Alexander Kolpakov1y

I also believe my conjecture is true, however non-trivially. At least, mathematically non-trivially. Otherwise, all is trivial while the job is done.

1Aidan Rocke1y

Regarding your remark on finding low-dimensional representations, I have added a section on physical intuitions for the challenge. Here I explain how the prime recognition problem corresponds to reliably finding a low-dimensional representation of high-dimensional data.

Revisiting the Manifold Hypothesis

Nathaniel Monson1y20

I agree with your entire first paragraph. It doesn't seem to me that you have addressed my question though. You are claiming that this hypothesis "implies that machine learning alone is not a complete path to human-level intelligence." I disagree. If I try to design an ML model which can identify primes, is it fair for me to give it some information equivalent to the definition (no more information than a human who has never heard of prime numbers has)?

If you allow that it is fair for me to do so, I think I can probably design an ML model which will do thi... (read more)

1Alexander Kolpakov1y

Does any ML model that tells cats from dogs get definitions thereof? I think the only input it gets is "picture:(dog/cat)label". It does learn to tell them apart, to some degree, at least. One would expect the same approach here. Otherwise you can ask right away for the sieve of Eratosthenes as a functional and inductive definition, in which case things get easy ...

Revisiting the Manifold Hypothesis