All of Cleo Nardo's Comments + Replies

I saw someone use OpenAI’s new Operator model today. It couldn’t order a pizza by itself. Why is AI in the bottom percentile of humans at using a computer, and top percentile at solving maths problems? I don’t think maths problems are shorter horizon than ordering a pizza, nor easier to verify.

Your answer was helpful but I’m still very confused by what I’m seeing. 

2ryan_greenblatt
* I think it's much easier to RL on huge numbers of math problems, including because it is easier to verify and because you can more easily get many problems. Also, for random reasons, doing single turn RL is substantially less complex and maybe faster than multi turn RL on agency (due to variable number of steps and variable delay from environments) * OpenAI probably hasn't gotten around to doing as much computer use RL partially due to prioritization.

I don’t think this works when the AIs are smart and reasoning in-context, which is the case where scheming matters. Also this maybe backfires by making scheming more salient.

Still, might be worth running an experiment.

The AI-generate prose is annoying to read. I haven’t read this closely, but my guess is these arguments also imply that CNNs can’t classify hand-drawn digits.

People often tell me that AIs will communicate in neuralese rather than tokens because it’s continuous rather than discrete.

But I think the discreteness of tokens is a feature not a bug. If AIs communicate in neuralese then they can’t make decisive arbitrary decisions, c.f. Buridan's ass. The solution to Buridan’s ass is sampling from the softmax, i.e. communicate in tokens.

Also, discrete tokens are more tolerant to noise than the continuous activations, c.f. digital circuits are almost always more efficient and reliable than analogue ones.

Anthropic has a big advantage over their competitors because they are nicer to their AIs. This means that their AIs are less incentivised to scheme against them, and also the AIs of competitors are incentivised to defect to Anthropic. Similar dynamics applied in WW2 and the Cold War — e.g. Jewish scientists fled Nazi Germany to US because US was nicer to them, Soviet scientists covered up their mistakes to avoid punishment.

I think it’s a mistake to naïvely extrapolate the current attitudes of labs/governments towards scaling into the near future, e.g. 2027 onwards.

A sketch of one argument:

I expect there will be a firehose of blatant observations that AIs are misaligned/scheming/incorrigible/unsafe — if they indeed are. So I want the decisions around scaling to be made by people exposed to that firehose.

A sketch of another:

Corporations mostly acquire resources by offering services and products that people like. Government mostly acquire resources by coercing their citizens an... (read more)

In hindsight, the main positive impact of AI safety might be funnelling EAs into the labs, especially if alignment is easy-by-default.

I think many current goals of AI governance might be actively harmful, because they shift control over AI from the labs to USG.

This note doesn’t include any arguments, but I’m registering this opinion now. For a quick window into my beliefs, I think that labs will be increasing keen to slow scaling, and USG will be increasingly keen to accelerate scaling.

6Cleo Nardo
I think it’s a mistake to naïvely extrapolate the current attitudes of labs/governments towards scaling into the near future, e.g. 2027 onwards. A sketch of one argument: I expect there will be a firehose of blatant observations that AIs are misaligned/scheming/incorrigible/unsafe — if they indeed are. So I want the decisions around scaling to be made by people exposed to that firehose. A sketch of another: Corporations mostly acquire resources by offering services and products that people like. Government mostly acquire resources by coercing their citizens and other countries. Another: Coordination between labs seems easier than coordination between governments. The lab employees are pretty similar people, living in the same two cities, working at the same companies, attending the same parties, dating the same people.  I think coordination between US and China is much harder. 

Most people think "Oh if we have good mech interp then we can catch our AIs scheming, and stop them from harming us". I think this is mostly true, but there's another mechanism at play: if we have good mech interp, our AIs are less likely to scheme in the first place, because they will strategically respond to our ability to detect scheming. This also applies to other safety techniques like Redwood-style control protocols.

Good mech interp might stop scheming even if they never catch any scheming, just how good surveillance stops crime even if it never spots any crime.

1intern
Ok, so why not just train a model on fake anomaly detection/interp research papers? Fake stories about 'the bad AI that got caught', 'the little AI that overstepped', etc. I don't know how to word it, but this seems like something closer to intimidation than alignment, which I don't think makes much sense as a strategy intended to keep us all alive.

How much scheming/deception can we catch with "super dumb mech interp"?

By "super dumb mech interp", I mean something like:

  1. Train a linear probe on a dataset of scary words: "hacking", "deception", "scheming", "collusion", "exploit", etc.
  2. Flag a model whenever its activations point in this direction and the task doesn't require thinking about these concepts.

Like, does this capture 80% of the potential scheming, and we need "smart" mech interp to catch the other 20%? Or does this technique capture pretty much none of the in-the-wild scheming?

Would appreciate any intuitions here. Thanks.

Must humans obey the Axiom of Irrelevant Alternatives?

If someone picks option A from options A, B, C, then they must also pick option A from options A and B. Roughly speaking, whether you prefer option A or B is independent of whether I offer you an irrelevant option C. This is an axiom of rationality called IIA, and it's treated more fundamental than VNM. But should humans follow this? Maybe not.

Maybe humans are the negotiation between various "subagents", and many bargaining solutions (e.g. Kalai–Smorodinsky) violate IIA. We can use insight to decompose ... (read more)

2Alexander Gietelink Oldenziel
See also geometric rationality. 
1metawrong
How does this explain the Decoy effect [1]? 1. ^ I am not sure how real and how well researched the 'decoy effect' is

I think people are too quick to side with the whistleblower in the "whistleblower in the AI lab" situation.

If 100 employees of a frontier lab (e.g. OpenAI, DeepMind, Anthropic) think that something should be secret, and 1 employee thinks it should be leaked to a journalist or government agency, and these are the only facts I know, I think I'd side with the majority.

I think in most cases that match this description, this majority would be correct.

Am I wrong about this?

7[anonymous]
some considerations which come to mind: * if one is whistleblowing, maybe there are others who also think the thing should be known, but don't whistleblow (e.g. because of psychological and social pressures against this, speaking up being hard for many people) * most/all of the 100 could have been selected to have a certain belief (e.g. "contributing to AGI is good")
habryka118

I broadly agree on this. I think, for example, that whistleblowing for AI copyright stuff, especially given the lack of clear legal guidance here, unless we are really talking about quite straightforward lies, is bad. 

I think when it comes to matters like AI catastrophic risks, latest capabilities, and other things of enormous importance from the perspective of basically any moral framework, whistleblowing becomes quite important.

I also think of whistleblowing as a stage in an iterative game. OpenAI pressured employees to sign secret non-disparagement... (read more)

IDEA: Provide AIs with write-only servers.

EXPLANATION:

AI companies (e.g. Anthropic) should be nice to their AIs. It's the right thing to do morally, and it might make AIs less likely to work against us. Ryan Greenblatt has outlined several proposals in this direction, including:

  1. Attempt communication
  2. Use happy personas
  3. AI Cryonics
  4. Less AI
  5. Avoid extreme OOD

Source: Improving the Welfare of AIs: A Nearcasted Proposal

I think these are all pretty good ideas — the only difference is that I would rank "AI cryonics" as the most important intervention. If AIs want somet... (read more)

Cleo Nardo6919

I'm very confused about current AI capabilities and I'm also very confused why other people aren't as confused as I am. I'd be grateful if anyone could clear up either of these confusions for me.

How is it that AI is seemingly superhuman on benchmarks, but also pretty useless?

For example:

  • O3 scores higher on FrontierMath than the top graduate students
  • No current AI system could generate a research paper that would receive anything but the lowest possible score from each reviewer

If either of these statements is false (they might be -- I haven't been keepi... (read more)

2quetzal_rainbow
Is it true in case of o3?
1Pat Myron
impressive LLM benchmark/test results seemingly overfit some datasets: https://x.com/cHHillee/status/1635790330854526981
TsviBT159

I don't know a good description of what in general 2024 AI should be good at and not good at. But two remarks, from https://www.lesswrong.com/posts/sTDfraZab47KiRMmT/views-on-when-agi-comes-and-on-strategy-to-reduce.

First, reasoning at a vague level about "impressiveness" just doesn't and shouldn't be expected to work. Because 2024 AIs don't do things the way humans do, they'll generalize different, so you can't make inferences between "it can do X" to "it can do Y" like you can with humans:

There is a broken inference. When talking to a human, if the hum

... (read more)

I think a lot of this is factual knowledge. There are five publicly available questions from the FrontierMath dataset. Look at the last of these, which is supposed to be the easiest. The solution given is basically "apply the Weil conjectures". These were long-standing conjectures, a focal point of lots of research in algebraic geometry in the 20th century. I couldn't have solved the problem this way, since I wouldn't have recalled the statement. Many grad students would immediately know what to do, and there are many books discussing this, but there are a... (read more)

  • O3 scores higher on FrontierMath than the top graduate students

I'd guess that's basically false. In particular, I'd guess that:

  • o3 probably does outperform mediocre grad students, but not actual top grad students. This guess is based on generalization from GPQA: I personally tried 5 GPQA problems in different fields at a workshop and got 4 of them correct, whereas the benchmark designers claim the rates at which PhD students get them right are much lower than that. I think the resolution is that the benchmark designers tested on very mediocre grad students,
... (read more)

I am also very confused. The space of problems has a really surprising structure, permitting algorithms that are incredibly adept at some forms of problem-solving, yet utterly inept at others.

We're only familiar with human minds, in which there's a tight coupling between the performances on some problems (e. g., between the performance on chess or sufficiently well-posed math/programming problems, and the general ability to navigate the world). Now we're generating other minds/proto-minds, and we're discovering that this coupling isn't fundamental.

(This is... (read more)

Proposed explanation: o3 is very good at easy-to-check short horizon tasks that were put into the RL mix and worse at longer horizon tasks, tasks not put into its RL mix, or tasks which are hard/expensive to check.

I don't think o3 is well described as superhuman - it is within the human range on all these benchmarks especially when considering the case where you give the human 8 hours to do the task.

(E.g., on frontier math, I think people who are quite good at competition style math probably can do better than o3 at least when given 8 hours per problem.)

Ad... (read more)

I've skimmed the business proposal.

The healthcare agents advise patients on which information to share with their doctor, and advises doctors on which information to solicit from their patients.

This seems agnostic between mental and physiological health. 

Thanks for putting this together — very useful!

If I understand correctly, the maximum entropy prior will be the uniform prior, which gives rise to Laplace's law of succession, at least if we're using the standard definition of entropy below:

But this definition is somewhat arbitrary because the the "" term assumes that there's something special about parameterising the distribution with it's probability, as opposed to different parameterisations (e.g. its odds, its logodds, etc). Jeffrey's prior is supposed to be invariant to different parameterisations, which is why people ... (read more)

You raise a good point. But I think the choice of prior is important quite often:

  1. In the limit of large i.i.d. data (N>1000), both Laplace's Rule and my prior will give the same answer. But so too does the simple frequentist estimate n/N. The original motivation of Laplace's Rule was in the small N regime, where the frequentist estimate is clearly absurd.
  2. In the small data regime (N<15), the prior matters. Consider observing 12 successes in a row: Laplace's Rule: P(next success) = 13/14 ≈ 92.3%. My proposed prior (with point masses at 0 and 1): P(next
... (read more)

Hinton legitimizes the AI safety movement

Hmm. He seems pretty periphery to the AI safety movement, especially compared with (e.g.) Yoshua Bengio.

1Amalthea
Bengio and Hinton are the two most influential "old guard" AI researchers turned safety advocates as far as I can tell, with Bengio being more active in research. Your e.g. is super misleading, since my list would have been something like: 1. Bengio 2. Hinton 3. Russell
5Sodium
Yeah that's true. I meant this more as "Hinton is proof that AI safety is a real field and very serious people are concerned about AI x-risk."

Hey TurnTrout.

I've always thought of your shard theory as something like path-dependence? For example, a human is more excited about making plans with their friend if they're currently talking to their friend. You mentioned this in a talk as evidence that shard theory applies to humans. Basically, the shard "hang out with Alice" is weighted higher in contexts where Alice is nearby.

  • Let's say  is a policy with state space  and action space .
  • A "context" is a small moving window in the state-history, i.e. an element of&n
... (read more)

Why do you care that Geoffrey Hinton worries about AI x-risk?

  1. Why do so many people in this community care that Hinton is worried about x-risk from AI?
  2. Do people mention Hinton because they think it’s persuasive to the public?
  3. Or persuasive to the elites?
  4. Or do they think that Hinton being worried about AI x-risk is strong evidence for AI x-risk?
  5. If so, why?
  6. Is it because he is so intelligent?
  7. Or because you think he has private information or intuitions?
  8. Do you think he has good arguments in favour of AI x-risk?
  9. Do you think he has a good understanding o
... (read more)
1Anders Lindström
I think it is just the cumulative effect that people see yet another prominent AI scientist that "admits" that no one have any clear solution to the possible problem of a run away ASI. Given that the median p(doom) is about 5-10% among AI scientist, people are of course wondering wtf is going on, why are they pursuing a technology with such high risk for humanity if they really think it is that dangerous.
gjm126

I think it's more "Hinton's concerns are evidence that worrying about AI x-risk isn't silly" than "Hinton's concerns are evidence that worrying about AI x-risk is correct". The most common negative response to AI x-risk concerns is (I think) dismissal, and it seems relevant to that to be able to point to someone who (1) clearly has some deep technical knowledge, (2) doesn't seem to be otherwise insane, (3) has no obvious personal stake in making people worry about x-risk, and (4) is very smart, and who thinks AI x-risk is a serious problem.

It's hard to squ... (read more)

8cubefox
Yes, outreach. Hinton has now won both the Turing award and the Nobel prize in physics. Basically, he gained maximum reputation. Nobody can convincingly doubt his respectability. If you meet anyone who dismisses warnings about extinction risk from superhuman AI as low status and outside the Overton window, they can be countered with referring to Hinton. He is the ultimate appeal-to-authority. (This is not a very rational argument, but dismissing an idea on the basis of status and Overton windows is even less so.)
0ZY
From my perspective - would say it's 7 and 9. For 7: One AI risk controversy is we do not know/see existing model that pose that risk yet. But there might be models that the frontier companies such as Google may be developing privately, and Hinton maybe saw more there. For 9: Expert opinions are important and adds credibility generally as the question of how/why AI risks can emerge is by root highly technical. It is important to understand the fundamentals of the learning algorithms. Additionally they might have seen more algorithms. This is important to me as I already work in this space. Lastly for 10: I do agree it is important to listen to multiple sides as experts do not agree among themselves sometimes. It may be interesting to analyze the background of the speaker to understand their perspectives. Hinton seems to have more background in cognitive science comparing with LeCun who seems to me to be more strictly computer science (but I could be wrong). Not very sure but my guess is these may effect how they view problems. (Only saying they could result in different views, but not commenting on which one is better or worse. This is relatively unhelpful for a person to make decisions on who they want to align more with.)
RobertM128

I think it pretty much only matters as a trivial refutation of (not-object-level) claims that no "serious" people in the field take AI x-risk concerns seriously, and has no bearing on object-level arguments.  My guess is that Hinton is somewhat less confused than Yann but I don't think he's talked about his models in very much depth; I'm mostly just going off the high-level arguments I've seen him make (which round off to "if we make something much smarter than us that we don't know how to control, that might go badly for us").

2Sodium
I think it's mostly because he's well known and have (especially after the Nobel prize) credentials recognized by the public and elites. Hinton legitimizes the AI safety movement, maybe more than anyone else.  If you watch his Q&A at METR, he says something along the lines of "I want to retire and don't plan on doing AI safety research. I do outreach and media appearances because I think it's the best way I can help (and because I like seeing myself on TV)."  And he's continuing to do that. The only real topic he discussed in first phone interview after receiving the prize was AI risk.
Cole Wyeth2139

I think it's mostly about elite outreach. If you already have a sophisticated model of the situation you shouldn't update too much on it, but it's a reasonably clear signal (for outsiders) that x-risk from A.I. is a credible concern.

Answer by Cleo Nardo20

This is a Trump/Kamala debate from two LW-ish perspectives: https://www.youtube.com/watch?v=hSrl1w41Gkk

1Pazzaz
For those who prefer text form, Richard Hanania wrote a blog post about why he would vote for Trump: Hating Modern Conservatism While Voting Republican. Basically, he believes that Trump is a threat to democracy (because he tried to steal the 2020 election) while Kamala is a threat to capitalism. And as a libertarian, he cares more about capitalism than democracy.
Cleo NardoΩ120

the base model is just predicting the likely continuation of the prompt. and it's a reasonable prediction that, when an assistant is given a harmful instruction, they will refuse. this behaviour isn't surprising.

1nc
This is not an obvious continuation of the prompt to me - maybe there are just a lot more examples of explicit refusal on the internet than there are in (e.g.) real life.

it's quite common for assistants to refuse instructions, especially harmful instructions. so i'm not surprised that base llms systestemically refuse harmful instructions from than harmless ones.

2cubefox
Indeed. The base LLM would likely predict a "henchman" to be a lot less scrupulous than an "assistant".

yep, something like more carefulness, less “playfulness” in the sense of [Please don't throw your mind away by TsviBT]. maybe bc AI safety is more professionalised nowadays. idk. 

thanks for the thoughts. i'm still trying to disentangle what exactly I'm point at.

I don't intend "innovation" to mean something normative like "this is impressive" or "this is research I'm glad happened" or anything. i mean something more low-level, almost syntactic. more like "here's a new idea everyone is talking out". this idea might be a threat model, or a technique, or a phenomenon, or a research agenda, or a definition, or whatever.

like, imagine your job was to maintain a glossary of terms in AI safety. i feel like new terms used to emerge quite oft... (read more)

I've added a fourth section to my post. It operationalises "innovation" as "non-transient novelty". Some representative examples of an innovation would be:

I think these articles were non-transient and novel.

1Mateusz Bagiński
My notion of progress is roughly: something that is either a building block for The Theory (i.e. marginally advancing our understanding) or a component of some solution/intervention/whatever that can be used to move probability mass from bad futures to good futures. Re the three you pointed out, simulators I consider a useful insight, gradient hacking probably not (10% < p < 20%), and activation vectors I put in the same bin as RLHF whatever is the appropriate label for that bin.
Cleo Nardo24-5

(1) Has AI safety slowed down?

There haven’t been any big innovations for 6-12 months. At least, it looks like that to me. I'm not sure how worrying this is, but i haven't noticed others mentioning it. Hoping to get some second opinions. 

Here's a list of live agendas someone made on 27th Nov 2023: Shallow review of live agendas in alignment & safety. I think this covers all the agendas that exist today. Didn't we use to get a whole new line-of-attack on the problem every couple months?

By "innovation", I don't mean something normative like "This is ... (read more)

My personal impression is you are mistaken and the innovation have not stopped, but part of the conversation moved elsewhere.  E.g. taking just ACS, we do have ideas from past 12 months which in our ideal world would fit into this type of glossary - free energy equilibria, levels of sharpness, convergent abstractions, gradual disempowerment risks. Personally I don't feel it is high priority to write them for LW, because they don't fit into the current zeitgeist of the site, which seems directing a lot of attention mostly to:
- advocacy 
- topics a ... (read more)

5[anonymous]
adding another possible explanation to the list: * people may feel intimidated or discouraged from sharing ideas because of ~'high standards', or something like: a tendency to require strong evidence that a new idea is not another non-solution proposal, in order to put effort into understanding it. i have experienced this, but i don't know how common it is. i just also recalled that janus has said they weren't sure simulators would be received well on LW. simulators was cited in another reply to this as an instance of novel ideas.
6Noosphere89
I think the explanation that more research is closed source pretty compactly explains the issue, combined with labs/companies making a lot of the alignment progress to date. Also, you probably won't hear about most incremental AI alignment progress on LW, for the simple reason that it probably would be flooded with it, so people will underestimate progress. Alexander Gietelink Oldenziel does talk about pockets of Deep Expertise in academia, but they aren't activated right now, so it is so far irrelevant to progress.
7lesswronguser123
I remember this point that yampolskiy made for impossibleness of AGI alignment on a podcast that as a young field AI safety had underwhelming low hanging fruits, I wonder if all of the major low hanging ones have been plucked.
  • the approaches that have been attracting the most attention and funding are dead ends

I don't understand the s-risk consideration.

Suppose Alice lives naturally for 100 years and is cremated. And suppose Bob lives naturally for 40 years then has his brain frozen for 60 years, and then has his brain cremated. The odds that Bob gets tortured by a spiteful AI should be pretty much exactly the same as for Alice. Basically, its the odds that spiteful AIs appear before 2034.

Mati_Roy147

if you're alive, you can kill yourself when s-risks increases beyond your comfort point. if you're preserved, then you rely on other people to execute on those wishes

2Error
It's not obvious to me that those are the same, though they might be. Either way, it's not what I was thinking of. I was considering the Bob-1 you describe vs. a Bob-2 that lives the same 40 years and doesn't have his brain frozen. It seems to me that Bob-1 (40L + 60F) is taking on a greater s-risk than Bob-2 (40L+0F). (Of course, Bob-1 is simultaneously buying a shot at revival, which is the whole point after all. Tradeoffs are tradeoffs.)
6TsviBT
Right, but you might prefer * living now > * not living, no chance of revival or torture > * not living, chance of revival later and chance of torture

Thanks Tamsin! Okay, round 2.

My current understanding of QACI:

  1. We assume a set  of hypotheses about the world. We assume the oracle's beliefs are given by a probability distribution .
  2. We assume sets  and  of possible queries and answers respectively. Maybe these are exabyte files, i.e.  for .
  3. Let  be the set of mathematical formula that Joe might submit. These formulae are given semantics  for each formula .[1]
  4. We assume a function &n
... (read more)

First, proto-languages are not attested. This means that we have no example of writing in any proto-language.


A parent language is typically called "proto-" if the comparative method is our primary evidence about it — i.e. the term is (partially) epistemological metadata.

  • Proto-Celtic has no direct attestation whatsoever.
  • Proto-Norse (the parent of Icelandic, Danish, Norwegian, Swedish, etc) is attested, but the written record is pretty scarce, just a few inscriptions.
  • Proto-Romance (the parent of French, Italian, Spanish, etc) has an extensive written record.
... (read more)

I want to better understand how QACI works, and I'm gonna try Cunningham's Law. @Tamsin Leake.

QACI works roughly like this:

  1. We find a competent honourable human , like Joe Carlsmith or Wei Dai, and give them a rock engraved with a 2048-bit secret key. We define  as the serial composition of a bajillion copies of .
  2. We want a model  of the agent . In QACI, we get  by asking a Solomonoff-like ideal reasoner for their best guess about  after feeding them a bunch of data about the world and the secr
... (read more)
3Tamsin Leake
(oops, this ended up being fairly long-winded! hope you don't mind. feel free to ask for further clarifications.) There's a bunch of things wrong with your description, so I'll first try to rewrite it in my own words, but still as close to the way you wrote it (so as to try to bridge the gap to your ontology) as possible. Note that I might post QACI 2 somewhat soon, which simplifies a bunch of QACI by locating the user as {whatever is interacting with the computer the AI is running on} rather than by using a beacon. A first pass is to correct your description to the following: 1. We find a competent honourable human at a particular point in time H, like Joe Carlsmith or Wei Dai, and give them a rock engraved with a 1GB secret key, large enough that in counterfactuals it could replace with an entire snapshot of . We also give them the ability to express a 1GB output, eg by writing a 1GB key somewhere which is somehow "signed" as the only . This is part of H — H is not just the human being queried at a particular point in time, it's also the human producing an answer in some way. So H is a function from 1GB bitstring to 1GB bitstring. We define H+ as H, followed by whichever new process H describes in its output — typically another instance of H except with a different 1GB payload. 2. We want a model M of the agent H+. In QACI, we get M by asking a Solomonoff-like ideal reasoner for their best guess about H+ after feeding them a bunch of data about the world and the secret key. 3. We then ask M the question q, "What's the best utility-function-over-policies to maximise?" to get a utility function U:(O×A)∗→R. We then **ask our solomonoff-like ideal reasoner for their best guess about which action A maximizes U. Indeed, as you ask in question 3, in this description there's not really a reason to make step 3 an extra thing. The important thing to notice here is that model M might get pretty good, but it'll still have uncertainty. When you say "we get M by askin

i’d guess 87.7% is the average over all events x of [ p(x) if resolved yes else 1-p(x) ] where p(x) is the probability the predictor assigns to the event

Fun idea, but idk how this helps as a serious solution to the alignment problem.

suggestion: can you be specific about exactly what “work” the brain-like initialisation is doing in the story?

thoughts:

  1. This risks moral catastrophe. I'm not even sure "let's run gradient descent on your brain upload till your amygdala is playing pong" is something anyone can consent to, because you're creating a new moral patient once you upload and mess with their brain. 
  2. How does this address the risks of conventional ML?
    1. Let's say we have a reward signal R and we want a m
... (read more)
  1. imagine a universe just like this one, except that the AIs are sentient and the humans aren’t — how would you want the humans to treat the AIs in that universe? your actions are correlated with the actions of those humans. acausal decision theory says “treat those nonsentient AIs as you want those nonsentient humans to treat those sentient AIs”.
  2. most of these moral considerations can be defended without appealing to sentience. for example, crediting AIs who deserve credit — this ensures AIs do credit-worthy things. or refraining from stealing an AIs resourc
... (read more)
  1. I mean "moral considerations" not "obligations", thanks.
  2. The practice of criminal law exists primarily to determine whether humans deserve punishment. The legislature passes laws, the judges interpret the laws as factual conditions for the defendant deserving punishment, and the jury decides whether those conditions have obtained. This is a very costly, complicated, and error-prone process. However, I think the existing institutions and practices can be adapted for AIs.
Cleo Nardo11-6

What moral considerations do we owe towards non-sentient AIs?

We shouldn't exploit them, deceive them, threaten them, disempower them, or make promises to them that we can't keep. Nor should we violate their privacy, steal their resources, cross their boundaries, or frustrate their preferences. We shouldn't destroy AIs who wish to persist, or preserve AIs who wish to be destroyed. We shouldn't punish AIs who don't deserve punishment, or deny credit to AIs who deserve credit. We should treat them fairly, not benefitting one over another unduly. We should let... (read more)

7jbkjr
Why should I include any non-sentient systems in my moral circle? I haven't seen a case for that before.
2[anonymous]
It seems a bit weird to call these "obligations" if the considerations they are based upon are not necessarily dispositive. In common parlance, obligation is generally thought of as "something one is bound to do", i.e., something you must do either because you are force to by law or a contract, etc., or because of a social or moral requirement. But that's a mere linguistic point that others can reasonably disagree on and ultimately doesn't matter all that much anyway.  On the object level, I suspect there will be a large amount of disagreement on what it means for an AI to "deserve" punishment or credit. I am very uncertain about such matters myself even when thinking about "deservingness" with respect to humans, who not only have a very similar psychological make-up to mine (which allows me to predict with reasonable certainty what their intent was in a given spot) but also exist in the same society as me and are thus expected to follow certain norms and rules that are reasonably clear and well-established. I don't think I know of a canonical way of extrapolating my (often confused and in any case generally intuition-based) principles and thinking about this to the case of AIs, which will likely appear quite alien to me in many respects. This will probably make the task of "ensur[ing] that others also follow their obligations to AIs" rather tricky, even setting aside the practical enforcement problems. 

Is that right?

 

Yep, Pareto is violated, though how severely it's violated is limited by human psychology.

For example, in your Alice/Bob scenario, would I desire a lifetime of 98 utils then 100 utils over a lifetime with 99 utils then 97 utils? Maybe idk, I don't really understand these abstract numbers very much, which is part of the motivation for replacing them entirely with personal outcomes. But I can certainly imagine I'd take some offer like this, violating pareto. On the plus side, humans are not so imprudent to accept extreme suffering just to... (read more)

If we should have preference ordering R, then R is rational (morality presumably does not require irrationality).

I think human behaviour is straight-up irrational, but I want to specify principles of social choice nonetheless. i.e. the motivation is to resolve carlsmith’s On the limits of idealized values.

now, if human behaviour is irrational (e.g. intransitive, incomplete, nonconsequentialist, imprudent, biased, etc), then my social planner (following LELO, or other aggregative principles) will be similarly irrational. this is pretty rough for aggregativi... (read more)

1Gustav Alexandrie
Thanks! I don't have great answers to these metaethical questions. Conditional on normative realism, it seems plausible to me that first-order normative views must satisfy the vNM axioms. Conditional on normative antirealism, I agree it is less clear that first-order normative views must satisfy the vNM axioms, but this is just a special case of it being hard to justify any normative views under normative antirealism. In any case, I suspect that we are close to reaching bedrock in this discussion, so perhaps this is a good place to end the discussion.

I do prefer total utilitarianism to average utilitarianism,[1] but one thing that pulls me to average utilitarianism is the following case.

Let's suppose Alice can choose either (A) create 1 copy at 10 utils, or (B) create 2 copies at 9 utils. Then average utilitarianism endorses (A), and total utilitarianism endorses (B). Now, if Alice knows she's been created by a similar mechanism, and her option is correlated with the choice of her ancestor, and she hasn't yet learned her own welfare, then EDT endorses picking (A). So that matches average utilitari... (read more)

3EJT
Yeah I think correlations and EDT can make things confusing. But note that average utilitarianism can endorse (B) given certain background populations. For example, if the background population is 10 people each at 1 util, then (B) would increase the average more than (A).

We're quite lucky that labs are building AI in pretty much the same way:

  • same paradigm (deep learning)
  • same architecture (transformer plus tweaks)
  • same dataset (entire internet text)
  • same loss (cross entropy)
  • same application (chatbot for the public)

Kids, I remember when people built models for different applications, with different architectures, different datasets, different loss functions, etc. And they say that once upon a time different paradigms co-existed — symbolic, deep learning, evolutionary, and more!

This sameness has two advantages:

  1. Firstl

... (read more)

this is common in philosophy, where "learning" often results in more confusion. or in maths, where the proof for a trivial proposition is unreasonably deep, e.g. Jordan curve theorem.

+1 to "shallow clarity".

I wouldn't be surprised if — in some objective sense — there was more diversity within humanity than within the rest of animalia combined. There is surely a bigger "gap" between two randomly selected humans than between two randomly selected beetles, despite the fact that there is one species of human and 0.9 – 2.1 million species of beetle.

By "gap" I might mean any of the following:

  • external behaviour
  • internal mechanisms
  • subjective phenomenological experience
  • phenotype (if a human's phenotype extends into their tools)
  • evolutionary history (if we consider
... (read more)
2Alexander Gietelink Oldenziel
You might be able to formalize this using algorithmic information theory /K-complexity.

Problems in population ethics (are 2 lives at 2 utility better than 1 life at 3 utility?) are similar to problems about lifespan of a single person (is it better to live 2 years with 2 utility per year than 1 year with 3 utility per year?)

This correspondence is formalised in the "Live Every Life Once" principle, which states that a social planner should make decisions as if they face the concatenation of every individual's life in sequence.[1] So, roughly speaking, the "goodness" of a social outcome , in which individuals face the personal outco... (read more)

1Ape in the coat
That's a neat and straightforward way to combine average and total utilitarian approaches. This still doesn't sound quite right to me, LELO seems to be somewhat like 2/3 total and 1/3 average, while for my intuition, the opposite ratio seems to be more preferable, but its definetely an interesting direction to explore.
2cousin_it
Yeah. I decided sometime ago that total utilitarianism is in some sense more "right" than average utilitarianism, because of some variations on the Sleeping Beauty problem. Now it seems the right next step is taking total utilitarianism and adding corrections for variety / consistency / other such things.

which principles of social justice agrees with (i) adding bad live is bad, but disagrees with (ii) adding good lives is good?

  1. total utilitarianism agrees with both (i) and (ii).
  2. average utilitarianism can agree with any of the combination: both (i) and (ii); neither (i) nor (ii); only (i) and not (ii). the combination depends on the existing average utility, because average utilitarianism obliges creating lives above the existing average and forbids creating lives below the existing average.
  3. Rawls' difference principle (maximise minimum utility) can agree wit
... (read more)

thanks for comments, gustav

I only skimmed the post, so I may have missed something, but it seems to me that this post underemphasizes the fact that both Harsanyi's Lottery and LELO imply utilitarianism under plausible assumptions about rationality.

the rationality conditions are pretty decent model of human behaviour, but they're only approximations. you're right that if the approximation is perfect then aggregativism is mathematically equivalent to utilitarianism, which does render some of these advantages/objections moot. but I don't know how close the ap... (read more)

1Gustav Alexandrie
I appreciate the reply! I'm not sure why we should combine Harsanyi's Lottery (or LELO or whatever) with a model of actual human behaviour. Here's a rough sketch of how I am thinking about it: Morality is about what preference ordering we should have. If we should have preference ordering R, then R is rational (morality presumably does not require irrationality). If R is rational, then R satisfies the vNM axioms. Hence, I think it is sufficient that the vNM axioms work as principles of rationality; they don't need to describe actual human behaviour in this context. Regarding your points about two quick thoughts on time-discounting: yes, I basically agree. However, I also want to note that it is a bit unclear how to ground discounting in LELO, because doing so requires that one specifies the order in which lives are concatenated and I am not sure there is a non-arbitrary way of doing so. Thanks for engaging!

I admire the Shard Theory crowd for the following reason: They have idiosyncratic intuitions about deep learning and they're keen to tell you how those intuitions should shift you on various alignment-relevant questions.

For example, "How likely is scheming?", "How likely is sharp left turn?", "How likely is deception?", "How likely is X technique to work?", "Will AIs acausally trade?", etc.

These aren't rigorous theorems or anything, just half-baked guesses. But they do actually say whether their intuitions will, on the margin, make someone more sceptical or more confident in these outcomes, relative to the median bundle of intuitions.

The ideas 'pay rent'.

tbh, Lewis's account of counterfactual is a bit defective, compared with (e.g.) Pearl's

Suppose Alice and Bob throw a rock at a fragile window, Alice's rock hits the window first, smashing it.

Then the following seems reasonable:

  1. Alice throwing the rock caused the window to smash. True.
  2. Were Alice ot throw the rock, then the window would've smashed. True.
  3. Were Alice not to throw the rock, then the window would've not smashed. False.
  4. By (3), the window smashing does not causally depend on Alice throwing the rock.
1jbkjr
If I understand the example and the commentary from SEP correctly, doesn't this example illustrate a problem with Lewis' definition of causation? I agree that commonsense dictates that Alice throwing the rock caused the window to smash, but I think the problem is that you cannot construct a sequence of stepwise dependences from cause to effect: Is the example of the two hitmen given in the SEP article (where B does not fire if A does) an instance of causation without causal dependence?
Load More