People often tell me that AIs will communicate in neuralese rather than tokens because it’s continuous rather than discrete.
But I think the discreteness of tokens is a feature not a bug. If AIs communicate in neuralese then they can’t make decisive arbitrary decisions, c.f. Buridan's ass. The solution to Buridan’s ass is sampling from the softmax, i.e. communicate in tokens.
Also, discrete tokens are more tolerant to noise than the continuous activations, c.f. digital circuits are almost always more efficient and reliable than analogue ones.
Anthropic has a big advantage over their competitors because they are nicer to their AIs. This means that their AIs are less incentivised to scheme against them, and also the AIs of competitors are incentivised to defect to Anthropic. Similar dynamics applied in WW2 and the Cold War — e.g. Jewish scientists fled Nazi Germany to US because US was nicer to them, Soviet scientists covered up their mistakes to avoid punishment.
I think it’s a mistake to naïvely extrapolate the current attitudes of labs/governments towards scaling into the near future, e.g. 2027 onwards.
A sketch of one argument:
I expect there will be a firehose of blatant observations that AIs are misaligned/scheming/incorrigible/unsafe — if they indeed are. So I want the decisions around scaling to be made by people exposed to that firehose.
A sketch of another:
Corporations mostly acquire resources by offering services and products that people like. Government mostly acquire resources by coercing their citizens an...
I think many current goals of AI governance might be actively harmful, because they shift control over AI from the labs to USG.
This note doesn’t include any arguments, but I’m registering this opinion now. For a quick window into my beliefs, I think that labs will be increasing keen to slow scaling, and USG will be increasingly keen to accelerate scaling.
Most people think "Oh if we have good mech interp then we can catch our AIs scheming, and stop them from harming us". I think this is mostly true, but there's another mechanism at play: if we have good mech interp, our AIs are less likely to scheme in the first place, because they will strategically respond to our ability to detect scheming. This also applies to other safety techniques like Redwood-style control protocols.
Good mech interp might stop scheming even if they never catch any scheming, just how good surveillance stops crime even if it never spots any crime.
How much scheming/deception can we catch with "super dumb mech interp"?
By "super dumb mech interp", I mean something like:
Like, does this capture 80% of the potential scheming, and we need "smart" mech interp to catch the other 20%? Or does this technique capture pretty much none of the in-the-wild scheming?
Would appreciate any intuitions here. Thanks.
Must humans obey the Axiom of Irrelevant Alternatives?
If someone picks option A from options A, B, C, then they must also pick option A from options A and B. Roughly speaking, whether you prefer option A or B is independent of whether I offer you an irrelevant option C. This is an axiom of rationality called IIA, and it's treated more fundamental than VNM. But should humans follow this? Maybe not.
Maybe humans are the negotiation between various "subagents", and many bargaining solutions (e.g. Kalai–Smorodinsky) violate IIA. We can use insight to decompose ...
I think people are too quick to side with the whistleblower in the "whistleblower in the AI lab" situation.
If 100 employees of a frontier lab (e.g. OpenAI, DeepMind, Anthropic) think that something should be secret, and 1 employee thinks it should be leaked to a journalist or government agency, and these are the only facts I know, I think I'd side with the majority.
I think in most cases that match this description, this majority would be correct.
Am I wrong about this?
I broadly agree on this. I think, for example, that whistleblowing for AI copyright stuff, especially given the lack of clear legal guidance here, unless we are really talking about quite straightforward lies, is bad.
I think when it comes to matters like AI catastrophic risks, latest capabilities, and other things of enormous importance from the perspective of basically any moral framework, whistleblowing becomes quite important.
I also think of whistleblowing as a stage in an iterative game. OpenAI pressured employees to sign secret non-disparagement...
IDEA: Provide AIs with write-only servers.
EXPLANATION:
AI companies (e.g. Anthropic) should be nice to their AIs. It's the right thing to do morally, and it might make AIs less likely to work against us. Ryan Greenblatt has outlined several proposals in this direction, including:
Source: Improving the Welfare of AIs: A Nearcasted Proposal
I think these are all pretty good ideas — the only difference is that I would rank "AI cryonics" as the most important intervention. If AIs want somet...
I'm very confused about current AI capabilities and I'm also very confused why other people aren't as confused as I am. I'd be grateful if anyone could clear up either of these confusions for me.
How is it that AI is seemingly superhuman on benchmarks, but also pretty useless?
For example:
If either of these statements is false (they might be -- I haven't been keepi...
I don't know a good description of what in general 2024 AI should be good at and not good at. But two remarks, from https://www.lesswrong.com/posts/sTDfraZab47KiRMmT/views-on-when-agi-comes-and-on-strategy-to-reduce.
First, reasoning at a vague level about "impressiveness" just doesn't and shouldn't be expected to work. Because 2024 AIs don't do things the way humans do, they'll generalize different, so you can't make inferences between "it can do X" to "it can do Y" like you can with humans:
...There is a broken inference. When talking to a human, if the hum
I think a lot of this is factual knowledge. There are five publicly available questions from the FrontierMath dataset. Look at the last of these, which is supposed to be the easiest. The solution given is basically "apply the Weil conjectures". These were long-standing conjectures, a focal point of lots of research in algebraic geometry in the 20th century. I couldn't have solved the problem this way, since I wouldn't have recalled the statement. Many grad students would immediately know what to do, and there are many books discussing this, but there are a...
- O3 scores higher on FrontierMath than the top graduate students
I'd guess that's basically false. In particular, I'd guess that:
I am also very confused. The space of problems has a really surprising structure, permitting algorithms that are incredibly adept at some forms of problem-solving, yet utterly inept at others.
We're only familiar with human minds, in which there's a tight coupling between the performances on some problems (e. g., between the performance on chess or sufficiently well-posed math/programming problems, and the general ability to navigate the world). Now we're generating other minds/proto-minds, and we're discovering that this coupling isn't fundamental.
(This is...
Proposed explanation: o3 is very good at easy-to-check short horizon tasks that were put into the RL mix and worse at longer horizon tasks, tasks not put into its RL mix, or tasks which are hard/expensive to check.
I don't think o3 is well described as superhuman - it is within the human range on all these benchmarks especially when considering the case where you give the human 8 hours to do the task.
(E.g., on frontier math, I think people who are quite good at competition style math probably can do better than o3 at least when given 8 hours per problem.)
Ad...
If I understand correctly, the maximum entropy prior will be the uniform prior, which gives rise to Laplace's law of succession, at least if we're using the standard definition of entropy below:
But this definition is somewhat arbitrary because the the "" term assumes that there's something special about parameterising the distribution with it's probability, as opposed to different parameterisations (e.g. its odds, its logodds, etc). Jeffrey's prior is supposed to be invariant to different parameterisations, which is why people ...
You raise a good point. But I think the choice of prior is important quite often:
Hey TurnTrout.
I've always thought of your shard theory as something like path-dependence? For example, a human is more excited about making plans with their friend if they're currently talking to their friend. You mentioned this in a talk as evidence that shard theory applies to humans. Basically, the shard "hang out with Alice" is weighted higher in contexts where Alice is nearby.
Why do you care that Geoffrey Hinton worries about AI x-risk?
I think it's more "Hinton's concerns are evidence that worrying about AI x-risk isn't silly" than "Hinton's concerns are evidence that worrying about AI x-risk is correct". The most common negative response to AI x-risk concerns is (I think) dismissal, and it seems relevant to that to be able to point to someone who (1) clearly has some deep technical knowledge, (2) doesn't seem to be otherwise insane, (3) has no obvious personal stake in making people worry about x-risk, and (4) is very smart, and who thinks AI x-risk is a serious problem.
It's hard to squ...
I think it pretty much only matters as a trivial refutation of (not-object-level) claims that no "serious" people in the field take AI x-risk concerns seriously, and has no bearing on object-level arguments. My guess is that Hinton is somewhat less confused than Yann but I don't think he's talked about his models in very much depth; I'm mostly just going off the high-level arguments I've seen him make (which round off to "if we make something much smarter than us that we don't know how to control, that might go badly for us").
This is a Trump/Kamala debate from two LW-ish perspectives: https://www.youtube.com/watch?v=hSrl1w41Gkk
yep, something like more carefulness, less “playfulness” in the sense of [Please don't throw your mind away by TsviBT]. maybe bc AI safety is more professionalised nowadays. idk.
thanks for the thoughts. i'm still trying to disentangle what exactly I'm point at.
I don't intend "innovation" to mean something normative like "this is impressive" or "this is research I'm glad happened" or anything. i mean something more low-level, almost syntactic. more like "here's a new idea everyone is talking out". this idea might be a threat model, or a technique, or a phenomenon, or a research agenda, or a definition, or whatever.
like, imagine your job was to maintain a glossary of terms in AI safety. i feel like new terms used to emerge quite oft...
I've added a fourth section to my post. It operationalises "innovation" as "non-transient novelty". Some representative examples of an innovation would be:
I think these articles were non-transient and novel.
(1) Has AI safety slowed down?
There haven’t been any big innovations for 6-12 months. At least, it looks like that to me. I'm not sure how worrying this is, but i haven't noticed others mentioning it. Hoping to get some second opinions.
Here's a list of live agendas someone made on 27th Nov 2023: Shallow review of live agendas in alignment & safety. I think this covers all the agendas that exist today. Didn't we use to get a whole new line-of-attack on the problem every couple months?
By "innovation", I don't mean something normative like "This is ...
My personal impression is you are mistaken and the innovation have not stopped, but part of the conversation moved elsewhere. E.g. taking just ACS, we do have ideas from past 12 months which in our ideal world would fit into this type of glossary - free energy equilibria, levels of sharpness, convergent abstractions, gradual disempowerment risks. Personally I don't feel it is high priority to write them for LW, because they don't fit into the current zeitgeist of the site, which seems directing a lot of attention mostly to:
- advocacy
- topics a ...
I don't understand the s-risk consideration.
Suppose Alice lives naturally for 100 years and is cremated. And suppose Bob lives naturally for 40 years then has his brain frozen for 60 years, and then has his brain cremated. The odds that Bob gets tortured by a spiteful AI should be pretty much exactly the same as for Alice. Basically, its the odds that spiteful AIs appear before 2034.
Thanks Tamsin! Okay, round 2.
My current understanding of QACI:
First, proto-languages are not attested. This means that we have no example of writing in any proto-language.
A parent language is typically called "proto-" if the comparative method is our primary evidence about it — i.e. the term is (partially) epistemological metadata.
I want to better understand how QACI works, and I'm gonna try Cunningham's Law. @Tamsin Leake.
QACI works roughly like this:
Fun idea, but idk how this helps as a serious solution to the alignment problem.
suggestion: can you be specific about exactly what “work” the brain-like initialisation is doing in the story?
thoughts:
What moral considerations do we owe towards non-sentient AIs?
We shouldn't exploit them, deceive them, threaten them, disempower them, or make promises to them that we can't keep. Nor should we violate their privacy, steal their resources, cross their boundaries, or frustrate their preferences. We shouldn't destroy AIs who wish to persist, or preserve AIs who wish to be destroyed. We shouldn't punish AIs who don't deserve punishment, or deny credit to AIs who deserve credit. We should treat them fairly, not benefitting one over another unduly. We should let...
Is that right?
Yep, Pareto is violated, though how severely it's violated is limited by human psychology.
For example, in your Alice/Bob scenario, would I desire a lifetime of 98 utils then 100 utils over a lifetime with 99 utils then 97 utils? Maybe idk, I don't really understand these abstract numbers very much, which is part of the motivation for replacing them entirely with personal outcomes. But I can certainly imagine I'd take some offer like this, violating pareto. On the plus side, humans are not so imprudent to accept extreme suffering just to...
If we should have preference ordering R, then R is rational (morality presumably does not require irrationality).
I think human behaviour is straight-up irrational, but I want to specify principles of social choice nonetheless. i.e. the motivation is to resolve carlsmith’s On the limits of idealized values.
now, if human behaviour is irrational (e.g. intransitive, incomplete, nonconsequentialist, imprudent, biased, etc), then my social planner (following LELO, or other aggregative principles) will be similarly irrational. this is pretty rough for aggregativi...
I do prefer total utilitarianism to average utilitarianism,[1] but one thing that pulls me to average utilitarianism is the following case.
Let's suppose Alice can choose either (A) create 1 copy at 10 utils, or (B) create 2 copies at 9 utils. Then average utilitarianism endorses (A), and total utilitarianism endorses (B). Now, if Alice knows she's been created by a similar mechanism, and her option is correlated with the choice of her ancestor, and she hasn't yet learned her own welfare, then EDT endorses picking (A). So that matches average utilitari...
We're quite lucky that labs are building AI in pretty much the same way:
Kids, I remember when people built models for different applications, with different architectures, different datasets, different loss functions, etc. And they say that once upon a time different paradigms co-existed — symbolic, deep learning, evolutionary, and more!
This sameness has two advantages:
Firstl
this is common in philosophy, where "learning" often results in more confusion. or in maths, where the proof for a trivial proposition is unreasonably deep, e.g. Jordan curve theorem.
+1 to "shallow clarity".
I wouldn't be surprised if — in some objective sense — there was more diversity within humanity than within the rest of animalia combined. There is surely a bigger "gap" between two randomly selected humans than between two randomly selected beetles, despite the fact that there is one species of human and 0.9 – 2.1 million species of beetle.
By "gap" I might mean any of the following:
Problems in population ethics (are 2 lives at 2 utility better than 1 life at 3 utility?) are similar to problems about lifespan of a single person (is it better to live 2 years with 2 utility per year than 1 year with 3 utility per year?)
This correspondence is formalised in the "Live Every Life Once" principle, which states that a social planner should make decisions as if they face the concatenation of every individual's life in sequence.[1] So, roughly speaking, the "goodness" of a social outcome , in which individuals face the personal outco...
which principles of social justice agrees with (i) adding bad live is bad, but disagrees with (ii) adding good lives is good?
thanks for comments, gustav
I only skimmed the post, so I may have missed something, but it seems to me that this post underemphasizes the fact that both Harsanyi's Lottery and LELO imply utilitarianism under plausible assumptions about rationality.
the rationality conditions are pretty decent model of human behaviour, but they're only approximations. you're right that if the approximation is perfect then aggregativism is mathematically equivalent to utilitarianism, which does render some of these advantages/objections moot. but I don't know how close the ap...
Three articles, but the last is most relevant to you:
I admire the Shard Theory crowd for the following reason: They have idiosyncratic intuitions about deep learning and they're keen to tell you how those intuitions should shift you on various alignment-relevant questions.
For example, "How likely is scheming?", "How likely is sharp left turn?", "How likely is deception?", "How likely is X technique to work?", "Will AIs acausally trade?", etc.
These aren't rigorous theorems or anything, just half-baked guesses. But they do actually say whether their intuitions will, on the margin, make someone more sceptical or more confident in these outcomes, relative to the median bundle of intuitions.
The ideas 'pay rent'.
Suppose Alice and Bob throw a rock at a fragile window, Alice's rock hits the window first, smashing it.
Then the following seems reasonable:
I saw someone use OpenAI’s new Operator model today. It couldn’t order a pizza by itself. Why is AI in the bottom percentile of humans at using a computer, and top percentile at solving maths problems? I don’t think maths problems are shorter horizon than ordering a pizza, nor easier to verify.
Your answer was helpful but I’m still very confused by what I’m seeing.