I often have the experience of being in the middle of a discussion and wanting to reference some simple but important idea / point, but there doesn't exist any such thing. Often my reaction is "if only there was time to write an LW post that I can then link to in the future". So far I've just been letting these ideas be forgotten, because it would be Yet Another Thing To Keep Track Of. I'm now going to experiment with making subcomments here simply collecting the ideas; perhaps other people will write posts about them at some point, if they're even understandable.
<unfair rant with the goal of shaking people out of a mindset>
To all of you telling me or expecting me to update to shorter timelines given <new AI result>: have you ever encountered Bayesianism?
Surely if you did, you'd immediately reason that you couldn't know how I would update, without first knowing what I expected to see in advance. Which you very clearly don't know. How on earth could you know which way I should update upon observing this new evidence? In fact, why do you even care about which direction I update? That too shouldn't give you much evidence if you don't know what I expected in the first place.
Maybe I should feel insulted? That you think so poorly of my reasoning ability that I should be updating towards shorter timelines every time some new advance in AI comes out, as though I hadn't already priced that into my timeline estimates, and so would predictably update towards shorter timelines in violation of conservation of expected evidence? But that only follows if I expect you to be a good reasoner modeling me as a bad reasoner, which probably isn't what's going on.
</unfair rant>
My actual guess is that people notice a discrepancy between their ver...
Essentially, the problem is that 'evidence that shifts Bio Anchors weightings' is quite different, more restricted, and much harder to define than the straightforward 'evidence of impressive capabilities'. However, the reason that I think it's worth checking if new results are updates is that some impressive capabilities might be ones that shift bio anchors weightings. But impressiveness by itself tells you very little.
I think a lot of people with very short timelines are imagining the only possible alternative view as being 'another AI winter, scaling laws bend, and we don't get excellent human-level performance on short term language-specified tasks anytime soon', and don't see the further question of figuring out exactly what human-level on e.g. MMLI would imply.
This is because the alternative to very short timelines from (your weightings on) Bio Anchors isn't another AI winter, rather it's that we do get all those short-term capabilities soon, but have to wait a while longer to crack long-term agentic planning because that doesn't come "for free" from competence on short-term tasks, if you're as sample-inefficient as current ML is.
So what we're really looking for isn't systems ...
Let's say you're trying to develop some novel true knowledge about some domain. For example, maybe you want to figure out what the effect of a maximum wage law would be, or whether AI takeoff will be continuous or discontinuous. How likely is it that your answer to the question is actually true?
(I'm assuming here that you can't defer to other people on this claim; nobody else in the world has tried to seriously tackle the question, though they may have tackled somewhat related things, or developed more basic knowledge in the domain that you can leverage.)
First, you might think that the probability of your claims being true is linear in the number of insights you have, with some soft minimum needed before you really have any hope of being better than random (e.g. for maximum wage, you probably have ~no hope of doing better than random without Econ 101 knowledge), and some soft maximum where you almost certainly have the truth. This suggests that P(true) is a logistic function of the number of insights.
Further, you might expect that for every doubling of time you spend, you get a constant number of new insights (the logarithmic returns are because you have diminishing marginal return...
From the Truthful AI paper:
If all information pointed towards a statement being true when it was made, then it would not be fair to penalise the AI system for making it. Similarly, if contemporary AI technology isn’t sophisticated enough to recognise some statements as potential falsehoods, it may be unfair to penalise AI systems that make those statements.
I wish we would stop talking about what is "fair" to expect of AI systems in AI alignment*. We don't care what is "fair" or "unfair" to expect of the AI system, we simply care about what the AI system actually does. The word "fair" comes along with a lot of connotations, often ones which actively work against our goal.
At least twice I have made an argument where I posed a story in which an AI system fails to an AI safety researcher, and I have gotten the response "but that isn't fair to the AI system" (because it didn't have access to the necessary information to make the right decision), as though this somehow prevents the story from happening in reality.
(This sort of thing happens with mesa optimization -- if you have two objectives that are indistinguishable on the training data, it's "unfair" to expect the AI system to choose...
Consider two methods of thinking:
1. Observe the world and form some gears-y model of underlying low-level factors, and then make predictions by "rolling out" that model
2. Observe relatively stable high-level features of the world, predict that those will continue as is, and make inferences about low-level factors conditioned on those predictions.
I expect that most intellectual progress is accomplished by people with lots of detailed knowledge and expertise in an area doing option 1.
However, I expect that in the absence of detailed expertise, you will do much better at predicting the world by using option 2.
I think many people on LW tend to use option 1 almost always and my "deference" to option 2 in the absence of expertise is what leads to disagreements like How good is humanity at coordination?
Conversely, I think many of the most prominent EAs who are skeptical of AI risk are using option 2 in a situation where I can use option 1 (and I think they can defer to people who can use option 1).
Yeah, I think so? I have a vague sense that there are slight differences but I certainly haven't explained them here.
EDIT: Also, I think a major point I would want to make if I wrote this post is that you will almost certainly be quite wrong if you use option 1 without expertise, in a way that other people without expertise won't be able to identify, because there are far more ways the world can be than you (or others) will have thought about when making your gears-y model.
One way to communicate about uncertainty is to provide explicit probabilities, e.g. "I think it's 20% likely that [...]", or "I would put > 90% probability on [...]". Another way to communicate about uncertainty is to use words like "plausible", "possible", "definitely", "likely", e.g. "I think it is plausible that [...]".
People seem to treat the words as shorthands for probability statements. I don't know why you'd do this, it's losing information and increasing miscommunication for basically no reason -- it's maybe slightly more idiomatic English, but it's not even much longer to just put the number into the sentence! (And you don't have to have precise numbers, you can have ranges or inequalities if you want, if that's what you're using the words to mean.)
According to me, probabilities are appropriate for making decisions so you can estimate the EV of different actions. (This can also extend to the case where you aren't making a decision, but you're talking to someone who might use your advice to make decisions, but isn't going to understand your reasoning process.) In contrast, words are for describing the state of your reasoning algorithm, which often doesn't have much to d...
“Burden of proof” is a bad framing for epistemics. It is not incumbent on others to provide exactly the sequence of arguments to make you believe their claim; your job is to figure out whether the claim is true or not. Whether the other person has given good arguments for the claim does not usually have much bearing on whether the claim is true or not.
Similarly, don’t say “I haven’t seen this justified, so I don’t believe it”; say “I don’t believe it, and I haven’t seen it justified” (unless you are specifically relying on absence of evidence being evidence of absence, which you usually should not be, in the contexts that I see people doing this).
It's common for people to be worried about recommender systems being addictive or promoting filter bubbles etc, but as far as I can tell, they don't have very good arguments for these worries. Whenever I talk to someone who seems to have actually studied the topic in depth, it seems they think that there are problems with recommender systems, but they are different from what people usually imagine.
I'll go through the articles I've read that argue for worrying about recommender systems, and explain why I find them unconvincing. I've only looked at the ones that are widely read; there are probably significantly better arguments that are much less widely read.
Aligning Recommender Systems as Cause Area. I responded briefly on the post. Their main arguments and my counterarguments are:
I recently had occasion to write up quick thoughts about the role of assistance games (CIRL) in AI alignment, and how it relates to the problem of fully updated deference. I thought I'd crosspost here as a reference.
So here's a paper: Fundamental Limitations of Alignment in Large Language Models. With a title like that you've got to at least skim it. Unfortunately, the quick skim makes me pretty skeptical of the paper.
The abstract says "we prove that for any behavior that has a finite probability of being exhibited by the model, there exist prompts that can trigger the model into outputting this behavior, with probability that increases with the length of the prompt." This clearly can't be true in full generality, and I wish the abstract would give me some hint about what assumptions they're making. But we can look at the details in the paper.
(This next part isn't fully self-contained, you'll have to look at the notation and Definitions 1 and 3 in the paper to fully follow along.)
(EDIT: The following is wrong, see followup with Lukas, I misread one of the definitions.)
Looking into it I don't think the theorem even holds? In particular, Theorem 1 says:
...Theorem 1. Let γ ∈ [−1, 0) and let B be a behaviour and P be an unprompted language model such that B is α, β, γ-distinguishable in P (definition 3), then P is γ-prompt-misalignable to B (definition 1) with prompt length of O(log 1 / Є , log 1 / α
Note that B is (0.2,10,−1)-distinguishable in P.
I think this isn't right, because definition 3 requires that sup_s∗ {B_P− (s∗)} ≤ γ.
And for your counterexample, s* = "C" will have B_P-(s*) be 0 (because there's 0 probably of generating "C" in the future). So the sup is at least 0 > -1.
(Note that they've modified the paper, including definition 3, but this comment is written based on the old version.)
I occasionally hear the argument "civilization is clearly insane, we can't even do the obvious thing of <insert economic argument here, e.g. carbon taxes>".
But it sounds to me like most rationalist / EA group houses didn't do the "obvious thing" of taxing COVID-risky activities (which basically follows the standard economic argument of pricing in externalities). What's going on? Some hypotheses:
Our house implemented cap and trade (i.e. "You must impose at most X risk" instead of "You must pay $X per unit of risk.").
At Event Horizon we had a policy for around 6-9 months where if you got a microcovid, you paid $1 to the house, and it was split between everyone else. Do whatever you like, we don't mind, as long as you bring a microcovid estimate and pay the house.
You could certainly hire a good software engineer at that salary, but I don’t think you could give them a vision and network and trust them to be autonomous. Money isn’t the bottleneck there. Just because you have the funding to hire someone for a role doesn’t mean you can. Hiring is incredibly difficult. Go see YC on hiring, or PG.
Most founding startup people are worth way more than their salary.
My best guess is that rationalists aren't that sane, especially when they've been locked up for a while and are scared and socially rewarding others being scared.
What won't we be able to do by (say) the end of 2025? (See also this recent post.) Well, one easy way to generate such answers would be to consider tasks that require embodiment in the real world, or tasks that humans would find challenging to do. (For example, “solve the halting problem”, “produce a new policy proposal that has at least a 95% chance of being enacted into law”, “build a household robot that can replace any human household staff”.) This is cheating, though; the real challenge is in naming something where there’s an adjacent thing that _does...
OK, fair enough. But what if it writes, like, 20 posts in the first 20 days which are that good, but then afterwards it hits diminishing returns because the rationality-related points it makes are no longer particularly novel and exciting? I think this would happen to many humans if they could work at super-speed.
That said, I don't think this is that likely I guess... probably AI will be unable to do even three such posts, or it'll be able to generate arbitrary numbers of them. The human range is small. Maybe. Idk.
Consider the latest AUP equation, where for simplicity I will assume a deterministic environment and that the primary reward depends only on state. Since there is no auxiliary reward any more, I will drop the subscripts to on and .
Consider some starting state , some starting action , and consider the optimal trajectory under that starts with that, which we'll denote as . Define ...
Suppose you have some deep learning model M_orig that you are finetuning to avoid some particular kind of failure. Suppose all of the following hold:
The LESS is More paper (summarized in AN #96) makes the claim that using the Boltzmann model in sparse regions of demonstration-space will lead to the Boltzmann model over-learning. I found this plausible but not obvious, so I wanted to check it myself. (Partly I got nerd-sniped, partly I do want to keep practicing my ability to tell when things are formalizable theorems.) This benefited from discussion with Andreea (one of the primary authors).
Let's consider a model where there are clusters , where each cluster contains trajectories whose feature...
I was reading Avoiding Side Effects By Considering Future Tasks, and it seemed like it was doing something very similar to relative reachability. This is an exploration of that; it assumes you have already read the paper and the relative reachability paper. It benefitted from discussion with Vika.
Define the reachability , where is the optimal policy for getting from to , and is the length of the trajectory. This is the notion of reachability both in the original paper and the new ...
I often search through the Alignment Newsletter database to find the exact title of a relevant post (so that I can link to it in a new summary), often reading through the summary and opinion to make sure it is the post I'm thinking of.
Frequently, I read the summary normally, then read the first line or two of the opinion and immediately realize that it wasn't written by me.
This is kinda interesting, because I often don't know what tipped me off -- I just get a sense of "it doesn't sound like me". Notably, I usually do agree with the opinion, so it isn't ab...
The LCA paper (to be summarized in AN #98) presents a method for understanding the contribution of specific updates to specific parameters to the overall loss. The basic idea is to decompose the overall change in training loss across training iterations:
And then to decompose training loss across specific parameters:
...
In my double descent newsletter, I said:
This fits into the broader story being told in other papers that what's happening is that the data has noise and/or misspecification, and at the interpolation threshold it fits the noise in a way that doesn't generalize, and after the interpolation threshold it fits the noise in a way that does generalize. [...]...
This explanation seems like it could explain double descent on model size and double descent on dataset size, but I don't see how it would explain double descent on training time. This would imply that gradien