I don't think that's a very good analogy, but I will say that is is basically true for the Amish. And I do think that we should respect their preferences. (I seperately think cars are not that good, and that people would infact prefer to bicycle around or ride house drawn carriges or whatever if civilization was conducive to that, although that's kinda besides the point.)
I'm not arguing that we should be conservative about changing the sun. I'm just claiming that people like the sun and won't want to see it eaten/fundamentally transformed, and that we shou...
I am claiming that people when informed will want the sun to continuing being the sun. I also think that most people when informed will not really care that much about creating new people, will continue to believe in the act-omission distinction, etc. And that this is a coherent view that will add up to a large set of people wanting things in the solar system to remain conservatively the same. I seperately claim that if this is true, then other people should just respect this preference, and use the other stars that people don't care about for energy.
This point doesn't make sense to me. It sounds similar to saying "Most people don't like it when companies develop more dense housing in cities, therefore a good democracy should not have it" or "Most people don't like it when their horse-drawn carriages are replaced by cars, therefore a good democracy should not have it".
The cost-benefit calculations on these things work out and it's good if most uninformed people who haven't spent much time on it are not able to get in the way of companies that are building goods and services in this regard.
T...
This is in part the reasoning used by Judge Kaplan:
Kaplan himself said on Thursday that he decided on his sentence in part to make sure that Bankman-Fried cannot harm other people going forward. “There is a risk that this man will be in a position to do something very bad in the future,” he said. “In part, my sentence will be for the purpose of disabling him, to the extent that can appropriately be done, for a significant period of time.”
from https://time.com/6961068/sam-bankman-fried-prison-sentence/
I know what the word means, I just think in typical cases people should be saying a lot more about why something is undignified, because I don’t think people’s senses of dignity typically overlap that much, especially if the reader doesn’t typically read LW. In these cases I think permitting the use of the word “undignified” prevents specificity.
"Undignified" is really vague
I sometimes see/hear people say that "X would be a really undignified". I mostly don't really know what this means? I think it means "if I told someone that I did X, I would feel a bit embarassed." It's not really an argument against X. It's not dissimilar to saying "vibes are off with X".
Not saying you should never say it, but basically every use I see could/should be replaced with something more specific.
I was intending to warn about the possibility of future perception of corruption, e.g. after a non-existential AI catastrophe. I do not think anyone currently working at safety teams is percieved as that "corrupted", although I do think there is mild negative sentiment among some online communities (some parts of twitter, reddit, etc.).
AI safety researchers might be allocated too heavily to Anthropic compared to Google Deepmind
Some considerations:
Some possible counterpoints:
ANT has a stronger safety culture, and so it is a more pleasant experience to work at ANT for the average safety researcher. This suggests that there might be a systematic bias towards ANT that pulls away from the "optimal allocation".
I think this depends on whether you think AI safety at a lab is more of an O-ring process or a swiss-cheese process. Also, if you think it's more of an O-ring process, you might be generally less excited about working at a scaling lab.
I think I disagree with your model of importance. If your goal is the make a sum of numbers small, then you want to focus your efforts where the derivative is lowest (highest? signs are hard), not where the absolute magnitude is highest.
The "epsilon fallacy" can be committed in both directions: both in that any negative dervative is worth working on, and that any extremely large number is worth taking a chance to try to improve.
I also seperately think that "bottleneck" is not generally a good term to apply to a complex project with high amounts of technica...
related to the claim that "all models are meta-models", in that they are objects capable of e.g evaluating how applicable they are for making a given prediction. E.g. "newtonian mechanics" also carries along with it information about how if things are moving too fast, you need to add more noise to its predictions, i.e. it's less true/applicable/etc.
tentative claim: there are models of the world, which make predictions, and there is "how true they are", which is the amount of noise you fudge the model with to get lowest loss (maybe KL?) in expectation.
E.g. "the grocery store is 500m away" corresponds to "my dist over the grocery store is centered at 500m, but has some amount of noise"
related to the claim that "all models are meta-models", in that they are objects capable of e.g evaluating how applicable they are for making a given prediction. E.g. "newtonian mechanics" also carries along with it information about how if things are moving too fast, you need to add more noise to its predictions, i.e. it's less true/applicable/etc.
My vague plan along these lines is to attempt as hard as possible to defer all philosophically confusing questions to the "long reflection", and to use AI control as a tool to help produce AIs that can help preserve long term option value (including philosophical option value) as best as possible.
I seperately have hope we can solve "the entire problem" at some point, e.g. through ARC's agenda (which I spend most of my time trying to derisk and advance).
I agree it is better work on bottlenecks than non-bottlenecks. I have high uncertainty about where such bottlenecks will be, and I think sufficiently low amounts of work have gone into "control" that it's obviously worth investing more, because e.g. I think it'll let us get more data on where bottlenecks are.
Alignment researchers should think hard about switching to working on AI Control
I think Redwood Research’s recent work on AI control really “hits it out of the park”, and they have identified a tractable and neglected intervention that can make AI go a lot better. Obviously we should shift labor until the marginal unit of research in either area decreases P(doom) by the same amount. I think that implies lots of alignment researchers should shift to AI control type work, and would naively guess that the equilibrium is close to 50/50 across people who a...
quick analogy: if the sum of a bunch of numbers is large, there doesn’t need to be any individual number that is large; similarly, if the consequences of a sequence of actions results in a large change, no individual action needs to be “pivotal”
This feels like a pretty central cruxy point - and not just for the relevance of the pivotal act framing specifically. I think it's underlying a whole difference of worldview or problem-solving approach.
A couple other points in a similar direction:
I really think if you want to tell a story of AI Control work being good (especially compared to working on preventing AGI from being built in the first place), the important and difficult part is figuring out how to actually use these more powerful AI systems to either achieve some kind of global moratorium, or make unprecedented progress on the hard parts of the AI Alignment problem.
When I see most people start thinking about control, I rarely see them interface with either of these two things, and honestly, I mostly see them come up with cool addi...
Better control solutions make AI more economically useful, which speeds up the AI race and makes it even harder to do an AI pause.
When we have controlled unaligned AIs doing economically useful work, they probably won't be very useful for solving alignment. Alignment will still be philosophically confusing, and it will be hard to trust the alignment work done by such AIs. Such AIs can help solve some parts of alignment problems, parts that are easy to verify, but alignment as a whole will still be bottle-necked on philosophically confusing, hard to verify ...
The intelligence explosion might happen with less-fully-AGI AIs, who will also be doing some alignment work on the side. It’s important for them to not escape and do other bad stuff until they’ve solve alignment. We can give ourselves more time to use smart AIs to help with alignment if we have better AI control.
Well, this would be the lone crux. The rest of the stuff you wrote is about non-exploding AI, right? And is therefore irrelevant to the thing about everyone dying, except insofar as controlled non-exploding AI can help prevent uncontrolled exploding AI from killing everyone?
This topic is important enough that you could consider making a full post.
My belief is that this would improve reach, and also make it easier for people to reference your arguments.
Consider, you believe there is a 45% chance that alignment researchers would be better suited pivoting to control research. I do not suspect a quick take will reach anywhere close to that number, and has a low chance of catalysing dramatic, institutional level change.
I directionally agree with this (and think it's good to write about this more, strongly upvoted!)
For clarity, I would distinguish between two control-related ideas more explicitly when talking about how much work should go into what area:
I th...
shane legg had 2028 median back in 2008, see e.g. https://e-discoveryteam.com/2023/11/17/shane-leggs-vision-agi-is-likely-by-2028-as-soon-as-we-overcome-ais-senior-moments/
Yes I agree with what you have written, and do think it’s overall not that likely that everything pans out as hoped. We do also have other hopes for how this general picture can still cohere if the specific path doesn’t work out, eg we’re open to learning some stuff empirically and adding an “algorithmic cherry on top” to produce the estimate.
The literature review is very strange to me. Where is the section on certified robustness against epsilon-ball adversarial examples? The techniques used in that literature (e.g. interval propagation) are nearly identical to what you discuss here.
I was meaning to include such a section, but forgot :). Perhaps I will edit it in. I think such work is qualitatively similar to what we're trying to do, but that the key difference is that we're interested in "best guess" estimates, as opposed to formally verified-to-be-correct estimates (mostly because we don'...
I don’t think Paul thinks verification is generally easy or that delegation is fundamentally viable. He, for example, doesn’t suck at hiring because he thinks it’s in fact a hard problem to verify if someone is good at their job.
I liked Rohin's comment elsewhere on this general thread.
I’m happy to answer more specific questions, although would generally feel more comfortable answering questions about my views then about Paul’s.
If you're commited to producing a powerful AI then the thing that matters is the probability there exists something you can't find that will kill you. I think our current understanding is sufficiently paltry that the chance of this working is pretty low (the value added by doing selection on non-deceptive behavior is probably very small, but I think there's a decent chance you just won't get that much deception). But you can also get evidence about the propensity for your training process to produce deceptive AIs and stop producing them until you develop b...
For any given system, you have some distribution over which properties will be necessary to verify in order to not die to that system. Some of those you will in fact be able to verify, thereby obtaining evidence about whether that system is dangerous. “Strategic deception” is a large set of features, some of which are possible to verify.
yes, you would need the catastrophe detector to be reasonably robust. Although I think it's fine if e.g. you have at least 1/million chance of catching any particular catastrophe.
I think there is a gap, but that the gap is probably not that bad (for "worst case" tail risk estimation). That is maybe because I think being able to do estimation through a single forward pass is likely already to be very hard, and to require being able to do "abstractions" over the concepts being manipulated by the forward pass. CoT seems like it will require vaguely similar struction of a qualitatively similar kind.
I think both that:
I also think that this post is pulling a bit of a motte-and-bailey, although not really in the sense that John claims he is making in argument in the post:
I also think that this post is pulling a bit of a motte-and-bailey, although not really in the sense that John claims he is making in argument in the post:
- the motte: there exist hard to verify properties
- the bailey: all/most important properties are hard to verify
I don't think I am trying to claim that bailey at all. For purposes of AI risk, if there is even just one single property of a given system which is both (a) necessary for us to not die to that system, and (b) hard to verify, then difficulty of verification is a blocking issue for outsourcing align...
I mean, sure, for any given X there will be some desirable properties of X which are easy to verify, and it's usually pretty easy to outsource the creation of an X which satisfies the easy-to-verify properties. The problem is that the easy-to-verify properties do not typically include all the properties which are important to us. Ergonomics is a very typical example.
Extending to AI: sure, there will be some desirable properties of AI which are easy to verify, or properties of alignment research which are easy to verify, or properties of plans which are eas...
I think catastrophe detectors in practice will be composed of neural networks interacting with other stuff, like scientific literature, python, etc.
With respect to the stuff quoted, I think all but "doing experiments" can be done with a neural net doing chain of thought (although not making claims about quality).
I think we're trying to solve a different problem than trusted monitoring, but I'm not that knowledgeable about what issues trusted monitoring is trying to solve. The main thing that I don't think you can do with monitoring is producing a model tha...
I have left a comment about a central way I think this post is misguided: https://www.lesswrong.com/posts/7fJRPB6CF6uPKMLWi/my-ai-model-delta-compared-to-christiano?commentId=sthrPShrmv8esrDw2
This post uses "I can identify ways in which chairs are bad" as an example. But it's easier for me to verify that I can sit in a chair and that it's comfortable then to make a chair myself. So I don't really know why this is a good example for "verification is easier than generation".
More examples:
If th...
I think "basically obviates" is too strong. imitation of human-legible cognitive strategies + RL seems liable to produce very different systems that would been produced with pure RL. For example, in the first case, RL incentizes the strategies being combine in ways conducive to accuracy (in addition to potentailly incentivizing non-human-legible cognitive strategies), whereas in the second case you don't get any incentive towards productively useing human-legible cogntive strategies.
if you train on (x, f(x)) pairs, and you ask it to predict f(x') on some novel input x', and also to write down what it thinks f is, do you know if these answers will be consistent? For instance, the model could get f wrong, and also give the wrong prediction for f(x), but it would be interesting if the prediction for f(x) was "connected" to it's sense of what f was.
It's important to distinguish between:
Strategy stealing assumption isn't saying that copying strategies is a good strategy, it's saying the possibility of copying means that there exists a strategy P1 can take that is just a good as P2.
You could instead ask whether or not the observer could predict the location of a single particle p0, perhaps stipulating that p0 isn't the particle that's randomly perturbed.
My guess is that a random 1 angstrom perturbation is enough so that p0's location after 20s is ~uniform. This question seems easier to answer, and I wouldn't really be surprised if the answer is no?
Here's a really rough estimate: This says 10^{10} s^{-1} per collision, so 3s after start ~everything will have hit the randomly perturbed particle, and then there are 17 * 10^{10} more col...
The system is chaotic, so the uncertainty increases exponentially with each collision. Also atoms are only about 1 angstrom wide, so the first unpredictable collision p0 makes will send it flying in some random direction, and totally miss the 2nd atom it should have collided with, instead probably hitting some other atom.
Here are some things I think you can do:
Train a model to be really dumb unless I prepend a random secret string. The goverment doesn't have this string, so I'll be able to predict my model and pass their eval. Some precedent in: https://en.wikipedia.org/wiki/Volkswagen_emissions_scandal
I can predict a single matrix multiply just by memorizing the weights, and I can predict ReLU, and I'm allowed to use helper AIs.
I just train really really hard on imitating 1 particular individual, then have them just say whatever first comes to mind.
I think I expect Earth in this case to just say no and not sell the sun? But I was confused at like 2 points in your paragraph so I don't think I understand what you're saying that well. I also think we're probably on mostly the same page, and am not that interested in hashing out further potential disagreements.
Also, mostly unrelated, maybe a hot take, but if you're able to get outcompeted because you don't upload, then the future you're in is not very good.