It is possible that AI would allow for the creation of brain-computer interfaces such that we can productively merge with AI systems. I don't think this would apply in that case since that would be a true "augmentation."
If that doesn't happen, though, or before that happens, I think this is a real possibility. The disanalogy is that our brains wouldn't add anything to sufficiently advanced AI systems, unlike books, which are useless without our brains to read them.
Today, many people are weaker physically than in previous times because we don't need to do a...
Hi Jan, I appreciate your feedback.
I've been helping out with this and I can say that the organizers are working as quickly as possible to verify and publish new signatures. New signatures have been published since the launch, and additional signatures will continue to be published as they are verified. There is a team of people working on it right now and has been since launch.
The main obstacles to extremely swift publication are:
I've been collecting examples of this kind of thing for a while now here: ai-improving-ai.safe.ai.
In addition to algorithmic and data improvements I'll add there are also some examples of AI helping to design hardware (e.g. GPU architectures) and auxiliary software (e.g. for datacenter cooling).
At the time of this post, the FLI letter has been signed by 1 OpenAI research scientist, 7 DeepMind research scientists/engineers, and 0 Anthropic employees.
"1 OpenAI research scientist" felt weird to me on priors. 0 makes sense, if the company gave some guidance (e.g. legal) to not sign, or if the unanimous opinion was that it's a bad idea to sign. 7 makes sense too -- it's about what I'd expect from DeepMind and shows that there's a small contingent of people really worried about risk. Exactly 1 is really weird -- there are definitely multiple risk...
Later in the thread Jan asks, "is this interpretability complete?" which I think implies that his intuition is that this should be easier to figure out than other questions (perhaps because it seems so simple). But yeah, it's kind of unclear why he is calling out this in particular.
I find myself surprised/confused at his apparent surprise/confusion.
Jan doesn't indicate that he's extremely surprised or confused? He just said he doesn't know why this happens. There's a difference between being unsurprised by something (e.g. by observing something similar before) and actually knowing why it happens.To give a trivial example, hunter gatherers from 10,000 BC would not have been surprised if a lightning strike caused fire, but would be quite clueless (or incorrect) as to why or how this happens.
I think Quintin's answer is a good possible hypothesis (though of course it leads to the further question of how LLMs learn language-neutral circuitry).
I do think that we don't really understand anything in a language model at this level of detail. Like, how does a language model count? How does a language model think about the weather? How does a language model do ROT13 encoding? How does a language model answer physics questions? We don't have an answer to any of these at any acceptable level of detail, so why would we circle out our confusion about the language agnosticity, if indeed we predicted it, we just don't really understand the details the same way we don' really understand the details of anything going on in a large language model.
In addition to the post you linked, there is also an earlier post on this topic that I like.
I also co wrote a post that looks at specific structural factors related to AI safety.
Thanks so much for writing this, quite useful to see your perspective!
First, I don't think that you've added anything new to the conversation. Second, I don't think what you have mentioned even provides a useful summary of the current state of the conversation: it is neither comprehensive, nor the strongest version of various arguments already made.
Fair enough!
...I don't think that's a popular opinion here. And while I think some people might just have a cluster of "brain/thinky" words in their head when they don't think about the meaning of things closely, I
I agree with this. If we are able to design consciousness such that a system is fulfilled by serving humans, then it's possible that would be morally alright. I don't think there is a strong enough consensus that I'd feel comfortable locking it in, but to me it seems ok.
By default though, I think we won't be designing consciousness intentionally, and it will just emerge, and I don't think that's too likely to lead to this sort of situation.
A related post: https://www.lesswrong.com/posts/xhD6SHAAE9ghKZ9HS/safetywashing
[Realized this is contained in a footnote, but leaving this comment here in case anyone missed it].
Consciousness therefore only happens if it improves performance at the task we have assigned. And some tasks like interacting directly with humans it might improve performance.
I don't think this is necessarily true. Consciousness could be a side effect of other processes that do improve performance.
The way I've heard this put: a polar bear has thick hair so that it doesn't get too cold, and this is good for its evolutionary fitness. The fact that the hair is extremely heavy is simply a side effect of this. Consciousness could possibly me similar.
I don't think these are necessarily bad suggestions if there were a future series. But my sense is that John did this for the people in the audience, somebody asked him to record it so he did, and now he's putting them online in case they're useful to anyone. It's very hard to make good production quality lectures, and it would have required more effort. But it sounds like John knew this and decided he would rather spend his time elsewhere, which is completely his choice to make. As written, these suggestions feel a bit pushy to me.
I just do not think that the post is written for people who think "slowing down AI capabilities is robustly good." If people thought that, then why do they need this post? Surely they don't need somebody to tell them to think about it?
So it seems to me like the best audience for this post would be those (including those at some AI companies, or those involved in policy, which includes people reading this post) who currently think something else, for example that the robustly good thing is for their chosen group to be ahead so that they can execute whatever...
The claim being made is something like the following:
1) AGI is a dangerous technology.
2) It is robustly good to slow down dangerous technologies.
3) Some people might say that you should not actually do this because of [complicated unintelligible reason].
4) But you should just do the thing that is more robustly good.
I argue that many people (yes, you're right, in ways that conflict with one another) believe the following:
1) X is a dangerous country.
2) It is robustly good to always be ahead of X in all technologies, including dangerous ones.
3) Some people mi...
There are things that are robustly good in the world, and things that are good on highly specific inside-view models and terrible if those models are wrong. Slowing dangerous tech development seems like the former, whereas forwarding arms races for dangerous tech between world superpowers seems more like the latter.
It may seem the opposite to some people. For instance, my impression is that for many adjacent to the US government, "being ahead of China in every technology" would be widely considered robustly good, and nobody would question you at all if you...
I think this is all true, but also since Yale CS is ranked poorly the graduate students are not very strong for the most part. You certainly have less competition for them if you are a professor, but my impression is few top graduate students want to go to Yale. In fact, my general impression is often the undergraduates are stronger researchers than the graduate students (and then they go on to PhDs at higher ranked places than Yale).
Yale is working on strengthening its CS department and it certainly has a lot of money to do that. But there are a lot of re...
I'll just comment on my experience as an undergrad at Yale in case it's useful.
At Yale, the CS department, particularly when it comes to state of the art ML, is not very strong. There are a few professors who do good work, but Yale is much stronger in social robotics and there is also some ML theory. There are a couple AI ethics people at Yale, and there soon will be a "digital ethics" person, but there aren't any AI safety people.
That said, there is a lot of latent support for AI safety at Yale. One of the global affairs professors involved in the Schmidt...
I appreciate this. I don't even consider myself part of the rationality community, though I'm adjacent. My reasons for not drinking have nothing to do with the community and existed before I knew what it was. I actually get the sense this is the case for a number of people in the community (more of a correlation or common cause rather than caused by the community itself). But of course I can't speak for all.
I will be trying it on Sunday. We will see how it is.
I've thought about this comment, because it certainly is interesting. I think I was clearly confused in my questions to ChatGPT (though I will note: My tequila-drinking friends did not and still don't think tequila tastes at all sweet, including "in the flavor profile" or anything like that. But it seems many would say they're wrong!) ChatGPT was clearly confused in its response to me as well.
I think this part of my post was incorrect:
...It was perfectly clear: ChatGPT was telling me that tequila adds a sweetness to the drink. So it was telling me that
I'm going to address your last paragraph first, because I think it's important for me to respond to, not just for you and me but for others who may be reading this.
When I originally wrote this post, it was because I had asked ChatGPT a genuine question about a drink I wanted to make. I don't drink alcohol, and I never have. I've found that even mentioning this fact sometimes produces responses like yours, and it's not uncommon for people to think I am mentioning it as some kind of performative virtue signal. People choose not to drink for all sorts of reas...
OpenAI has in the past not been that transparent about these questions, but in this case, the blog post (linked in my post) makes it very clear it's trained with reinforcement learning from human feedback.
However, of course it was initially pretrained in an unsupervised fashion (it's based on GPT-3), so it seems hard to know whether this specific behavior was "due to the RL" or "a likely continuation".
This is a broader criticism of alignment to preferences or intent in general, since these things can change (and sometimes, you can even make choices of whether to change them or not). L.A. Paul wrote a whole book about this sort of thing; if you're interested, here's a good talk.
That's fair. I think it's a critique of RLHF as it is currently done (just get lots of preferences over outputs and train your model). I don't think just asking you questions "when it's confused" is sufficient, it also has to know when to be confused. But RLHF is a pretty general framework, so you could theoretically expose a model to lots of black swan events (not just mildly OOD events) and make sure it reacts to them appropriately (or asks questions). But as far as I know, that's not research that's currently happening (though there might be something I'm not aware of).
I think if they operationalized it like that, fine, but I would find the frame "solving the problem" to be a very weird way of referring to that. Usually, when I hear people saying "solving the problem" they have a vague sense of what they are meaning, and have implicitly abstracted away the fact that there are many continuous problems where progress needs to be made and that the problem can only really be reduced, but never solved, unless there is actually a mathematical proof.
I'm not a big supporter of RLHF myself, but my steelman is something like:
RLHF is a pretty general framework for conforming a system to optimize for something that can't clearly be defined. If we just did, "if a human looked at this for a second would they like it?" this does provide a reward signal towards deception, but also towards genuinely useful behavior. You can take steps to reduce the deception component, for example by letting the humans red team the model or do some kind of transparency; this can all theoretically fit in the framework of RLHF. O...
I think I essentially agree with respect to your definition of "sanity," and that it should be a goal. For example, just getting people to think more about tail risk seems like your definition of "sanity" and my definition of "safety culture." I agree that saying that they support my efforts and say applause lights is pretty bad, though it seems weird to me to discount actual resources coming in.
As for the last bit: trying to figure out the crux here. Are you just not very concerned about outer alignment/proxy gaming? I think if I was totally unconcerned a...
I think RLHF doesn't make progress on outer alignment (like, it directly creates a reward for deceiving humans). I think the case for RLHF can't route through solving outer alignment in the limit, but has to route through somehow making AIs more useful as supervisors or as researchers for solving the rest of the AI Alignment problem.
Like, I don't see how RLHF "chips away at the proxy gaming problem". In some sense it makes the problem much harder since you are directly pushing on deception as the limit to the proxy gaming problem, which is the worst case for the problem.
I appreciate this comment, because I think anyone who is trying to do these kinds of interventions needs to be constantly vigilant about exactly what you are mentioning. I am not excited about loads of inexperienced people suddenly trying to suddenly do big things in AI strategy, because downsides can be so high. Even people I trust are likely to make a lot of miscalculations. And the epistemics can be bad.
I wouldn't be excited about (for example) retreats with undergrads to learn about "how you can help buy more time." I'm not even sure of the sign of int...
I want to push back on your suggestion that safety culture is not relevant. I agree that being vaguely "concerned" does seem not very useful. But safety culture seems very important. Things like (paraphrased from this paper):
Sorry, I don't want to say that safety culture is not relevant, but I want to say something like... "the real safety culture is just 'sanity'".
Like, I think you are largely playing a losing game if you try to get people to care about "safety" if the people making decisions are incapably of considering long-chained arguments, or are ...
I feel somewhat conflicted about this post. I think a lot of the points are essentially true. For instance, I think it would be good if timelines could be longer all else equal. I also would love more coordination between AI labs. I also would like more people in AI labs to start paying attention to AI safety.
But I don't really like the bottom line. The point of all of the above is not reducible to just "getting more time for the problem to be solved."
First of all, the framing of "solving the problem" is, in my view, misplaced. Unless you think...
As far as I understand it, "intelligence" is the ability to achieve one's goals through reasoning and making plans, so a highly intelligent system is goal-directed by definition. Less goal-directed AIs are certainly possible, but they must necessarily be considered less intelligent - the thermometer example illustrates this. Therefore, a less goal-directed AI will always lose in competition against a more goal-directed one.
Your argument seems to be:
I think it's tangentially relevant in certain cases. Here's something I wrote in another context, where I think it's perhaps useful to understand what we really mean when we say "intelligence."
...We consider humans intelligent not because they do better on all possible optimization problems (they don’t, due to the no free lunch theorem), but because they do better on a subset of problems that are actually encountered in the real world. For instance, humans have particular cognitive architectures that allow us to understand language well, and language is somet
(Not reviewed by Dan Hendrycks.)
This post is about epistemics, not about safety techniques, which are covered in later parts of the sequence. Machine learning, specifically deep learning, is the dominant paradigm that people believe will lead to AGI. The researchers who are advancing the machine learning field have proven quite good at doing so, insofar as they have created rapid capabilities advancements. This post sought to give an overview of how they do this, which is in my view extremely useful information! We strongly do not favor advancements of cap...
From my comments on the MLSS project submission (which aren't intended to be comprehensive):
Quite enjoyed reading this, thanks for writing!
My guess is that the factors combine to create a roughly linear model. Even if progress is unpredictable and not linear, the average rate of progress will still be linear.
I’m very skeptical that this is a linear interpolation. It’s the core of your argument, but I didn’t think it was really argued. I would be very surprised if moving from 50% to 49% risk took similar time as moving from 2% to 1% risk, even if there are ...
As somebody who used to be an intern at CHAI, but certainly isn't speaking for the organization:
CHAI seems best approximated as a collection of researchers doing a bunch of different things. There is more reinforcement learning at CHAI than elsewhere, and it's ML research, but it's not top down at all so it doesn't feel that unified. Stuart Russell has an agenda, but his students have their own agendas which only sometimes overlap with his.
Also, as to your comment:
My worry is that academics will pursue strategies that work right now but won't work for AGI, because they are trying to win the competition instead of align AGIs. This might be really helpful though.
(My personal opinion, not necesasarily the opinion of CAIS) I pretty much agree. It's the job of the concretizers (and also grantmakers to some extent) to incentivize/nudge research to be in a useful direction rather than a nonuseful direction, and for fieldbuilding to shift researchers towards more explicitly considering x-risk. But, ...
Thanks so much for writing this! I think it's a very useful resource to have. I wanted to add a few thoughts on your description of CAIS, which might help make it more accurate.
[Note: I worked full time at CAIS from its inception until a couple weeks ago. I now work there on a part time basis while finishing university. This comment hasn't been reviewed by others at CAIS, but I'm pretty confident it's accurate.]
For somebody external to CAIS, I think you did a fairly good job describing the organization so thank you! I have a couple things I'd probably chan...
I think the meaning of "distillation" is used differently by different people, and this confuses me. Originally (based on John Wentworth's post) I thought that "distillation" meant:
"Take existing ideas from a single existing work (or from a particular person) and present them in a way that is more understandable or more concise."
That's also the definition seemingly used by the Distillation Contest.
But then you give these examples in your curriculum:
...Bushwackers:
- Explain how a sharp left turn would look like in current ML paradigms
- Explain the connection betwe
These are already the top ~10%, the vast majority of the submissions aren't included. We didn't feel we really had enough data to accurately rank within these top 80 or so, though some are certainly better than others. Also, it really depends on the point you're trying to make or the audience, I don't think there really exists an objective ordering.
We did do categorization at one point, but many points fall into multiple categories and there are a lot of individual points such that we didn't find it very useful when we had them categorized.
I'm not sure what you mean by "using bullet-pointed summaries of the 7 works stated in the post". If you mean the past examples of good materials, I'm not sure how good of an idea that is. We don't just want people to be rephrasings/"distillations" of single pieces of prior work.
I'm also not sure we literally tell you how to win, but yes, reading the instructions would be useful.
I made this link which redirects to all arxiv pages from the last day on AI, ML, Computation and Language, Computer Vision, and Computers and Society into a single view. Since some papers are listed under multiple areas I prefer to view this so I don't skim over the same paper twice. If you bookmark it's just one click per day!
No need to delete the tweet. I dagree the examples are not info hazards, they're all publicly known. I just probably wouldn't want somebody going to good ML researchers who currently are doing something that isn't really capabilities (e.g., application of ML to some other area) and telling them "look at this, AGI soon."
We weren't intending to use the contest to do any direct outreach to anyone (not sure how one would do direct outreach with one liners in any case) and we didn't use it for that. I think it was less useful than I would have hoped (nearly all submissions were not very good), but ideas/anecdotes surfaced have been used in various places and as inspiration.
It is also interesting to note that the contest was very controversial on LW, essentially due to it being too political/advocacy flavored (though it wasn't intended for "political" outreach per se). I think... (read more)