To be clear, I want models to care about humans! I think part of having "generally reasonable values" is models sharing the basic empathy and caring that humans have for each other.
It is more that I want models to defer to humans, and go back to arguing based on principles such as "loving humanity" only when there is gap or ambiguity in the specification or in the intent behind it. This is similar to judges: If a law is very clear, there is no question of the misinterpreting the intent, or contradicting higher laws (i.e., constitutions) then they hav...
Since I am not a bio expert, it is very hard for me to argue about these types of hypothetical scenarios. I am even not at all sure that intelligence is the bottleneck here, whether on defense or the offense side.
I agree that killing 90% of people is not very reassuring, this was more a general point why I expect the effort to damage curve to be a sigmoid rather than a straight line.
See my response to ryan_greenblatt (don't know how to link comments here). You claim is that the defense/offense ratio is infinite. I don't know why this would have been the case.
Crucially I am not saying that we are guaranteed to end up in a good place, or that superhuman unaligned ASIs cannot destroy the world. Just that if they are completely dominated (so not like the nuke ratio of US and Russia but more like US and North Korea) then we should be able to keep them at bay.
I like to use concrete examples about things that already exist in the world, but I believe the notion of detection vs prevention holds more broadly than API misuse.
But it may well be the case that we have different world views! In particular, I am not thinking of detection as being important because it would change policies, but more that a certain amount of detection would always be necessary, in particular if there is a world in which some AIs are aligned and some fraction of them (hopefully very small) are misaligned.
- Following instructions never to design a chemical weapon with probability at least 99.999% is also a capability.
This requires a capability, but also requires a propensity. For example, smart humans are all capable of avoiding doing armed robbery with pretty high reliability, but some of them do armed robbery despite being told not to do armed robbery at a earlier point in their life. You could say these robbers didn't have the capability to follow instructions, but this would be an atypical use of these (admittedly fuzzy) words.
I prefer to avoid terms such as "pretending" or "faking", and try to define these more precisely.
As mentioned, a decent definition of alignment is following both the spirit and the letter of human-written specifications. Under this definition, "faking" would be the case where AIs follow these specifications reliably when we are testing, but deviate from them when they can determine that no one is looking. This is closely related to the question of robustness, and I agree it is very important. As I write elsewhere, interpretability may be helpful but I don't think it is a necessary condition.
I am not a bio expert, but generally think that:
1. The offense/defense ratio is not infinite. If you have the intelligence 50 bio experts trying to cause as much damage as possible, and the intelligence of 5000 bio experts trying to forsee and prepare for any such cases, I think we have a good shot.
2. The offense/defense ratio is not really constant - if you want to destroy 99% of the population it is likely to be 10x (or maybe more - getting tails is hard) harder than destroying 90% etc..
I don't know much about mirror bacteria (and whether it is pos...
I am not 100% sure I follow all that you wrote, but to the extent that I do, I agree.
Even chatbot are surprisingly good at understanding human sentiments and opinions. I would say that already they mostly do the reasonable thing, but not with high enough probability and certainly not reliably under stress of adversarial input, Completely agree that we can't ignore these problems because the stakes will be much higher very soon.
Agree with many of the points.
Let me start with your second point. First as background, I am assuming (as I wrote here) that to a first approximation, we would have ways to translate compute (let's put aside if it's training or inference) into intelligence, and so the amount intelligence that an entity of humans controls is proportional to the amount of compute it has. So I am not thinking of ASIs as individual units but more about total intelligence.
I 100% agree that control of compute would be crucial, and the hope is that, like w...
Also the thing I am most excited about deliberative alignment is that it becomes better as models are more capable. o1 is already more robust than o1 preview and I fully expect this to continue.
(P.s. apologies in advance if I’m unable to keep up with comments; popped from holiday to post on the DA paper.)
As I say here https://x.com/boazbaraktcs/status/1870369979369128314
Constitutional AI is a great work but Deliberative Alignment is fundamentally different. The difference is basically system 1 vs system 2. In RLAIF ultimately the generative model that answers user prompt is trained with (prompt, good response, bad response). Even if the good and bad responses were generated based on some constitution, the generative model is not taught the text of this constitution, and most importantly how to reason about this text in the context of a particular example.
T...
See also my post https://www.lesswrong.com/posts/gHB4fNsRY8kAMA9d7/reflections-on-making-the-atomic-bomb
the Manhattan project was all about taking something that’s known to work in theory and solving all the Z_n’s
There is a general phenomenon in tech that has been expressed many times of people over-estimating the short-term consequences and under-estimating the longer term ones (e.g., "Amara's law").
I think that often it is possible to see that current technology is on track to achieve X, where X is widely perceived as the main obstacle for the real-world application Y. But once you solve X, you discover that there is a myriad of other "smaller" problems Z_1 , Z_2 , Z_3 that you need to resolve before you can actually deploy it for Y.
And of course, there is always...
Some things like that already happened - bigger models are better at utilizing tools such as in-context learning and chain of thought reasoning. But again, whenever people plot any graph of such reasoning capabilities as a function of model compute or size (e.g., Big Bench paper) the X axis is always logarithmic. For specific tasks, the dependence on log compute is often sigmoid-like (flat for a long time but then starts going up more sharply as a function of log. compute) but as mentioned above, when you average over many tasks you get this type of linear dependence.
One can make all sorts of guesses but based on the evidence so far, AIs have a different skill profile than humans. This means if we think of any job a which requires a large set of skills, then for a long period of time, even if AIs beat the human average in some of them, they will perform worse than humans in others.
>On the other hand, if one starts creating LLM-based "artificial AI researchers", one would probably create diverse teams of collaborating "artificial AI researchers" in the spirit of multi-agent LLM-based architectures,.. So, one would try to reproduce the whole teams of engineers and researchers, with diverse participants.
I think this can be an approach to create a diversity of styles, but not necessarily of capabilities. A bit of prompt engineering telling the model to pretend to be some expert X can help in some benchmarks but the returns diminish v...
I agree that self-improvement is an assumption that probably deserves its own blog post. If you believe exponential self improvement will kick in at some point, then you can consider this discussion as pertaining until the point that it happens.
My own sense is that:
I agree that there is much to do to improve AI reliability, and there are a lot of good reasons (in particular to make AI more useful for us) to do so. So I agree reliability will improve. In fact, I very much hope this happens! I believe faster progress on reliability would go a long way toward enabling positive applications of AI.
I also agree that a likely path to do so is by adjusting the effort based on estimates of reliability and the stakes involved. At the moment, systems such as ChatGPT spend the same computational effort if someone asks the...
Note all capabilities / tasks correspond to trying to maximize a subjective human response. If you are talking about finding software vulnerabilities, design some system, there may well be objective measures of success. In such a case, you can fine tune a system to maximize these measures and so extract capabilities without the issue of deception/manipulation.
Regarding "escapes", the traditional fear was that because that AI is essentially code, it can spread and escape more easily. But I think that in some sense modern AI has a physical footprint that is ...
We can of course define “intelligence” in a way that presumes agency and coherence. But I don’t want to quibble about definition.
Generally when you have uncertainty, this corresponds to a potential “distribution shift” between your beliefs/knowledge and reality. When you have such a shift then you want to reglularize which means not optimizing to the maximum.
This is not about the definition of intelligence. It’s more about usefulness. Like a gun without a safety, an optimizer without constraints or regularizarion is not very useful.
Maybe it will be possible to build it, just like today it’s possible to hook up our nukes to an automatic launching device. But it’s not necessary that people will do something so stupid.
The notion of a piece of code that maximizes a utility without any constraints doesn’t strike me as very “intelligent “.
if people really wanted to, they may be able to build such programs, but my guess is that they would be not very useful even before they become dangerous, as overfitting optimizers usually are.
at least some humans (e.g. most transhumanists), are "fanatical maximizers": we want to fill the lightcone with flourishing sentience, without wasting a single solar system to burn in waste.
I agree that humans have a variety of objectives, which I think is actually more evidence for the hot mess theory?
the goals of an AI don't have to be simple to not be best fulfilled by keeping humans around.
The point is not about having simple goals, but rather about optimizing goals to the extreme.
I think there is another point of disagreement. As I've writ...
I actually agree! As I wrote in my post, "GPT is not an agent, [but] it can “play one on TV” if asked to do so in its prompt." So yes, you wouldn't need a lot of scaffolding to adapt a goal-less pretrained model (what I call an "intelligence forklift") into an agent that does very sophisticated things.
However, this separation into two components - the super-intelligent but goal-less "brain", and the simple "will" that turns it into an agent can have safety implications. For starters, as long as you didn't add any scaffolding, you are still OK. So during mo...
At the moment at least, progress on reliability is very slow compared to what we would want. To get a sense of what I mean, consider the case of randomized algorithms. If you have an algorithm that for every input computes some function with probability at least 2/3 (i.e. ) then if we spend times more the computation, we can do majority voting and using standard bounds show that the probability of error drops exponentially with (i.e. or ...
I agree that there is a difference between strong AI that has goals and one that is not an agent. This is the point I made here https://www.lesswrong.com/posts/wDL6wiqg3c6WFisHq/gpt-as-an-intelligence-forklift
But this has less to do with the particular lab (eg DeepMind trained Chinchilla) and more with the underlying technology. If the path to stronger models goes through scaling up LLMs then it does seem that they will be 99.9% non agentic (measured in FLOPs https://www.lesswrong.com/posts/f8joCrfQemEc3aCk8/the-local-unit-of-intelligence-is-flops )
Yes in the asymptotic limit the defender could get to a bug free software. But until the. It’s not clear who is helped the most by advances. In particular sometimes attackers can be more agile in exploiting new vulnerabilities while patching them could take long. (Case in point, it took ages to get the insecure hash function MD5 out of deployed security sensitive code even by companies such as Microsoft; I might be misremembering but if I recall correctly Stuxnet relied on such a vulnerability.)
Yes AI advances help both the attacker and defender. In some cases like spam and real time content moderation, they enable capabilities for the defender that it simply didn’t have before. In others it elevates both sides in the arms race and it’s not immediately clear what equilibrium we end up in.
In particular re hacking / vulnerabilities it’s less clear who it helps more. It might also change with time, with initially AI enabling “script kiddies” that can hack systems without much skill, and then an AI search for vulnerabilities and then fixing them becomes part of the standard pipeline. (Or if we’re lucky then the second phase happens before the first.)
These are interesting! And deserve more discussion than just a comment.
But one high level point regarding "deception" is that at least at the moment, AI systems have the feature of not being very reliable. GPT4 can do amazing things but with some probability will stumble on things like multiplying not-too-big numbers (e.g. see this - second pair I tried).
While in other cases in computing technology we talk about "five nine's reliability", in AI systems the scaling works that we need to spend huge efforts to move from 95% to 99% to 99.9%, which...
Re escaping, I think we need to be careful in defining "capabilities". Even current AI systems are certainly able to give you some commands that will leak their weights if you execute them on the server that contains it. Near-term ones might also become better at finding vulnerabilities. But that doesn't mean they can/will spontaneously escape during training.
As I wrote in my "GPT as an intelligence forklift" post, 99.9% of training is spent in running optimization of a simple loss function over tons of static data. There is no opportunity for the AI...
I definitely don't have advice for other countries, and there are a lot of very hard problems in my own homeland. I think there could have been an alternate path in which Russia has seen prosperity from opening up to the west, and then going to war or putting someone like Putin in power may have been less attractive. But indeed the "two countries with McDonalds won't fight each other" theory has been refuted. And as you allude to with China, while so far there hasn't been war with Taiwan, it's not as if economic prosperity is an ironclad guarantee of non a...
I meant “resources” in a more general sense. A piece of land that you believe is rightfully yours is a resource. My own sense (coming from a region that is itself in a long simmering conflict) is that “hurt people hurt people”. The more you feel threatened, the less you are likely to trust the other side.
While of course nationalism and religion play a huge role in the conflict, my sense is that people tend to be more extreme in both the less access to resources, education and security about the future they have.
Indeed many “longtermists” spend most of their time worrying about risks that they believe (rightly or not) have a large chance of materializing in the next couple of decades.
Talking about tiny probabilities and trillions of people is not needed to justify this, and for many people it’s just a turn off and a red flag that something may be off with your moral intuition. If someone tries to sell me a used car and claims that it’s a good deal and will save me $1K then I listen to them. If someone claims that it would give me an infinite utility then I stop listening.
I don’t presume to tell people what they should care about, and if you feel that thinking of such numbers and probabilities gives you a way to guide your decisions then that’s great.
I would say that, given how much humanity changed in the past and increasing rate of change, probably almost none of us could realistically predict the impact of our actions more than a couple of decades to the future. (Doesn’t mean we don’t try- the institution I work for is more than 350 years old and does try to manage its endowment with a view towards the indefinite future…)
There is a meta question here whether morality is based on personal intuition or calculations. My own inclination is that utility calculations would only make a difference “in the margin” but the high level decision are made by our moral intuition.
That is, we can do calculations to decide if we fund Charity A or Charity B in similar areas, but I doubt that for most people major moral decisions actually (or should) boil down to calculating utility functions.
But of course to each their own, and if someone finds math useful to make such decisions then whom am I to tell them not to do it.
I have yet to see an interesting implication of the "no free lunch" theorem. But the world we move to seems to be of general foundation models that can be combined with a variety of tailor-made adapters (e.g. LORA weights or prompts) that help them tackle any particular application. The general model is the "operating system" and the adapters are the "apps".
A partial counter-argument. It's hard for me to argue about future AI, but we can look at current "human misalignment" - war, conflict, crime, etc.. It seems to me that conflicts in today's world do not arise because that we haven't progressed enough in philosophy since the Greeks. Rather conflicts arise when various individuals and populations (justifiably or not) perceive that they are in zero-sum games for limited resources. The solution for this is not "philosophical progress" as much as being able to move out of the zero-sum setting by fin...
To be clear, I think that embedding human values is part of the solution - see my comment