All of boazbarak's Comments + Replies

To be clear, I think that embedding human values is part of the solution  - see my comment  

1Gianluca Calcagni
How do you envision access and control of an AI that is robustly and reasonably compliant? And in which way would "human values" be involved? I agree with you that they are part of the solution, but I want to compare my beliefs with yours.

To be clear, I want models to care about humans! I think part of having "generally reasonable values" is models sharing the basic empathy and caring that humans have for each other. 

It is more that I want models to defer to humans, and go back to arguing based on principles such as "loving humanity" only when there is gap or ambiguity in the specification or in the intent behind it. This is similar to judges: If a law is very clear, there is no question of the misinterpreting the intent, or contradicting higher laws (i.e., constitutions) then they hav... (read more)

4Noosphere89
This sounds a lot like what @Seth Herd's talk about instruction following AIs is all about: https://www.lesswrong.com/posts/7NvKrqoQgJkZJmcuD/instruction-following-agi-is-easier-and-more-likely-than

Since I am not a bio expert, it is very hard for me to argue about these types of hypothetical scenarios. I am even not at all sure that intelligence is the bottleneck here, whether on defense or the offense side.

I agree that killing 90% of people is not very reassuring, this was more a general point why I expect the effort to damage curve to be a sigmoid rather than a straight line.

See my response to ryan_greenblatt (don't know how to link comments here). You claim is that the defense/offense ratio is infinite. I don't know why this would have been the case. 

Crucially I am not saying that we are guaranteed to end up in a good place, or that superhuman unaligned ASIs cannot destroy the world. Just that if they are completely dominated (so not like the nuke ratio of US and Russia but more like US and North Korea) then we should be able to keep them at bay.

2Aaron_Scher
Hm, sorry, I did not mean to imply that the defense/offense ratio is infinite. It's hard to know, but I expect it's finite for the vast majority of dangerous technologies[1]. I do think there are times where the amount of resources and intelligence needed to do defense are too high and a civilization cannot do them. If an astroid were headed for earth 200 years ago, we simply would not have been able to do anything to stop it. Asteroid defense is not impossible in principle — the defensive resources and intelligence needed are not infinite — but they are certainly above what 1825 humanity could have mustered in a few years. It's not in principle impossible, but it's impossible for 1825 humanity.  While defense/offense ratios are relevant, I was more-so trying to make the points that these are disjunctive threats, some might be hard to defend against (i.e., have a high defense-offense ratio), and we'll have to do that on a super short time frame. I think this argument goes through unless one is fairly optimistic about the defense-offense ratio for all the technologies that get developed rapidly. I think the argumentative/evidential burden to be on net optimistic about this situation is thus pretty high, and per the public arguments I have seen, unjustified.  (I think it's possible I've made some heinous reasoning error that places too much burden on the optimism case, if that's true, somebody please point it out) 1. ^ To be clear, it certainly seems plausible that some technologies have a defense/offense ratio which is basically unachievable with conventional defense, and that you need to do something like mass surveillance to deal with these. e.g., triggering vacuum decay seems like the type of thing where there may not be technological responses that avert catastrophe if the decay has started, instead the only effective defenses are ones that stop anybody from doing the thing to begin with.
2Sodium
You link a comment by clicking the timestamp next to the username (which, now that I say it, does seem quite unintuitive... Maybe it should also be possible via the three dots on the right side).

I like to use concrete examples about things that already exist in the world, but I believe the notion of detection vs prevention holds more broadly than API misuse.

But it may well be the case that we have different world views! In particular, I am not thinking of detection as being important because it would change policies, but more that a certain amount of detection would always be necessary, in particular if there is a world in which some AIs are aligned and some fraction of them (hopefully very small) are misaligned.

These are all good points! This is not an easy problem. And generally I agree that for many reasons we don't want a world where all power is concentrated by one entity - anti-trust laws exist for a reason!

I am not sure I agree about the last point. I think, as mentioned, that alignment is going to be crucial for usefulness of AIs, and so the economic incentives would actually be to spend more on alignment.

8Daniel Kokotajlo
Can you say more about how alignment is crucial for usefulness of AIs? I'm thinking especially of AIs that are scheming / alignment faking / etc.; it seems to me that these AIs would be very useful -- or at least would appear so -- until it's too late.
8ryan_greenblatt
Seems very sensitive to the type of misalignment right? As an extreme example suppose literally all AIs have long run and totally inhuman preferences with linear returns. Such AIs might instrumentally decide to be as useful as possible (at least in domains other than safety research) for a while prior to a treacherous turn.

I think that:
1. Being able to design a chemical weapon with probability at least 50% is a capability
2. Following instructions never to design a chemical weapon with probability at least 99.999%  is also a capability.

  1. Following instructions never to design a chemical weapon with probability at least 99.999% is also a capability.

This requires a capability, but also requires a propensity. For example, smart humans are all capable of avoiding doing armed robbery with pretty high reliability, but some of them do armed robbery despite being told not to do armed robbery at a earlier point in their life. You could say these robbers didn't have the capability to follow instructions, but this would be an atypical use of these (admittedly fuzzy) words.

I prefer to avoid terms such as "pretending" or "faking", and try to define these more precisely.

As mentioned, a decent definition of alignment is following both the spirit and the letter of human-written specifications. Under this definition, "faking" would be the case where AIs follow these specifications reliably when we are testing, but deviate from them when they can determine that no one is looking. This is closely related to the question of robustness, and I agree it is very important. As I write elsewhere, interpretability may be helpful but I don't think it is a necessary condition.

I am not a bio expert, but generally think that:

1. The offense/defense ratio is not infinite. If you have the intelligence 50 bio experts trying to cause as much damage as possible,  and the intelligence of 5000 bio experts trying to forsee and prepare for any such cases, I think we have a good shot.
2. The offense/defense ratio is not really constant - if you want to destroy 99% of the population it is likely to be 10x (or maybe more - getting tails is hard) harder than destroying 90% etc..

I don't know much about mirror bacteria (and whether it is pos... (read more)

4ryan_greenblatt
I agree with not infinite and not being constant, but I do think the ratio for killing 90% is probably larger than 10x and plausibly much larger than 100x for some intermediate period of technological development. (Given realistic society adaptation and response.) It's worth noting that for the case of mirror bacteria in particular this exact response wouldn't be that helpful and might be actively bad. I agree that very strong government response to clear ultra-lethal bioweapon deployment is pretty likely. I think it would plausibly require >5000 bio experts to prevent >30% of people dying from mirror bacteria well designed for usage as bioweapons. There are currently no clear stories for full defense from my perspective so it would require novel strategies. And the stories I'm aware of to keep a subset of humans alive seem at least tricky. Mirror antibiotics are totally possible and could be manufactured at scale, but this wouldn't suffice for preventing most large plant life from dying which would cause problems. If we suppose that 50 experts could make mirror bacteria, then I think the offense-defense imbalance could be well over 100x? For takeover, you might only need 90% or less, depending on the exact situation, the AI's structural advantages, and the affordances granted to a defending AI. Regardless, I don't think "well sure, the misaligned AI will probably be defeated even if it kills 90% of us" will be much comfort to most people. While I agree that 99% is harder than 90%, I think the difference is probably more like 2x than 10x and I don't think the effort vs log fraction destruction is going to have a constant slope. (For one thing, once a sufficiently small subset remains, a small fraction of resources suffices to outrace economically. If the AI destroys 99.95% of its adversaries and was previously controlling 0.1% of resources, this would suffice for outracing the rest of the world and becoming the dominant power, likely gaining the ability to destr

I am not 100% sure I follow all that you wrote, but to the extent that I do, I agree.
Even chatbot are surprisingly good at understanding human sentiments and opinions. I would say that already they mostly do the reasonable thing, but not with high enough probability and certainly  not reliably under stress of adversarial input, Completely agree that we can't ignore these problems because the stakes will be much higher very soon.

Agree with many of the points. 
 

Let me start with your second point. First as background, I am assuming (as I wrote here)  that to a first approximation, we would have ways to translate compute (let's put aside if it's training or inference) into intelligence, and so the amount intelligence that an entity of humans controls is proportional to the amount of compute it has. So I am not thinking of ASIs as individual units but more about total intelligence. 

I 100% agree that control of compute would be crucial, and the hope is that, like w... (read more)

Thanks all for commenting! Just quick apology for being behind on responding but I do plan to get to it! 

Also the thing I am most excited about deliberative alignment is that it becomes better as models are more capable. o1 is already more robust than o1 preview and I fully expect this to continue.

(P.s. apologies in advance if I’m unable to keep up with comments; popped from holiday to post on the DA paper.)

As I say here https://x.com/boazbaraktcs/status/1870369979369128314

Constitutional AI is a great work but Deliberative Alignment is fundamentally different. The difference is basically system 1 vs system 2. In RLAIF ultimately the generative model that answers user prompt is trained with (prompt, good response, bad response). Even if the good and bad responses were generated based on some constitution, the generative model is not taught the text of this constitution, and most importantly how to reason about this text in the context of a particular example.

T... (read more)

1Joel Burget
Hi Boaz, first let me say that I really like Deliberative Alignment. Introducing a system 2 element is great, not only for higher-quality reasoning, but also for producing a legible, auditable chain of though. That said, I have a couple questions I'm hoping you might be able to answer. 1. I read through the model spec (which DA uses, or at least a closely-related spec). It seems well-suited and fairly comprehensive for answering user questions, but not sufficient for a model acting as an agent (which I expect to see more and more). An agent acting in the real world might face all sorts of interesting situations that the spec doesn't provide guidance on. I can provide some examples if necessary. 2. Does the spec fed to models ever change depending on the country / jurisdiction that the model's data center or the user are located in? Situations which are normal in some places may be legal in others. For example, Google tells me that homosexuality is illegal in 64 countries. Other situations are more subtle and may reflect different cultures / norms.
4boazbarak
Also the thing I am most excited about deliberative alignment is that it becomes better as models are more capable. o1 is already more robust than o1 preview and I fully expect this to continue. (P.s. apologies in advance if I’m unable to keep up with comments; popped from holiday to post on the DA paper.)

I was thinking of this as a histogram- probability that the model solves the task at that level of quality

I indeed believe that regulation should focus on deployment rather than on training.

See also my post https://www.lesswrong.com/posts/gHB4fNsRY8kAMA9d7/reflections-on-making-the-atomic-bomb

the Manhattan project was all about taking something that’s known to work in theory and solving all the Z_n’s

There is a general phenomenon in tech that has been expressed many times of people over-estimating the short-term consequences and under-estimating the longer term ones (e.g., "Amara's law").

I think that often it is possible to see that current technology is on track to achieve X, where X is widely perceived as the main obstacle for the real-world application Y. But once you solve X, you discover that there is a myriad of other "smaller" problems Z_1 , Z_2 , Z_3 that you need to resolve before you can actually deploy it for Y.

And of course, there is always... (read more)

1lemonhope
Do you know of any compendiums of such Z_Ns? Would love to read one

Some things like that already happened - bigger models are better at utilizing tools such as in-context learning and chain of thought reasoning. But again, whenever people plot any graph of such reasoning capabilities as a function of model compute or size (e.g., Big Bench paper) the X axis is always logarithmic. For specific tasks, the dependence on log compute is often sigmoid-like (flat for a long time but then starts going up more sharply as a function of log. compute) but as mentioned above, when you average over many tasks you get this type of linear dependence.

One can make all sorts of guesses but based on the evidence so far, AIs have a different skill profile than humans. This means if we think of any job a which requires a large set of skills, then for a long period of time, even if AIs beat the human average in some of them, they will perform worse than humans in others.

2mishka
Yes, at least that's the hope (that there will be need for joint teams and for finding some mutual accommodation and perhaps long-term mutual interest between them and us; basically, the hope that Copilot-style architecture will be essential for long time to come)...

I always thought the front was the other side, but looking at Google images you are right.... don't have time now to redraw this but you'll just have to take it on faith that I could have drawn it on the other side 😀

1boazbarak
Ok drew it on the back  now :)

>On the other hand, if one starts creating LLM-based "artificial AI researchers", one would probably create diverse teams of collaborating "artificial AI researchers" in the spirit of multi-agent LLM-based architectures,.. So, one would try to reproduce the whole teams of engineers and researchers, with diverse participants.

I think this can be an approach to create a diversity of styles, but not necessarily of capabilities. A bit of prompt engineering telling the model to pretend to be some expert X can help in some benchmarks but the returns diminish v... (read more)

1mishka
Yes, I just did confirm that even turning Code Interpreter on does not seem to help with recognition of a winning move at Tic-Tac-Toe (even when I tell it to play like a Tic-Tac-Toe expert). Although, it did not try to generate and run any Python (perhaps, it needs to be additionally pushed towards doing that). A more sophisticated prompt engineering might do it, but it does not work well enough on its own on this task. ---------------------------------------- Returning to "artificial researchers based on LLMs", I would expect the need for more sophisticated prompts, not just reference to a person, but some set of technical texts and examples of reasoning to focus on (and learning to generate better long prompts of this kind would be a part of self-improvement, although I would expect the bulk of self-improvement to come from designing smarter relatively compact neural machines interfacing with LLMs and smarter schemes of connectivity between them and LLMs (I expect an LLM in question to be open and not hidden by an opaque API, so that one would be able to read from any layer/inject into any layer)).

I agree that self-improvement is an assumption that probably deserves its own blog post. If you believe exponential self improvement will kick in at some point, then you can consider this discussion as pertaining until the point that it happens.

My own sense is that:

  1. While we might not be super close to them, there are probably fundamental limits to how much intelligence you can pack per FLOP.  I don't believe there is a small C program that is human-level intelligent. In fact, since both AI and evolution seem to have arrived at roughly similar magnitud
... (read more)
1mishka
Yes, this makes sense. I agree with that. On the other hand, if one starts creating LLM-based "artificial AI researchers", one would probably create diverse teams of collaborating "artificial AI researchers" in the spirit of multi-agent LLM-based architectures, for example, in the spirit of Multiagent Debate or Mindstorms in Natural Language-Based Societies of Mind or Multi-Persona Self-Collaboration or other work in that direction. So, one would try to reproduce the whole teams of engineers and researchers, with diverse participants. I am not sure. Let's consider the shift from traditional neural nets to Transformers. In terms of expressive power, there is an available shift of similar magnitude in the space of neural machines from Transformers to "flexible attention machines"(those can be used as continuously deformable general-purpose dataflow programs, and they can be very compact, and they also allow for very fluent self-modification). No one is using those "flexible attention machines" for serious machine learning work (as far as I know), mostly because no one optimized them to make them GPU-friendly at their maximal generality (again as far as I know), but at some point people will figure that out (probably by rediscovering the whole thing from scratch rather than by reading the overlooked arXiv preprints and building on top of that). It might be that one would consider a hybrid between such a machine and a more traditional Transformer (the Transformer part will be opaque, just like today, but the "flexible neural machine" might be very compact and transparent). I am agnostic on how far one could push all this, but the potential there is strong enough to be an argument against making a firm bet against this possibility. And there might be some alternative routes to "compact AI with an LLM as an oracle" (I describe the route I understand reasonably well, but it does not have to be the only one).

I agree that there is much to do to improve AI reliability, and there are a lot of good reasons (in particular to make AI more useful for us) to do so.   So I agree reliability will improve. In fact, I very much hope this happens! I believe faster progress on reliability would go a long way toward enabling positive applications of AI.

I also agree that a likely path to do so is by adjusting the effort based on estimates of reliability and the stakes involved. At the moment, systems such as ChatGPT spend the same computational effort if someone asks the... (read more)

2HoldenKarnofsky
I'm not convinced it requires a huge compute tax to reliably avoid being caught. (If I were, I would in fact probably be feeling a lot more chill than I am.) The analogy to humans seems important. Humans are capable of things like going undercover, and pulling off coups, and also things like "working every day with people they'd fire if they could, without clearly revealing this." I think they mostly pull this off with: * Simple heuristics like "Be nice, unless you're in the very unusual situation where hostile action would work well." (I think the analogy to how AIs might behave is straightforward.) * The fact that they don't need to be perfect - lots of behaviors and situations are just ambiguous. (I think this also applies to AI, and maybe more so: there could be lots of weird little incidents that are chalked up to bugs without causing concern sufficient to address the problem reliably.) * Also, humans generally need to do a lot of reasoning along the lines of "X usually works, but I do need to notice the rare situations when something radically different is called for." So if this is expensive, they just need to be doing that expensive thing a lot.

Note all capabilities / tasks correspond to trying to maximize a subjective human response. If you are talking about finding software vulnerabilities, design some system, there may well be objective measures of success. In such a case, you can fine tune a system to maximize these measures and so extract capabilities without the issue of deception/manipulation.

Regarding "escapes", the traditional fear was that because that AI is essentially code, it can spread and escape more easily. But I think that in some sense modern AI has a physical footprint that is ... (read more)

2HoldenKarnofsky
I think training exclusively on objective measures has a couple of other issues: * For sufficiently open-ended training, objective performance metrics could incentivize manipulating and deceiving humans to accomplish the objective. A simple example would be training an AI to make money, which might incentivize illegal/unethical behavior. * For less open-ended training, I basically just think you can only get so much done this way, and people will want to use fuzzier "approval" measures to get help from AIs with fuzzier goals (this seems to be how things are now with LLMs). I think your point about the footprint is a good one and means we could potentially be very well-placed to track "escaped" AIs if a big effort were put in to do so. But I don't see signs of that effort today and don't feel at all confident that it will happen in time to stop an "escape."

We can of course define “intelligence” in a way that presumes agency and coherence. But I don’t want to quibble about definition.

Generally when you have uncertainty, this corresponds to a potential “distribution shift” between your beliefs/knowledge and reality. When you have such a shift then you want to reglularize which means not optimizing to the maximum.

This is not about the definition of intelligence. It’s more about usefulness. Like a gun without a safety, an optimizer without constraints or regularizarion is not very useful.

Maybe it will be possible to build it, just like today it’s possible to hook up our nukes to an automatic launching device. But it’s not necessary that people will do something so stupid.

The notion of a piece of code that maximizes a utility without any constraints doesn’t strike me as very “intelligent “.

if people really wanted to, they may be able to build such programs, but my guess is that they would be not very useful even before they become dangerous, as overfitting optimizers usually are.

2Richard_Kennaway
But very rational! That was just a quip (and I'm not keen on utility functions myself, for reasons not relevant here). More seriously, calling utility maximisation "unintelligent" is more anthropomorphism. Stockfish beats all human players at chess. Is it "intelligent"? ChatGPT can write essays or converse in a moderately convincing manner upon any subject. It is "intelligent"? If an autonomous military drone is better than any human operator at penetrating enemy defences and searching out its designated target, is it "intelligent"? It does not matter. These are the sorts of things that are being done or attempted by people who call their work "artificial intelligence". Judgements that this or that feat does not show "real" intelligence are beside the point. More than 70 years ago, Turing came up with the Turing Test in order to get away from sterile debates about whether a machine could "think". What matters is, what does the thing do, and how does it do it?

at least some humans (e.g. most transhumanists), are "fanatical maximizers": we want to fill the lightcone with flourishing sentience, without wasting a single solar system to burn in waste.

 

I agree that humans have a variety of objectives, which I think is actually more evidence for the hot mess theory?
 

the goals of an AI don't have to be simple to not be best fulfilled by keeping humans around.

The point is not about having simple goals, but rather about optimizing goals to the extreme.

I think there is another point of disagreement. As I've writ... (read more)

2Max H
I think the hot mess theory (more intelligence => less coherence) is just not true.  Two objections: * It's not really using a useful definition of coherence (the author notes this limitation): * Most of the examples (animals, current AI systems, organizations) are not above the threshold where any definition of intelligence or coherence is particularly meaningful. My own working definition is that intelligence is mainly about ability to steer towards a large set of possible futures, and an agent's values / goals / utility function determine which futures in its reachable set it actually chooses to steer towards. Given the same starting resources, more intelligent agents will be capable of steering into a larger set of possible futures. Being coherent in this framework means that an agent tends not to work at cross purposes against itself ("step on its own toes") or take actions far from the Pareto-optimal frontier. Having complicated goals which directly or indirectly require making trade-offs doesn't make one incoherent in this framework, even if some humans might rate agents with such goals as less coherent in an experimental setup. Whether the future is "inherently chaotic" or not might limit the set of reachable futures even for a superintelligence, but that doesn't necessarily affect which future(s) the superintelligence will try to reach.  And there are plenty of very bad (and very good) futures that seem well within reach even for humans, let alone ASI, regardless of any inherent uncertainty about or unpredictability of the future.

I actually agree! As I wrote in my post, "GPT is not an agent, [but] it can “play one on TV” if asked to do so in its prompt." So yes, you wouldn't need a lot of scaffolding to adapt a goal-less pretrained model (what I call an "intelligence forklift") into an agent that does very sophisticated things.

However, this separation into two components - the super-intelligent but goal-less "brain", and the simple "will" that turns it into an agent can have safety implications. For starters, as long as you didn't add any scaffolding, you are still OK. So during mo... (read more)

3HoldenKarnofsky
I agree with these points! But: * Getting the capabilities to be used by other agents to do good things could still be tricky and/or risky, when reinforcement is vulnerable to deception and manipulation. * I still don't think this adds up to a case for being confident that there aren't going to be "escapes" anytime soon.

At the moment at least, progress on reliability is very slow compared to what we would want. To get a sense of what I mean, consider the case of randomized algorithms. If you have an algorithm  that for every input  computes some function  with probability at least 2/3 (i.e. ) then if we spend  times more the computation, we can do majority voting and using standard bounds show that the probability of error drops exponentially with   (i.e.  or ... (read more)

3HoldenKarnofsky
That's interesting, thanks! In addition to some generalized concern about "unknown unknowns" leading to faster progress on reliability than expected by default (especially in the presence of commercial incentives for reliability), I also want to point out that there may be some level of capabilities where AIs become good at doing things like: * Assessing the reliability of their own thoughts, and putting more effort into things that have the right combination of uncertainty and importance. * Being able to use that effort productively, via things like "trying multiple angles on a question" and "setting up systems for error checking." I think that in some sense humans are quite unreliable, and use a lot of scaffolding - variable effort at reliability, consulting with each other and trying to catch things each other missed, using systems and procedures, etc. - to achieve high reliability, when we do so. Because of this, I think AIs could be have pretty low baseline reliability (like humans) while finding ways to be effectively highly reliable (like humans). And I think this applies to deception as much as anything else (if a human thinks it's really important to deceive someone, they're going to make a lot of use of things like this).

I agree that there is a difference between strong AI that has goals and one that is not an agent. This is the point I made here https://www.lesswrong.com/posts/wDL6wiqg3c6WFisHq/gpt-as-an-intelligence-forklift

But this has less to do with the particular lab (eg DeepMind trained Chinchilla) and more with the underlying technology. If the path to stronger models goes through scaling up LLMs then it does seem that they will be 99.9% non agentic (measured in FLOPs https://www.lesswrong.com/posts/f8joCrfQemEc3aCk8/the-local-unit-of-intelligence-is-flops )

1Noosphere89
You're right, it is the technology that makes the difference, but my point is that specific companies focus more on specific technology paths to safe AGI. And OpenAI/Anthropic's approach tends not to have instrumental convergence/powerseeking, compared to Deepmind, given that Deepmind focuses on RL, which essentially requires instrumental convergence. To be clear, I actually don't think OpenAI/Anthropic's path can work to AGI, but their alignment plans probably do work. And given instrumental convergence/powerseeking is basically the reason why AI is more dangerous than standard technology, that is a very big difference between the companies rushing to AGI. Thanks for the posts on non-agentic AGI. My other points are that the non-existence of instrumental convergence/powerseeking even at really high scales, if true, has very, very large implications for the dangerousness of AI, and consequently basically everything has to change with respect to AI safety, given that it's a foundational assumption of why AI is so dangerous at all.

Yes in the asymptotic limit the defender could get to a bug free software. But until the. It’s not clear who is helped the most by advances. In particular sometimes attackers can be more agile in exploiting new vulnerabilities while patching them could take long. (Case in point, it took ages to get the insecure hash function MD5 out of deployed security sensitive code even by companies such as Microsoft; I might be misremembering but if I recall correctly Stuxnet relied on such a vulnerability.)

2O O
This is because there probably wasn’t a huge reason to (stuxnet was done with massive resources, maybe not frequent t enough to justify fixing) and engineering time is expensive. As long as bandaid patches are available then the same AI can just be used to patch all these vulnerabilities. Also engineering time probably goes down if you have exploit finding AI.

Yes the norms of responsible disclosures of security vulnerabilities, where potentially affected companies gets advanced notice before public disclosure, can and should be used for vulnerability-discovering AIs as well.

Yes AI advances help both the attacker and defender. In some cases like spam and real time content moderation, they enable capabilities for the defender that it simply didn’t have before. In others it elevates both sides in the arms race and it’s not immediately clear what equilibrium we end up in.

In particular re hacking / vulnerabilities it’s less clear who it helps more. It might also change with time, with initially AI enabling “script kiddies” that can hack systems without much skill, and then an AI search for vulnerabilities and then fixing them becomes part of the standard pipeline. (Or if we’re lucky then the second phase happens before the first.)

3Not Relevant
Lucky or intentional. Exploit embargoes artificially weight the balance towards the defender - we should create a strong norm of providing defender access first in AI.
3O O
I think it’s clear in the scenario of hacker vs defender, the defender has a terminal state of being unhackable while the hacker has no such terminal state.

These are interesting! And deserve more discussion than just a comment. 

But one high level point regarding "deception" is that at least at the moment, AI systems have the feature of not being very reliable. GPT4 can do amazing things but with some probability will stumble on things like multiplying not-too-big numbers (e.g. see this - second pair I tried).  
While in other cases in computing technology we talk about "five nine's reliability", in AI systems the scaling works that we need to spend huge efforts to move from 95% to 99% to 99.9%, which... (read more)

2HoldenKarnofsky
I agree that today's AI systems aren't highly reliable at pretty much anything, including deception. But I think we should expect more reliability in the future, partly for reasons you give above, and I think that's a double-edged sword. Under the picture you sketch out above, companies will try to train AIs to be capable of being much more reliable (while also, presumably, being intelligent and even creative). I also think reliability is likely to increase without necessarily having big reliability-focused efforts: just continuing to train systems at larger scale and with more/better data is likely to make them more capable in a way that makes them more reliable. (E.g., I think current language models have generally gotten more reliable partly via pure scaling up, though things like RLHF are also part of the picture.) For both reasons, I expect progress on reliability, with the pace of progress very hard to forecast. If AI systems become capable of being intelligent and creative in useful ways while having extraordinary rare mistakes, then it seems like we should be worrying about their having developed reliable deception capabilities as well. Thoughts on that?

Re escaping, I think we need to be careful in defining "capabilities". Even current AI systems are certainly able to give you some commands that will leak their weights if you execute them on the server that contains it.  Near-term ones might also become better at finding vulnerabilities. But that doesn't mean they can/will spontaneously escape during training.

As I wrote in my "GPT as an intelligence forklift" post, 99.9% of training is spent in running optimization of a simple loss function over tons of static data. There is no opportunity for the AI... (read more)

2HoldenKarnofsky
On your last three paragraphs, I agree! I think the idea of security requirements for AI labs as systems become more capable is really important. I think good security is difficult enough (and inconvenient enough) that we shouldn't expect this sort of thing to happen smoothly or by default. I think we should assume there will be AIs kept under security that has plenty of holes, some of which may be easier for AIs to find (and exploit) than humans. I don't find the points about pretraining compute vs. "agent" compute very compelling, naively. One possibility that seems pretty live to me is that the pretraining is giving the model a strong understanding of all kinds of things about the world - for example, understanding in a lot of detail what someone would do to find vulnerabilities and overcome obstacles if they had a particular goal. So then if you put some scaffolding on at the end to orient the AI toward a goal, you might have a very capable agent quite quickly, without needing vast quantities of training specifically "as an agent." To give a simple concrete example that I admittedly don't have a strong understanding of, Voyager seems pretty competent at a task that it didn't have vast amounts of task-specific training for.

Yes. Right now we would have to re-train all LORA weights of a model when an updated version comes out, but I imagine that at some point we would have "transpilers" for adaptors that don't use natural language as their API as well.

I definitely don't have advice for other countries, and there are a lot of very hard problems in my own homeland. I think there could have been an alternate path in which Russia has seen prosperity from opening up to the west, and then going to war or putting someone like Putin in power may have been less attractive. But indeed the "two countries with McDonalds won't fight each other" theory has been refuted. And as you allude to with China, while so far there hasn't been war with Taiwan, it's not as if economic prosperity is an ironclad guarantee of non a... (read more)

5HoldenKarnofsky
I'm curious why you are "not worried in any near future about AI 'escaping.'" It seems very hard to be confident in even pretty imminent AI systems' lack of capability to do a particular thing, at this juncture.

I meant “resources” in a more general sense. A piece of land that you believe is rightfully yours is a resource. My own sense (coming from a region that is itself in a long simmering conflict) is that “hurt people hurt people”. The more you feel threatened, the less you are likely to trust the other side.

While of course nationalism and religion play a huge role in the conflict, my sense is that people tend to be more extreme in both the less access to resources, education and security about the future they have.

4Wei Dai
If someone cares a lot about a strictly zero-sum resource, like land, how do you convince them to 'move out of the zero-sum setting by finding "win win" resolutions'? Like what do you think Ukraine or its allies should have done to reduce the risk of war before Russia invaded? Or what should Taiwan or its allies do now? Also to bring this thread back to the original topic, what kinds of interventions do you think your position suggests with regard to AI?

Indeed many “longtermists” spend most of their time worrying about risks that they believe (rightly or not) have a large chance of materializing in the next couple of decades.

Talking about tiny probabilities and trillions of people is not needed to justify this, and for many people it’s just a turn off and a red flag that something may be off with your moral intuition. If someone tries to sell me a used car and claims that it’s a good deal and will save me $1K then I listen to them. If someone claims that it would give me an infinite utility then I stop listening.

I don’t presume to tell people what they should care about, and if you feel that thinking of such numbers and probabilities gives you a way to guide your decisions then that’s great.

I would say that, given how much humanity changed in the past and increasing rate of change, probably almost none of us could realistically predict the impact of our actions more than a couple of decades to the future. (Doesn’t mean we don’t try- the institution I work for is more than 350 years old and does try to manage its endowment with a view towards the indefinite future…)

Thanks. I tried to get at that with the phrase “irreversible humanity-wide calamity”.

There is a meta question here whether morality is based on personal intuition or calculations. My own inclination is that utility calculations would only make a difference “in the margin” but the high level decision are made by our moral intuition.

That is, we can do calculations to decide if we fund Charity A or Charity B in similar areas, but I doubt that for most people major moral decisions actually (or should) boil down to calculating utility functions.

But of course to each their own, and if someone finds math useful to make such decisions then whom am I to tell them not to do it.

1dr_s
Yeah, I think calculations can be a tool but ultimately when deciding a framework we're trying to synthesise our intuitions into a simple set of axioms from which everything proceeds. But the intuitions remain the origin of it all. You could design some game theoretical framework for what guarantees a society to run best without appealing to any moral intuition, but that would probably look quite alien and cold. Morality is one of our terminal values, we just try to make sense of it.

I have yet to see an interesting implication of the "no free lunch" theorem. But the world we move to seems to be of general foundation models that can be combined with a variety of tailor-made adapters (e.g. LORA weights or prompts) that help them tackle any particular application. The general model is the "operating system" and the adapters are the "apps".

2meijer1973
This emphasis on generality makes deployment of future models a lot easier. We first build a gpt4 ecosystem. When gpt5 comes out it will be easy to implement (e.g. autogpt can run just as easy on gpt4 as on gpt5). The adaptions that are necessary are very small and thus very fast deployment of future models is to be expected.

A partial counter-argument. It's hard for me to argue about future AI, but we can look at current "human misalignment" - war, conflict, crime, etc..  It seems to me that conflicts in today's world do not arise because that we haven't progressed enough in philosophy since the Greeks. Rather conflicts arise when various individuals and populations (justifiably or not) perceive that they are in zero-sum games for limited resources.  The solution for this is not "philosophical progress" as much as being able to move out of the zero-sum setting by fin... (read more)

6Wei Dai
I think many of today's wars are at least as much about ideology (like nationalism, liberalism, communism, religion) as about limited resources. I note that Russia and Ukraine both have below replacement birth rates and are rich in natural resources (more than enough to support their declining populations, with Russia at least being one of the biggest exporters of raw materials in the world). I think this was part of the rationale for Europe to expand trade relations with Russia in the years before the Ukraine war (e.g. by building/allowing the Nordstream pipelines), but it ended up not working. Apparently Putin was more interested in some notion of Russian greatness than material comforts for his people. Similarly the US, China, and Taiwan are deeply enmeshed in positive sum trade relationships that a war would destroy, which ought to make war unthinkable from your perspective, but the risk of war has actually increased (compared to 1980, say, when trade was much less). If China did end up invading Taiwan I think we can assign much of the blame to valuing nationalism (or caring about the "humiliation" of not having a unified nation) too much, which seems a kind of philosophical error to me. (To be clear, I'm not saying that finding “win win” resolutions for conflict or growing the overall pie are generally not good solutions or not worth trying, just that having wrong values/philosophies clearly play a big role in many modern big conflicts.)
Load More