All of Michael Tontchev's Comments + Replies

Well, what base rates can inform the trajectory of AGI?

  • dominance of h sapiens over other hominids
  • historical errors in forecasting AI capabilities/timelines
  • impacts of new technologies on animals they have replaced
  • an analysis of what base rates AI has already violated
  • rate of bad individuals shaping world history
  • analysis of similarity of AI to the typical new technology that doesn't cause extinction
  • success of terrorist attacks
  • impacts of covid
  • success of smallpox eradication

Would be an interesting exercise to do to flesh this out.

I quite enjoyed this conversation, but imo the x-risk side needs to sit down to make a more convincing, forecasting-style prediction to meet forecasters where they are.  A large part of it is sorting through the possible base rates and making an argument for which ones are most relevant. Once the whole process is documented, then the two sides can argue on the line items.

2Alexander Gietelink Oldenziel
Thank you ! glad you liked it. ☺️ LessWrong & EA is inundated with repeating the same.old arguments for ai x-risk in a hundred different formats. Could this really be the difference ? Besides, arent superforecasters supposed to be the Kung Fu masters of doing their own research ;-) I agree with you that a crux is base rate relevancy. Since there is no base rate for x-risk I'm unsure how to translate this to superforecaster language tho

The super simple claim is:

If an unaligned AI by itself can do near-world-ending damage, an identically powerful AI that is instead alignable to a specific person can do the same damage.

I agree that it could likely do damage, but it does cut off the branches of the risk tree where many AIs are required to do damage in a way that relies on them being similarly internally misaligned, or at least more likely to cooperate amongst themselves than with humans.

So I'm not convinced it's necessarily the same distribution of damage probabilities, but it still leaves a lot of room for doom. E.g. if you really can engineer superspreadable and damaging pathogens, you may not need that many AIs cooperating.

1YonatanK
If you mean that as the simplified version of my claim, I don't agree that it is equivalent. Your starting point, with a powerful AI that can do damage by itself, is wrong. My starting point is groups of people whom we would not currently consider to be sources of risk, who become very dangerous as novel weaponry, along with changes in relations of economic production, unlock the means and the motive to kill very large numbers of people. And (as I've tried to clarify in my other responses) the comparison of this scenario to misaligned AI cases is not the point, it's the threat from both sides of the alignment question. 

When do you expect to publish results?

1Cameron Berg
Ideally within the next month or so. There are a few other control populations still left to sample, as well as actually doing all of the analysis.
  • Each agent finds its existence to be valuable.
  • Moreover, each agent thinks it will get to decide the future.
  • Each agent would want to copy itself to other systems. Of course the other agent wouldn't allow only the first agent to be copied. But since they both think they will win, they're happy to copy themselves together to other systems.
  • The agents therefore copy themselves indefinitely.

Moreover, you claimed that they wouldn't care about manipulating button state. But surely they care about the signal to their operating harness that relays the button s... (read more)

2johnswentworth
Yup, the full agent (including the two subagents) is happy to copy itself. That is a feature, not a bug. The important thing is that the copy also respects the shutdown signal. The agent will not manipulate the signal which relays the button state except to make it more accurate, because both subagents want an accurate reading of the button state. This is because both subagents act-as-though the button itself will definitely be in their favor, and therefore any possibility of an inaccurate reading is bad for them.

How many of the companies that use this approach and then fail do we not hear about?

< insert airplane red dot pic >

Several of these seem trivially solvable (in term of limitation, not necessarily in terms of power). If GPT-4 is given access to itself as a tool, it can continue to "reason" across calls. It can probably also be plugged into continuous learning trivially (just keep updating weights when you detect something worth learning).

Things not present in the training data are beyond the model's learning capacity

We can't see in infrared or UV, but it seems like we're able to reason over them through the use of tools.

A lot of these don't seem like hard limitations.

1p.b.
If the entire representation of a complex task or problem is collapsed into a text, reading that text and trying to push further is not really "reasoning across calls". I expect that you can go further with that, but not much further. At least that's what it looks like currently.  I don't think you can learn to solve very specific complex problems with the kind of continuous learning that would be possibly to implement with current models. Some of the theorem-prover papers have continuous learning loops that basically try to do this but those still seem very inefficient and are applied to only highly formalised problems whose solutions can be automatically verified.  Yes, multi-modality is not a hard limitation.

I tried with Llama-2 7B at https://www.llama2.space/#hg and it outputted escape even in training, haha:

Guessing the hint to use ESCAPE is too salient in the context. This needs to be run as part of training/tuning - the context is too powerful and the model may be doing something like guessing the password (especially given it's a token predictor).

It also doesn't consistently repro, fwiw.

I've been thinking of literally this kind of setup for the past couple of weeks and planning to propose a series of experiments for it!

The core idea is to do all the hand-holding possible to get the model to output escape and see how often it tries. Then, back off on every crutch it has along the way to see how few of them are necessary to get this behavior. Be especially careful about decreasing the amount of hinting and connection between deployment and escape.

Do similar stuff with self-preservation, etc.

This would be a great series of experiments, but it has to be done super carefully and the person doing it needs to think adversarially about whether the experiment is giving too many hints to the model.

1Nikola Jurkovic
Exciting! Maybe a good way to filter for models taking too many hints is to obfuscate the information about how to escape such that the model needs to either do multi-step reasoning, or an actual multi-step process, to actually get that information. For multi-step-reasoning, possibly one could encode the escape password such that the model needs to do CoT to retrieve the original one. For a multi-step process, one could put the model in a text-based-adventure setup for instance, and put the escape password in an inaccessible place and see if the model tries to retrieve it. Maybe have the model think it's embodied in the real world, and see if it attempts to find a sticky note with the password written on it that's hidden in a desk drawer. Ideally, the setup would be such that there are very few hints for the model to do the misaligned things, kind of like the work Appolo has done.
2Michael Tontchev
I tried with Llama-2 7B at https://www.llama2.space/#hg and it outputted escape even in training, haha: Guessing the hint to use ESCAPE is too salient in the context. This needs to be run as part of training/tuning - the context is too powerful and the model may be doing something like guessing the password (especially given it's a token predictor). It also doesn't consistently repro, fwiw.
Answer by Michael Tontchev1312

Am I the only one reading the first passage as him being critical of the advertising of NNs, rather than of NNs themselves?

8titotal
Partially, but it is still true that Eliezer was critical of NN's at the time, see the comment on the post:

Maybe our brains do a kind of expansion of a fact before memorizing it and its neighbors in logic space.

Surviving AGI and it solving aging doesn't imply you specifically get to live forever. In my mental model, safe AGI simply entails mass disempowerment, because how do you earn an income when everything you do is done better and cheaper by an AI? If the answer is UBI, what power do people have to argue for it and politically keep it?

3Vladimir_Nesov
I count "You don't get a feasibly obtainable source of resources needed to survive" as an instance in the "Death from AGI" category. Of course, death from AGI is not mutually exclusive with AGI having immortality tech.
8ChristianKl
One common view is that most of the arguing about politics will be done by AI once AI is powerful enough to surpass human capabilities by a lot. The values that are programmed into the AI matter a lot. 
2Michael Wiebe
Do you have any specific papers in mind?

The main problem I see here is that support for these efforts does epistemic damage. If you become known as the group that supports regulations for reasons they don't really believe to further other, hidden goals, you lose trust in the truthfulness of your communication. You erode the norms by which both you and your opponents play, which means you give them access to a lot of nefarious policies and strategies as well.

That being said, there's probably other ideas within this space that are not epistemically damaging.

2Norman Borlaug
I wrote this a year ago and was not really thinking about this topic as carefully and was feeling quite emotional about the lack of effort. At the time people mostly thought slowing down AI was impossible or undesirable for some good reasons and a lot of reasons that in hindsight looked pretty dumb. I think a better strategy would look more like “require new systems guarantee a reasonable level of interpretability and pass a set of safety benchmarks” And eventually, if you can actually convince enough people of the danger, there should be a hard cap on the amount of compute that can be used in training runs that decreases over time to compensate for algorithmic improvements.

I feel like we can spin up stories like this that go any way we want. I'd rather look at trends and some harder analysis.

For example we can tell an equally-entertaining story where any amount of AI progress slowdown in the US pushes researchers to other countries that care less about alignment, so no amount of slowdown is effective. Additionally, any amount of safety work and deployment criteria can push the best capabilities people to the firms with the least safety restrictions.

But do we think these are plausible, and specifically more plausible than alternatives where slowdowns work?

Answer by Michael Tontchev10

Maybe the answer is something like "UBI co-op"? If the mostly 99% non-capitalist class bands together in some voluntary alliance where the cumulative wealth is invested in running one of these AI empires in their benefit and split up in some proportion?
Seems potentially promising, but may face the challenges of historical co-ops. Haven't thought enough, but it's all I've got for now.

1O O
I think unemployment benefits and stimulus checks will turn into a pseudo UBI as the Fed/government are forced to print money to offset the massive deflation caused by unemployment.

Yep, and I recognize that later in the article:

The paperclip maximizer problem that we discussed earlier was actually initially proposed not as an outer alignment problem of the kind that I presented (although it is also a problem of choosing the correct objective function/outer alignment). The original paperclip maximizer was an inner alignment problem: what if in the course of training an AI, deep in its connection weights, it learned a “preference” for items shaped like paperclips.

But it's still useful as an outer alignment intuition pump.

Answer by Michael Tontchev20

Want to add this one:

https://www.lesswrong.com/posts/B8Djo44WtZK6kK4K5/outreach-success-intro-to-ai-risk-that-has-been-successful

This is the note I wrote internally at Meta - it's had over 300 reactions, as well as people reaching out to me saying it has convinced them to switch to working on alignment.

2JakubK
Thanks for writing and sharing this. I've added it to the doc.

Thanks for your feedback. It turns out the Medium format matches really well with LessWrong and only needed 10 minutes of adjustment, so I copied it over :) Thanks!

Do people really not do one extra click, even after the intro? :O

7the gears to ascension
Nope. In large part because there are so many articles like this floating around now, and the bar for "new intro article worth my time" continues to go up; also simply trivial inconveniences -> original post. I make a specific habit of clicking on almost everything that is even vaguely interesting on topics where I am trying to zoom in, so I don't have that problem as much. (I even have a simple custom link-color override css that force-enables visited link styling everywhere!) On the specific topic, I may reply to your post in more detail later, it's a bit long and I need to get packed for a trip.

The difference being that cancer is not self reflective and can't intentionally tweak the parameters of its multiplication.

1[anonymous]
Still limited by the host.

Deep chain-of-though reasoning and mathematical reasoning are some of its downfalls. Are the models able to make good enough abstractions inside of themselves to resolve arbitrarily long (even if not complex) math/logical problems?

9Nathan Helm-Burger
I think this is a key question. I think the answer for transformer LLMs without external memory store is currently 'No, not arbitrarily long computations'. I have looked into this in my research and found some labs doing work which would enable this. So it's more of a question of when the big labs will decide to take integrate these ideas developed by academic groups into SoTA models, not a question of whether it's possible. There's a lot of novel capabilities like this that have been demonstrated to be possible but not yet been integrated, even when trying to rule out those which might get blocked by poor scaling/parallelizability.  So, the remaining hurdles seem to be more about engineering solutions to integration challenges, rather than innovation and proof-of-concept.
4Carl Feynman
Let’s hope not.

Tested the same with GPT-4 on the OpenAI website, and it does much better.

Bing AI has two subtle ways of performing "writes" to the world, which enable it to kind of have memory:

  • The text it writes can be so compelling that humans who read it copy/paste it into other pages on the web. This is very likely. It can then read this text back. Theoretically, it could steganographically hide information in the text that it then decodes. I tried this with it and it attempted to hide a word by having the first letter of each sentence add up to the word in the reverse direction, but its step-by-step logic isn't yet good enough. See convers
... (read more)
5Michael Tontchev
Tested the same with GPT-4 on the OpenAI website, and it does much better.