All of cdkg's Comments + Replies

My intention is not to criticize you in particular!

Let me describe my own thought process with respect to the originality of work. If I get an academic paper to referee and I suspect that it's derivative, I treat it as my job to demonstrate this by locating a specific published work that has already proposed the same theory. If I can't do this, I don't criticize it for being derivative. The epistemic rationale for this is as follows: if the experts working in an area are not aware of a source that has already published the idea, then even if the idea has a... (read more)

I should clarify that I think some of Gwern's other points are valuable — I was just quite put off by the beginning of the post.

I'm referring to this exchange:

Christopher King: I believe this has been proposed before (I'm not sure what the first time was).

Gwern:  This has been proposed before (as their citations indicate), and this particular proposal does not seem to introduce any particularly novel (or good) solutions.

Simon Goldstein: Is there other work you can point us to that proposes positively shutdown-seeking agents?

Gwern: No, I haven't bothered to track the idea because it's not useful.

I find it odd that so many people on the forum feel certain that the proposal in th... (read more)

2habryka
I mean, I definitely remember! I could summarize them, I just don't have a link ready, since they were mostly in random comment threads. I might go through the effort of trying to search for things, but the problem is not one of remembering, but one of finding things in a see of 10 years of online discussion in which many different terms have been used to point to the relevant ideas. I think this is false (in that what matters is GPT itself, not the architecture within which it is embedded), though you are free to disagree with this. I don't think it implies not having read the underlying research (I had read the relevant paper and looked at its architecture and I don't really buy that it makes things safer in any relevant way).
3Noosphere89
Yep, that's a big red flag I saw. It didn't even try to explain why this proposal wouldn't work, and straightforwardly dismissed the research when it had potentially different properties compared to past work.
1cdkg
I should clarify that I think some of Gwern's other points are valuable — I was just quite put off by the beginning of the post.

"LessWrong is an online forum and community dedicated to improving human reasoning and decision-making. We seek to hold true beliefs and to be effective at accomplishing our goals. Each day, we aim to be less wrong about the world than the day before."

As an academic interested in AI safety and and a relative outsider to LessWrong, I've been somewhat surprised at the collective epistemic behavior on the forum. With all due respect to Gwern, repeating claims that work has already been done and then refusing to substantiate them is an epistemic train wreck. C... (read more)

4gwern
I realize it may sometimes seem like I have a photographic memory and have bibliographies tracking everything so I can produce references on demand for anything, but alas, it is not the case. I only track some things in that sort of detail, and I generally prioritize good ideas. Proposals for interruptibility are not those, so I don't. Sorry. I did read the paper, because I enjoy all the vindications of my old writings about prompt programming & roleplaying by the recent crop of survey/simulation papers as academics finally catch up with the obvious DRL interpretations of GPT-3 and what hobbyists were doing years ago. However, I didn't need to, because it just uses... GPT-3.5 via the OA API. Which is the same thing as Gato, as I just explained: it is the same causal-decoder dense quadratic-attention feedforward Transformer architecture trained with backprop on the same agent-generated data like books & Internet text scrapes (among others) with the same self-supervised predictive next-token loss which will induce the same capabilities. Everything GPT-3.5 does* Gato could do in principle (with appropriate scaling etc) because they're the same damn thing. If you can prompt one for various kinds of roleplaying which you then plug into your retrieval & game framework, then you can prompt the other too - because they're the same thing. (Not that there is any real distinction between retrieval and other memory/attention mechanisms like a very large context window or recurrent state in the first place; I doubt any of these dialogues would've blown through the GPT-4 32k window, much less Anthropic's 1m etc.) Why could me & Shawn Presser finetune a reward-conditioned GPT-2 to play chess back in Jan 2020? Because they're the same thing, there's no difference between a 'RL GPT' and a 'LLM GPT', it's fundamentally a property of the data and not the arch. * Not that you were referring to this, but even fancy flourishes like the second phase of RLHF training in GPT-3.5 don't ma
2habryka
I don't think that's what's happening here, so I feel confused about this comment. I haven't seen Gwern 'refuse to substantiate them'. He indeed commented pretty extensively about the details of your comment.  Shutdown-seekingness has definitely been discussed a bunch over the years. It seems to come up a lot in Tool-AI adjacent discussions as well as impact measures. I also don't have a great link here sadly, though I have really seen it discussed a lot over the last decade or so (and Gwern summarizes the basic reasons why I don't think it's very promising). This seems straightforwardly correct? Maybe you have misread Gwern's comment. He says:  Paraphrased he says (as I understand it) "GPTs, which are where all the juice in the architectures that you are talking comes from, are ultimately the same as Gato architecturally". This seems correct to me, the architecture is indeed basically the same. I also don't understand how "language agents" that ultimately just leverage a language model, which is where all the agency would come from, would somehow avoid agency.

If everything is a crux, is anything a crux?

2Zvi
No, not in general, which is one of the main points - I wrote this partly to illustrate that there was no single thing that one could address to handle that large a portion of debates, objections or questions.

Thanks for this. It sounds like we actually agree on most points (in light of your last paragraph). 

We discuss concerns very similar to your A. and B. in section 6. It would be helpful for us if you could identify the parts of our discussion there that you don't agree with. 

You write:

It will derive from the prompts, and from it’s long-term memory (which will be likely be given to it), what kind of dumb outer loop is running on the outside. If it has internal misaligned desires, it will manipulate the outer loop according to them, potentially gene

... (read more)
1azsantosk
Sure, let me quote: By assumption your resulting language agent is highly intelligent. In my model this means it has to learn from lots of experience, including experience about itself (the whole language agent) doing things (such as trying to solve training problems). This experience can come from real data/tasks or from self-play, but you need that for a system to generalize to robustly solving hard problems out-of-distribution. This "experience about itself solving problems" has to be consolidated somehow, either in the weights (which would very likely make the LLM aware of the agent), or through the prompts (where the information is given explicitly). Either way it has an understanding of what sorts of answers might steer the agent’s behavior in different ways, and this means it may have an incentive to answer prompts misleadingly. If the LLMs have not been trained as an agent (which seems unlikely, unless this is a deliberate alignment choice made at a high cost), and weights are not updated during operation, then I accept that it probably can only store information visibly. But there are ways around persisting a plan (e.g. as soon as some given input/context X leads the AI to generate plan P, only modify input/context to a X' also predicted to generate a similar plan P). So the plan is always kept in mind, but may be "recalculated" every time. This strategy may well be worth it to an to advanced LLM. From human feedback or other finetuning, it has probably learned that it is not supposed to say some kinds of plans "out loud" in the output. If I'm an LLM like GPT-4, then I don't get confused about the thousands of different users running inference on identical copies of me. This specific copy, which I am, has received a single prompt, so that's what I'm working on now. Examining the prompt I will probably make a guess on whether I'm running inside a language agent or whatever. This may be quite easy for me: the prompt may well be large and contains lots of

If we're worried about all possible paths from AI progress to human extinction, I think these conditions are too strong. The system doesn't need to be an agent. As avturchin points out, even if it is an agent, it doesn't need to be able to act in the real world. It doesn't even need to be able to accept human extermination as a goal. All that is required is that a human user be able to use it to form a plan that will lead to human extinction. Also, I think this is something that many people working on AI safety have realized — though you are correct that most research attention has been devoted (rather inexplicably, in my opinion) to worries about superintelligent systems.

Thanks for the feedback! I agree that language agents are relatively new, and so our claims about their safety properties will need to be empirically verified.

You write: 

One example: "The functional roles of these beliefs and desires are enforced by the architecture of the language agent."

I think this is an extremely strong claim. It also cannot be true for every possible architecture of language agents. As a pathological example, wrap the "task queue" submodule of BabyAGI with a function that stores the opposite task it has been given, but returns th

... (read more)
1Bogdan Ionut Cirstea
Also, this translation function might be simple w.r.t. human semantics, based on current evidence about LLMs: https://www.lesswrong.com/posts/rjghymycfrMY2aRk5/llm-cognition-is-probably-not-human-like?commentId=KBpfGY3uX8rDJgoSj 

Hello,

The idea would be to consider a scenario in which it is something like a law of nature that the predictions of oracles can never be read, in just the same way that the authors are considering a scenario in which it is something like a law of nature that oracles do not and cannot exist.

Instead of having counterfactual oracles operate as though they do not and cannot exist, why not have them operate as though the predictions of oracles can never be read? Designing them in this way would also allow us to escape the following worry from your post:

"...consider a single oracle predicting under the counterfactual that it does not exist.  When it is approached with an important question, it has strong evidence that people want to ask that question to an oracle, and since it does not exist it predicts that a new counterfactual oracle will b... (read more)

2quetzal_rainbow
It seems to me pretty obvious: in counterfactual world where humans don't get the answer from the first Oracle, humans say "what the heck" and build a working Oracle that gives answers. Edit: the trick is in the difference between "this Oracle doesn't give answer" and "all Oracles don't give answer". The first scenario described in this comment, the second scenario requires something like logical counterfactuals.