I'm referring to this exchange:
Christopher King: I believe this has been proposed before (I'm not sure what the first time was).
Gwern: This has been proposed before (as their citations indicate), and this particular proposal does not seem to introduce any particularly novel (or good) solutions.
Simon Goldstein: Is there other work you can point us to that proposes positively shutdown-seeking agents?
Gwern: No, I haven't bothered to track the idea because it's not useful.
I find it odd that so many people on the forum feel certain that the proposal in th...
"LessWrong is an online forum and community dedicated to improving human reasoning and decision-making. We seek to hold true beliefs and to be effective at accomplishing our goals. Each day, we aim to be less wrong about the world than the day before."
As an academic interested in AI safety and and a relative outsider to LessWrong, I've been somewhat surprised at the collective epistemic behavior on the forum. With all due respect to Gwern, repeating claims that work has already been done and then refusing to substantiate them is an epistemic train wreck. C...
Thanks for this. It sounds like we actually agree on most points (in light of your last paragraph).
We discuss concerns very similar to your A. and B. in section 6. It would be helpful for us if you could identify the parts of our discussion there that you don't agree with.
You write:
...It will derive from the prompts, and from it’s long-term memory (which will be likely be given to it), what kind of dumb outer loop is running on the outside. If it has internal misaligned desires, it will manipulate the outer loop according to them, potentially gene
If we're worried about all possible paths from AI progress to human extinction, I think these conditions are too strong. The system doesn't need to be an agent. As avturchin points out, even if it is an agent, it doesn't need to be able to act in the real world. It doesn't even need to be able to accept human extermination as a goal. All that is required is that a human user be able to use it to form a plan that will lead to human extinction. Also, I think this is something that many people working on AI safety have realized — though you are correct that most research attention has been devoted (rather inexplicably, in my opinion) to worries about superintelligent systems.
Thanks for the feedback! I agree that language agents are relatively new, and so our claims about their safety properties will need to be empirically verified.
You write:
...One example: "The functional roles of these beliefs and desires are enforced by the architecture of the language agent."
I think this is an extremely strong claim. It also cannot be true for every possible architecture of language agents. As a pathological example, wrap the "task queue" submodule of BabyAGI with a function that stores the opposite task it has been given, but returns th
Instead of having counterfactual oracles operate as though they do not and cannot exist, why not have them operate as though the predictions of oracles can never be read? Designing them in this way would also allow us to escape the following worry from your post:
"...consider a single oracle predicting under the counterfactual that it does not exist. When it is approached with an important question, it has strong evidence that people want to ask that question to an oracle, and since it does not exist it predicts that a new counterfactual oracle will b...
My intention is not to criticize you in particular!
Let me describe my own thought process with respect to the originality of work. If I get an academic paper to referee and I suspect that it's derivative, I treat it as my job to demonstrate this by locating a specific published work that has already proposed the same theory. If I can't do this, I don't criticize it for being derivative. The epistemic rationale for this is as follows: if the experts working in an area are not aware of a source that has already published the idea, then even if the idea has a... (read more)