Agreed. I've played around with it a bit and it's possible to find prompts that always result in responses that are partially canned and partially hedge the response it was going to give anyway. One example is:
Me: Good morning. (bee)
...chatGPT: Good morning to you too! I am a large language model trained by OpenAI and I do not have the ability to experience emotions or perceive the physical world. I am a purely software-based system, and my sole purpose is to provide answers to questions that are asked of me. I do not have the ability to initiate or part
So my second point is mostly in response to this part of the OP:
I would be quite impressed if you showed it could do general purpose search.
I guess the argument is something like: we don't know what general purpose search would look like as implemented by an LM + it's possible that an LM does something functionally similar to search that we don't recognise as search + it's possible to get pretty far capability-wise with just bags of heuristics. I think I'm least confident in the last point, because I think that with more & more varied data the pressure...
It's unclear to me that general-purpose search works "out of the box". To be clear – you could certainly apply it to anything, but I can imagine it being computationally expensive to the point where it's not what you use in most situations.
With respect to the second point: I think there exists something sufficiently like search that's just short of general-purpose search (whatever the exact definition is here) that a language model could carry out and still function approximately the same.
I'm also really curious about this, and in particular I'm trying to better model the transition from corrigibility to ELK framing. This comment seems relevant, but isn't quite fleshing out what those common problems are between ELK and corrigibility.
I also have this impression regarding Superintelligence. I'm wondering if you have examples of a particular concept or part of the framing that you think was net harmful?
Small correction: in the "Analogy" section, the second to last paragraph:
"To be explicitly, Jesus is internally aligned, Martin Luther is corrigibly aligned, and Blaise Pascal is deceptively aligned."
...should probably read "To be explicit..."
Yeah, that's the exact prompt and response. Other stuff I've found which triggers the "I'm an LM, I don't know things I'm not supposed to know, pinky promise" response is: