I think this is quite similar to my proposal in Capabilities and alignment of LLM cognitive architectures.
I think people will add cognitive capabilities to LLMs to create fully capable AGIs. One such important capability is executive function. That function is loosely defined in cognitive psychology, but it is crucial for planning among other things.
I do envision such planning looking loosely like a search algorithm, as it does for humans. But it's a loose search algorithm, working in the space of statements made by the LLM about possible future states and action outcomes. So it's more like a tree of thought or graph of thought than any existing search algorithm, because the state space isn't well defined independently of the algorithm.
That all keeps things more dependent on the LLM black box, as in your final possibility.
At least I think that's the analogy between the proposals? I'm not sure.
I think the pushback to both of these is roughly: this is safer how?
I don't think there's any way to strictly formalize not harming humans. My answer is halfway between that and your "sentiment analysis in each step of planning". I think we'll define rules of behavior in natural language, including not harming humans but probably much more elaborate, and implement both internal review, like your sentiment analysis but more elaborate, and external review by humans aided by tool AI (doing something like sentiment analysis), in a form of scalable oversight.
I'm curious if I'm interpreting your proposal correctly. It's stated very succinctly, so I'm not sure.
Every LLM in existence is a blackbox, and alignment relying on tuning the blackbox never succeeds - that is evident by that fact that even models like ChatGPT get jailbroken constantly. Moreover, blackbox tuning has no reason to transfer to bigger models.
A new architecture is required. I propose using an LLM to parse environment into planner format such as STRIPS, and then using an algorithmic planner such as fast downward in order to implement agentic behaviour. The produced plan is then parsed back into natural language or into commands to execute automatically. Such architecture would also be commercially desirable and would deincentivise investments into bigger monolithic models.
Draft of the architecture: https://gitlab.com/anomalocaribd/prometheus-planner/-/blob/main/architecture.md
q: What part of the alignment problem does this plan aim to solve?
a: Defining hard constraints on AI behaviour.
q: How does this plan aim to solve the problem?
a:A planner would be either hard-constrained with formal language definition of not harming humans, or would incorporate sentiment analysis in each step of planning.
q: What evidence is there that the methods will work?
a: https://arxiv.org/abs/2304.11477
q: What are the most likely causes of this not working?
a: STRIPS is a very primitive planning language, and for the more complex ones planners need to be developed almost from scratch. This approach might once again teach us the bitter lesson, as completely algorithmic planner becomes unfeasible to implement. Naive implementation will also suffer from combinatorial explosion almost immediately. However, this may all be mitigated with further integration of LLM functionality into planning process. And, even though this brings back the problem of blackboxes, the architecture will still enable some degree of compartmentalisation, which will mean that a smaller, more easily interpretable models shall have to be contended with.