This is an interesting model, and I know you acknowledged that progress could take years, but my impression is that this would be even more difficult than you're implying. Here are the problems I see, and I apologize in advance if this doesn't all make sense as I am a non-technical newb.
Wouldn’t it take insane amounts of compute to process all of this? LLM + CoT already uses a lot of compute (see: o3 solving ARC puzzles for $1mil). Combining this with processing images/screenshots/video/audio, plus using tokens for incorporating saved episodic memories into working memory, plus tokens for the decision-making (basal ganglia) module = a lot of tokens. Can this all fit into a context window and be processed with the amount of compute that will be available? Even if one extremely expensive system could run this, could you have millions of agents running this system for long periods of time?
How do you train this? LLMs are superhuman at language processing due to training on billions of pieces of text. How do you train an agent similarly? We don’t have billions of examples of a system like this being used to achieve goals. I don’t think we have any examples. You could put together a system like this today, but it would be bad (see: Claude playing Pokemon). How does it improve? I think it would have to actually carry out tasks and RL on them. In order for it to improve on long-horizon tasks, it would take long-horizon timeframes to get reinforcement signals. You could run simulations, but will they come anywhere close to matching the complexity of the real world? And then there’s the issue of scalable RL only working for tasks with a defined goal: how would it improve on open-ended problems?
If an LLM is at the core of the system, do hallucinations from the LLM “poison the well” so to speak? You can give it tools, but if the LLM at the core doesn’t know what’s true or false, how does it effectively use them? I’ve seen examples like: an LLM got a math problem wrong, then the user asked it to solve using a python script. The python script produced the correct answer, but then the LLM just repeated the wrong answer and added a fake floating point error that it said came from the python script. So it seems like hallucinated errors from the LLM could derail the whole system. I suppose that being trained and RL’d on solving tasks would eventually lead it to learning what works from what doesn’t, but I guess I’m just pointing out why adding additional modules and tools doesn’t automatically solve the issues of LLMs.
This is an interesting model, and I know you acknowledged that progress could take years, but my impression is that this would be even more difficult than you're implying. Here are the problems I see, and I apologize in advance if this doesn't all make sense as I am a non-technical newb.