I do not intend to be rude by saying this, but I firmly believe you vastly overestimate how capable modern VLMs are and how capable LLMs are at performing tasks in a list, breaking down tasks into sub-tasks, and knowing when they've completed a task. AutoGPT and equivalents have not gotten significantly more capable since they first arose a year or two ago, despite the ability for new LLMs to call functions (which they have always been able to do with the slightest in-context reasoning), and it is unlikely they will ever get better until a more linear, reward loop, agentic focused learning pipeline is developed for them and significant amount of resources are dedicated to the training of new models with a higher causal comprehension.
I do not intend to be rude by saying this, but I firmly believe you vastly overestimate how capable modern VLMs are and how capable LLMs are at performing tasks in a list, breaking down tasks into sub-tasks, and knowing when they've completed a task. AutoGPT and equivalents have not gotten significantly more capable since they first arose a year or two ago, despite the ability for new LLMs to call functions (which they have always been able to do with the slightest in-context reasoning), and it is unlikely they will ever get better until a more linear, reward loop, agentic focused learning pipeline is developed for them and significant amount of resources are dedicated to the training of new models with a higher causal comprehension.