It doesn't have to be literally a utility function. To be more precise, we're worried about any sort of AGI that exhibits goal-directed behavior across a wide variety of real-world contexts.
Why would anyone build an AI that does that? Humans might build it directly because it's useful: AI that you can tell to achieve real-world goals could make you very rich. Or it might arise as an unintended consequence of optimizing in non-real-world domains (e.g. playing a videogame): goal-directed reasoning in that domain might be useful enough that it gets learned from scratch - and then goal-directed behavior in the real world might be instrumentally useful to achieving goals in the original domain (e.g. modifying your hardware to be better at the game).
That seems to be a bit of a motte-and-bailey. Goal-directed behavior does not require optimizing, satisficing works fine. Having a utility function means not stopping until it's maximized, as I understand it.
A more intelligent DALL-E wouldn't make pictures that people like better, it would more accurately approximate the distribution of images in its training data. And you're right that this is not dangerous, but it is also not very useful.
A utility function is an abstraction. It is not something that you literally program into an agent. A utility function is a dual to all the individual decisions made or the preferences between real or hypothetical options. A utility function always implicitly exists if the preferences satisfy certain reasonable requirements. But it is mostly not possible to determine the utility function from observed preferences because you'd need all preferences or make a lot of regularizing assumptions.
A utility function can be a real, separable feature of a system, but that is rather exceptional.
Goodness of the picture is the utility function.
Maybe the intuition can be pumped by thinking of a picture prompt like "Timelapse of the world getting fixed. Colorized historical photo 4k."
My intuition says that a narrow AI like DALL-E would not blow up the world, no matter how much smarter it became. It would just get really good at making pictures.
This is clearly a form of superintelligence we would all prefer, and the difference seems to me to be that DALL-E doesn't really seem to have 'goals' or anything like that, it's just a massive tool.
Why do we care to have AGI with utility functions?