This is where the strengths of using things like comfy.ui comes in to play where you can reinforce your prompts with loras and input additional prompts into refiners and upscalers, etc.
I'm still fairly new to the AI imaging thing only having played around with it for about 2 months now, so I can't really respond to this without learning more myself. Generally what I'm seeing in the more fast paced places still split between stable diffusion (and comfy) about equally with others.
All these secondary tools for diffusion models like LORAs, picking the models, using ControlNet, using inpainting, upscaling, using post-processing... ALL of these things are basically trying to remedy the fact that diffusion models are inherently random and rarely spit out exactly what the user asked for. So you need different tools that specifically manipulate some parts of the output. That's the reality.
GPT is basically an "intelligent" interface that can understand your request in simple prompt form. There is an enormous difference between the two concepts.
Instead of using a LORA of some character, you just tell it to draw this character. Instead of using art style LORAs, you just tell it to use this art style. Instead of using inpainting, you just tell it to keep everything same but fix the hands. Instead of using upscaling, you just tell it to upscale. You can tell it to zoom out. You can tell it to rotate. It's an interface where you can communicate with the image generator as if you would to an LLM. And the image will be built bit by bit, ensuring it adheres to the request.
Granted, the more advanced diffusion models are better able to understand prompts written like sentences, but they are still working with random noise as the starting point.
The massive benefit of GPT styled, multimodal LLM with image generation capabilities is the fact that the AI can keep track of the entire project. You can make iterative refinement. Ask it to create a cat on a skateboard. It does it, then you ask it to change art style. Then you ask it to add sunglasses. You then upload a logo of your company and tell it to put it on the cat's shirt. Then you ask it to write some slogan for your company after explaining what your company does. This kind of workflow with one single program. And all you are doing is just writing prompts in a sentence.
You can upload a picture of your OC and ask the AI to create various other poses and expressions for this character for VN sprite work. Or maybe you ask it to create RPGM pixel sprites in the four cardinal directions.
And the biggest reason for us gooners. Having the ability to illustrate your chat in real time, staying consistent always with the art style and the character appearance.
THIS is the holy grail that we've been waiting for. Of course, it will never happen with GPT censorship, but this technology will be the future of image generation.