Neato.
Re your question:
For consistent clothes you can do the same as you'd do for faces:
Use Adetailer + IP-Adapter.
If you're using ComfyUI, you can follow this video (I'm using Forge2, which is a tad different, but the workflow isn't too different as far as I could tell from a quick look):
You must be registered to see the links
PS: I'm not mixing pure SDXL into my fav Pony/Ill/NoobAI models - Pony already is SDXL, but super tuned on fictional/artifical stuff + booru tags.
Ill and NoobAI lean even more into that (+e621, depending on the training).
That's why those models no longer "speak" natural language and rely on their tags - but are also extremly versatile + consistent.
I mean, you could add back more "pure" SDXL and split your prompts for that model into "language" and tags (there's a Plugin for Forge2, that introduces [SPLIT] to non-Adetailer prompting for exactly that reason - only it doesn't split the prompt for each area, but instead for the whole image).
But I'm using SDXL models exclusively for ADetailer - just account for the slightly different lighting (even with the same VAE) by using more dilation + blur - at least when you use higher denoise (aka >0.5, if you need to significantly change things inside that area).
PPS: I'd remembered from way back when, that one could gain more realism by abusing the Latent upscaler - because the typical favorite modern upscalers like 4x Remarcri don't change enough at lower denoise and add too much crap at higher denoise - aka anything above 0.3.
Anyways, using a high initial resolution of ~1280 x 1600 (like I typically do anyways) + something between 1.2 and 1.5 upscale resolution, ~5 upscale CFG and ~0.5 - 0.6 denoise will add quite an amount of realism - the prompt only has to counter the slight changes (putting more emphasis on things that Latent would suppress etc.).
I'm currently working on something else, but once I'm done I might return to this once more for a final final version.