[Stable Diffusion] Prompt Sharing and Learning Thread

Nano999

Member
Jun 4, 2022
154
69
Hey!
I've been not touching SD for 3+ months or so, did something happened to hires thingy?
It feels it renders the image 10 times slower than before xD

Using 8x_NMKD-Superscale_150000_G

Upscale by 2 from 512 px

Maybe something in my system loads the resources?
I checked with security task manager, but found nothing

Time taken: 17 min. 34.7 sec. for 2 images oO
50 + 20 steps
 
Last edited:

daddyCzapo

Member
Mar 26, 2019
241
1,492
Hey!
I've been not touching SD for 3+ months or so, did something happened to hires thingy?
It feels it renders the image 10 times slower than before xD

Using 8x_NMKD-Superscale_150000_G

Upscale by 2 from 512 px

Maybe something in my system loads the resources?
I checked with security task manager, but found nothing

Time taken: 17 min. 34.7 sec. for 2 images oO
50 + 20 steps
What are your specs? For me on rtx 2060 in a1111 it takes about 6 minutes to generate 512x768 30 steps and hires 2.5 60 steps with 8x_NMKD-Superscale_150000_G. It was said on a1111's github that some nvidia drivers were causing longer generation times.
Edit. Oh, and if you updated the webui you need to select optimization from the settings tab. iirc --xformers no longer enables it.
 
Last edited:

Nano999

Member
Jun 4, 2022
154
69
What are your specs? For me on rtx 2060 in a1111 it takes about 6 minutes to generate 512x768 30 steps and hires 2.5 60 steps with 8x_NMKD-Superscale_150000_G. It was said on a1111's github that some nvidia drivers were causing longer generation times.
Edit. Oh, and if you updated the webui you need to select optimization from the settings tab. iirc --xformers no longer enables it.
Yes, if you count an update via console?

1694946954116.png

1694946868836.png
Well 3 months ago the identical render would take like 4-5 minutes max per image (50+20 scalex2)
The speed for first 50 is fine, but when it comes to 20 x2 8x_NMKD-Superscale_150000_G, super slow turtule...

CPU - AMD Ryzen 7 3700X AM4 BOX 8-Core Processor (16 CPUs), ~3.6 GHz
GPU - PALIT NVIDIA GeForce RTX 2060 SUPER GP (8 Gb)
RAM - Kingston DDR4 32Gb (2x16Gb) 3200 MHz pc-25600 (HX432C18FWK2/32) HyperX FURY White
RAM - SAMSUNG DDR4 32Gb (2x16Gb) 3200 MHz pc-25600 (M378A2G43MX3-CWE)
MOBO - MSI X470 Gaming Plus (MS-7B79) (AM4, ATX)


1694946831949.png

What should I choose here?
 

daddyCzapo

Member
Mar 26, 2019
241
1,492
Yes, if you count an update via console?

View attachment 2936572

View attachment 2936570
Well 3 months ago the identical render would take like 4-5 minutes max per image (50+20 scalex2)
The speed for first 50 is fine, but when it comes to 20 x2 8x_NMKD-Superscale_150000_G, super slow turtule...

CPU - AMD Ryzen 7 3700X AM4 BOX 8-Core Processor (16 CPUs), ~3.6 GHz
GPU - PALIT NVIDIA GeForce RTX 2060 SUPER GP (8 Gb)
RAM - Kingston DDR4 32Gb (2x16Gb) 3200 MHz pc-25600 (HX432C18FWK2/32) HyperX FURY White
RAM - SAMSUNG DDR4 32Gb (2x16Gb) 3200 MHz pc-25600 (M378A2G43MX3-CWE)
MOBO - MSI X470 Gaming Plus (MS-7B79) (AM4, ATX)


View attachment 2936567

What should I choose here?
You can either choose Doggettx or install xformers. i personally prefer xformers. The a1111 wiki isn't consistent with info on this so i'm not 100% sure, but if you want to install xformers you need to edit your webui-user.bat and add --xformers as shown in this screen 1694950546529.png ,
it should install xformers automatically and then you can select xformers in cross attention optimization menu :)
And as i can see you have earlier version of torch. I have 1694950891902.png torch 2.0.1. But this topic i'll leave to someone that is more capable than me, i'm just a noob
 
Last edited:

me3

Member
Dec 31, 2016
316
708
"Something" relating to the last update killed xformers for me as well and so far i've not been able to work out why. I had it working just fine when first updating, but after i relaunched the UI xformers suddenly wasn't found any more. I've upgrade xformers, reinstalled it, downgraded it, completely "reinstalled" venv, and still it says "no module xformers" on launch, xformers isn't an option in the optimizers, regardless of --xformers or not. An option that technically no longer is need/works.
I have xformers working just fine with kohya_ss so it's something limited to just a1111
 
  • Like
Reactions: Mr-Fox

devilkkw

Member
Mar 17, 2021
305
1,040
"Something" relating to the last update killed xformers for me as well and so far i've not been able to work out why. I had it working just fine when first updating, but after i relaunched the UI xformers suddenly wasn't found any more. I've upgrade xformers, reinstalled it, downgraded it, completely "reinstalled" venv, and still it says "no module xformers" on launch, xformers isn't an option in the optimizers, regardless of --xformers or not. An option that technically no longer is need/works.
I have xformers working just fine with kohya_ss so it's something limited to just a1111
Have you try to choosing it in "SETTINGS" tab, under "Optimization" ?
 

me3

Member
Dec 31, 2016
316
708
"Something" relating to the last update killed xformers for me as well and so far i've not been able to work out why. I had it working just fine when first updating, but after i relaunched the UI xformers suddenly wasn't found any more. I've upgrade xformers, reinstalled it, downgraded it, completely "reinstalled" venv, and still it says "no module xformers" on launch, xformers isn't an option in the optimizers, regardless of --xformers or not. An option that technically no longer is need/works.
I have xformers working just fine with kohya_ss so it's something limited to just a1111
Have you try to choosing it in "SETTINGS" tab, under "Optimization" ?
 

me3

Member
Dec 31, 2016
316
708
Just to potentially up the "learning" part of this thread again.
As i've been doing a lot of repeated training lately i thought i'd share some minor things that might help others do things a bit faster.
Since i have a slow, old and just 6gb vram card, reducing memory needs have always been one of the priorities, which means that any sort of "speed" goes out the window.

When i started this endlessly long repeated training...sigh.... it was running at 6.8 to 8.4 s/it...yes that's the correct way...seconds per iteration...
which means a 10k step training would be >20 hours, easily. This was without dropping into shared memory usage etc and purely running at 95-98% vram. Training i'm currently running, still on the same dataset etc, is running at 1.6 to 1.8 s/it...it's still "the slow way", but that's still very noticeable. Now if all those with 30xx and 40xx are done laughing, this might potentially help you as well. Obviously it's a bit hard for me to test, unless i'm somehow gifted a massive new computer or wins some kind of lottery.

While these things are mainly reducing memory needs, it can work for those with much more vram too since it might allow you to add more batches which would speed things up greatly.

First off, many guides and what not complain about buckets being horrible without any reason as to why or some crappy excuse suggesting they don't even know how it works. In this case the point with buckets will let you reduce the number of px in most/all images as you can crop out a lot of pointless background from tall/wide images. 100 images at 512x or 768x squares are a lot more to work through than the same images and you can cut off half because it's "empty space". Just remember to do the settings for it so you don't get more cropping or weird bucketing.

Second, optimizer, if you're on low/lower specs you're probably already familiar with adamw8bit, however there's a more "optimized" version called PagedAdamW8bit.
For me with adamw8bit i had to run training with both "gradient checkpointing" and "memory efficient attention" and training ran constant at 5.8gb. When doing sample images it would then offload to generate the image, which was a noticeable lag, it's not major, but it was there.
Using the Paged version i can remove "memory efficient attention" and training is running at 4.4-4.6gb and it only spikes up for sample images, but it doesn't offload or "reorganize" any memory usage to do it. So it still keeps everything in memory which speeds up sampling too. Not that it's a function you need. But because of not needing the ME attention and still having vram to spare, it speeds up everything.
(I haven't tested if ME attention allows for +1 batch and if that would be faster in total, i doubt it for me though)

Third small thing, Cache latents, it's generally checked by default in kohya_ss gui, but for some reason i've seen guide/training files turn that off. It might be because it keeps all the latents in memory and ppl haven't really thought what that means. As a simple explanation, you keep a "version" of the image in memory instead of reading it each time. "Each time" in this case means every repetition and epoch, but unlike much of the other "keeps data in memory" option this is fairly little. Mostly less than 100kb per image, so unless you're going way overboard with the amount of images you use, this should not be the reason you OOM and it does speed things up. Your mileage will obviously vary with system. Don't "cache to disk" though.

As a final note, i know these things worked very well for me, on my low spec old comp, in theory they should work for others as well, but the effect/impact will obviously depend on what's running it.
If you're already running at full speed it's unlikely to do much, but as i mentioned earlier, if it means you can increase your batch count by 1 or more, it would make this faster even for those systems.

And if anyone bothered reading all this, Hi...
 

ririmudev

Member
Dec 15, 2018
304
303
Obligatory "Noob encounters eldritch horrors" story:
(btw, I poked in a while ago, and didn't expect to come back, but something about a FreeCities x Stable Diffusion integration piqued my interest, so I installed SD, first on my laptop, and then might install later on my main rig if I don't have too many nightmares).

I installed it, seeing someone say that setup was fairly easy (well, I had issues with the g++-9 dependency, due to... blah blah, anyway, I got past it).
Somewhere (maybe here), I saw someone had a simple prompt, something like "a female mage casting a spell"
I tried it out a couple times, results weren't too bad. And I have no additional models installed yet.

So, I jumped to 20 sampling steps of "a female goblin eating drumstick" (I think I was recently looking at the Goblin Layer thread).
...
:cry::cry::cry:
...
Eh... I should have known better. Back to good-ol' coding for a little while.

(I can't quite bring myself to post pics, but I'm sure most of the experienced folks here have seen worse)
 
  • Sad
Reactions: Mr-Fox

me3

Member
Dec 31, 2016
316
708
Obligatory "Noob encounters eldritch horrors" story:
(btw, I poked in a while ago, and didn't expect to come back, but something about a FreeCities x Stable Diffusion integration piqued my interest, so I installed SD, first on my laptop, and then might install later on my main rig if I don't have too many nightmares).

I installed it, seeing someone say that setup was fairly easy (well, I had issues with the g++-9 dependency, due to... blah blah, anyway, I got past it).
Somewhere (maybe here), I saw someone had a simple prompt, something like "a female mage casting a spell"
I tried it out a couple times, results weren't too bad. And I have no additional models installed yet.

So, I jumped to 20 sampling steps of "a female goblin eating drumstick" (I think I was recently looking at the Goblin Layer thread).
...
:cry::cry::cry:
...
Eh... I should have known better. Back to good-ol' coding for a little while.

(I can't quite bring myself to post pics, but I'm sure most of the experienced folks here have seen worse)
Sounds like the prompt i used when testing different UI and SDXL, so here's a couple pretty clean prompt runs of yours. I've had far worse horrors from things you'd expect be pretty safe. Missing/skipped images are more due to being "boring" than any horror.
00024-1818585955.png
00020-2183929224.png 00028-3761512306.png 00029-328585663.png 00044-406385261.png

Models:
#1-2:
#3-4:
#5:
 
Last edited:
  • Like
Reactions: Mr-Fox

ririmudev

Member
Dec 15, 2018
304
303
Sounds like the prompt i used when testing different UI and SDXL, so here's a couple pretty clean prompt runs of yours. I've had far worse horrors from things you'd expect be pretty safe. Missing/skipped images are more due to being "boring" than any horror.
View attachment 2945034
View attachment 2945033 View attachment 2945035 View attachment 2945037 View attachment 2945048

Models:
#1-2:
#3-4:
#5:
Those are a pretty fair representation; in my mind I was going for a little more cutesy, more humanoid, but to be fair, I didn't specify that.
Ok fine... here's a couple images that I got (hope I don't get banned, though the pics are just unpleasant, nothing rule-breaking):
You don't have permission to view the spoiler content. Log in or register now.
A few others were just as bad, maybe slightly worse, but I'll leave it at this.
One was bad, but pretty abstract, and almost kind of cool (but I'll still put it in a spoiler):
You don't have permission to view the spoiler content. Log in or register now.
<End transmission>
 
  • Wow
Reactions: Mr-Fox and Dagg0th

me3

Member
Dec 31, 2016
316
708
Just to potentially up the "learning" part of this thread again.
As i've been doing a lot of repeated training lately i thought i'd share some minor things that might help others do things a bit faster.
Since i have a slow, old and just 6gb vram card, reducing memory needs have always been one of the priorities, which means that any sort of "speed" goes out the window.

When i started this endlessly long repeated training...sigh.... it was running at 6.8 to 8.4 s/it...yes that's the correct way...seconds per iteration...
which means a 10k step training would be >20 hours, easily. This was without dropping into shared memory usage etc and purely running at 95-98% vram. Training i'm currently running, still on the same dataset etc, is running at 1.6 to 1.8 s/it...it's still "the slow way", but that's still very noticeable. Now if all those with 30xx and 40xx are done laughing, this might potentially help you as well. Obviously it's a bit hard for me to test, unless i'm somehow gifted a massive new computer or wins some kind of lottery.

While these things are mainly reducing memory needs, it can work for those with much more vram too since it might allow you to add more batches which would speed things up greatly.

First off, many guides and what not complain about buckets being horrible without any reason as to why or some crappy excuse suggesting they don't even know how it works. In this case the point with buckets will let you reduce the number of px in most/all images as you can crop out a lot of pointless background from tall/wide images. 100 images at 512x or 768x squares are a lot more to work through than the same images and you can cut off half because it's "empty space". Just remember to do the settings for it so you don't get more cropping or weird bucketing.

Second, optimizer, if you're on low/lower specs you're probably already familiar with adamw8bit, however there's a more "optimized" version called PagedAdamW8bit.
For me with adamw8bit i had to run training with both "gradient checkpointing" and "memory efficient attention" and training ran constant at 5.8gb. When doing sample images it would then offload to generate the image, which was a noticeable lag, it's not major, but it was there.
Using the Paged version i can remove "memory efficient attention" and training is running at 4.4-4.6gb and it only spikes up for sample images, but it doesn't offload or "reorganize" any memory usage to do it. So it still keeps everything in memory which speeds up sampling too. Not that it's a function you need. But because of not needing the ME attention and still having vram to spare, it speeds up everything.
(I haven't tested if ME attention allows for +1 batch and if that would be faster in total, i doubt it for me though)

Third small thing, Cache latents, it's generally checked by default in kohya_ss gui, but for some reason i've seen guide/training files turn that off. It might be because it keeps all the latents in memory and ppl haven't really thought what that means. As a simple explanation, you keep a "version" of the image in memory instead of reading it each time. "Each time" in this case means every repetition and epoch, but unlike much of the other "keeps data in memory" option this is fairly little. Mostly less than 100kb per image, so unless you're going way overboard with the amount of images you use, this should not be the reason you OOM and it does speed things up. Your mileage will obviously vary with system. Don't "cache to disk" though.

As a final note, i know these things worked very well for me, on my low spec old comp, in theory they should work for others as well, but the effect/impact will obviously depend on what's running it.
If you're already running at full speed it's unlikely to do much, but as i mentioned earlier, if it means you can increase your batch count by 1 or more, it would make this faster even for those systems.

And if anyone bothered reading all this, Hi...
As a follow up to this, it turns out that it's just possible for me to disable gradient checkpointing, which further increase speed.
Now i've gotten down to 1.14 to 1.16 s/it, the vram spike with sampling cause it to go up to 1.2 to 1.24 s/it, but sampling can be disabled if needed/wanted.
This is probably gonna depend on model size etc, as it's very borderline, but cuts of 1/3 of the time and i'm closing in on seeing it/s.
Any further improvements is probably gonna depend on code or driver changes and tbh i don't think nvidias focus is on improving as old cards as mine :p
 
  • Like
Reactions: Mr-Fox and Sepheyer

me3

Member
Dec 31, 2016
316
708
Follow up #2
Part of testing/science/learning is to make mistakes, be wrong, etc and learn from those.
So it seems i was wrong in my previous post, it seems you don't need a code or driver update to speed things up further :p
I'm currently running at a very stable 1.08 s/it, for a very short while it even ran at a speed where the readout kept changing back and forth between s/it and it/s.
Only thing i changed was network rank and alpha, don't know if it's related to both or just one of them but testing is ongoing.
I know this is probably uninteresting etc for most ppl, but i'm basically running training at 1 sec iterations on a almost 7.5 year old 6gb card and it's currently just using 5.4gb of which includes what ever the OS etc is still using.
 
  • Like
Reactions: VanMortis

sharlotte

Member
Jan 10, 2019
268
1,436
Been away for a bit, and started a couple of days ago to use ComfyUI as much as possible. Still testing a lot of flows out there or creating my own, making sure to understand what the various steps and settings actually do. I find it great so far at creating objects, nature... but really awful at creating faces. Have not read the thread for a while so will be going (slowly) over the past few (dozens) of pages.
Meanwhile here is some of the stuff I generated (anyone wondering I've been playing BG3 lately), as usual flow inside.
ComfyUI_00003_.png ComfyUI_00004_.png ComfyUI_00005_.png ComfyUI_00010_.png ComfyUI_00012_.png ComfyUI_00013_.png ComfyUI_00019_.png ComfyUI_00018_.png
 

me3

Member
Dec 31, 2016
316
708
Been away for a bit, and started a couple of days ago to use ComfyUI as much as possible. Still testing a lot of flows out there or creating my own, making sure to understand what the various steps and settings actually do. I find it great so far at creating objects, nature... but really awful at creating faces. Have not read the thread for a while so will be going (slowly) over the past few (dozens) of pages.
Meanwhile here is some of the stuff I generated (anyone wondering I've been playing BG3 lately), as usual flow inside.
View attachment 2947610 View attachment 2947611 View attachment 2947612 View attachment 2947613 View attachment 2947614 View attachment 2947615 View attachment 2947617 View attachment 2947616
i think i've figured out why you got problems with faces, you've forgotten something, the skin ;)
 

Mr-Fox

Well-Known Member
Jan 24, 2020
1,401
3,793
Follow up #2
Part of testing/science/learning is to make mistakes, be wrong, etc and learn from those.
So it seems i was wrong in my previous post, it seems you don't need a code or driver update to speed things up further :p
I'm currently running at a very stable 1.08 s/it, for a very short while it even ran at a speed where the readout kept changing back and forth between s/it and it/s.
Only thing i changed was network rank and alpha, don't know if it's related to both or just one of them but testing is ongoing.
I know this is probably uninteresting etc for most ppl, but i'm basically running training at 1 sec iterations on a almost 7.5 year old 6gb card and it's currently just using 5.4gb of which includes what ever the OS etc is still using.
Try using a little token merging in optimizations settings. 0.2 is fine for "Token merging ratio", and 0.08 ish is fine for "Negative Guidance minimum sigma" in my experience. You can ofc experiment and try higher settings.
 
  • Like
Reactions: Sepheyer

me3

Member
Dec 31, 2016
316
708
Try using a little token merging in optimizations settings. 0.2 is fine for "Token merging ratio", and 0.08 ish is fine for "Negative Guidance minimum sigma" in my experience. You can ofc experiment and try higher settings.
it didn't really have an effect of my generation speed unfortunately, maybe it's more apparent with high resolution, upscaling or with controlnet involved. Also the affect it had on the images i were generating at the time was a bit "unfortunate"

As a sidenote, since i can't use xformers atm because there's something wrong in the code/setup, i'm forced to use sdp, so it might work better with xformers
 
  • Like
Reactions: Mr-Fox

Mr-Fox

Well-Known Member
Jan 24, 2020
1,401
3,793
it didn't really have an effect of my generation speed unfortunately, maybe it's more apparent with high resolution, upscaling or with controlnet involved. Also the affect it had on the images i were generating at the time was a bit "unfortunate"

As a sidenote, since i can't use xformers atm because there's something wrong in the code/setup, i'm forced to use sdp, so it might work better with xformers
It has made a big difference for me, not only when using hiresfix but "normal" generations as well. I don't have any data to show right now but I know it has cut down on my generation times significantly.
 

Sharinel

Active Member
Dec 23, 2018
508
2,103
This might be of interest to some people.
TestMerge.jpg

The above pic shows the same prompt/seed combination using 2 different checkpoints.
The left hand pic is using Dreamshaper 8 while the right hand is using EpicRealism.
The one in the middle is using both. It starts off with Dreamshaper then uses the Refiner tool in Automatic1111 to then morph into the EpicRealism. You can get some interesting outcomes depending on how you do the merging.
1695491277070.png

Prompt is "beautiful female standing next to desk wearing __CC_female_clothing_set_business__****, deep cleavage, photorealistic, wide hips, closeup, textured skin, skin pores, looking down at camera, thicc thighs, gigapixel, 8k, cinematic, fov 60 photo of perfecteyes eyes, perfecteyes eyes, <lora:more_details:1> <lora:GoodHands-beta2:1>

****This is a wildcard, came out as Trousers and boat neck top