[Stable Diffusion] Prompt Sharing and Learning Thread

hkennereth

Member
Mar 3, 2019
223
725
And what happens if I force my potato PC to operate Stable Diffusion?
With the specs you described, my guess to what will happen is "not much", as in "it will not run", or if you are able to make it run (by turning off GPU rendering and using CPU instead, for example), at best it will be so incredibly slow you'll give up attempting to use it in a couple of days; even the best CPU takes between 10 to 30 minutes to render a single image. But as Sepheyer pointed out, you are free to try.

Browser AI image generation is a little limited, but is there an alternative to Stable Diffusion, that I can download and use offline?
Yeah, it's limited, but not as much as you imagine. All the alternatives I suggested ARE Stable Diffusion, just running on a server instead of on your machine, some of them even use exactly the same software you would otherwise install locally on your machine. Those are the limitations one must accept when their options are limited, there is no magical way to make a hardware hungry application like AI image generation run on an old machine.
 
Last edited:

felldude

Member
Aug 26, 2017
460
1,416
Browser AI image generation is a little limited, but is there an alternative to Stable Diffusion, that I can download and use offline? And what happens if I force my potato PC to operate Stable Diffusion?
It's entirely possible to generate images locally in 10-30 seconds (Or a batch of 50 in seconds with a 10k processors)
Intel guide.

But most builds are optimized to do BF16 math, and that is a no go for processors.
Search for OpenVINO and IPEX or hope someone makes a CPU optimized build like was done for AMD with direct-ML and Oynx

on a processor *Its a $10k processor but it was used for training)

It might be worth a shot to try COMFY cpu only with this build

" python -m pip install intel_extension_for_pytorch -f "
 

devilkkw

Member
Mar 17, 2021
279
943
I've got an RTX3060 12GB, so I don't have to worry as much about it hitting the vRAM limit and failing over into system RAM (and yeah, that sure as hell slows it down!). Unfortunately A1111 is so poorly optimised that it's running a couple of GB in vRAM even when idling. I don't think it's the Checkpoint as even when I've loaded a 7GB one the vRAM shows as ~2GB.
what's about driver version?
 
  • Like
Reactions: Jimwalrus

Jimwalrus

Active Member
Sep 15, 2021
816
3,060
what's about driver version?
Last but one I think, whichever that is. Had a pop-up for a new one the other day but ignored it.
It's definitely one of the newer drivers that will spill over into system RAM if vRAM is full.
 

devilkkw

Member
Mar 17, 2021
279
943
what slow generation is pushing out image, in a1111. For what i see generation is fast, but when is the moment to pushing out,it slow down. this is why i'm on 532.03 driver.
But with 12Gb seem strange you have those problem.
 

Jimwalrus

Active Member
Sep 15, 2021
816
3,060
what slow generation is pushing out image, in a1111. For what i see generation is fast, but when is the moment to pushing out,it slow down. this is why i'm on 532.03 driver.
But with 12Gb seem strange you have those problem.
"Slow" may be relative. Just over 3mins for an image at ~900x~1300 resolution, doing all the upscaling in HiRes fix.
Dunno, maybe I'm greedy and want lots of images immediately!!
 
  • Like
Reactions: devilkkw

me3

Member
Dec 31, 2016
316
708
What would the PC specs would it need for me to run Stable Diffusion? I have 16gb Ram and Gt1030 (2gb Vram)?
Since i started out and occasionally still use a 1050 2gb, yes it's possible to run both a1111 and comfyui on just 2gb vram, but you'll need patience.
The 1030 seems to have slower memory than the 1050 so times will likely be slower for you, also i'm running it on Ubuntu so not sure how it would work on Windows.
The thing is, you'd probably need a second device of some kind able to run a browser and use that to access the "UI" remotely and just run the SD setup on the 1030 comp. Reason is that having the browser run on the same comp will eat up vram and more than likely freeze things, specially during model loading. In my case just without a running browser window Pytorch gets about 1.5-1.7gb to work with, just having a browser running drops that down to almost 1.1gb.

So both a1111 and comfyui need --listen as launch argument, so they can be accessed over a network.
A1111 uses about 45-55s to generate a 512x512 image at 20 steps, comfyui uses about 35-40. (if using a tiled vae node and min tile size, comfyui can at least get to width+height about 1500px without OOM, possibly higher, 1m 20sec at 20steps)
A1111 really doesn't like you checking models, no joke, it easily takes 15-20min, doesn't matter if it's 2gb or 6gb, time is pretty much the same, so with that you pick a model stick with it.
 
  • Like
Reactions: Sepheyer

Falmino

New Member
Dec 27, 2022
6
0
It's entirely possible to generate images locally in 10-30 seconds (Or a batch of 50 in seconds with a 10k processors)
Intel guide.
Thanks but I don't understand the guide and I'm not that knowledgable on PCspecs and hardware
Also thanks Sepheyer and hkennereth.

Also I am just experimenting with AI made art specifically one I can use as a software app rather than the browser because limitations and paywall. I also value the time between the AI can generate image because I believe AI generation of image is more or less trial and error on the prompt. I don't really have a good pc so I guess I'm just stuck at using browser, unless something can be done, thanks though. :)
 

felldude

Member
Aug 26, 2017
460
1,416
Thanks but I don't understand the guide and I'm not that knowledgable on PCspecs and hardware
Also thanks Sepheyer and hkennereth.

Also I am just experimenting with AI made art specifically one I can use as a software app rather than the browser because limitations and paywall. I also value the time between the AI can generate image because I believe AI generation of image is more or less trial and error on the prompt. I don't really have a good pc so I guess I'm just stuck at using browser, unless something can be done, thanks though. :)
The short version is you could have a setup using any that could generate images in seconds locally. (Some of those processors are 150-300 dollars US)

You could try this , It might speed up generation even with a 3rd generation processor as they are made to handle FP16 and FP32 math faster then the BF16 used for graphics cards.

Unfortunately the IPEX builds are all for Linux
 
Last edited:
  • Like
Reactions: devilkkw

Synalon

Member
Jan 31, 2022
187
616
Random Halloween picture idea I had, it was supposed to be on a gothic castle balcony at night. Its not bad but its also not what I wanted.

If anybody else wants to edit it feel free. I was thinking of turning it into a landscape style but I don't have the patience.

I've added a large version I upscaled in zip format since it was to large for F95.

Halloween Small.jpg
 

Falmino

New Member
Dec 27, 2022
6
0
The short version is you could have a setup using any that could generate images in seconds locally. (Some of those processors are 150-300 dollars US)

You could try this , It might speed up generation even with a 3rd generation processor as they are made to handle FP16 and FP32 math faster then the BF16 used for graphics cards.

Unfortunately the IPEX builds are all for Linux
Ummm I dunno about that, but my CPU is an Intel i5-8400 2.80Ghz
 

Gtg Gtg

Member
Feb 24, 2018
105
38
Hi everyone, I was wondering if anybody knew the settings for kohya_ss for getting a realistic person, so face, body, everything? I tried many methods for "realism" online but they seem to fall short. I have a dataset of literally ~2000+ HQ images, some are professional images, while some are with phone cameras.
My current settings are here:
Batch 16 : Epoch 20 : Clip Skip 1 : all Learning Rates @ 0.0001 : Cosine Adafactor : No Warmup : 128 network/ 128 alpha
folder on 5 repeats e.g. 5_name

the 16 batch number was explained to me that since some of the images of my dataset is similar, I can make it 8, 16, 32 even and such

128 network so that I capture a lot of detail, and 128 alpha i'm not too sure, but I heard that the lower it is, the more flexible the AI?
 

devilkkw

Member
Mar 17, 2021
279
943
Hi everyone, I was wondering if anybody knew the settings for kohya_ss for getting a realistic person, so face, body, everything? I tried many methods for "realism" online but they seem to fall short. I have a dataset of literally ~2000+ HQ images, some are professional images, while some are with phone cameras.
My current settings are here:
Batch 16 : Epoch 20 : Clip Skip 1 : all Learning Rates @ 0.0001 : Cosine Adafactor : No Warmup : 128 network/ 128 alpha
folder on 5 repeats e.g. 5_name

the 16 batch number was explained to me that since some of the images of my dataset is similar, I can make it 8, 16, 32 even and such

128 network so that I capture a lot of detail, and 128 alpha i'm not too sure, but I heard that the lower it is, the more flexible the AI?
for character i get good result using realistic model for train, and Learning Rates default, with 1 epoch, 120 repeat (120_name) with 96 alpha and network size based on number of image.(add 10 for every image, for example if i use 8 image, my network size is 80).
Also if you want to train facial features, using less image( 2000 is too high) where the subject face is visible and view from different angle. remember to do good tag.
i usually use not over 12 image, and result is pretty good. use image with varius background give better result.
 
Last edited:

me3

Member
Dec 31, 2016
316
708
Hi everyone, I was wondering if anybody knew the settings for kohya_ss for getting a realistic person, so face, body, everything? I tried many methods for "realism" online but they seem to fall short. I have a dataset of literally ~2000+ HQ images, some are professional images, while some are with phone cameras.
My current settings are here:
Batch 16 : Epoch 20 : Clip Skip 1 : all Learning Rates @ 0.0001 : Cosine Adafactor : No Warmup : 128 network/ 128 alpha
folder on 5 repeats e.g. 5_name

the 16 batch number was explained to me that since some of the images of my dataset is similar, I can make it 8, 16, 32 even and such

128 network so that I capture a lot of detail, and 128 alpha i'm not too sure, but I heard that the lower it is, the more flexible the AI?
If you're training just one person you don't need too many images, 15-30 is more than enough, so not sure what those 2k plus images are, but you really don't want to use that many. If they all are of the same person you'd pick the one that's most "clear" were you don't have things like shadows across the face etc, or weird facial expressions that can cause deformations in the trained face. If the images are the same person with different hair styles you can split it into different folders and name accordingly and have different triggers.

Assuming you're using standard lora, which you more than likely should be.
Learning rates, repeats and epoch will depend on what optimizer you use. With something like AdamW you want to have more repeats and less epochs. IE with 20 images you'd have something like 40-60 repeats and 2-3 epochs, it might be fine with just 1, but that depends on images, subject etc. You'll need some trial and error anyway so running it for an extra epoch isn't that big a deal.
If you use something from the dadapt/prodigy "family" you want less repeats and more epochs. Generally that might mean just 1-2 repeats and a lot of epochs. You also need a bunch of optimizer arguments.

Rank and alpha. While it's true that rank is the "size" of the lora, just think about it logically. A model contains insane amounts of "things" and those are just 2gb. Even if loras would be far less optimized there should be no need for just one "simple" person to need 128mb.
Having done A LOT of testing recently trying to figure out what's broken with kohya, one of the things i've found is that you can train a realistic quality "person" perfectly fine on rank 8. It was failing badly for me and for no other reason than to "just try" i dropped rank from 32 to 8 and likeness and quality increased drastically. Higher rank seems to allow too much additional data to be picked up and "contaminate" the training, too much junk to clear out to keep that details you are interested in. Observed result more than documented fact, but still.
Alpha is more of a limiter on learning, but i haven't really found a exact positive or negative consistent effect so far.

You didn't specify if this was on sd1.5 or xl, that has a huge impact on learning rates.
If it's for sd1.5 my suggested starting setup would be to pick 20-30 images, use a simple blip style captioning (long descriptive text and not comma separated tags). Around 40 repeats and 3 epochs, save each epoch for testing so you can see the progress. Use constant learning rate and adamw/adamw 8bit for simplicity. Network rank and alpha at 8 and 1 to start with.
For learning rate, batching is a bit of a variable. If you use batching you often have to increase your learning rate by a similar amount, i'm not sure if it's completely linear, but given you can run batch at 16 it shouldn't take you long to run this even at batch 1.
So use 0.0001 as learning rate and 0.00005 for text rate and run it at batch 1 just to start. Given it would take me over an hour to do this, it should be far far less of you.

Not sure what model you're planing to train on, but it seems like some "people" don't train well on some models, so even if you get good results no one model with one person it might not work as well for another. I've gotten good results on both deliberate and cyber realistic, but some times it's worked well on just one of them and failed completely on the other.

Lots of text and i might have forgotten something, but hopefully nothing too important.

(edit, knew there was something...)
For realistic images you might need to use regularization images, but try without them first as there's a bit of work involved using them. Can deal with that issue later if need be as it's more of a final tweak than absolute need.

Random test image from one of my test trainings, just to keep ppls attention. 00452-3065532086.png
 
Last edited:

sharlotte

Member
Jan 10, 2019
257
1,351
For LORA training on SDXL, I've stuck to the tips indicated on Kohya Github page which is to use the adafactor optimizer:
1698591861750.png

Anyone else has tried something different where it worked?
 
  • Like
Reactions: VanMortis

Gtg Gtg

Member
Feb 24, 2018
105
38
If you're training just one person you don't need too many images, 15-30 is more than enough, so not sure what those 2k plus images are, but you really don't want to use that many. If they all are of the same person you'd pick the one that's most "clear" were you don't have things like shadows across the face etc, or weird facial expressions that can cause deformations in the trained face. If the images are the same person with different hair styles you can split it into different folders and name accordingly and have different triggers.

Assuming you're using standard lora, which you more than likely should be.
Learning rates, repeats and epoch will depend on what optimizer you use. With something like AdamW you want to have more repeats and less epochs. IE with 20 images you'd have something like 40-60 repeats and 2-3 epochs, it might be fine with just 1, but that depends on images, subject etc. You'll need some trial and error anyway so running it for an extra epoch isn't that big a deal.
If you use something from the dadapt/prodigy "family" you want less repeats and more epochs. Generally that might mean just 1-2 repeats and a lot of epochs. You also need a bunch of optimizer arguments.

Rank and alpha. While it's true that rank is the "size" of the lora, just think about it logically. A model contains insane amounts of "things" and those are just 2gb. Even if loras would be far less optimized there should be no need for just one "simple" person to need 128mb.
Having done A LOT of testing recently trying to figure out what's broken with kohya, one of the things i've found is that you can train a realistic quality "person" perfectly fine on rank 8. It was failing badly for me and for no other reason than to "just try" i dropped rank from 32 to 8 and likeness and quality increased drastically. Higher rank seems to allow too much additional data to be picked up and "contaminate" the training, too much junk to clear out to keep that details you are interested in. Observed result more than documented fact, but still.
Alpha is more of a limiter on learning, but i haven't really found a exact positive or negative consistent effect so far.

You didn't specify if this was on sd1.5 or xl, that has a huge impact on learning rates.
If it's for sd1.5 my suggested starting setup would be to pick 20-30 images, use a simple blip style captioning (long descriptive text and not comma separated tags). Around 40 repeats and 3 epochs, save each epoch for testing so you can see the progress. Use constant learning rate and adamw/adamw 8bit for simplicity. Network rank and alpha at 8 and 1 to start with.
For learning rate, batching is a bit of a variable. If you use batching you often have to increase your learning rate by a similar amount, i'm not sure if it's completely linear, but given you can run batch at 16 it shouldn't take you long to run this even at batch 1.
So use 0.0001 as learning rate and 0.00005 for text rate and run it at batch 1 just to start. Given it would take me over an hour to do this, it should be far far less of you.

Not sure what model you're planing to train on, but it seems like some "people" don't train well on some models, so even if you get good results no one model with one person it might not work as well for another. I've gotten good results on both deliberate and cyber realistic, but some times it's worked well on just one of them and failed completely on the other.

Lots of text and i might have forgotten something, but hopefully nothing too important.

(edit, knew there was something...)
For realistic images you might need to use regularization images, but try without them first as there's a bit of work involved using them. Can deal with that issue later if need be as it's more of a final tweak than absolute need.

Random test image from one of my test trainings, just to keep ppls attention. View attachment 3040533
Thanks for the info I'll try out your setup and do test, now I need to go through the 2000+ images and choose the best 30? Or the best 30 while having variations? like different poses, lighting conditions and whatnot? The person I'm trying to replicate is a cosplayer, so I thought I would need as much data as possible to get the cosplays down but I guess not.
 

hkennereth

Member
Mar 3, 2019
223
725
Thanks for the info I'll try out your setup and do test, now I need to go through the 2000+ images and choose the best 30? Or the best 30 while having variations? like different poses, lighting conditions and whatnot? The person I'm trying to replicate is a cosplayer, so I thought I would need as much data as possible to get the cosplays down but I guess not.
As someone who have trained plenty of Dreambooth and LoRA models based on cosplayers, I can advice you to stick with the most natural looking pictures, and avoid the actual cosplay ones. You want variety, yes, but of lighting conditions, poses, environment, etc. If you try to use pictures with heavy makeup or clothing that occludes too much of their natural body shape you will have a hard time getting consistency from the model. The point is to teach the model how to draw that person, not how they look in cosplay; that you'll get from prompting.

Also avoid images that are too dark, too low resolution, too grainy, etc. You want high quality, not high volume. You'll need images cropped to 512px square if you're training for Stable Diffusion 1.5 models, and 1024px square if you're training for SDXL, so make sure that the pictures have big enough resolution so that when you crop the desired area from them, the resulting image is already equal or larger than those resolutions, meaning NEVER UPSCALE your source images.

And yes, you probably want to stick with about 30-50 pictures total, more or less in a proportion of 60% of headshots, 35% of medium shots (waist up), and 15% of full body shots -- that's a good ratio for images in my experience.
 
  • Like
Reactions: Gtg Gtg and me3

me3

Member
Dec 31, 2016
316
708
...
And yes, you probably want to stick with about 30-50 pictures total, more or less in a proportion of 60% of headshots, 35% of medium shots (waist up), and 15% of full body shots -- that's a good ratio for images in my experience.
If there's a bit of a problem finding images of one "type", you can help get that balance with splitting the images into different folders and adjusting the repeats. That said you probably want to avoid repeating just 1-2 images too many times in comparison to the rest as you might end up locked into that exact image.
 

hkennereth

Member
Mar 3, 2019
223
725
If there's a bit of a problem finding images of one "type", you can help get that balance with splitting the images into different folders and adjusting the repeats. That said you probably want to avoid repeating just 1-2 images too many times in comparison to the rest as you might end up locked into that exact image.
Indeed. If you do want to use the same image, for example, for full body and face crop, I would recommend flipping one of them horizontally, which tends to help the training process see it as something new. Of course that is more helpful if the person is shot from the side or at 3/4, but it won't help much if they are directly facing the camera. But I would avoid that unless strictly necessary.