[Stable Diffusion] Prompt Sharing and Learning Thread

DrPepper808

Newbie
Dec 7, 2021
78
45
Yes, CUI mean Comfyui.
Actually I'm not a PC, but if you start using it I suggest to add ComfyUi Manager. You find link for downloading by search it on Google.
If you like it, and want to try it I will share my workflow for removing background when I'm at my PC.
I watched 3 or 4 vids last night on it, its amazing, and a little overwhelming. LOL
 

felldude

Active Member
Aug 26, 2017
505
1,500
I watched 3 or 4 vids last night on it, its amazing, and a little overwhelming. LOL
It can be overwhelming when watching people with master crafted workflows, but the default workflow with one or two lora's is simple to learn.

I only recently got Xformer's working for Auto1111 without it my 1024x1024 XL generations took 5 minutes vs 20 seconds.
That was the only reason I switched to Comfy but now I only use Auto for text to 3D or some of the other features.

Auto1111 will likely have longer support and more modules built for it as huggingface made Gradio the building block for every WebUI they use, but I would find it hard to go back to Auto1111 now.
 

devilkkw

Member
Mar 17, 2021
303
1,033
I watched 3 or 4 vids last night on it, its amazing, and a little overwhelming. LOL
This is a simple workflow i use for remove background:
kkw-RemBg.png

And this is the result(also workflow inside):
kkw-alphaTest-_00001_.png

Just drag your image in the image loader and queque prompt.



It can be overwhelming when watching people with master crafted workflows, but the default workflow with one or two lora's is simple to learn.

I only recently got Xformer's working for Auto1111 without it my 1024x1024 XL generations took 5 minutes vs 20 seconds.
That was the only reason I switched to Comfy but now I only use Auto for text to 3D or some of the other features.

Auto1111 will likely have longer support and more modules built for it as huggingface made Gradio the building block for every WebUI they use, but I would find it hard to go back to Auto1111 now.
For what i've see Forge seem have better memory management than a1111. I totally switched to CUI because what you can do with it is good, like experimenting multi sampler in one image, apply lora at different time, etc.
But i keep my favorite a1111 version for testing new lora or model i make, if i share it i want to check in both UI.
 

felldude

Active Member
Aug 26, 2017
505
1,500
For what i've see Forge seem have better memory management than a1111. I totally switched to CUI because what you can do with it is good, like experimenting multi sampler in one image, apply lora at different time, etc.
But i keep my favorite a1111 version for testing new lora or model i make, if i share it i want to check in both UI.
I have not used forge, keeping Auto1111, Comfy and Koyha all in different VENV's is taking up enough space with 98% the same files and 2% difference that makes them incompatible.

You have the #coder in your signature, have you gotten deepspeed to work with windows?
The precompiled one fails for me and when I compiled it with ninja it corrupted my cuda files. I tried multiple CUDA sdk's and fixed the reference to the Linux time .h
 
Last edited:

Sharinel

Active Member
Dec 23, 2018
506
2,095
I have not used forge, keeping Auto1111, Comfy and Koyha all in different VENV's is taking up enough space with 98% the same files and 2% difference that makes them incompatible.
I'm using StableSwarmUI at the moment as it has SD3 compatibility, might swap over from Forge. It's a auto1111-type UI built on top of Comfy
 

felldude

Active Member
Aug 26, 2017
505
1,500
I'm using StableSwarmUI at the moment as it has SD3 compatibility, might swap over from Forge. It's a auto1111-type UI built on top of Comfy
I'm using a custom build of comfy that I built with tensor flow RT .dll's other then a round off error I have no issue

Screenshot 2024-06-12 125008.jpg

I'm not sure it is actually speeding things up because no one post thier IT/s or secs per IT...lol

For 1024x1024 I am at 1.0-1.5 IT/s with most samplers (Not Huen)
For 2048x2048 I am at 4.5 secs per IT

This is native render not High res fix or SEGS that breaks the image down.
Oh and my motherboard is only PCI 3.0 so I am at half bandwidth but I'm not sure it matters.
2,560 GPU cores on a RTX 3050
 

Sharinel

Active Member
Dec 23, 2018
506
2,095
I'm using a custom build of comfy that I built with tensor flow RT .dll's other then a round off error I have no issue

View attachment 3729886

I'm not sure it is actually speeding things up because no one post thier IT/s or secs per IT...lol

For 1024x1024 I am at 1.0-1.5 IT/s with most samplers (Not Huen)
For 2048x2048 I am at 4.5 secs per IT

This is native render not High res fix or SEGS that breaks the image down.
Oh and my motherboard is only PCI 3.0 so I am at half bandwidth but I'm not sure it matters.
2,560 GPU cores on a RTX 3050
Doesn't tell me any of that on Stableswarm i'm afraid, this is for 1024x1024
1718212529999.png
That's about the closest
And on the Stableswarm page it shows on the right how long it took
1718212610379.png
 

DrPepper808

Newbie
Dec 7, 2021
78
45
This is awesome. I feel like I am 12yo at Christmas.. LOL
So followed .
Running CUI and Manager. If the next few days of reading does not kill me, I should have more questions. :p
 
  • Like
Reactions: VanMortis and DD3DD

DrPepper808

Newbie
Dec 7, 2021
78
45
This is a simple workflow i use for remove background:
View attachment 3729751

And this is the result(also workflow inside):
View attachment 3729750

Just drag your image in the image loader and queque prompt.





For what i've see Forge seem have better memory management than a1111. I totally switched to CUI because what you can do with it is good, like experimenting multi sampler in one image, apply lora at different time, etc.
But i keep my favorite a1111 version for testing new lora or model i make, if i share it i want to check in both UI.
the good news is I have decades of workflow experience, so the CUI GUI very similar to other tools I have used. the 200k of different options is the hard part :p
 

felldude

Active Member
Aug 26, 2017
505
1,500
Doesn't tell me any of that on Stableswarm i'm afraid, this is for 1024x1024
View attachment 3729914
That's about the closest
And on the Stableswarm page it shows on the right how long it took
View attachment 3729916
For me its:
25-30 seconds on average per image at 1024x1024 (assuming 20 steps) with the VAE encode time
Around 2 minute to native render a 2048x2048

It looks like your using a batch size of 4 (I only use batch size of 4 or 8 on SD as its out of range of my 8GB card for XL)

And if I did the math right your at 115 seconds per image.
That might be the batch size slowing down generation?
 

Sharinel

Active Member
Dec 23, 2018
506
2,095
For me its:
25-30 seconds on average per image at 1024x1024 (assuming 20 steps) with the VAE encode time
Around 2 minute to native render a 2048x2048

It looks like your using a batch size of 4 (I only use batch size of 4 or 8 on SD as its out of range of my 8GB card for XL)

And if I did the math right your at 115 seconds per image.
That might be the batch size slowing down generation?
Your maths might be a bit off, it's showing 17 secs or so for each generation? 18:10:06 it starts and last one kicks off at 18:10:58 . I have a 4090 so it's certainly a lot faster than 115 secs
 

felldude

Active Member
Aug 26, 2017
505
1,500
Your maths might be a bit off, it's showing 17 secs or so for each generation? 18:10:06 it starts and last one kicks off at 18:10:58 . I have a 4090 so it's certainly a lot faster than 115 secs
EDIT, I see the up scaling process in their at 1.5x but its not image to image but that is taking some time.
I don't know how much time it is taking up though. Lets say its around half the time and give you an even

128 steps in 10 seconds = 10-12.8 IT/per second which is good....if the up scaling is taking that amount of time
Your at 7.11 IT/s with the up scaling

For reference I also did a batch size of 4, amazed it increased my IT per second (Relative to the number of steps)
But I am still around 1.0-1.2 IT/s if I was doing a unsupervised batch I might run a batch size of 4 instead of 1 to see if it is stable for that extra .2 seconds.

I didn't quite get the same level of clothing as you did...lol
ComfyUI_01375_.png ComfyUI_01386_.png
 
Last edited:

felldude

Active Member
Aug 26, 2017
505
1,500
Holy shit I was looking at the new SD 3

First Text encoder standard open clip 200MB
Second Text Encoder CLIP Vit-L used in XL 2GB
Third Text Encoder T5XXL 10GB

OK I cant even fit the TE in memory let alone the Unet

The pruned combined model isn't up yet but Im guessing it will be at least 12GB
If not over the 16GB threshold wich would put it into the 0.1% of users.
 
  • Wow
Reactions: Jimwalrus

devilkkw

Member
Mar 17, 2021
303
1,033
I have not used forge, keeping Auto1111, Comfy and Koyha all in different VENV's is taking up enough space with 98% the same files and 2% difference that makes them incompatible.

You have the #coder in your signature, have you gotten deepspeed to work with windows?
The precompiled one fails for me and when I compiled it with ninja it corrupted my cuda files. I tried multiple CUDA sdk's and fixed the reference to the Linux time .h
I'm using a custom build of comfy that I built with tensor flow RT .dll's other then a round off error I have no issue

View attachment 3729886

I'm not sure it is actually speeding things up because no one post thier IT/s or secs per IT...lol

For 1024x1024 I am at 1.0-1.5 IT/s with most samplers (Not Huen)
For 2048x2048 I am at 4.5 secs per IT

This is native render not High res fix or SEGS that breaks the image down.
Oh and my motherboard is only PCI 3.0 so I am at half bandwidth but I'm not sure it matters.
2,560 GPU cores on a RTX 3050
Holy shit I was looking at the new SD 3

First Text encoder standard open clip 200MB
Second Text Encoder CLIP Vit-L used in XL 2GB
Third Text Encoder T5XXL 10GB

OK I cant even fit the TE in memory let alone the Unet

The pruned combined model isn't up yet but Im guessing it will be at least 12GB
If not over the 16GB threshold wich would put it into the 0.1% of users.
I don't understand what appen on your pc, i'm on win11 and using miniconda for enviremont. I have a1111, forge and CUI with their venv. I don't have any problem on made it working, just download and extract github zip then run. Haven't any cuda problem, but in past i remember i've to download a specific version of cudnn precompiled, cuz my compiler getting error on building it from source.

Speaking about speed, in cui i've a 1.49it/s for a 1024x1280.
speed.jpg

I've see SD3 is out, but also reading about it, some sort of censor on it so we need to wait a trained model.
Seem working good on text, but censor is not what we want.
 

felldude

Active Member
Aug 26, 2017
505
1,500
I don't understand what appen on your pc, i'm on win11 and using miniconda for enviremont. I have a1111, forge and CUI with their venv. I don't have any problem on made it working, just download and extract github zip then run. Haven't any cuda problem, but in past i remember i've to download a specific version of cudnn precompiled, cuz my compiler getting error on building it from source.

Speaking about speed, in cui i've a 1.49it/s for a 1024x1280.
View attachment 3730569

I've see SD3 is out, but also reading about it, some sort of censor on it so we need to wait a trained model.
Seem working good on text, but censor is not what we want.
Yeah I have them all in VENV's my point with that is why I haven't tried Forge or Stability is do to each VENV having about 90,000 files at around 15GB I don't want to mess with anymore programs unless something ground breaking comes in.

....

So you have script working in Koyha its pretty sad they left references to Linux in the windows build but even following the guide to fix that I have not been able to compile it with any version of CUDA

The pre-compiled version did not function for me.

...

They did not train on the LAION-5B set, that set is facing legal issues and has been taken down.
They just say it was trained on 1Billion images and refined on 3M



Nude is at 16k
XXX is at 33923

They have 3 versions 4GB up to 10.9GB for the version with the new TE
Ok so that is how they did it....its only 10.9GB with the TE because it is in FP8 to use the FP16 TE with the CLIP and UNET its around 16GB


{
"_class_name": "SD3Transformer2DModel",
"_diffusers_version": "0.29.0.dev0",
"attention_head_dim": 64,
"caption_projection_dim": 1536,
"in_channels": 16,
"joint_attention_dim": 4096,
"num_attention_heads": 24,
"num_layers": 24,
"out_channels": 16,
"patch_size": 2,
"pooled_projection_dim": 2048,
"pos_embed_max_size": 192,
"sample_size": 128
}

Yeah I can't even use this model let alone train on it.
 
Last edited:
  • Like
Reactions: devilkkw

felldude

Active Member
Aug 26, 2017
505
1,500
Ok so regarding SD3

I can run the sd3_medium_incl_clips without the T5XXL even in fp8 that model is outside of my range.

PONY Loras do work with the model and improve the results slightly.

Without lora
ComfyUI_01406_.png

With lora

ComfyUI_01405_.png

The only sampler I found that doesn't corrupt is Euler without the TE
 

Sharinel

Active Member
Dec 23, 2018
506
2,095
Ok so regarding SD3

I can run the sd3_medium_incl_clips without the T5XXL even in fp8 that model is outside of my range.

PONY Loras do work with the model and improve the results slightly.

Without lora
View attachment 3730872

With lora

View attachment 3730873

The only sampler I found that doesn't corrupt is Euler without the TE
The supergirl image and the generation times from my earlier post were the SD3 model with clip/t5 (the 10 gig one), and I also posted a quick example try at generation on the other thread - https://f95zone.to/threads/ai-art-show-us-your-ai-skill-no-teens.138575/post-13995058

I think it needs a lot of work, but from the way they have worded the licence it looks like it's not worth it for the people who normally do finetunes
 

felldude

Active Member
Aug 26, 2017
505
1,500
The supergirl image and the generation times from my earlier post were the SD3 model with clip/t5 (the 10 gig one), and I also posted a quick example try at generation on the other thread - https://f95zone.to/threads/ai-art-show-us-your-ai-skill-no-teens.138575/post-13995058

I think it needs a lot of work, but from the way they have worded the licence it looks like it's not worth it for the people who normally do finetunes
Trying to do the math on what a native finetune using Adam would require....
It might be out of the range of the 80GB A100

I'd be curious if someone who has a 24GB card could finetune with Lion or Ada
 

Synalon

Member
Jan 31, 2022
208
631
I managed to get this out of SD3 so far, if anybody has any prompts they want me to test on it for them send my a message with the prompts, it only takes a few seconds to render.

This is with the most basic workflow I could make, as the multiple clips wasn't working for me.

I'm trying to fix the mutiple clips workflow now, so it might get better later.

You don't have permission to view the spoiler content. Log in or register now.

*Edit. Added two more example pictures*
 
Last edited:

felldude

Active Member
Aug 26, 2017
505
1,500
I managed to get this out of SD3 so far, if anybody has any prompts they want me to test on it for them send my a message with the prompts, it only takes a few seconds to render.

This is with the most basic workflow I could make, as the multiple clips wasn't working for me.

I'm trying to fix the mutiple clips workflow now, so it might get better later.

You don't have permission to view the spoiler content. Log in or register now.
I didn't notice much of a difference with the 3 clip SD3 encoder that comfy had a heads up on. Then again I am using the model with only 2 TE's so...

Have you tried the FP16 version they posted