[Stable Diffusion] Prompt Sharing and Learning Thread

Fuchsschweif

Active Member
Sep 24, 2019
959
1,515
Openpose isn't perfect, it has a lot of issues determining front and back legs etc
Really? When people do it on youtube they alway seem to get fantastic results.

1719353018049.png

Here, they did hit every single pose correctly. Isn't that the whole idea, so that SD can't miss because it follows the exact limb layout? Maybe we're not skilled enough..
 

Synalon

Member
Jan 31, 2022
211
631
Really? When people do it on youtube they alway seem to get fantastic results.

View attachment 3770533

Here, they did hit every single pose correctly. Isn't that the whole idea, so that SD can't miss because it follows the exact limb layout? Maybe we're not skilled enough..
Notice most the time the person is standing, facing forwards with the leg positions clearly defined.
This is also a workflow coming from a video so it has frames before and after to help keep the position clear.
Also without a clear prompt some checkpoints will fail to understand simple things like slightly turned looking back over her shoulder, so loras are needed for some of those.

Your pose on all fours is already something Stable Diffusion will fuck up given the chance, from the openpose skeleton it looks like her back should be towards the viewer slightly with her looking back.

The showing her back while looking over her shoulder confuses checkpoints a lot.
 

Sepheyer

Well-Known Member
Dec 21, 2020
1,528
3,598
Really? When people do it on youtube they alway seem to get fantastic results.

View attachment 3770533

Here, they did hit every single pose correctly. Isn't that the whole idea, so that SD can't miss because it follows the exact limb layout? Maybe we're not skilled enough..
High chance of duds. Even more so on anything crawling, bending, etc. I used OpenPose heavily early on but eventually moved to other models - depth, canny and tile. May be you experimented with those already, but I would venture to say that you if you haven't then you will end up going with those over OpenPose once you try its alternatives.
 

devilkkw

Member
Mar 17, 2021
305
1,039
Really? When people do it on youtube they alway seem to get fantastic results.

View attachment 3770533

Here, they did hit every single pose correctly. Isn't that the whole idea, so that SD can't miss because it follows the exact limb layout? Maybe we're not skilled enough..
This is simple standing pose and working great, but for complex pose you need to use some trick.
I usually use openpose and depth map, and for more complex I add normal map.
I Made pose in daz3d.
This is what for me work well for posing with sd1.5.
 

devilkkw

Member
Mar 17, 2021
305
1,039
This is what i mean for multi controlnet.
kkwmulticnet.jpg-w.jpg

Made a pose in daz3d, fast render at same resolution (not required but i also try i2i) and load it, then apply it.
Use a resolution 1024 for each cnet node, because my image is 1024x1280.

Result image are simple prompt: nude old woman in kitchen.
Result showed on different checkpoint.
Most important with those method is how you combine cnet and strength, on my test max total strength is a value from 1 to 1.2.
Also order in conditioning combine is important: best i've found is showed in image:

Schematic view with strength:

depth(0.44)
openpose(0.4)|
....................|result
....................|normal(0.3)
.....................................|--->to positive sampler conditioning

Also i don't use Advanced controlnet because this affect negative prompt and i never get good result with multiple contronet.

This slow generation about 2x, but allow you simple prompt without pose description request.
 

Synalon

Member
Jan 31, 2022
211
631
00313-3530447844.png

Very quick using Forge, I suppose I could spend more time refining it to make it better but I'm lazy.

I used depth hand refiner, depth anything, and lineart, theres still lots wrong with it but as a base to work from that took less than 2 minutes its not bad.
 

felldude

Active Member
Aug 26, 2017
511
1,502
I have been playing around with PONY for awhile but I honestly might switch back to XL

I did a 2k training on 332 images of females flashing the camera in public places, (It was 600 images I pruned down the ones I thought wouldn't train well)

I took over an hour just to BLIP-2 Caption them, I was impressed with the BLIP-2 captioning on high quality highly complex images. I would not use it on a white background image however.

Here are some test images from 1024 up to 2k on the XXXL-V3 model

ComfyUI_01278_.png ComfyUI_01274_.png ComfyUI_01273_.png ComfyUI_01272_.png ComfyUI_01264_.png ComfyUI_01258_.png
 

crnisl

Member
Dec 23, 2018
446
335
I took over an hour just to BLIP-2 Caption them, I was impressed with the BLIP-2 captioning on high quality highly complex images. use
Have you tried clip interrogator?


Made a pose in daz3d, fast render at same resolution (not required but i also try i2i) and load it, then apply it.
Do you have any ideas, people, how to turn 3d images into high-realistic photo-like ones - but without losing the similarity/consistency with the original face and colors?
What I use so far is just inpainting eyes and mouth with realvisxl, then filmgrain.

But maybe you have much more clever ideas, or maybe even some magic comfyui configs?
Something to make the textures of skin/hair/clothes more realistic?
 

Sepheyer

Well-Known Member
Dec 21, 2020
1,528
3,598
Have you tried clip interrogator?




Do you have any ideas, people, how to turn 3d images into high-realistic photo-like ones - but without losing the similarity/consistency with the original face and colors?
What I use so far is just inpainting eyes and mouth with realvisxl, then filmgrain.

But maybe you have much more clever ideas, or maybe even some magic comfyui configs?
Something to make the textures of skin/hair/clothes more realistic?
We had a scorching debate about likeness not a six month ago I think. The eventual consensus was the likeness -- facewise -- is NOT retained per se. To keep the face one needs to borderline "copy" it over. So, depending on your threshold for likeness this can work either very well or not at all:
https://f95zone.to/threads/stable-diffusion-prompt-sharing-and-learning-thread.146036/post-12670868
https://f95zone.to/threads/stable-diffusion-prompt-sharing-and-learning-thread.146036/post-12750138
https://f95zone.to/threads/stable-diffusion-prompt-sharing-and-learning-thread.146036/post-12669146
 

felldude

Active Member
Aug 26, 2017
511
1,502
Have you tried clip interrogator?
I can just barley run BLIP-2 by itself for natural language I found it far superior to BLIP.
I don't think I could run the clip interrogation with BLIP-2

Before they took down the Liaon-5B reverse image search, I used it to get exact prompting on images assuming they where used, or the closest trained source.

I wish the BLIP-2 gave a summary of the description terms like the WD14 tagger does.
 

crnisl

Member
Dec 23, 2018
446
335
I don't think I could run the clip interrogation with BLIP-2
You definitely can, use colab or replicate or whatever. Look, clip interrogation with blip2-2.7b, with sdxl prompt mode.

And you can use your own dictionary for questioning, it can help.
 
Last edited:

felldude

Active Member
Aug 26, 2017
511
1,502
You definitely can. Look, clip interrogation with blip2-2.7b, with sdxl prompt mode.

And you can use your own dictionary for questioning, it can help.
I did see the BLIP-2 but I am guessing that nearly any CLIP model won't fit into my VRAM with BLIP-2, but it looks like an interesting tool and I wasn't aware of it, I might try it later. (EDIT: I could probably run it FP8 assuming some of the 15GB FP32 Model was offloaded when not in use)

Time wise I can run BLIP-2 and append with WD-14 on 332 2k images in a few hours....
If my times are the same as yours then it would be 11 hours to run on the program.

EDIT: I'd be curious to see what the image attached is tagged as using that program if you wanted to test run it:

BLIP-2 with WD-14 (Onyx) appended -1girl

a woman sitting on a bench in a white tank top and red shorts with her hands on her hips, solo, long hair, breasts, looking at viewer, skirt, jewelry, medium breasts, sitting, purple hair, outdoors, pussy, choker, day, spread legs, miniskirt, clothes lift, mole, lips, uncensored, no panties, red skirt, skirt lift, sunglasses, tank top, lifted by self, building, mole on breast, watch, wristwatch, white choker
 
Last edited:

felldude

Active Member
Aug 26, 2017
511
1,502
View attachment 3777479


Heh, I don't know why violet tanktop.
With this prompt and your aspect ratio, I got this from realvisxl, so it catches some vibe.
View attachment 3777443 View attachment 3777444
Thanks for the test, my thoughts:

I am fighting anchor fallacy here but I also am one of those people that pulls the beta or nightly builds and then spends hours rebuilding the version I just came from....(I really should keep to VENV's)

Your description had Spain and full round face which I felt was accurate.

But I felt the natural language was lacking if you compare the two:

a woman sitting on a bench in a white tank top and red shorts with her hands on her hips
a woman sitting on a bench with her legs bare

The shorts part was wrong so it could be a wash but the WD14 caught the clothes lift, no panties, skirt lift, and lifted by self,

Of course this is just one image, and it would be poor to judge based off just one image, I know I had some badly described images in the mix, one or two that where completely wrong.

I think CLIP interrogation is more beneficial for trying to reproduce an image that a model has been trained on, rather then providing data to train a model on.
 

crnisl

Member
Dec 23, 2018
446
335
But I felt the natural language was lacking if you compare the two:
The first sentence is just an output of the first launch of the blip2, before the loop of the interrogation. There're other default parameters in my build, to make it faster or smth, and it can be changed to be equal to your output, I think.

Anyway, I had more success with training character loras using for captioning the output of the interrogator (however, not the full one) than with blip+tagger. But there're different specifics, I guess. If you want to work with poses and fetishes, etc., not with making most similar faces, then the tagger is better - or maybe you need to experiment with custom dictionaries for the interrogator.
 

felldude

Active Member
Aug 26, 2017
511
1,502
The first sentence is just an output of the first launch of the blip2, before the loop of the interrogation. There're other default parameters in my build, to make it faster or smth, and it can be changed to be equal to your output, I think.
Yeah I didn't mess with number of beams all that, I think the default settings where tested over a large group of images and then used a scale to not over describe or get the repeating terms. (Based of one article I read, I haven't played around at all with it)

I'm not sure what quality I am running at, I would guess FP8 or at best 16 as my card is not able to fit 15GB but with bitsandbytes it might actually be running at FP32 as my understanding is it doesn't need to load the whole model in unless training.
 
Last edited:

crnisl

Member
Dec 23, 2018
446
335
We had a scorching debate about likeness not a six month ago I think. The eventual consensus was the likeness -- facewise -- is NOT retained per se. To keep the face one needs to borderline "copy" it over. So, depending on your threshold for likeness this can work either very well or not at all:
https://f95zone.to/threads/stable-diffusion-prompt-sharing-and-learning-thread.146036/post-12670868
https://f95zone.to/threads/stable-diffusion-prompt-sharing-and-learning-thread.146036/post-12750138
https://f95zone.to/threads/stable-diffusion-prompt-sharing-and-learning-thread.146036/post-12669146
Well...
00002.png 00002.png 00002.png
 

Synalon

Member
Jan 31, 2022
211
631
I have been playing around with PONY for awhile but I honestly might switch back to XL

I did a 2k training on 332 images of females flashing the camera in public places, (It was 600 images I pruned down the ones I thought wouldn't train well)

I took over an hour just to BLIP-2 Caption them, I was impressed with the BLIP-2 captioning on high quality highly complex images. I would not use it on a white background image however.

Here are some test images from 1024 up to 2k on the XXXL-V3 model

View attachment 3776302 View attachment 3776303 View attachment 3776304 View attachment 3776306 View attachment 3776307 View attachment 3776308
What program did you use for the blip 2 captioning?
I used Kohya to caption 768 images and it took about 3 minutes, since yours took so long was it more detailed captions or something?
 

felldude

Active Member
Aug 26, 2017
511
1,502
What program did you use for the blip 2 captioning?
I used Kohya to caption 768 images and it took about 3 minutes, since yours took so long was it more detailed captions or something?
You have an amazing computer if you where parsing 4.5ish 2k images a second.
(Or the model is oversized for me and causing major slowdowns)

I can get those times on the 373MB WD-Vit tagger with oynx but not the 15GB BLIP 2 model
1.jpg
 
Last edited: