I see people using florence2 in here.
Just wanted to quickly chime in with some info.
What model is used depends highly on if a prompt from florence will work or not.
For diffusion models you should use booru tags, ie "1girl, blonde, big breasts, naked, standing, next to a tree, blah, blah, blah", ie TAGS.
Florence does not work like that as you can see.
For models like FLUX or even video generation models accept and even works better with full descriptions like what florence gives you (but does also understand tags).
But for pony models for example, it works better with tags. (you can also make descriptions like florence, but tags is the way to get the model to play nice with you). You also need those "score" prompts to specify the quality of the images the model should pull from the dataset (only pony V6 based models).
I have not used A1111 in AGES, but on forge, you have a 2 buttons you can press below the "generate" button, one for clip and the other for booru. Press them and see what they spit out. They are kinda like "florence for tags".
Could be good to know if the model is not producing the results you want. Read about the model and find how how you should prompt for it.
Edit
Since the thread is about img2vid:
I have recently started to really deep dive into video generation locally.
The new WAN models are quite frankly insane so I have started using a few of them in Comfy. It's pretty complicated but the results sometimes blows even my mind, and I have been generating images locally for years.
If you want to generate videos, forget about Hunyuan and go for WAN. Maybe start out with WAN2.1, and after learning that, move on to WAN2.2.
WAN2.2 is pretty new, so loras and tutorials can be a bit tricky to find since it is very new while WAN2.1 has tons of tutorials and lora:s for you to play around with.
Just be mindful with what size of model you use, they are pretty darn big if you want good quality.
I have a GPU with 24G memory so I can for example use Q8 gguf models (gguf is faster than safetensors, easily described), start with Q4 and work your way up and see what works to your satisfaction.
Also, be prepared to wait. A 5s video in 960x540 (the resolution I usually run in because it gives me the details and quality I want) using Q8 with my 3090 takes around 5-10 minutes (yes, the generation time CAN vary that much). So if it's not "good enough" you have to tweak (if needed) and rerun the whole generation and wait again.
Patience is a required trait with video generation, unless you can afford a 5090 or an Axxxx model ofc.
You can then upsacle the video by 2 to get 1080p, but the upscaled video will never be better than the original so keep that in mind.
Hence "I use Q8 gguf in resolution 960x540". If I use lower Q models, or lower resolution the output becomes "sloppy" in my eyes.
More GPU ram: better quality models (bigger size models) and higher base resolution
Newer GPU: faster generation
All above is about I2V, I have not played around with t2i at all actually. I prefer to create an image in forge and then generate videos from those.
I would say a 3090 with 24g beats a 50xx model with less memory, simply because I CAN use better/bigger models even though it takes time.
With a 50xx card with lets say only half the amount of memory, it will go fast, but I will never be able to use the big quality models so what good does the speed do to me then.
Also, do not forget about power consumption on some of the high grade cards. For me, 350w might not sound like much, but have it running for 8hrs and suddenly it starts to cost a bit of money in electricity.
The AI spit out this video when I was playing around with it. I wanted the video to be static but I forgot about nailing that down in the prompt. Look at how it handles distances with camera movement, it blew my mind it could do this (I was using florence2 to add to the prompt, that is probably why it managed to sectionate the image so good).
View attachment WAN-UmeAiRT-gguf-speed_2025-11-02-1411_OG_00001.mp4