I've been using ComfyUI for quite a while, I've taught myself on how to create (simple) workflows to create images, upscale images, use ControlNet to use existing poses or even animations to reproduce this poses/animations with pre-defined images, do face replacements and all this stuff. All while using SD(XL) models. And it's pretty amazing what you can do, I love it! I even replaced my NVIDIA RTX 2080Ti with an NVIDIA RTX 4090 to be able to circumvent the limited VRAM of the 2080Ti. But that's when I stopped teaching myself new things. I don't know anything about Pony, Illustrious, Qwen or anything like that, I'm just seeing your posts and I'm... wow!
And now there's Image-2-Video and Text-2-Video using Wan 2.1/2.2, which I'm very interested in, but I'm totally lost! So here I am asking you guys if you can help me out.
First, I've started using the official Wan templates like "Wan 2.2 14B Text to Video" or "Wan 2.2 14B Image to Video" which can be found in the "Templates" section of the ComfyUI menu. I've downloaded the models for it, and while those workflows do work, they seem to be slow and "limited" when it comes to the length and resulution of the video.. And because you'll usually want to create a couple of videos with the same prompt to pick the best result, this might take many hours, maybe even a couple of days to get the video you're looking for. And it's only like 5 seconds long...
Next I was looking for some "optimized" workflows people share online, and first I found
You must be registered to see the links
set of workflows on civit.ai, and I've been trying it out. I do like the "WAN 2.2 I2V" workflow included, because it seems to be faster and has more options, but I still feel limited to when it comes to the resolution and length of the video because it uses ".safetenors" models which uses a lot of VRAM. I can still get 5 seconds videos with a decent resolution, or I can get a longer video with a poor resolution.
Then I thought I might go for GGUF models instead, because from what I understand, they do use less VRAM, but they are "compressed" and therefore might take longer. I don't mind waiting a couple of minutes for results if I can use more frames (= longer videos) or a higher resolution than with the "default" workflow. So I found
You must be registered to see the links
, which is very impressive, uses GGUF, has a bunch of options, and after downloading all the missing nodes and models (as well as fixing a "bug" in the workflow itself) it's producing decent results within a couple of minutes. I've been able to create a few videos of 20+ seconds (at 24 FPS) with a resolution of 480x800, but as soon as I add action prompts for the camera or the subject in the picture (btw: no additional LoRAs are involved), the video gets blurry (looks like a double- or even multi-exposures when talking about photographs) or it just doesn't follow the prompt (i.e. if the prompt says "the camera slowly zooms in toward the woman's face", it zooms-in for about 3 seconds, then zooms back out and repeats those steps until the end of the clip -- even if I add something like "at second 5, the camera stops completely and remains entirely static for the rest of the video. there is no zooming, panning, or movement after this point — the frame stays locked on her face.")
So here are my question:
- What's your overall workflow to create a 10-20+ second high-resolution video based on your imgination/prompt?
- The resulting video should be produced in a couple of minutes (5-15 minutes at most, not hours).
- What's your Text-2-Image workflow you use to create your starting image?
- What's your Image-2-Video workflow to produce a 10-20+ second video with a decent (720p) resolution?
- What's your workflow to upscale the video to a HD resolution (1280p or even 1440p)?
- What prompt (or LoRA) do you use to consistently "control" the camera movements (zoom in, zoom out, keep being static at a close-up etc.)
Any help is highly appreciated. I would love to end up with with like 3-4 workflows in total (1: create a starting/ending image for the video / 2: create an at least 10-20+ second video with "precise" camera movement / 3: upscale the video to at least 1280p).
TL;DR: if you share your workflows to create a 20+ seconds video with precise camera (and subject) actions, or are able to point me into the right direction where to research further, I will be in your dept forever
