Sharinel
Active Member
- Dec 23, 2018
- 607
- 2,436
- 448
I'm running a 4090 as well and tbh I don't think it has the VRAM to do 10-20 secs videos. At the moment I'm running 5 sec videos at 768x768 or equivalent ratio that take 4 mins or so to produce. I've attached the json file I use, on the list of lora only the top one is needed, the rest are NSFW loras to do specific things.I've been using ComfyUI for quite a while, I've taught myself on how to create (simple) workflows to create images, upscale images, use ControlNet to use existing poses or even animations to reproduce this poses/animations with pre-defined images, do face replacements and all this stuff. All while using SD(XL) models. And it's pretty amazing what you can do, I love it! I even replaced my NVIDIA RTX 2080Ti with an NVIDIA RTX 4090 to be able to circumvent the limited VRAM of the 2080Ti. But that's when I stopped teaching myself new things. I don't know anything about Pony, Illustrious, Qwen or anything like that, I'm just seeing your posts and I'm... wow!
And now there's Image-2-Video and Text-2-Video using Wan 2.1/2.2, which I'm very interested in, but I'm totally lost! So here I am asking you guys if you can help me out.
First, I've started using the official Wan templates like "Wan 2.2 14B Text to Video" or "Wan 2.2 14B Image to Video" which can be found in the "Templates" section of the ComfyUI menu. I've downloaded the models for it, and while those workflows do work, they seem to be slow and "limited" when it comes to the length and resulution of the video.. And because you'll usually want to create a couple of videos with the same prompt to pick the best result, this might take many hours, maybe even a couple of days to get the video you're looking for. And it's only like 5 seconds long...
Next I was looking for some "optimized" workflows people share online, and first I foundYou must be registered to see the linksset of workflows on civit.ai, and I've been trying it out. I do like the "WAN 2.2 I2V" workflow included, because it seems to be faster and has more options, but I still feel limited to when it comes to the resolution and length of the video because it uses ".safetenors" models which uses a lot of VRAM. I can still get 5 seconds videos with a decent resolution, or I can get a longer video with a poor resolution.
Then I thought I might go for GGUF models instead, because from what I understand, they do use less VRAM, but they are "compressed" and therefore might take longer. I don't mind waiting a couple of minutes for results if I can use more frames (= longer videos) or a higher resolution than with the "default" workflow. So I foundYou must be registered to see the links, which is very impressive, uses GGUF, has a bunch of options, and after downloading all the missing nodes and models (as well as fixing a "bug" in the workflow itself) it's producing decent results within a couple of minutes. I've been able to create a few videos of 20+ seconds (at 24 FPS) with a resolution of 480x800, but as soon as I add action prompts for the camera or the subject in the picture (btw: no additional LoRAs are involved), the video gets blurry (looks like a double- or even multi-exposures when talking about photographs) or it just doesn't follow the prompt (i.e. if the prompt says "the camera slowly zooms in toward the woman's face", it zooms-in for about 3 seconds, then zooms back out and repeats those steps until the end of the clip -- even if I add something like "at second 5, the camera stops completely and remains entirely static for the rest of the video. there is no zooming, panning, or movement after this point — the frame stays locked on her face.")
So here are my question:
Any help is highly appreciated. I would love to end up with with like 3-4 workflows in total (1: create a starting/ending image for the video / 2: create an at least 10-20+ second video with "precise" camera movement / 3: upscale the video to at least 1280p).
- What's your overall workflow to create a 10-20+ second high-resolution video based on your imgination/prompt?
- The resulting video should be produced in a couple of minutes (5-15 minutes at most, not hours).
- What's your Text-2-Image workflow you use to create your starting image?
- What's your Image-2-Video workflow to produce a 10-20+ second video with a decent (720p) resolution?
- What's your workflow to upscale the video to a HD resolution (1280p or even 1440p)?
- What prompt (or LoRA) do you use to consistently "control" the camera movements (zoom in, zoom out, keep being static at a close-up etc.)
TL;DR: if you share your workflows to create a 20+ seconds video with precise camera (and subject) actions, or are able to point me into the right direction where to research further, I will be in your dept forever![]()
The good thing about this workflow is that you get a final image which you can then use to kick off the next video, and it also uses interpolate to increase the fps/vid sizes. I'm sure if you've been downloading the models you probably have these.
Here is an example of the output
View attachment Deadwood Vibes Video 02.mp4