With the NF4 model I am able to generate images at 7-12 Seconds per it
This is 10x faster then I was getting out of the FP8 model
If you have a lowvram card and use Comfy UI I highly recommend this for FLUX
Just google NF4 for COMFYUI
EDIT: NF4 has issues with LORA's but it should allow for older card users to use FLUX at a decent speed when QLORA conversion is available lora's should also be usable.
My Shenell NF4 Seconds per IT - 4.64 per IT
My Dev NF4 Sec Per IT - 30 or so
Still a vast improvement from the 100 seconds per IT I was getting.
I tried adding
quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)
along with
import torch
from transformers import BitsAndBytesConfig
But I am still unable to load BF16 or FP16 lora's along side a NF4 model, hopefully a programmer can get a working script for NF4