Having been messing around with cascade, i'll share some observations etc. (badly formatted and written wall-of-text warning)
I've tried both the "comfyui checkpoints" and the unet models and from what i've seen so far something seem "wrong" with the comfyui ones. This could be an issue in the creation or just how things are loaded, i don't know. Results are generally "bad", faces seem very low quality, proportions are off, i had one image where it looked like the head was like half the size you'd expect it to be compared to the size of the upper body. There's quite a bit of blurring/out of focus, cropping/out of frame. It kind reminds me of SD early on with very small dimensions.
It also has a very high memory issue. While it does have a low vram usage during generation, initial loading is really bad. The two model files are about 13-14gb combined but it REALLY struggled fitting that into 32gb of RAM.
I have no problem caching multiple sd and sdxl models in RAM normally, but this really did not do well. I'm sure they'll work on some optimizing, but atm it's not doing all that well.
The "unet version" seem to work far better, both in memory and generation. At least for me, They have multiple versions and i didn't notice all that much difference in the results they gave, so for testing i'd say you're fairly safe at going with the smaller ones.
Generated images are fairly clean with little/no negative prompts, using :x.y style weighting seems to completely break the image though so keep that in mind if getting bad results.
Another problem is that RNG seems to be tied to gpu, at least there's a problem recreating other ppls images which makes me think it's linked to gpu. (this might be talked about in the various papers, i haven't read them). I've tried with quite a few images, using the included workflow, models etc, but it's not even close to the same results.
This "version"/method does use even less vram though, how much depends on compression, but as an example, currently running a 1840x3072 "image" and the sampling is using around 2-3gb. The latent -> image decoding obviously has a higher usage, but tiled version can fix that if needed. Prompt exec time was 237 sec, so at that size, 40 steps and decoding, that is fairly good. I doubt i'd be able to run an image at that starting size in sd or sdxl to compare.
As it stands, i'd probably let them finish things and work out the bugs and issues before committing too much to using it. It's fairly new and in "early access", and it shows. I'm sure there's issues to be worked out both in the core and in the various implementations before things work as wanted.
Seems like some are using it to create "base" images and then using existing sd/sdxl models and extensions to refine/fix the images.
Which already being done with sd and xl (both ways) to take advantage of some model or extension in one to improve/control the result of the other.