Just to potentially up the "learning" part of this thread again.
As i've been doing a lot of repeated training lately i thought i'd share some minor things that might help others do things a bit faster.
Since i have a slow, old and just 6gb vram card, reducing memory needs have always been one of the priorities, which means that any sort of "speed" goes out the window.
When i started this endlessly long repeated training...sigh.... it was running at 6.8 to 8.4 s/it...yes that's the correct way...seconds per iteration...
which means a 10k step training would be >20 hours, easily. This was without dropping into shared memory usage etc and purely running at 95-98% vram. Training i'm currently running, still on the same dataset etc, is running at 1.6 to 1.8 s/it...it's still "the slow way", but that's still very noticeable. Now if all those with 30xx and 40xx are done laughing, this might potentially help you as well. Obviously it's a bit hard for me to test, unless i'm somehow gifted a massive new computer or wins some kind of lottery.
While these things are mainly reducing memory needs, it can work for those with much more vram too since it might allow you to add more batches which would speed things up greatly.
First off, many guides and what not complain about buckets being horrible without any reason as to why or some crappy excuse suggesting they don't even know how it works. In this case the point with buckets will let you reduce the number of px in most/all images as you can crop out a lot of pointless background from tall/wide images. 100 images at 512x or 768x squares are a lot more to work through than the same images and you can cut off half because it's "empty space". Just remember to do the settings for it so you don't get more cropping or weird bucketing.
Second, optimizer, if you're on low/lower specs you're probably already familiar with adamw8bit, however there's a more "optimized" version called PagedAdamW8bit
.
For me with adamw8bit i had to run training with both "gradient checkpointing" and "memory efficient attention" and training ran constant at 5.8gb. When doing sample images it would then offload to generate the image, which was a noticeable lag, it's not major, but it was there.
Using the Paged version i can remove "memory efficient attention" and training is running at 4.4-4.6gb and it only spikes up for sample images, but it doesn't offload or "reorganize" any memory usage to do it. So it still keeps everything in memory which speeds up sampling too. Not that it's a function you need. But because of not needing the ME attention and still having vram to spare, it speeds up everything.
(I haven't tested if ME attention allows for +1 batch and if that would be faster in total, i doubt it for me though)
Third small thing, Cache latents, it's generally checked by default in kohya_ss gui, but for some reason i've seen guide/training files turn that off. It might be because it keeps all the latents in memory and ppl haven't really thought what that means. As a simple explanation, you keep a "version" of the image in memory instead of reading it each time. "Each time" in this case means every repetition and epoch, but unlike much of the other "keeps data in memory" option this is fairly little. Mostly less than 100kb per image, so unless you're going way overboard with the amount of images you use, this should not be the reason you OOM and it does speed things up. Your mileage will obviously vary with system. Don't "cache to disk" though.
As a final note, i know these things worked very well for me, on my low spec old comp, in theory they should work for others as well, but the effect/impact will obviously depend on what's running it.
If you're already running at full speed it's unlikely to do much, but as i mentioned earlier, if it means you can increase your batch count by 1 or more, it would make this faster even for those systems.
And if anyone bothered reading all this, Hi...