You definitely can. Look, clip interrogation with blip2-2.7b, with sdxl prompt mode.
And you can use your own dictionary for questioning, it can help.
I did see the BLIP-2 but I am guessing that nearly any CLIP model won't fit into my VRAM with BLIP-2, but it looks like an interesting tool and I wasn't aware of it, I might try it later. (EDIT: I could probably run it FP8 assuming some of the 15GB FP32 Model was offloaded when not in use)
Time wise I can run BLIP-2 and append with WD-14 on 332 2k images in a few hours....
If my times are the same as yours then it would be 11 hours to run on the program.
EDIT: I'd be curious to see what the image attached is tagged as using that program if you wanted to test run it:
BLIP-2 with WD-14 (Onyx) appended -1girl
a woman sitting on a bench in a white tank top and red shorts with her hands on her hips, solo, long hair, breasts, looking at viewer, skirt, jewelry, medium breasts, sitting, purple hair, outdoors, pussy, choker, day, spread legs, miniskirt, clothes lift, mole, lips, uncensored, no panties, red skirt, skirt lift, sunglasses, tank top, lifted by self, building, mole on breast, watch, wristwatch, white choker