Tbh, we are so close. Gooning with Gemini (directly or using it as proxy on JAI/Chub) is some quality smut that you can direct exactly as you desire. It's not even comparable to the 8k context window crap and retarded LLMs that was the early days of NSFW chat bots.
The only thing we really need right now is image integration. I know various diffusion models have come a long way, and I'm sure some mad lads using SillyTavern to maximum effect probably already can illustrate their gooning chats pretty well. But for ease of use, a truly multimodal input/output model is what we are waiting for.
Something that can write your long form slow-burn corruption arcs, and then do individual voice acting for each character, and then illustrate the scenes with consistent style and characters whenever you ask. Like google basically has all the parts, the LLM, the voice model, NanoBanana.
I'm pretty sure this kind of generalized approach is what all companies are aiming for, as it's the path towards "AGI". But man, western companies won't ever loosen the Christian filters on image/video generation. Gotta hope China can step up.
The only thing we really need right now is image integration. I know various diffusion models have come a long way, and I'm sure some mad lads using SillyTavern to maximum effect probably already can illustrate their gooning chats pretty well. But for ease of use, a truly multimodal input/output model is what we are waiting for.
Something that can write your long form slow-burn corruption arcs, and then do individual voice acting for each character, and then illustrate the scenes with consistent style and characters whenever you ask. Like google basically has all the parts, the LLM, the voice model, NanoBanana.
I'm pretty sure this kind of generalized approach is what all companies are aiming for, as it's the path towards "AGI". But man, western companies won't ever loosen the Christian filters on image/video generation. Gotta hope China can step up.