This is straight out of science fiction! The idea is truly brilliant. Honestly, I've been dreaming of something like this for a long time, but even I knew technology hadn't quite reached this point, at least not a year ago. If I understand correctly, this game concept doesn't include pre-made images, sounds, etc. Instead, everything, including the plot, is essentially generated based on the player's choices? If so, how would it work? Does it require entire servers, or if it's local, does it require a powerful modern PC to run this kind of game?
Yes, that is correct. No premade dialogs, images or videos. Some people are doing this partially in SillyTavern already, so the tech exists and runs locally. But it can be quite slow due to unload-load-offload models process.
There are two ways today: open source local for full freedom and slow; closed source servers for limited content faster.
Possibly the best way to do it is with servers, to guarantee that it's playable without 15-20 mins wait time every scene, but then it can only generate legal content.
Grok is going through law suits right now for converting people's photos into them on bikini on Twitter/X, even without nudity it is a problem. So, when a game is made with NSFW content then it can't have real people and one needs to careful due to character ages, looks, etc.
Locally it's possible, but very slow and requires a powerful PC indeed indeed. For example, to generate a scene it needs to generate the text for the picture (load model + understand context + write text: ~30 secs), generate a image (load model + generate + upscale for quality: ~25 seconds), then unload and generate video (load models, generate the video in low quality: ~200 seconds), then unload models and generate audio (load models, generate audio for the video: ~25 seconds). So, in total, a scene would take 6 minutes to generate an okay 5 seconds scene even in a very powerfull computer.
When you have different servers for different tasks, like one for language or chat, one for image, one for video, one for audio and have them running in parallel with the models and weights always loaded. So, an HD 5 seconds scene could take around 2 minutes to generate fully.
As new models come out with already audio merged and even the image merged, then you can lower that to 60 seconds, which is already playable. With the LTX-2 I've been able to make 20 second scenes in about 1min30seconds.
Some people into AI reading this will likely mention gaussian or other tech in the market for real-time generative worlds, but that will be very hard to do with full freedom before 2027.
Grok team mentioned that they'll have a tech for this released in 2026, so we'll see where it goes...