There seems to be an aspect of this that is being over looked in the conversation.
So far the conversation has been around rendering with programs like daz3d or blender and then bring those renders into the VN.
Not sure that's the best solution. Even when scene and objects are reduced and optimized the take a long time to render. But the end result isn't images a great deal better than what high end video games have today. Scene complexity can actually be lower.
People see less detail in motion images than they do stills they have less time to observe it. So is it really worth that extra time and effort put into these types of renders?
The real issue is then are we using the best method of providing animations?
My understanding is that Renpy can actually display 3D with opengl.
You must be registered to see the links
However, I don't think it is going to be to the quality and detail level of say Laura Croft. Renpy has a number of performance bottle necks of its own add to that any ones the game developer might create in their programming.
I think the real issue is more along the lines of trying to add animation to a game built on the wrong game engine.
Just because you can do something doesn't mean what you are using is the best tool for the job.
That also goes just because something is easier to setup doesn't make it the best tool for the job either.
Update 1/30/2020 Just adding more.
Take the image above I tend to find it disingenuous. If you look at the table in the raster version it is really dark. The problem is they didn't adjust the material properly. You can also see they don't have shadows turned on. I'm not entirely sure if that is purely the person who created the image or the lack of hardware capability they were dealing with. You have gourand and phong shading, Shadow volumes and soft shadows and blending has various methods as well.
If you do a search for rasterization vs ray tracing most issues show in similar images be it lighting, shadows, blending have all been improved drastically.
The biggest area ray tracing has a real advantage is accurate reflections. That said reflections can be mapped onto surfaces but they aren't the same.
Given that humans can only perceive so much detail at one time or in a length of time. Unless your characters are screwing in a hall of mirrors ray tracing frames for a animated sequence is over kill. It's needed for large production movies that show on large screens. Even then most the detail is missed. Take the movie blade runner. Most people who have watched it know there is a lot of detail in it but most really have no idea exactly how much unless they watched the show about its making.
Given that a game engine can produce the scenes in over 60fps in many cases with an optimized scene. You could capture it and have a hell of a lot more time to remake it if you choose to correct something.
That said if you are going to take the time to port the models and scenes to be used in a game engine it would stand to reason why not use the game engine to make the game rather than try and render to a video.
There are a number of topics one can help you understand my point of view.
digital images vs human perception.
Limits of human perception ... This one can cover a lot of stuff. Just focus on the stuff related to vision since that's really what this is about.