This is just my opinion, but wouldn't it be better to focus on the slow focus on the story, do episodes that are closed by action frames. Instead of rendering the city, you can do it on a smaller scale, you would gather a considerable number of supporters and you could hire more employees and then make a larger venture.
As far as the workflow is concerned, that's not a good idea.
If you work on a project where you have to fill multiple rolls (graphic design, story design, etc.) you want to stay within each role as much as possible.
That's because it helps you get focused and keeps you familiar with your tools.
Every time you need to jump out of one tool to another (e.g. stop programming & writing in renpy and do some art stuff) is actually a big disruption where you need to concentrate on a different workflow and everything goes so much slower because you have to get back into the routine again. That's especially true for tools you haven't absolutely mastered and where you need to come up with problem solutions.
Think of it as countless rocks in your way that you keep stumbling over.
Your creative ideas are sending you from A to B, but you keep stumbling into obstacles that say: you can't do that now, do X first.
If I were to guess what the dev thought it'd probably be something along those lines:
I'll focus on the city, get all the assets gathered together, place them in a way that I've got a great looking asset that I can use in multiple ways in the future and where I won't have to worry things don't match up. I might have to do small, minor adjustments here and there for some scenes, but with this, the city's done and I can go back to other stuff.
Now if he decides that the first shots of the city when coming across it should be bird's eye shots or shots from up high? No problem. If he decides to have a high up shot from out of one of the buildings? No problem.
The other way, he might be able to progress faster at first, but then it's stumbling time. Because everything has to be specifically crafted for every instance and you can get into situations where assets and scenes don't match up, because they're not part of the same scene but entirely seperate scenes. That ends up causing congruency errors where building or vehicle positions don't match up, shot angles don't work, or you have to constantly check all the other scene parts you created to get an idea what should be in the newest shot based on all the previous stuff.
The characters are at the mall, looking out the window. To know what they see you need to go through all you have made and try to reassemble the right view from this particular angle.
Wouldn't it be great if, instead, you'd just have to position the camera on a finished asset, get the view, then frame it through a window before going back to writing?