Ren'Py Dictionary - Creation and writing to and reading from a file instead of memory

Sancho1969

Message Maven
Modder
Donor
Jan 19, 2020
12,382
48,852
anne O'nymous: Mentor, I need some advice. Rather than a PM I thought others might benefit from a brief discussion if you wouldn't mind.

Problem/Issue:
For over a year I've been battling the creation of a very large dictionary which if I don't keep in restraint can significantly increase game load times and just not efficient due to how expensive it is to create large dicts. For others that might see this thread know that yes, I've worked with different variants of creating the dictionaries to minimize the timeframe but the resulting difference is minimal at best. For example, I've used the following with nearly identical results:
  • {k: k for k in it}
  • dict.fromkeys(it)
  • dict(zip(it, it))
  • etc.
Because I'm a redneck who probably drinks too much and does some crazy things in some of my mods, I thought to myself "why not populate this massive dictionary to a file that I create then package with the product so it doesn't have to be created by the player every time they execute the game?". So, I set out to do just that... if the file doesn't exist I create it (also in case the player "accidentally" deletes it) and if it does exist then the file is read instead of creating the dict. Should be simple, right? Well I hit some bumps in the road but I think I'm finally on to something but here's the rub:
I'm now not sure if I'm wasting my time or on to something just very slightly brilliant? :ROFLMAO:

The low down: The dictionary in question consists of tupled lists to serve as keys against another dictionary of string items (that's significantly smaller so no worries there) and all used in another function who's purpose is not relevant to the topic... just know it's required and I'm not insane...maybe. AON actually helped me create the beast but it's grown into a monster over the last year or so. Enough yammering... onward.

Current similar code that creates the dict as normal (in memory at initialization). This creates a sorting that's similar to using itertools.combinations generator in succession:
You don't have permission to view the spoiler content. Log in or register now.

Current possible solution that creates the dict and dumps to a txt file and once it's done, on subsequent game starts, the required dict info is simply read from the file rather than recreating it in memory again thus eliminating the startup time expense. This works but not yet implemented... am I nuts for going down this path?:
You don't have permission to view the spoiler content. Log in or register now.

Example "myKeys" output if file doesn't exist (from print):
You don't have permission to view the spoiler content. Log in or register now.
Example "myKeys" input if file does exist (from print):
You don't have permission to view the spoiler content. Log in or register now.

I've never done this before in Python much less incorporated into RenPy so would appreciate your thoughts. This seems feasible as subsequent game starts don't have to spend the expense of rebuilding the massive dictionary each time. You've told be before if it works then it works but I'm just making sure I've not started thinking a little too far outside the box. Your insight is always appreciated.

Regards and hope all has been well with you and yours.
 

anne O'nymous

I'm not grumpy, I'm just coded that way.
Modder
Donor
Respected User
Jun 10, 2017
10,967
16,215
Current possible solution that creates the dict and dumps to a txt file [...] am I nuts for going down this path?:
Strictly speaking, no, even if relying on pickle, rather than on a pure text file, would probably be easier. And also probably a bit faster at load time.


But the question is: Is this design really the best approach ?

You said that "the dictionary in question consists of tupled lists to serve as keys against another dictionary of string items", with contents that looks like "['str_A', 'str_D', 'str_M']: ['str_D', 'str_A', 'str_M']".
Don't you have another approach, to match the twos ? At some point, and I think you reached it, it's sometimes better to have more code, than a tons of data. It ease the memory use, and spread the computing time over a long run, making it less noticeable.


This being said, there's already a way to reduce the amount of memory needed, and make this be a bit faster. You can rely on an intermediary list. Something like equivalence = [ "str_A", "str_B", "str_C", "str_D", ... ].
then rely on the index of each string to reduce the size needed by the dict. Instead of "['str_A', 'str_D', 'str_M']: ['str_D', 'str_A', 'str_M']", you would have "[1, 4, 13]: [4, 1, 13]"
when you get "['str_A', 'str_D', 'str_M']" as input, you translate it:
Code:
res = []
for s in input:
    res.append( equivalence.index( s ) )
Then, you use res to find the right entry in the dict, what give you the output, that you just have to translate back into strings:
Code:
res = []
for i in output:
    res.append( equivalence[i] )
It's a bit slower since you need to pre-proceed and post-proceed the data, but you should gain a bit of time while browsing the dict, what compensate. And in exchange the size of the dict should be drastically smaller.

Starting there, there's another approach that you could use:
Code:
res = ""
for s in input:
    res += "{:03d}".format( equivalence.index( s ) )
res = int( res )
You pass by a "001004013", then transform it into an integer 1004013 that will be unique for this combination. This should make the dict access faster. You can keep the value as a tuple of number, it's not a problem (and easier to revert back to strings).


But this don't really solve the creation issue. It should still be a bit faster, because using less memory, and the keys are now more basic. But it would still need time, too much time. Therefore, isn't there some kind of recurring pattern you can rely on ?

I haven't looked at all the keys, but in your example it seem that when the key is "str_A, str_D, [whatever]", the result is always "str_D, str_A, [whatever]". There's possibly other patterns like this, with perhaps something like, if the first string is "str_X", then inverse the two others.

You could then rely on three dict:
Python:
# if first is "str_X", invert second and third
# if first is "str_Y", invert first and third
oneKey = { "str_X", ( 1, 2 ), "str_Y": (0, 2 ), ... }
# if first is "str_A" and second is "str_D", invert first and second
twoKeys = { ( "str_A", "str_D" ): ( 0, 1 ), ... }
# The few case that don't match or are exceptions.
threeKeys = { ( "str_X", "str_A", "str_C" ) : ( "str_A", "str_C", "str_X" ), ... }

# Starts by the exceptions:
if input in threeKeys:
   output = threeKeys[input]
# then try more generic cases:
elif input[:1] in twoKeys:
   output = invert( input, twoKeys[input[:1]]
# And finally try the full generic cases:
elif input[1] in oneKeys:
   output = invert( input, *oneKeys[input[1]] )
else:
   raise RuntimeError( "Oops..." )
With the help of this:
Code:
def invert( input, first, second ):
    input = list( input )    # tuple are immutable, list can be updated.
    tmp = input[first]
    input[first] = input[second]
    input[second] = tmp
    return tuple( input )   # revert back to a tuple
You can go up to more than just 3 strings, just ensure that you proceed in descending order. Always the real exceptions first, and the full generic last.
This way, if you've "starting by str_X mean invert second and third, unless second is str_W", the code will be the equivalent of:
Python:
if input[:1] == ( str_X, str_W ): # the twoKeys dict match.
   invert first and second
elif input[1] == str_X:   # before you even try the oneKey dict, that would have matched.
   invert second and third.
This would need more pre-works, since you've to identify all the patterns. But in the end, it would drastically reduce the number of entries you need.

Lets say that there's 26 strings, and that effectively all the combinations starting by str_X just need a permutation of the second and third strings. Actually you would need 26*26 entries in your dict, for all the possible combinations of the second and third strings. By relying on a generic approach, you would reduced this number to 1 entry in the right dict.
And if there's effectively the "str_X, str_W" exception, it would go up to 2 entries... still in place of the 26*26 ones.

Even if there's many exceptions, every patterns that you could identify and isolate would represent a significant reduction on the size of your dicts. And therefore a significant reduction of the time needed to create them
You wouldn't be able to automatize this creation anymore, but it would be less a problem since there should be way less entries to create.
 

Sancho1969

Message Maven
Modder
Donor
Jan 19, 2020
12,382
48,852
Thank you for the reply. I will think about this psbl pattern use but believe they are random... I need to confirm. Obviously the example I gave is extremely small just to get the point across as I don't hit a wall until 5+ combos with 40+ keys but depending on the project applied to I'm hitting that wall more and more lately so I wanted to revisit this issue.

Honestly it's come a long way since you first helped me get out of the concept rut over a year ago when the we started by manually populating the keys/items dictionaries used in various functions in this particular application. It's now matured to be 100% automated but like you said, it's likely time for me to find a way to break the data down. On the bright side at least the custom generators used in the automation are extremely fast as that was once the bottleneck... now it's back to this, full circle.

You always give a few ways to think about tackling an issue and is greatly appreciated, especially when "tunnel-vision" sets in. I'll test some ideas you've planted with your post and will revisit this thread with progress so others might benefit. Thanks again for your time and input, sincerely.
 

shark_inna_hat

Active Member
Game Developer
Dec 25, 2018
705
2,765
Are you sure you know how many items you are working with?
If you have 50 items then there are 2 118 760 possible, unique 5-element combinations of these 50 items. Adding just one more item gives you 230 300 more combinations, add 10 will yield additional 3 342 752 (for a total of 5 461 512). 6 element combination of 50 items is 15 890 700. These numbers grow ridiculously fast.

I doubt that you actually need all these values at all times, so rethinking how you use and store the data might be the fastest solution in the long run - but I don't know how or why you need this, so no help from he here ('sparse multidimensional arrays' might be fun to google).

If you're sure you NEED all that data and you think serializing it might be faster than building it from scratch - marshal!
marshal is the python low-level, secret-ish format/module for serializing stuff fast. Python uses it for bytecode, and it's quite limited in what it can store, but it can handle dicts just fine.

I run a small test like this:
Python:
from itertools import combinations
from time import perf_counter
import marshal

print('starting...')

def combolistsort( k, l ):
    return sorted([' '.join(i) for i in combinations(k, l)])

def combolist( k, l ):
    return [' '.join(i) for i in combinations(k, l)]

k = [str(i) for i in range(50)]

print('creating...')

t0 = perf_counter()

smKeys = {}
listKeys = list( combolistsort( k, 1 ) )
listValues = list( combolist( k, 1 ) )
smKeys = dict( zip( listKeys, listValues ), **smKeys )
listKeys = list( combolistsort( k, 2 ) )
listValues = list( combolist( k, 2 ) )
smKeys = dict( zip( listKeys, listValues ), **smKeys )
listKeys = list( combolistsort( k, 3 ) )
listValues = list( combolist( k, 3 ) )
smKeys = dict( zip( listKeys, listValues ), **smKeys )
listKeys = list( combolistsort( k, 4 ) )
listValues = list( combolist( k, 4 ) )
smKeys = dict( zip( listKeys, listValues ), **smKeys )
listKeys = list( combolistsort( k, 5 ) )
listValues = list( combolist( k, 5 ) )
smKeys = dict( zip( listKeys, listValues ), **smKeys )

print('total creation time:', perf_counter()-t0)

print('dumping...')
with open('sm_keys.dat', 'wb') as outfile:
    marshal.dump(smKeys, outfile)

smKeys = None
print('loading...')
t1 = perf_counter()
with open('sm_keys.dat', 'rb') as infile:
    smKeys = marshal.load(infile)

print('total load time:', perf_counter()-t1)
Now I don't know how close this is to your actual code, but deserializing the dict is.... about x2 slower, compared to just making it in memory. Still might be worth a shot (?)

EDIT:
Small update: Forget about marshal. It's way faster when dumping it to the hard drive, but plain old pickle (cPickle, I guess for the ancient ren'py python version...) is ~40-60% faster when loading the file compared to generating it in memory. Millage may vary depending on python version.
 
Last edited:

anne O'nymous

I'm not grumpy, I'm just coded that way.
Modder
Donor
Respected User
Jun 10, 2017
10,967
16,215
I doubt that you actually need all these values at all times, so rethinking how you use and store the data might be the fastest solution in the long run [...]
I second that. So far you've more than two millions entries in your dict, while the average player will use what ? 2,000 of them. It mean that only 0.1% of your dict will be effectively used. At this level, you can say that your dict is needed only to simplify your code, but totally useless.


On the bright side at least the custom generators used in the automation are extremely fast as that was once the bottleneck... now it's back to this, full circle.
Before this, you said that you believe that there's no pattern and that the sequences are random. Yet you've found a way to write generators. And it's all you effectively need ; I'm convinced that the answer is in those generators.

I mean, it's code that map all the possible equivalences between a value you get, and a value you'll use. Whatever how it do it, and whatever how complex is the code to do it, if it can generate all the equivalences, it can also generate one single equivalence.
The only difference between the actual code and the code you need, is that your generators works blindly. They proceed and return all the equivalences. This while the code you need should discriminate in order to return only the right equivalence. But it's not something too difficult to change ; "too difficult" being relative, it depend of the effective complexity of your code.

Let's say that your code rely on loops (because it's easier to explain). This is how you change a generator into a code that return the direct equivalence:
Python:
def keys( input ):

    for i in [ whatever ]:
        # It's not the first value of our input, so we don't care
        if i != input[0]:
            continue
        # It is the first value of our input, we want to know what come next
        for j in [ whatever ]:
            # It's not the second value of our input, so we don't care
            if j != input[1]:
                continue
            # It is the first value of our input, we want to know what come next
            [...]
And the same principle works whatever the way you generate the equivalence.
As long as the working value (i, j, etc.) don't match the entry at its level ( input[0], input[1], etc.), you just pass to the next working value, without trying to go further. This simply because you know that what come further don't interest you right now.
Since the code of your generators is extremely fast, and would be even faster if you skip 99% of the possible equivalences, it seem to be the best approach.


You always give a few ways to think about tackling an issue and is greatly appreciated, especially when "tunnel-vision" sets in.
Oh, there's a secret behind this... I past decades stuck in my own tunnel-vision issues. I'm not more intelligent that others, I just failed more often ;)

It's the same for you with this issue. Whatever how you will finally solve it, the time you past stuck with it will be a strong reminder that a full set of data isn't always the best solution. Next time, you'll start with a full set of data, and at one time you'll have this "wait, it fell like last time..." thought. You'll stop right there and goes back to the start, looking if there isn't another solution you can use.
And with times, you'll know a tons of alternate solutions, using them right from the start.
 

Sancho1969

Message Maven
Modder
Donor
Jan 19, 2020
12,382
48,852
Are you sure you know how many items you are working with?
...
I second that. So far you've more than two millions entries in your dict, while the average player will use what ? 2,000 of them. It mean that only 0.1% of your dict will be effectively used. At this level, you can say that your dict is needed only to simplify your code, but totally useless.

Before this, you said that you believe that there's no pattern and that the sequences are random. Yet you've found a way to write generators. And it's all you effectively need ; I'm convinced that the answer is in those generators.
...
I get it, it seems odd but yes, the data is accessed many many times throughout the players progress so it's needed but... AON is is now thinking along what my plan B has been: to either change my existing custom generator functions (slightly modified versions of itertools.combinations code) or create new ones as needed to call for the info as needed rather than creating the "smKeys" dict to begin with. Side note: "marshal" reminds me a bit of "csv" dump method... I forgot about that too. But speaking of it I can almost see AON starting to cringe :ROFLMAO: although I see what you're suggesting.

The basis is this: This large dict (smKeys) is used to compare the strings found in passed key word arguments (kwargs), analyzes this to get the correct combination, then compares the ordered (sorted) kwargs (they now must be ordered, hence why the dict is setup like it is) to compare against another dict of stringed tuples that has the needed values... which is for use in a loop in another function with the loop:
a = [ smObjExt[i] % kwargs.get(i) for i in smKeys[str(k)] ].
I'll place function with this loop line along with the generators below for reference.

Side note: AON actually helped me develop the code theory over a year ago but he didn't know at the time how beastly of the amount of combinations I might need ( tbh, I didn't either but it was so useful I went bonkers with it :) ) and we didn't use generators during the idea phase but rather manually populated dicts and lists with a small sample set to get the theory sorted.

For additional insight, here's most of the core (I don't think I missed any for this particular use case):
You don't have permission to view the spoiler content. Log in or register now.
I do realize I'm overthinking it, truly. I've been so set on this method of the dict for so long it's like a ball-and-chain tbh. Originally the method was much simpler but I kept condensing it and found myself needing the other lists/dicts elsewhere so didn't want to break everything else in the massive project. Again, I don't need the "smKeys" dict anywhere else in the project, only for this particular use case so I'm free to experiment. I do need "k" and "smObjExt" elsewhere but these are the list and dict with only ~40-60 entries max so no worries there.
 

anne O'nymous

I'm not grumpy, I'm just coded that way.
Modder
Donor
Respected User
Jun 10, 2017
10,967
16,215
For additional insight, here's most of the core [...]
I'm a little blind regarding the whole data themselves, but I'll write my thoughts seeing the code.

Note: I'll write part on my comment directly in the code, to explain how I interpret its meaning. This way, you'll more easily see if I got something wrong.
Python:
#  Extend a seed accordingly to some current values in game. Which one of those values will be 
# used depend on the context, sometimes it can be str_A, str_B, str_C, sometimes str_X, str_A, str_Z
# you've now way to know it before. 
#  Since there's no "**" in the declaration, I guess that either you get the dict directly, and so have
# near to no control over it.
def seedext(seed, kwargs):
        # smKeys are ordered, so order the arguments names, in order to find a possible match
        k = list (kwargs.keys ())
        k.sort()
        # There's no match, so do not extend the seed.
        if not str(k) in smKeys:
            return seed
        # The extended value must be presented in a particular order, found in smKeys.
        a = [smObjExt[i] % kwargs.get(i) for i in smKeys[str(k)]]
        return seed + "".join(a)
So, the main issue you're facing, is to find in which order the kwargs values should be used. And like dict are not ordered (side note: not until Python 3.9), you really need a reference order.



It's a raw alteration that I write on the fly. I don't guaranty that it will effectively works like that, but it's where I would start myself face to such situation.
Side note: Yeah, I'm really dirty when I have to update someone else code. I firstly alter the code in a way that seem logical, then only start to think about how it integrate with what is originally wrote. Do not reproduce this at home, I'm a professional stuntman ;)

Python:
#  /match/ is the occurence you actually have. It's the /k/ of your /seedext/ function.
#  Will now return an integer. This integer is the index of the value you need.
#  Initially you mapped together two lists, one of keys, and one of value. Now, you
# just care about the index where the value you want is located.
def combolistsort(iterable, r, match):
    myPool = list(iterable)
    n = len(myPool)
    if r > n:
        return
    indices = list(range(r))
    myYield = str(sorted(list(myPool[i] for i in indices)))
    #yield myYield
    if myYeld == match:  # <-
        return 0         # <-
    count = 1            # <-
    while True:
        for i in reversed(range(r)):
            if indices[i] != i + n - r:
                break
        else:
            return
        indices[i] += 1
        for j in range(i+1, r):
            indices[j] = indices[j-1] + 1
        myYield = str(sorted(list(myPool[i] for i in indices)))
        #yield myYield
        if myYeld == match:  # <-
            return count     # <-
        count + 1            # <-
Python:
#  /index/ is the value now returned by /combolistsort/.
def combolist(iterable, r, index):
    myPool = list(iterable)
    n = len(myPool)
    if r > n:
        return
    indices = list(range(r))
    myYield = list(myPool[i] for i in indices)
    #yield myYield
    if index == 0:        # <-
        return myYield    # <-
    count = 1             # <-
    while True:
        for i in reversed(range(r)):
            if indices[i] != i + n - r:
                break
        else:
            return
        indices[i] += 1
        for j in range(i+1, r):
            indices[j] = indices[j-1] + 1
        myYield = list(myPool[i] for i in indices)
        #yield myYield
        if index == count:    # <-
            return myYield    # <-
        count += 1            # <-
What lead to a change that should looks like this:
Python:
def seedext(seed, kwargs):
        k = list (kwargs.keys ())
        k.sort()
        idx = combolistsort( iterable, r, k )   # <-
        if idx is None:                         # <-
            return seed                         # <-
        order = combolist(iterable, r, idx )    # <-
        if order is None:                       # <-
            return seed                         # <-
        a = [smObjExt[i] % kwargs.get(i) for i in order]  # <- Updated
        return seed + "".join(a)
What limit the data you effectively need to the different /iterable/ lists.
So, here probably have a dict indexed by r:
Python:
iterables = { 1: [ ... ],
                   2: [ ... ],
                   ... }
What lead to a final change in seedext that should looks like:
Python:
def seedext(seed, kwargs):
        k = list (kwargs.keys ())
        k.sort()
        idx = combolistsort( iterables[len( k )], len( k ), k )   # Updated
        if idx is None:
            return seed
        order = combolist( iterables[len( k )], len( k ), idx )   # Updated
        if order is None:
            return seed
        a = [smObjExt[i] % kwargs.get(i) for i in order]
        return seed + "".join(a)
The code can be optimized. I voluntarily kept it that way to make it match what is actually used, in order to make it more understandable.

[...] (AON, don't look) [...]
Code:
total creation time: 230.35592739999993
I've seen worse, and wrote by me. When I was rewriting my variable viewer, my first approach led to processing tie that could goes up to 10 minutes... :(
 

Sancho1969

Message Maven
Modder
Donor
Jan 19, 2020
12,382
48,852
I've seen worse, and wrote by me. When I was rewriting my variable viewer, my first approach led to processing tie that could goes up to 10 minutes... :(
:ROFLMAO: I do silly shit all the time by pure mishap. I'm certain that those of us that write our own code from scratch always do, especially when venturing into uncharted territory... but that, to me, is the enjoyment of it, the challenge so to speak.

So, the main issue you're facing, is to find in which order the kwargs values should be used. And like dict are not ordered (side note: not until Python 3.9), you really need a reference order.
Basically, yes... and this is one of the many reasons I consider you a mentor, this ability to derive the pure objective and offer a different point of view, efficiency, etc. Basically I grab all (if any) kwargs which may be in any order and quantity (which is what I'm currently limiting it to ~4 or 5 max due to my current bottleneck) over the course of many repeated events. I then currently use the smKeys dict to compare which kwargs were grabbed as a string with all possible orders so that I can then use against (as you correctly derived) a mapped dictionary of keys/values (smKeys dict) to derive which and all correlated values (smObjExt dict) I need to pass through and process. So yes, I believe you derived the exact scenario use case I'm processing... you're a badass "stuntman" AON, truly.

I believe I grasp what you're getting at in your examples to process the kwargs against code mapping in real-time rather than against the mapping dict. I'll certainly tinker with it and see if I can receive results that don't lag too much during gameplay.
 

anne O'nymous

I'm not grumpy, I'm just coded that way.
Modder
Donor
Respected User
Jun 10, 2017
10,967
16,215
[...] and this is one of the many reasons I consider you a mentor, this ability to [...] you're a badass "stuntman" AON, truly.
Know what ? Today is my birthday and I can easily imagine worse thing to read on such day ;)


I believe I grasp what you're getting at in your examples to process the kwargs against code mapping in real-time rather than against the mapping dict. I'll certainly tinker with it and see if I can receive results that don't lag too much during gameplay.
Take your time. The best helper when you're stuck with some code for too long is always to take a break. Coming back with a fresher mind often permit to have a bigger vision of the problem.
 
  • Like
Reactions: Sancho1969

Sancho1969

Message Maven
Modder
Donor
Jan 19, 2020
12,382
48,852
What limit the data you effectively need to the different /iterable/ lists.
So, here probably have a dict indexed by r:
Python:
iterables = { 1: [ ... ],
                   2: [ ... ],
                   ... }
First happy birthday, and may you have unlimited more.

Second... indexed by r a dict of lists of the iterable ( smKeys ) combinations (singles, doubles, triples)? I'm not sure why I got scrambled here tbh. I swear I tried that and "idx" stayed None but will tinker a bit more later.
 

Sancho1969

Message Maven
Modder
Donor
Jan 19, 2020
12,382
48,852
:ROFLMAO: wtf? ( note: if myYield == match returns False)

Code snippet to help isolate issue(s):
Python:
def combolistsort(iterable, r, match):
    [...]
    myYield = str(sorted(list(myPool[i] for i in indices)))
    #yield myYield
    print("match is:", match)
    print("myYield is:", myYield)
    print("match equal is:", myYield==match)
    if myYield == match:  # <-
    return 0         # <-
    count = 1
    [...]
Output:
Code:
match is: ['testing_it']
myYield is: ['testing_it']
match equal is: False
Edit: Nevermind... myYield is set as a string... match is not ... ugh
 
Last edited: