Tool RPGM SLR Translator - Offline JP to EN Translation for RPG Maker VX, VX Ace, MV, and MZ

Shisaye

Engaged Member
Modder
Dec 29, 2017
3,349
5,944
By the way, I forgot to mention—reasoning models aren’t great for translation tasks.
I should still give people the option if they want to use them for whatever reason.
In general I don't think local hosting is viable right now. I'm just going to use DeepSeek proper.

So you're saying the issue is specifically with Meta-Llama-3.1-8B-Instruct-Q4_K_M and Gemma-2-9b-it-Q3_K_L?
I'll get those and run tests.
Might take a bit because I have potato internet.
 
Dec 1, 2018
104
92
That's odd. I ran some tests with an R1 distill version and I was getting an output.
Not a usable one, because the current version of DSLR doesn't know how to handle the "think" block at the start of the response of a reasoning model like that, but it shouldn't just outright crash. (I'm working on fixing the reasoning stuff right now.)
The one I tried was deepseek/deepseek-r1-0528-qwen3-8b though.

Edit: I should mention that I still haven't found a single model that can be locally hosted on reasonable hardware that actually gives acceptable performance in DSLR, yet.
The best one so far was actually gemma-3-27b, but still much worse than the official deepseek-chat-v3, or the experimental deepseek-chat-v3-0324 model, and incredibly slow.
Lm studio can do think parsing for you and separate that from the output.

and more or less, just regex it out. 1750899602944.png


It worked fine using the GPT4All model Meta-Llama.
By the way, I forgot to mention—reasoning models aren’t great for translation tasks.

View attachment 4980849
View attachment 4980851
View attachment 4980852
I use qwen3 for translation quite often. its very good.


I should still give people the option if they want to use them for whatever reason.
In general I don't think local hosting is viable right now. I'm just going to use DeepSeek proper.

So you're saying the issue is specifically with Meta-Llama-3.1-8B-Instruct-Q4_K_M and Gemma-2-9b-it-Q3_K_L?
I'll get those and run tests.
Might take a bit because I have potato internet.
And local is super doable in my opinion. Qwen3 32B tends to do just fine with your parser at around 3 rows per second on sequential requests, and it has the kv cache headroom to do *way* more if its allowed to
 

Shisaye

Engaged Member
Modder
Dec 29, 2017
3,349
5,944
Lm studio can do think parsing for you and separate that from the output.

and more or less, just regex it out. View attachment 4980947
Well that's convenient. If that works then I just need to make a new options menu entry to disable the token scaling. (Currently it makes max_tokens dependent on the input length, which screws reasoning models on very small tasks because they run out of tokens.)

But I'll still make my own failsafe system as well, because someone could technically host them using Kobold or whatever, and not have that option.
 
Dec 1, 2018
104
92
Well that's convenient. If that works then I just need to make a new options menu entry to disable the token scaling. (Currently it makes max_tokens dependent on the input length, which screws reasoning models on very small tasks because they run out of tokens.)

But I'll still make my own failsafe system as well, because someone could technically host them using Kobold or whatever, and not have that option.
Should probably add an option to also change the regex search for <think> and </think>, as some other models use different separators.
 

Shisaye

Engaged Member
Modder
Dec 29, 2017
3,349
5,944
Should probably add an option to also change the regex search for <think> and </think>, as some other models use different separators.
I was planning to make a regex list of multiple ones that check and trim the response. Which ones are there besides <think>, :think:, or the reasoning being in a separate assistant message?

Edit: I guess I'll try a catchall approach.
Code:
/^[\s\S]*?[^a-z]think[^a-z][\s\S]*?[^a-z]think[^a-z]+\s*/gi
 
Last edited:

Shisaye

Engaged Member
Modder
Dec 29, 2017
3,349
5,944
Please consider sometime in the future to finish implementing concurrent requests for users like myself who use a production serving engine with the hardware to support it. Doing one batched translation at a time is a vast underutilization of the power these GPU's can provide, to the scale of 10 or 20 times slower than doing multiple requests in parallel.
Do you literally just mean async processes that make requests to the same url at the same time?
That wouldn't be hard to implement, but doesn't do anything with LM Studio or OpenRouter, because at least on standard settings those will just queue the requests and still only do them one at a time.
 
Dec 1, 2018
104
92
Do you literally just mean async processes that make requests to the same url at the same time?
That wouldn't be hard to implement, but doesn't do anything with LM Studio or OpenRouter, because at least on standard settings those will just queue the requests and still only do them one at a time.
I mean exactly that.

I use , and that allows me to make *tons* of requests at once. its very different than LM Studio and openrouter in how they approach serving an individual user, as they artificially limit you to 1 request served at any given moment (or in LM Studio's case, and through that llama.cpp, it actually degrades performance serving more than one individual at once), but when you *have* a serving engine, not so limited, you can reap the benefits of your hardware.


I was planning to make a regex list of multiple ones that check and trim the response. Which ones are there besides <think>, :think:, or the reasoning being in a separate assistant message?

Edit: I guess I'll try a catchall approach.
Code:
/^[\s\S]*?[^a-z]think[^a-z][\s\S]*?[^a-z]think[^a-z]+\s*/gi
Not that many, i know some models use <reasoning></reasoning> but a catch all that you can replace 'think' with something else is probably as good as it gets.
 

Shisaye

Engaged Member
Modder
Dec 29, 2017
3,349
5,944
I mean exactly that.

I use VLLM, and that allows me to make *tons* of requests at once. its very different than LM Studio and openrouter in how they approach serving an individual user, as they artificially limit you to 1 request served at any given moment (or in LM Studio's case, and through that llama.cpp, it actually degrades performance serving more than one individual at once), but when you *have* a serving engine, not so limited, you can reap the benefits of your hardware.
What would be a "normal" amount of requests at the same time for an "average" user? Making it a fixed amount would be a lot easier than making it variable. There is so much that could go wrong.
 
Dec 1, 2018
104
92
What would be a "normal" amount of requests at the same time for an "average" user? Making it a fixed amount would be a lot easier than making it variable. There is so much that could go wrong.
5 or 8 would probably be a good balance of performance for most individuals, LM studio ollama and open router (not sure if it limits you to 1 request at once) we assume will queue requests regardless, so there isnt much you lose in that respect.

im an advocate for as many as the user sets and having a .lock on anything that needs to be *done* before the next batch goes out or anything that may turn into a race condition, but i can understand that im probably in more of a niche case than the standard user, especially with the request im making.
 

Shisaye

Engaged Member
Modder
Dec 29, 2017
3,349
5,944
It worked fine using the GPT4All model Meta-Llama.
I tried Meta-Llama-3.1-8B-Instruct-Q4_K_M and Gemma-2-9b-it-Q3_K_L, but I still get the same error.
I'm sorry, but I really do not understand what's going on on your end.
I've tried both of those models with standard settings and they worked without any issues.

Meta-Llama-3.1-8B-Instruct-Q4_K_M:
1.png
2.png
Gemma-2-9b-it-Q3_K_L
3.png
4.png

(Ignore it saying v1.135, I just changed the number already, it's still v1.134 DSLR.)
 

Shisaye

Engaged Member
Modder
Dec 29, 2017
3,349
5,944
5 or 8 would probably be a good balance of performance for most individuals, LM studio ollama and open router (not sure if it limits you to 1 request at once) we assume will queue requests regardless, so there isnt much you lose in that respect.

im an advocate for as many as the user sets and having a .lock on anything that needs to be *done* before the next batch goes out or anything that may turn into a race condition, but i can understand that im probably in more of a niche case than the standard user, especially with the request im making.
The way it works right now (which is not ideal, I know, but hard to change at this point) is that the MTL Processer module adds lines to a batch until the amount of Han/Hiragana/Katakana characters is 2000 (default setting) and sends that batch to the DSLR engine. The DSLR engine then counts it again and makes its own batch. (currently set to the same amount so it ends up with the same batch).

What I could do without having to remake large parts of the process would be to set the full batch to 10000, but keep the DSLR batch at 2000, and instead of waiting for the response, for DSLR to already send the next 2000 until it hits the end of the full batch.

That way it would make 5 requests at the same time to finish the 1 full batch.

And I could basically make that scalable by whatever you set as max for the full batch and the DSLR batch.
So that 12000 2000 would make 6 requests at the same time for example. 8000 2000 would make 4 etc.

Is that what you had in mind?

The main downside is that if one of the requests fails all retry attempts, the entire full batch will be lost.
 
Dec 1, 2018
104
92
The way it works right now (which is not ideal, I know, but hard to change at this point) is that the MTL Processer module adds lines to a batch until the amount of Han/Hiragana/Katakana characters is 2000 (default setting) and sends that batch to the DSLR engine. The DSLR engine then counts it again and makes its own batch. (currently set to the same amount so it ends up with the same batch).


What I could do without having to remake large parts of the process would be to set the full batch to 10000, but keep the DSLR batch at 2000, and instead of waiting for the response, for DSLR to already send the next 2000 until it hits the end of the full batch.


That way it would make 5 requests at the same time to finish the 1 full batch.


And I could basically make that scalable by whatever you set as max for the full batch and the DSLR batch.

So that 12000 2000 would make 6 requests at the same time for example. 8000 2000 would make 4 etc.


Is that what you had in mind?


The main downside is that if one of the requests fails all retry attempts, the entire full batch will be lost.
That definitely sounds prone to issues to me atleast, especially if it cant figure out whats 'bad' from the request and just add the bad request back to the pool.

I know the system itself has a 'cache' that it pools from, once it verifies which ones in a batch are good, is it possible to just add it to the cache pre-emptively? that way in the event a single batch TL in the pool fails, when it retries it checks cache and sees 'this seems to already be translated'?

Otherwise, best ideas i can think of otherwise is

```
Attempt 1: 5 concurrent requests ↓ (if any fail) Attempt 2: 3 concurrent requests (split failed portions) ↓ (if still failing) Attempt 3: Sequential requests (current behavior) ```

or make separate instances of the MTL processor for batches, where the main will do what you do, up to a configurable 10000+ character limit, and then it'll split that down into smaller chunks (like configurable 2000+ (i use around 300 personally) char segments) where each chunk gets handled by a separate MTL processor instance that manages its own retries independently, caches successful translations immediately, and can retry failed chunks without affecting the successful ones.
 

Shisaye

Engaged Member
Modder
Dec 29, 2017
3,349
5,944
I've released v1.135.
Added support for reasoning models.
There's a new option in the options menu that uncaps max_tokens regardless of input, and trims off the common think block variants.
The LM Studio option is not necessary.

Added new General option to increase concurrent requests to the translation server.
I capped it at 99 requests, because I doubt someone needs more (make bigger batches then lol), and I couldn't be arsed to allow 3 digits.
Since I had to completely rewrite the RedBatchTranslation class to make this work, something is likely broken now even if you keep the option at 1.
I would like to forward all complaints to RenderedFatality for requesting this cursed feature.
 
Dec 1, 2018
104
92
I've released v1.135.
Added support for reasoning models.
There's a new option in the options menu that uncaps max_tokens regardless of input, and trims off the common think block variants.
The LM Studio option is not necessary.

Added new General option to increase concurrent requests to the translation server.
I capped it at 99 requests, because I doubt someone needs more (make bigger batches then lol), and I couldn't be arsed to allow 3 digits.
Since I had to completely rewrite the RedBatchTranslation class to make this work, something is likely broken now even if you keep the option at 1.
I would like to forward all complaints to RenderedFatality for requesting this cursed feature.
Holy crap you did it on the first try!

Somehow didn't break anything, but i will note

1750927811746.png


Seems very stuck around 6 requests a second, not super ideal, but definitely better for mass translation than what it was before.
 

Shisaye

Engaged Member
Modder
Dec 29, 2017
3,349
5,944
Holy crap you did it on the first try!

Somehow didn't break anything, but i will note

View attachment 4981965


Seems very stuck around 6 requests a second, not super ideal, but definitely better for mass translation than what it was before.
What's your reasoning for using small batches? Wouldn't the LLM be better at stuff like pronouns and ownership if it has more of the text at once?
 
Dec 1, 2018
104
92
What's your reasoning for using small batches? Wouldn't the LLM be better at stuff like pronouns and ownership if it has more of the text at once?
I feel like prefacing this before I respond 12 hours later, that i know the context sizes on this generally arent over 4k tokens per request, with much of it being just the prompt, I just think for translation, lower is better, to a point.

Theres a few reasons i like to use smaller batches, but the biggest one is that shorter context gives smaller models less chances to drift. With big batches, and big inputs, theyre more likely to 'lose track', so they end up repeating lines, contradicting themselves, yada.

And if we get stuck in a retry loop, and it keeps getting the line count wrong so it makes the full big batch 6 times, its no good, but smaller batch sizes dont have that issue nearly as much. So it ends up being a tradeoff. a bit of translation loss from reduced context, but much better reliability and adherence. and an added benefit is that your wasting less tokens on retries, so more generated tokens are going towards an actual usable result

And in general, from my time throwing models at a wall, the smaller of a model you go, the smaller that usable coherent context window gets, and in general, most models will prefer smaller contexts. The added benefit of smaller contexts is that it naturally leads to better prompt adhesion, so the linecounts are wrong less often, and it wont mess up and just regurgitate the japanese text back to you nearly as much.

shorter prompts also tend to save on kv cache and generate tokens faster because you can make more inference requests.

theres a handful of benchmarks publizing the results like the more recent benchmark or the old RULER or Needle in a haystack test to see if a LLM was any good at long context, and while they're good references, im either placebo'ed into short context or crazy to think its worst or better. you should probably try it out and see if you like the results that your AI generates with shorter context rather than all of the context.
 
Last edited:
  • Thinking Face
Reactions: derakino999

kukuru97

Member
May 31, 2019
128
179
I should still give people the option if they want to use them for whatever reason.

So you're saying the issue is specifically with Meta-Llama-3.1-8B-Instruct-Q4_K_M and Gemma-2-9b-it-Q3_K_L?
I'll get those and run tests.
Might take a bit because I have potato internet.
i think the issue is LM Studio
I'm sorry, but I really do not understand what's going on on your end.
I've tried both of those models with standard settings and they worked without any issues.

Meta-Llama-3.1-8B-Instruct-Q4_K_M:
View attachment 4981246
View attachment 4981245
Gemma-2-9b-it-Q3_K_L
View attachment 4981244
View attachment 4981243

(Ignore it saying v1.135, I just changed the number already, it's still v1.134 DSLR.)

I found the culprit
it works fine now.
The issue was with the 'Context Prompt'.
I had filled it with way too many vocabulary entries.

Here’s the vocabulary I had entered:

You don't have permission to view the spoiler content. Log in or register now.

1750965221960.png
 

kukuru97

Member
May 31, 2019
128
179
Translating one by one works fine, but when I try batch translation, the process gets stuck like this.
I’ve been waiting for 10 minutes, but there’s no sign of any progress.

And when I close SLR, it shows logs in the LLM like this
image_2025-06-27_023751243.png
image_2025-06-27_023939801.png
image_2025-06-27_023949816.png
 

Shisaye

Engaged Member
Modder
Dec 29, 2017
3,349
5,944
Translating one by one works fine, but when I try batch translation, the process gets stuck like this.
I’ve been waiting for 10 minutes, but there’s no sign of any progress.

And when I close SLR, it shows logs in the LLM like this
View attachment 4983729
View attachment 4983732
View attachment 4983734
287 lines seems way too much for a batch. Usually it should be like 100.
I wonder what's causing that, I assume your batch size is still at 2000 characters?

You can make LM Studio log token usage in real time, that way you could check if it's actually stuck, or just really slow.
There should be a checkbox in the small menu that shows up if you press those 3 dots top right of the developer console.

Edit: Maybe just try lowering it to 500 and see if that makes a difference, but you have to lower both settings (batch size and request size), not just one of them.
2000 characters is just what during my tests had the best success to token ratio when using DeepSeek.
It's not really optimized for translation quality or speed, it's optimized for keeping down the cost when using the premium api.
 
Last edited: