What's your reasoning for using small batches? Wouldn't the LLM be better at stuff like pronouns and ownership if it has more of the text at once?
I feel like prefacing this before I respond 12 hours later, that i know the context sizes on this generally arent over 4k tokens per request, with much of it being just the prompt, I just think for translation, lower is better, to a point.
Theres a few reasons i like to use smaller batches, but the biggest one is that shorter context gives smaller models less chances to drift. With big batches, and big inputs, theyre more likely to 'lose track', so they end up repeating lines, contradicting themselves, yada.
And if we get stuck in a retry loop, and it keeps getting the line count wrong so it makes the full big batch 6 times, its no good, but smaller batch sizes dont have that issue nearly as much. So it ends up being a tradeoff. a bit of translation loss from reduced context, but much better reliability and adherence. and an added benefit is that your wasting less tokens on retries, so more generated tokens are going towards an actual usable result
And in general, from my time throwing models at a wall, the smaller of a model you go, the smaller that usable coherent context window gets, and in general, most models will prefer smaller contexts. The added benefit of smaller contexts is that it naturally leads to better prompt adhesion, so the linecounts are wrong less often, and it wont mess up and just regurgitate the japanese text back to you nearly as much.
shorter prompts also tend to save on kv cache and generate tokens faster because you can make more inference requests.
theres a handful of benchmarks publizing the results like the more recent
You must be registered to see the links
benchmark or the old RULER or Needle in a haystack test to see if a LLM was any good at long context, and while they're good references, im either placebo'ed into short context or crazy to think its worst or better. you should probably try it out and see if you like the results that your AI generates with shorter context rather than all of the context.