Best Ai to help with writing and fleshing out a story ? (preferably self hostable)

Illhoon

Member
Jul 7, 2019
455
571
Hi so basically the title im wondering what currently is the best llm modell to assist in writing and fleshing out a story and scenarios ? Id love for it to be self hostable. I dont really know any good modells the only big premium modell that alows nsfw stuff is grok
 

n00bi

Active Member
Nov 24, 2022
599
664
Howdy.

If you want to run a Local AI LLM.
Then i suggest you look into and as a interface. more like chatgpt but locally.
Ollama supports a great deal of .

As to which model is best to use. that is a bit subjective i guess.
But you should stay away from the small models.
Use models that are minimum 12b or 14b "Billion paramaters"
Overall. DeepSeek, Gemma, llama, Qwen2, Phi4 are descent models.
Most of these models are censored tho but you will find some uncensored models.
But works fine for general romance story making.

You need to pull down the models you want to use.
forexample:
ollama pull deepseek-r1:14b
or
ollama pull phi4:14b
And replace pull with run when you want to run said model.
If you use openai webui you just select which model you want there from a list of downloaded ones.

If you search for uncensored or you will fine some models.
I dont know how good they are tho

There is also the possibility to create a "Rag" template for your story.
Basically you tell it about what you already have so it doesn't start from no context.
but the rag needs to be updated every time the story is. so it knows about latest changes.

Also you have the ability to interact with ollama though python.
So you can do a lot of fancy stuff if you really want too.
Anyway you will find a ton of tutorials on yt etc, and the discord server is help full too.

And some ollama fun, making it write a song about itself :p
 
Last edited:

no_more_name

Member
Mar 3, 2024
104
46
Couldn't work on my last VN prototype for ages but here what I tested. I do prefer when I can use it. There is now portable installation that make everything easier, you can grab the latestest version corresponding to your card/system.

In my opinions and after plenty of testing there is not so much models to choose from. Despite it's promise, I found to be underwhelming in most aspects of what we are interested in. Since we relatively need huge context (16k+ tokens), 4-bit quantization is generaly a good compromise.

For "creative" writing, my favorites so far:


(less good than the two other)

For coding (Renpy/Python):
(If you have enough VRAM)
(The one I use)
(Not that bad)
(Same)

Each models have recommended settings, especially temperature. As exemple Mistral Small use temp=0.15, for coding answer you want to be near this value. For creative writing you want to push for more, say temp=1 and slightly lower top_p like 0.95 (opening the pool of words).

For coding, I mostly use it for debugging, not really to write code from scratch. I tried to use within VScode but found it was a waste of time. My prototype is probably just way too big overall and I haven't had time to test it on simpler things. I simply make a quick text skeleton of my game with all critical functions and game logic, a bit like this:

You are an AI specialized in Python and Renpy, a game engine used to create Visual novel game. You help the user by answering his questions. The following code is the context that the user is working on.

Provided code start here:
For creative I started to use (be sure to flag 'openai' in text-generation-webui) until I couldn't work on my prototype anymore. But I found the whole approach quite interesting, especialy in a VN context. I can't list all features so you have to figure it out for yourself browsing the site. Note that the trial version is feature complete but limited to 30 documents, iirc.

27099.png
 
Last edited:

n00bi

Active Member
Nov 24, 2022
599
664
Each models have recommended settings, especially temperature. As exemple Mistral Small use temp=0.15, for coding answer you want to be near this value. For creative writing you want to push for more, say temp=1 and slightly lower top_p like 0.95 (opening the pool of words).
For beginners i don't think its smart to mess around with the settings.
one need to understand what they do and what effect it will have on the output.
You dont want to end up with a hallucinated output fully out of context.
Once you get more experience you can ofc play around with them.

I think one should focus on learning how to make prompts first.
using templates etc before diving into settings.


For coding, I mostly use it for debugging, not really to write code from scratch.
Oh.. i do it the other way.. i tell it make me a function that takes arguments x,y,z as input, do this and that and return K.
And i end up debugging the code its spitting out, Its nice for generating misc templates tho.
Qwen2.5 coder is the best one to use for coding imo.
I did have a Local AI challenge for myself between DeepSeek, Phi4, llama, and Qwen2.5-coder.
Where the task was to create the old game called: Conway's Game of life. Qwen2.5-coder came out as the winner :)

I see Mistral is mentioned as a model, lets not forget the models.
There is aslo which quote " This model is proficient at both roleplaying and storywriting due to its unique nature. "
And there is aslo the model.
I have no idea how good these are tho.

(If you have enough VRAM)
This is a general rule for all Models.
While you can run models on the cpu with enough ram, that is not really something to aim for.
Running it on the gpu is FTL speed compared to the cpu.
 

no_more_name

Member
Mar 3, 2024
104
46
For beginners i don't think its smart to mess around with the settings.
one need to understand what they do and what effect it will have on the output.
You dont want to end up with a hallucinated output fully out of context.
Once you get more experience you can ofc play around with them.

I think one should focus on learning how to make prompts first.
using templates etc before diving into settings.
I tend to believe the opposite. For iterative writing, mess with settings (at least temp, top_p, min_p) as soon as possible to get the gist of it. But to each their own I guess. For coding/instruct you want to get close to the model recommended settings most of the time.

This is a general rule for all Models.
While you can run models on the cpu with enough ram, that is not really something to aim for.
Running it on the gpu is FTL speed compared to the cpu.
Yeah, running a local model on RAM is kinda pointless if the goal is to win time (and not lose it).

In another topic, I would add as a rule of thumb that the more context you add, the more your local model will hallucinate (some model are more robust than other). There is not much you can do about it (outside running test at ~16k~32k ctx), so be smart about the context you feed.
 
  • Like
Reactions: n00bi

n00bi

Active Member
Nov 24, 2022
599
664
so be smart about the context you feed.
Indeed, hence is why i said to focus on how to make your prompt first.
How you structure and phrase your question/input, will greatly affect how the output will be like.

If you use you own python script for query's, a good template is nice to have so can customize the hole thing for novel making, but that's another topic.

And it is really important if you are working with censored models how you phrase stuff.
As they often comes and does this moral preaching or have some safety concern.
You don't have permission to view the spoiler content. Log in or register now.
you must find a workaround, ie call a penis for a carrot. and a vagina for a donut etc.
"he puts his carrot into her donut ..." :p

But Yea you should play around with the settings as well to see how the output changes..
Although don't go down the rabbit hole and spend wast amount of time of doing endless tweaking.
You will see that, often if you do a fresh query with the same question/input the output may vary a lot anyway.
 

Illhoon

Member
Jul 7, 2019
455
571
Howdy.

If you want to run a Local AI LLM.
Then i suggest you look into and as a interface. more like chatgpt but locally.
Ollama supports a great deal of .

As to which model is best to use. that is a bit subjective i guess.
But you should stay away from the small models.
Use models that are minimum 12b or 14b "Billion paramaters"
Overall. DeepSeek, Gemma, llama, Qwen2, Phi4 are descent models.
Most of these models are censored tho but you will find some uncensored models.
But works fine for general romance story making.

You need to pull down the models you want to use.
forexample:
ollama pull deepseek-r1:14b
or
ollama pull phi4:14b
And replace pull with run when you want to run said model.
If you use openai webui you just select which model you want there from a list of downloaded ones.

If you search for uncensored or you will fine some models.
I dont know how good they are tho

There is also the possibility to create a "Rag" template for your story.
Basically you tell it about what you already have so it doesn't start from no context.
but the rag needs to be updated every time the story is. so it knows about latest changes.

Also you have the ability to interact with ollama though python.
So you can do a lot of fancy stuff if you really want too.
Anyway you will find a ton of tutorials on yt etc, and the discord server is help full too.

And some ollama fun, making it write a song about itself :p
Yeah i already host a llm per ollama mistral dolphin:latest which should have been uncensored but its really not so im looking for some models i can run at home( got a 4080super) that perform well and are uncensored for NSFW writing. I also do have a gtx 1080 with 8gb vram and 64 gb ddr4 in a old pc i have standing around and i was wondering if i can use that aswell to run self hosted llms. Do you know if there is something like a load balancer where i can run a bigger modell but split on my 4080 Super on my main pc and the 1080 in a homeserver ?

why i said to focus on how to make your prompt first.
How you structure and phrase your question/input, will greatly affect how the output will be like.
Are there any great guides on good prompting and settings to learn from ?
 
Last edited:
  • Hey there
Reactions: n00bi

Illhoon

Member
Jul 7, 2019
455
571
Couldn't work on my last VN prototype for ages but here what I tested. I do prefer when I can use it. There is now portable installation that make everything easier, you can grab the latestest version corresponding to your card/system.

In my opinions and after plenty of testing there is not so much models to choose from. Despite it's promise, I found to be underwhelming in most aspects of what we are interested in. Since we relatively need huge context (16k+ tokens), 4-bit quantization is generaly a good compromise.

For "creative" writing, my favorites so far:


(less good than the two other)

For coding (Renpy/Python):
(If you have enough VRAM)
(The one I use)
(Not that bad)
(Same)

Each models have recommended settings, especially temperature. As exemple Mistral Small use temp=0.15, for coding answer you want to be near this value. For creative writing you want to push for more, say temp=1 and slightly lower top_p like 0.95 (opening the pool of words).

For coding, I mostly use it for debugging, not really to write code from scratch. I tried to use within VScode but found it was a waste of time. My prototype is probably just way too big overall and I haven't had time to test it on simpler things. I simply make a quick text skeleton of my game with all critical functions and game logic, a bit like this:



For creative I started to use (be sure to flag 'openai' in text-generation-webui) until I couldn't work on my prototype anymore. But I found the whole approach quite interesting, especialy in a VN context. I can't list all features so you have to figure it out for yourself browsing the site. Note that the trial version is feature complete but limited to 30 documents, iirc.

View attachment 4857310
ouh thank you i will look into the modells you mentioned. For coding ill stick to the big modells for now i dont think i can get the same performance self hosted. Am looking for NSFW models i can self host to help me write NSFW specifically. Also do you know if there is something like a load balancer where i can split LLM loads onto multiple GPUs not in the same system ? because if got a 4080Super in my main pc and i have a old pc with a 1080 with 8gb vram in there and i thought i might be able to utilitze that while i still have it i just have no clue if that is even possible.

In another topic, I would add as a rule of thumb that the more context you add, the more your local model will hallucinate (some model are more robust than other). There is not much you can do about it (outside running test at ~16k~32k ctx), so be smart about the context you feed.

Is there any guids about proper promting and settings to get a gist of that without a lot of trial and error ?
 

osanaiko

Engaged Member
Modder
Jul 4, 2017
2,961
5,693
Do you know if there is something like a load balancer where i can run a bigger modell but split on my 4080 Super on my main pc and the 1080 in a homeserver ?
As far as I know this is not possible. At best you could run a model that fits in the VRAM (allowing for context space) on the smallest card. And even then it would be deathly slow because the inter-layer comms would be going across network, not the local PCIe bus.

Is there any guids about proper promting and settings to get a gist of that without a lot of trial and error ?
You can watch a bunch of vids and learn what other people are doing. But it very much is trial and error to learn how to use it yourself.
 
Last edited:
  • Like
Reactions: Illhoon and n00bi

n00bi

Active Member
Nov 24, 2022
599
664
Do you know if there is something like a load balancer where i can run a bigger modell but split on my 4080 Super on my main pc and the 1080 in a homeserver ?
I dont know about across networks. even if you could do it, that would be silly and dead slow.
At best you can put that old 1080 into your pc and use two cards.
If you look at "ollama serve --help" you will see this thing called:
OLLAMA_SCHED_SPREAD Always schedule model across all GPUs

So in your sys env you can add something like.
OLLAMA_SCHED_SPREAD = 1
and
CUDA_VISIBLE_DEVICES=0,1

I cant afford two good Nvidia cards so i have no clue how well it works tho.

Are there any great guides on good prompting and settings to learn from ?
Sadly i don't have anything specific. you just have to play around and do some trail and error.
and on youtube you search with keywords like. ollama prompt template etc.
Matt Williams has a lot of videos about ollama among others.
 
  • Like
Reactions: Illhoon

no_more_name

Member
Mar 3, 2024
104
46
I looked further in long context and performance degradation, and damn it's worse than I thought :HideThePain:
You can read the paper here:

95ysyjzs8sie1.png

CoT Prompting:

brave_AVQyJB2ECC.jpg
 
Last edited:
  • Like
Reactions: Illhoon and n00bi

n00bi

Active Member
Nov 24, 2022
599
664
I looked further in long context and performance degradation, and damn it's worse than I thought :HideThePain:
This is why i also mentioned the use of RAG ( Retrieval-Augmented Generation ) some posts up..
If done properly you can use less of the context window.

Lets say you have 5 chapters.
You defiantly don't want to feed it all chapters in one go and say write me the next chapter.

You split each into its own text file.
So have something like NovelName_Chap1.txt, NovelName_Chap2.txt ...,
Or NovelName_Chap1.1.txt, NovelName_Chap1.2, NovelName_Chap1.3.., NovelName_Chap2.1.txt, NovelName_Chap2.2 ...,
And a summery file for each chapter. ie. NovelName_Summery_Chap1.txt, NovelName_Summery_Chap2.txt ...

And when you want to continue on with NovelName_Chap6.txt
You can tell it to use the NovelName_Summery_Chap5 to continue on with a story.
or refer to any earlier chapters if needed in the story.

Rag is much more efficient as it uses less context for better results.
Its not magic tho but If used properly:
* The retriever selects only the most relevant chunks.
* The model gets just what it needs.
* You don’t overwhelm the model with irrelevant or old info.
* You use less of the context window and leave more room for better, focused answers.

Technically, Rag and manual prompting both use the same mechanism
But Rag is more efficient in how it selects what goes into that window, so it can use less context overall, especially compared to naive, manual pasting.

Setting up Rag requires some work tho, you can do it in openAI webui if thats your interface or alike.
Or if you want to dive into the world of python you need to install Chroma, LangChain etc for this.
It usually involves splitting the docs into chunks, make vectors, store in a db etc.

Your Docs

Split into chunks

Embed (turn into vectors)

Store in vector DB (e.g., Chroma)

You ask a question regarding NovelName:
→ Relevant chunks are retrieved
→ Passed to Ollama as context
→ You get a better, smart answer based on content from the Rag.

Some Python script and a good prompt template + Rag can probably get you a long way.

I have good experience using Rag when it comes to api documentation .
For example i use Rag when i want to ask my qwen2.5-coder for c4d related python stuff.
Because there are so many api changes between c4d versions, and without the rag, it just spits out a mixture of old and new code.
With the rag i get a much much better result.
But Making novels using Rag.. i never tried it for that tho.
 
Last edited:
  • Like
Reactions: no_more_name

no_more_name

Member
Mar 3, 2024
104
46
Oh yeah indeed, but I'm kinda surpised that models degrade this early in context.
Their test is a bit tricky, and iterative writing doesn't need to be fully pristine but damn.
Probably better overall to aim for a simple 8k context and a more robust quantization if your card can.

As for RAG, I use(d) a trial version of NovelForge. I never seen something that comes close to it. It's so feature rich that it's practically overkill for your typical VN. The is the one who made Ooba extensions years ago such as Playground and Twinbook, for those who know. People should really try it, I insist o/
 

osanaiko

Engaged Member
Modder
Jul 4, 2017
2,961
5,693
. The is the one who made Ooba extensions years ago such as Playground and Twinbook, for those who know.
Now you've sold me, i loved Playground. It was the first extension that made it feel useful to a llm writing tool. haven't done any more writing with it for a year now, so it's not the state of the art anymore. novelforge seems like the next thing I should try. thanks!
 
  • Like
Reactions: no_more_name