Best Ai to help with writing and fleshing out a story ? (preferably self hostable)

Illhoon · Tuesday at 1:21 AM

Hi so basically the title im wondering what currently is the best llm modell to assist in writing and fleshing out a story and scenarios ? Id love for it to be self hostable. I dont really know any good modells the only big premium modell that alows nsfw stuff is grok

n00bi · Tuesday at 1:52 AM

Howdy.

If you want to run a Local AI LLM.
Then i suggest you look into

You must be registered to see the links

and

You must be registered to see the links

as a interface. more like chatgpt but locally.
Ollama supports a great deal of

You must be registered to see the links

.

As to which model is best to use. that is a bit subjective i guess.
But you should stay away from the small models.
Use models that are minimum 12b or 14b "Billion paramaters"
Overall. DeepSeek, Gemma, llama, Qwen2, Phi4 are descent models.
Most of these models are censored tho but you will find some uncensored models.
But works fine for general romance story making.

You need to pull down the models you want to use.
forexample:
ollama pull deepseek-r1:14b
or
ollama pull phi4:14b
And replace pull with run when you want to run said model.
If you use openai webui you just select which model you want there from a list of downloaded ones.

If you search for uncensored or

You must be registered to see the links

you will fine some models.
I dont know how good they are tho

There is also the possibility to create a "Rag" template for your story.
Basically you tell it about what you already have so it doesn't start from no context.
but the rag needs to be updated every time the story is. so it knows about latest changes.

Also you have the ability to interact with ollama though python.
So you can do a lot of fancy stuff if you really want too.
Anyway you will find a ton of tutorials on yt etc, and the discord server is help full too.

And some ollama fun, making it write a song about itself

no_more_name · Tuesday at 12:28 PM

Couldn't work on my last VN prototype for ages but here what I tested. I do prefer

You must be registered to see the links

when I can use it. There is now portable installation that make everything easier, you can grab the latestest version

You must be registered to see the links

corresponding to your card/system.

In my opinions and after plenty of testing there is not so much models to choose from. Despite it's promise, I found

You must be registered to see the links

to be underwhelming in most aspects of what we are interested in. ~~Since we relatively need huge context (16k+ tokens), 4-bit quantization is generaly a good compromise.~~

For "creative" writing, my favorites so far:

You must be registered to see the links

(less good than the two other)

For coding (Renpy/Python):

You must be registered to see the links

(If you have enough VRAM)

You must be registered to see the links

(The one I use)

You must be registered to see the links

(Not that bad)

You must be registered to see the links

(Same)

Each models have recommended settings, especially temperature. As exemple Mistral Small use temp=0.15, for coding answer you want to be near this value. For creative writing you want to push for more, say temp=1 and slightly lower top_p like 0.95 (opening the pool of words).

For coding, I mostly use it for debugging, not really to write code from scratch. I tried to use

You must be registered to see the links

within VScode but found it was a waste of time. My prototype is probably just way too big overall and I haven't had time to test it on simpler things. I simply make a quick text skeleton of my game with all critical functions and game logic, a bit like this:

You are an AI specialized in Python and Renpy, a game engine used to create Visual novel game. You help the user by answering his questions. The following code is the context that the user is working on.

Provided code start here:

For creative I started to use

You must be registered to see the links

(be sure to flag 'openai' in text-generation-webui) until I couldn't work on my prototype anymore. But I found the whole approach quite interesting, especialy in a VN context. I can't list all features so you have to figure it out for yourself browsing the site. Note that the trial version is feature complete but limited to 30 documents, iirc.

n00bi · Tuesday at 5:19 PM

no_more_name said:
Each models have recommended settings, especially temperature. As exemple Mistral Small use temp=0.15, for coding answer you want to be near this value. For creative writing you want to push for more, say temp=1 and slightly lower top_p like 0.95 (opening the pool of words).

For beginners i don't think its smart to mess around with the settings.
one need to understand what they do and what effect it will have on the output.
You dont want to end up with a hallucinated output fully out of context.
Once you get more experience you can ofc play around with them.

I think one should focus on learning how to make prompts first.
using templates etc before diving into settings.

no_more_name said:
For coding, I mostly use it for debugging, not really to write code from scratch.

Oh.. i do it the other way.. i tell it make me a function that takes arguments x,y,z as input, do this and that and return K.
And i end up debugging the code its spitting out, Its nice for generating misc templates tho.
Qwen2.5 coder is the best one to use for coding imo.
I did have a Local AI challenge for myself between DeepSeek, Phi4, llama, and Qwen2.5-coder.
Where the task was to create the old game called: Conway's Game of life. Qwen2.5-coder came out as the winner

I see Mistral is mentioned as a model, lets not forget the

You must be registered to see the links

models.
There is aslo

You must be registered to see the links

which quote " This model is proficient at both roleplaying and storywriting due to its unique nature. "
And there is aslo the

You must be registered to see the links

model.
I have no idea how good these are tho.

no_more_name said:
(If you have enough VRAM)

This is a general rule for all Models.
While you can run models on the cpu with enough ram, that is not really something to aim for.
Running it on the gpu is FTL speed compared to the cpu.

no_more_name · Tuesday at 7:24 PM

n00bi said:
For beginners i don't think its smart to mess around with the settings.
one need to understand what they do and what effect it will have on the output.
You dont want to end up with a hallucinated output fully out of context.
Once you get more experience you can ofc play around with them.

I think one should focus on learning how to make prompts first.
using templates etc before diving into settings.

I tend to believe the opposite. For iterative writing, mess with settings (at least temp, top_p, min_p) as soon as possible to get the gist of it. But to each their own I guess. For coding/instruct you want to get close to the model recommended settings most of the time.

n00bi said:
This is a general rule for all Models.
While you can run models on the cpu with enough ram, that is not really something to aim for.
Running it on the gpu is FTL speed compared to the cpu.

Yeah, running a local model on RAM is kinda pointless if the goal is to win time (and not lose it).

In another topic, I would add as a rule of thumb that the more context you add, the more your local model will hallucinate (some model are more robust than other). There is not much you can do about it (outside running test at ~16k~32k ctx), so be smart about the context you feed.

n00bi · Tuesday at 8:25 PM

no_more_name said:
so be smart about the context you feed.

Indeed, hence is why i said to focus on how to make your prompt first.
How you structure and phrase your question/input, will greatly affect how the output will be like.

If you use you own python script for query's, a good template is nice to have so can customize the hole thing for novel making, but that's another topic.

And it is really important if you are working with censored models how you phrase stuff.
As they often comes and does this moral preaching or have some safety concern.

You don't have permission to view the spoiler content. Log in or register now.

you must find a workaround, ie call a penis for a carrot. and a vagina for a donut etc.
"he puts his carrot into her donut ..."

But Yea you should play around with the settings as well to see how the output changes..
Although don't go down the rabbit hole and spend wast amount of time of doing endless tweaking.
You will see that, often if you do a fresh query with the same question/input the output may vary a lot anyway.

Illhoon · Wednesday at 1:48 AM

n00bi said:
Howdy.

If you want to run a Local AI LLM.
Then i suggest you look into
You must be registered to see the links
and
You must be registered to see the links
as a interface. more like chatgpt but locally.
Ollama supports a great deal of
You must be registered to see the links
.

As to which model is best to use. that is a bit subjective i guess.
But you should stay away from the small models.
Use models that are minimum 12b or 14b "Billion paramaters"
Overall. DeepSeek, Gemma, llama, Qwen2, Phi4 are descent models.
Most of these models are censored tho but you will find some uncensored models.
But works fine for general romance story making.

You need to pull down the models you want to use.
forexample:
ollama pull deepseek-r1:14b
or
ollama pull phi4:14b
And replace pull with run when you want to run said model.
If you use openai webui you just select which model you want there from a list of downloaded ones.

If you search for uncensored or
You must be registered to see the links
you will fine some models.
I dont know how good they are tho

There is also the possibility to create a "Rag" template for your story.
Basically you tell it about what you already have so it doesn't start from no context.
but the rag needs to be updated every time the story is. so it knows about latest changes.

Also you have the ability to interact with ollama though python.
So you can do a lot of fancy stuff if you really want too.
Anyway you will find a ton of tutorials on yt etc, and the discord server is help full too.

And some ollama fun, making it write a song about itself

Yeah i already host a llm per ollama mistral dolphin:latest which should have been uncensored but its really not so im looking for some models i can run at home( got a 4080super) that perform well and are uncensored for NSFW writing. I also do have a gtx 1080 with 8gb vram and 64 gb ddr4 in a old pc i have standing around and i was wondering if i can use that aswell to run self hosted llms. Do you know if there is something like a load balancer where i can run a bigger modell but split on my 4080 Super on my main pc and the 1080 in a homeserver ?

why i said to focus on how to make your prompt first.
How you structure and phrase your question/input, will greatly affect how the output will be like.

Are there any great guides on good prompting and settings to learn from ?

Illhoon · Wednesday at 1:50 AM

no_more_name said:
Couldn't work on my last VN prototype for ages but here what I tested. I do prefer
You must be registered to see the links
when I can use it. There is now portable installation that make everything easier, you can grab the latestest version
You must be registered to see the links
corresponding to your card/system.

In my opinions and after plenty of testing there is not so much models to choose from. Despite it's promise, I found
You must be registered to see the links
to be underwhelming in most aspects of what we are interested in. Since we relatively need huge context (16k+ tokens), 4-bit quantization is generaly a good compromise.

For "creative" writing, my favorites so far:

You must be registered to see the links

You must be registered to see the links

You must be registered to see the links
(less good than the two other)

For coding (Renpy/Python):

You must be registered to see the links
(If you have enough VRAM)

You must be registered to see the links
(The one I use)

You must be registered to see the links
(Not that bad)

You must be registered to see the links
(Same)

Each models have recommended settings, especially temperature. As exemple Mistral Small use temp=0.15, for coding answer you want to be near this value. For creative writing you want to push for more, say temp=1 and slightly lower top_p like 0.95 (opening the pool of words).

For coding, I mostly use it for debugging, not really to write code from scratch. I tried to use
You must be registered to see the links
within VScode but found it was a waste of time. My prototype is probably just way too big overall and I haven't had time to test it on simpler things. I simply make a quick text skeleton of my game with all critical functions and game logic, a bit like this:

For creative I started to use
You must be registered to see the links
(be sure to flag 'openai' in text-generation-webui) until I couldn't work on my prototype anymore. But I found the whole approach quite interesting, especialy in a VN context. I can't list all features so you have to figure it out for yourself browsing the site. Note that the trial version is feature complete but limited to 30 documents, iirc.

View attachment 4857310

ouh thank you i will look into the modells you mentioned. For coding ill stick to the big modells for now i dont think i can get the same performance self hosted. Am looking for NSFW models i can self host to help me write NSFW specifically. Also do you know if there is something like a load balancer where i can split LLM loads onto multiple GPUs not in the same system ? because if got a 4080Super in my main pc and i have a old pc with a 1080 with 8gb vram in there and i thought i might be able to utilitze that while i still have it i just have no clue if that is even possible.

no_more_name said:
In another topic, I would add as a rule of thumb that the more context you add, the more your local model will hallucinate (some model are more robust than other). There is not much you can do about it (outside running test at ~16k~32k ctx), so be smart about the context you feed.

Is there any guids about proper promting and settings to get a gist of that without a lot of trial and error ?

osanaiko · Wednesday at 5:23 AM

Illhoon said:
Do you know if there is something like a load balancer where i can run a bigger modell but split on my 4080 Super on my main pc and the 1080 in a homeserver ?

As far as I know this is not possible. At best you could run a model that fits in the VRAM (allowing for context space) on the smallest card. And even then it would be deathly slow because the inter-layer comms would be going across network, not the local PCIe bus.

Is there any guids about proper promting and settings to get a gist of that without a lot of trial and error ?

You can watch a bunch of vids and learn what other people are doing. But it very much is trial and error to learn how to use it yourself.

n00bi · Wednesday at 8:40 AM

Illhoon said:
Do you know if there is something like a load balancer where i can run a bigger modell but split on my 4080 Super on my main pc and the 1080 in a homeserver ?

I dont know about across networks. even if you could do it, that would be silly and dead slow.
At best you can put that old 1080 into your pc and use two cards.
If you look at "ollama serve --help" you will see this thing called:
OLLAMA_SCHED_SPREAD Always schedule model across all GPUs

So in your sys env you can add something like.
OLLAMA_SCHED_SPREAD = 1
and
CUDA_VISIBLE_DEVICES=0,1

I cant afford two good Nvidia cards so i have no clue how well it works tho.

Illhoon said:
Are there any great guides on good prompting and settings to learn from ?

Sadly i don't have anything specific. you just have to play around and do some trail and error.
and on youtube you search with keywords like. ollama prompt template etc.
Matt Williams has a lot of videos about ollama among others.

no_more_name · Wednesday at 8:42 AM

I looked further in long context and performance degradation, and damn it's worse than I thought

You can read the paper here:

You must be registered to see the links

CoT Prompting:

n00bi · Wednesday at 10:45 AM

no_more_name said:
I looked further in long context and performance degradation, and damn it's worse than I thought

This is why i also mentioned the use of RAG ( Retrieval-Augmented Generation ) some posts up..
If done properly you can use less of the context window.

Lets say you have 5 chapters.
You defiantly don't want to feed it all chapters in one go and say write me the next chapter.

You split each into its own text file.
So have something like NovelName_Chap1.txt, NovelName_Chap2.txt ...,
Or NovelName_Chap1.1.txt, NovelName_Chap1.2, NovelName_Chap1.3.., NovelName_Chap2.1.txt, NovelName_Chap2.2 ...,
And a summery file for each chapter. ie. NovelName_Summery_Chap1.txt, NovelName_Summery_Chap2.txt ...

And when you want to continue on with NovelName_Chap6.txt
You can tell it to use the NovelName_Summery_Chap5 to continue on with a story.
or refer to any earlier chapters if needed in the story.

Rag is much more efficient as it uses less context for better results.
Its not magic tho but If used properly:
* The retriever selects only the most relevant chunks.
* The model gets just what it needs.
* You don’t overwhelm the model with irrelevant or old info.
* You use less of the context window and leave more room for better, focused answers.

Technically, Rag and manual prompting both use the same mechanism
But Rag is more efficient in how it selects what goes into that window, so it can use less context overall, especially compared to naive, manual pasting.

Setting up Rag requires some work tho, you can do it in openAI webui if thats your interface or alike.
Or if you want to dive into the world of python you need to install Chroma, LangChain etc for this.
It usually involves splitting the docs into chunks, make vectors, store in a db etc.

Your Docs
↓
Split into chunks
↓
Embed (turn into vectors)
↓
Store in vector DB (e.g., Chroma)
↓
You ask a question regarding NovelName:
→ Relevant chunks are retrieved
→ Passed to Ollama as context
→ You get a better, smart answer based on content from the Rag.

Some Python script and a good prompt template + Rag can probably get you a long way.

I have good experience using Rag when it comes to api documentation .
For example i use Rag when i want to ask my qwen2.5-coder for c4d related python stuff.
Because there are so many api changes between c4d versions, and without the rag, it just spits out a mixture of old and new code.
With the rag i get a much much better result.
But Making novels using Rag.. i never tried it for that tho.

no_more_name · Wednesday at 1:14 PM

n00bi said:
[...]

Oh yeah indeed, but I'm kinda surpised that models degrade this early in context.
Their test is a bit tricky, and iterative writing doesn't need to be fully pristine but damn.
Probably better overall to aim for a simple 8k context and a more robust quantization if your card can.

As for RAG, I use(d) a trial version of NovelForge. I never seen something that comes close to it. It's so feature rich that it's practically overkill for your typical VN. The

You must be registered to see the links

is the one who made Ooba extensions years ago such as Playground and Twinbook, for those who know. People should really try it, I insist o/

osanaiko · Wednesday at 1:19 PM

no_more_name said:
. The
You must be registered to see the links
is the one who made Ooba extensions years ago such as Playground and Twinbook, for those who know.

Now you've sold me, i loved Playground. It was the first extension that made it feel useful to a llm writing tool. haven't done any more writing with it for a year now, so it's not the state of the art anymore. novelforge seems like the next thing I should try. thanks!

Best Ai to help with writing and fleshing out a story ? (preferably self hostable)

Illhoon

Member

n00bi

Active Member

no_more_name

Member

n00bi

Active Member

no_more_name

Member

n00bi

Active Member

Illhoon

Member

Illhoon

Member

osanaiko

Engaged Member

n00bi

Active Member

no_more_name

Member

n00bi

Active Member

no_more_name

Member

osanaiko

Engaged Member