Tutorial VN Using your GPU to power AI for Erotic Roleplay

Dofus

New Member
Oct 7, 2017
5
16
Basically work on LLM AI's has advanced enough so that you can run them reliably on semi-high end consumer PC's thanks to something something quantification. You can use combination of CPU and/or GPU to run these, but I will only focus on running them on GPU for near instant responses.

There's a lot of options of running this stuff on cloud if you don't have good enough PC, from paid options such as NovelAI, OpenRouter to free options like google colab, poe, horde and so on but I will focus in this guide only on using our own hardware. For context, I have RTX 4070 with 12GB VRAM and 16 GB RAM on Windows 10. I've only played with local models for about month now, so I have limited knowledge on the matter.

Next thing you should know is this stuff advances fast, there's probably already better stuff out that I haven't heard about yet. New stuff comes out daily, so keep that in mind if you start googling for information and guides, a lot of is outdated. Extra details about why certain choices were made or what some setting means were moved to end of the post.



1. BACKEND
For backend we will use because its a simple single .exe file with everything. is good alternative choice with better support for different model formats but perhaps slightly more complex to use.

So grab koboldcpp.exe from the and move it somewhere where you don't run into windows permission issues. Such as C:\AI\Kobold\
AMD USERS: At the time of writing was recommended over default KoboldCPP.

2. MODEL
For this guide I'll use small 7B mistral model specifically so it will actually fit into 8GB VRAM GPU's. If you have more VRAM you can also test 10.7B, 13B and 20B llama models such as .

So lets grab the file, click the little download icon next to it and when finished downloading move it to your KoboldCPP folder.



3. FFRONTEND
KoboldCPP actually comes with light frontend so you could skip this step but in my opinion has become essential with all of its bells and whistles to keep you immersed.

Now there's couple different ways of installing ST.

Option 1: Personally I would just install with git like guided on the , which is just install NodeJS, Git and clone the release branch then open Start.bat. If you don't want to install git for some reason you could just grab the zip file from their github page but will have to do updates manually in future.

E: Removed SillyTavernSimpleLauncher as its considered outdated now.

When you have the Silly Tavern page open its should ask for your name and you can start with simple mode but I would just go straight to advanced.

4. BOOTING UP BACKEND
Now that we have SillyTavern installed, lets start KoboldCPP. Lets go over some settings, from Presets dropdown if you're Nvidia user use CuBLAS, AMD users should use hipBLAS.

Click Browse for the .gguf model file you downloaded.

Next GPU Layers, set to 35 for 7B models or 43 for 13B models manually or just use some really high number and it automatically scales to max. You can confirm it says 'offloaded 33/33 layers to GPU' once we launch to make sure entire thing is on GPU. Lower than that and you'll be using CPU aswell meaning slower responses.

And last Context Size, think of this like max memory for the AI. When you chat with the AI, its context size will fill up and after that some memories will need to be discarded, which usually means that the bot wont remember things you chatted about earlier on. There are tricks in SillyTavern like Vector Storage and Summarize that tries to alleviate that though.

I believe 4096 is good starting point for most models atm. You should be able to get away even with 8k on mistral 7B Q5_K_M model on 8GB card. Check your VRAM usage in Windows Task Manager (Dedicated GPU-Memory) to know how limited you are, then raise if you have room. But you should know most Mistral based models break past 8k context once it fills up and most llama based 10.7B, 13B, 20B models break after 4k context. More about this and how to get even higher context at end of the post.

Leave rest on default, you can leave Launch Browser on if you want to check out the Kobold Lite frontend. Hit Launch


This is pic from outdated version, leave Use ContextShift on in the new version.
Also it might say 33/33 layers instead of 35 in the new version.


5. SILLYTAVERN SETTINGS
Back to SillyTavern, if you closed it just use the Start.bat again.

At top Power Plug Icon -> API -> Text Completion -> KoboldCpp -> API URL: and hit connect it should show green icon.

Left most Settings Icon -> Left Panel Opens -> Kobold Presets
You need to just experiment and find one that fits your model and gives writing results that you like. Simple-1 or Storywriter are a decent start.

Enable Streaming, it makes it so you see the AI writing in real time instead of posting whole message at once.
Set Context Size same as you did in Kobold. Response Length default is fine but I keep mine at 300. Repetition Penalty at around 1.1 has in my experience been good, play with it if u start getting a lot of repetition but its is also sometimes just a limitation of small models. Temperature controls how creative the AI can get but too much and it turns into nonsense. Remember these sliders are double-sided swords.

Advanced Formatting (A icon) -> Context Template/Presets. I recommend either using the one that is mentioned in card details on hugging face where you downloaded the model (ChatML for our 7B OpenHermes and Alpaca for most llama models) or you can try the Roleplay preset if you want really detailed and long answers. Also tick Bind to Context so the preset also switches. Experiment, look into advanced settings, instruct mode, google etc.

User Settings -> I recommend enabling 'Quick "Continue" Button' it adds button to the bottom that attempts to finish the text if it was cut off mid sentence. Advanced formatting also has Trim Incomplete Sentences option but after some testing I'm not fan of it, I would rather lower target length (tokens) setting and use continue if you get a lot of cut sentances or just try different presets/model.

Background -> change to something nice.

More about extension settings in the next step.

Character Management (last icon) -> This is where you select bot who you want to chat with, import new ones or create one from scratch.

6. CHARACTER CARDS
Instead of writing the bot yourself we can find bunch of ready-made ones online that you can edit to your liking later on. One of the sites is , toggle the NSFW slider on top and find something to your liking. For the guide I'll use that I've forked and created expression images for.

Click the purple V2 icon, it will download image file that has all the character data stored into its metadata and also click the last red icon with loving smiley that will download zip file of the expressions.



On SillyTavern click the character card icon on top and then the small import character from file icon and find the image file you downloaded. The bot should appear into the list on right, click it and accept the lorebook import if it asks, if not you can manually import it by clicking the globe icon.




Then click the 3 boxes icon on top bar (Extensions) go to character expressions check "Local server classification" then scroll down and press the "Upload sprite pack (ZIP)" button and find the zip file you downloaded. Also I recommend enable Vector Storage and to turn on Summarize by selecting Main API from the drop down menu. From User Settings tab I recommend enabling Visual Novel Mode especially for mobile devices. Advanced settings has more options to move the images etc. Also if you don't have expressions for the card you can just click the avatar to expand it.




PROFIT
Start chatting away, try different character cards, try different writing presets and settings, research extensions, try downloading different models.


There's bunch of other stuff that you can research such as Text-to-Speech, Live2D for animated vtuber model, talking head for creating animated face from static image file and Stable Diffusion to let AI create images into the chat based on the roleplay and characters. Some of these use your GPU so if you have extra VRAM, running Stable Diffusion is pretty good addition for the RP. If you end up trying it keep in mind ST has now out of box support for it you don't need the extras server installed anymore like mentioned in most old guides. There's also google colab hosted stuff etc for running these addons in cloud for free but within limits and potential privacy risks.

_____________________________________________________________________________________________
EXTRA - NOT SO IMPORTANT INFORMATION

When you're more comfortable with this stuff I recommend switching to and using EXL2 models, they work faster than GGUF and you can save on VRAM with settings like 8-bit cache. Also if you have 16-24GB VRAM you might want look into the even bigger models such as 8x7B Mixtral, 20B / 34B llama or just running higher quant versions of the smaller ones and with more context size.

Why exactly Q4_K_M file?
K_M is the most recommended version at the time of writing this, Q4 is considered lowest sweetspot, any lower and the model starts to become too dumb. But if you have extra VRAM to spare grab something like Q5 or even Q8. on huggingface always has nice table on his releases model card page that gives you some direction how much memory you gonna need for each version and what the different letters mean. Here's couple graphs about perplexity loss between quants and models to give you some direction, lower is better. -

What models do you recommend?
At the time of writing variations of Mistral 7B are considered best small models (E.g. ) and with Noromaid being the popular choice for 13B or 20B roleplay, if you google a lot of people will recommend MythoMax but its getting pretty old at this point.

New models come out daily so if you're lost and want some guidance read post history on reddit he does rankings for different use cases, look into his RP tests. Another option is list. But take these test with grain of salt and experiment on your own.

How to see how much of the chat is stored in context?
Its shown by the yellow line if you scroll up the chat, also check summarize from extension tab every once in a while to make sure its somewhat accurate. Vector Storage's injection position setting might also be worth changing to In-chat instead of After Main Prompt but more testing needed.

More on Context Size?
Setting it too high even if u have enough vram and the model will start writing garbage once the context fills up (70+ sillytavern messages). Research into that specific model's limits, what it was trained on and RoPE setting. If you want some pointers has working 32k context out of the box, most other Mistral models work only up to 8k despite the default configs showing 32k.

But this 8k can be doubled to 16k with 2.64 alpha_value setting in oobabooga which should translate to RoPE Base of around 26 000 in KoboldCPP if the default for the model is at 10 000. But keep in mind the gains from RoPE scaling comes at the cost of making the model worse. Here's for custom context sizes in oobabooga, llama below, mistral above.

Why is my model so slow?
Keep in mind when you get near the max VRAM Windows likes to start moving it to your RAM which will make the model really slow. There's actually setting called 'CUDA - Sysmem Fallback Policy' to prevent this in the latest nvidia drivers, so the app just crashes when it maxes out video memory instead of trying to move it to sytem ram, making troubshooting easier.

Good speed for model that is fully loaded in GPU is >20 tokens/s. With EXL2 I'm getting 40+. If you are getting below 10 T/s the model is most likely not fully running on your GPU, check that you have all layers on GPU and ~0.6GB free vram so you know windows is not moving it to sys ram.

I want to use phone/tablet to chat with SillyTavern hosted on PC?
Create whitelist.txt in the SillyTavern folder and add 127.0.0.1 and 192.168.*.* into it or whatever local IP space you use. Edit config.yaml to 'listen: true', restart and now you can connect from another device in your local network.

Some Bookmarks:
- Source for models, people like TheBloke upload quantized versions.
- SillyTavern discussion & guides
- Discussion, news & reviews about local models.
- Writes reviews
- Model rated by their ERP potential
- Character cards
- ST Documentation & Guides
- Makes alot of useful videos about this
 
Last edited:

Pif paf

Engaged Member
Feb 5, 2018
2,238
1,047
Basically work on LLM AI's has advanced enough so that you can run them reliably on semi-high end consumer PC's thanks to something something quantification. You can use combination of CPU and/or GPU to run these, but I will only focus on running them on GPU for near instant responses.

There's a lot of options of running this stuff on cloud if you don't have beefy PC, from paid options such as NovelAI, OpenRouter to free options like google colab, poe, horde and so on but I will focus in this guide only on using our own hardware. For context, I have RTX 4070 with 12GB VRAM and 16 GB RAM on Windows 10. I've only played with local models for about month now, so I have limited knowledge on the matter.

Next thing you should know is this stuff advances fast, there's probably already better stuff out that I haven't heard about yet. New stuff comes out daily, so keep that in mind if you start googling for information and guides. Second thing is there's a lot of options with what software or models to use but for sake of keeping the guide short and easy I will focus on easiest setups and best bang for buck in my knowledge.

Extra details about why certain choices were made or what some setting means were moved to end of the post.



1. BACKEND
For backend we will use because its a simple single .exe file with everything. is good alternative choise with better support for different model formats but perhaps slightly more complex to use.

So grab koboldcpp.exe from the and move it somewhere where you don't run into windows permission issues. Such as C:\AI\KoboldCPP\
AMD USERS: At the moment was recommended over default KoboldCPP but when this post gets older refer to KoboldCPP github page.

2. MODEL
For this guide I'll use small 7B model specifically so it will actually fit into 8GB VRAM GPU's. If you have 12GB VRAM like me you can grab 13B model, here's one I've been experimenting with this week its extra spicy with ERP.

So lets grab the file, click the little download icon next to it and when finished downloading move it to your KoboldCPP folder.



3. FFRONTEND
KoboldCPP actually comes with light frontend so you could skip this step but in my opinion has become essential with all of its bells and whistles to keep you immersed.

Now there's couple different ways of installing ST.

Option 1: Personally I would just install with git like guided on the , which is just install NodeJS, Git and clone the release branch then open Start.bat. If you don't want to install git for some reason you could do the discouraged zip method but will have to do updates manually in future.

Option 2: For more guided version there's also , click the Source code (zip), extract the folder somewhere and launch "Install STSL" file, browser page should open after a while and you can click the orange button "Install Release Branch", click Ok, and then "Launch ST Release" it will say timed out but after a while new web page should open which is the Silly Tavern frontend we will be using.





When you have the Silly Tavern page open its should ask for your name and you can start with simple mode to not get overwhelmed if you want, but should eventually switch over to advanced for more options.

4. BOOTING UP BACKEND
Now that we have SillyTavern installed, lets start KoboldCPP. Default settings are for loading the entire thing with CPU only, so we change couple things, first one is Presets if you're Nvidia user change to Use CuBLAS, AMD users should use hipBLAS. The other ones are CPU only and GPU vendor agnostic but not as good.

Next GPU Layers, we want to load the entire model into VRAM so 35 for 7B models 43 for 13B models, afaik typing just large number like 99 defaults to the max, you can see it on the command line when we Launch.

And last Context Size, think of this like max memory for the AI. When you chat with the AI, its context size will fill up and after that some memories will need to be discarded, which usually means that the bot wont remember things you chatted about earlier on. There are tricks in SillyTavern like Vector Storage and Summarize that tries to alleviate that though.

More research needed but afaik 4096 is good starting point for most models atm. Don't go below 2k. Check your VRAM usage and how limited you're then start pushing higher. Research into the model limits, what they were trained on and RoPE setting. Keep in mind when you get near the max VRAM Windows will start moving it to your RAM and things will get slow.

Leave rest on default, you can leave Launch Browser on if you want to check out the Kobold Lite frontend.



5. SILLYTAVERN SETTINGS
Back to SillyTavern, if you closed it just use the Start.bat again.

At top Power Plug Icon -> API -> KoboldAI -> Change port to 5001 instead of the default 5000 and hit connect it should show green icon.

Left most Settings Icon -> Left Panel Opens -> Kobold Presets
You need to just experiment and find one that fits your model and gives results that you want. Storywriter or RecoveredRuins should be a decent start.
Enable Streaming, it makes it so you see the AI writing in real time instead of posting whole message at once. Set Context Size same as you did in Kobold. Response Length default is fine but I keep mine at 300. Repetition Penalty at 1.1 has in my experience been good on 13B models, play with it and if u start getting a lot of repetition which is also sometimes just a limitation of small models. Temperature controls how creative the AI can get but too much and it turns into nonsense. Remember these sliders are double-sided swords.

Advanced Formatting (A icon) -> Context Template/Presets. I recommend either using the one that is mentioned in card details on hugging face where you downloaded the model (ChatML for our 7B OpenHermes and Alpaca for 13B Echidna) or you can try the Roleplay preset if you want really detailed and long answers. Experiment, look into advanced settings instruct mode, google etc.

User Settings -> I recommend enabling 'Quick "Continue" Button' it adds button to the bottom that attempts to finish the text if it was cut off mid sentence. Advanced formatting also has Trim Incomplete Sentences option but imo after using it for a while its better to mess with settings, card or instruct mode sequence so that it starts doing exactly the size of messages you wanted and then once in a while that it gets cut off use the continue button instead of just always trimming.

Background -> change to something nice.

More about extension settings in the next step.

Character Management (last icon) -> This is where you select bot who you want to chat with, import new ones or create one from scratch.

6. CHARACTER CARDS
Instead of writing the bot yourself we can find bunch of ready-made ones online that you can edit to your liking later on. One of the sites is , toggle the NSFW slider on top and find something to your liking. For the guide I'll use that I've forked and created expression images for.

Click the purple V2 icon, it will download image file that has all the character data stored into its metadata and also click the last red icon with loving smiley that will download zip file of the expressions.



On SillyTavern click the character card icon on top and then the small import character from file icon and find the image file you downloaded. The bot should appear into the list on right, click it and accept the lorebook import if it asks, if not you can manually import it by clicking the globe icon.




Then click the 3 boxes icon on top bar (Extensions) go to character expressions check "Local server classification" then scroll down and press the "Upload sprite pack (ZIP)" button and find the zip file you downloaded. Also I recommend enable Vector Storage and also turn on Summarize by selecting Main API from the drop down menu. From User Settings tab I recommend enabling Visual Novel Mode especially for mobile devices. Advanced settings has more options to move the images etc. Also if you don't have expressions for the card you can just click the avatar to expand it.




There's bunch of other stuff that you can research such as Text-to-Speech, Live2D for animated vtuber model, talking head for creating animated face from static image file and Stable Diffusion to let AI create images into the chat based on the roleplay and characters. Some of these use your GPU so if you have extra VRAM, running Stable Diffusion is pretty good addition for the RP. There's also google colab hosted stuff etc for running these extras in cloud for free but within limits and potential privacy risks.

PROFIT
Start chatting away, try different character cards, hone the settings, research extensions, try downloading different models.


_____________________________________________________________________________________________
EXTRA - NOT SO IMPORTANT INFORMATION

When you're more comfortable with this stuff I recommend trying Oobabooga (or 0cc4m's kobold fork). If you have even more VRAM you might wanna look into the bigger models or just running higher quant versions of the 13B ones and with more context size.

Why exactly this Q4_K_M file? K_M is the most recommended version at the time of writing this, Q4 is considered sweetspot, any lower and the model starts to become too dumb. But if you have extra VRAM to spare you can test larger ones like Q5_K_M. on huggingface always has nice table on his releases model card page that gives you some direction how much memory you gonna need for each version and what the different letters mean.

At the time of writing this (Early November 2023) variations of Mistral 7B are considered best small models and there's no clear winner for 13B, if you google a lot of people will recommend MythoMax but its getting pretty old imo. Something based on Xwin like Tiefighter or one trained from that should be good atm.

New models come out daily so if you're lost and want some guidance read post history on reddit he does rankings for different use cases, look into his RP tests. Another option is but it was outdated while writing this, read the green box at top of the page for link to updated list. But you can take these test with grain of salt and experiment on your own.

KoboldCPP will only load GGUF models at the moment, for other formats you'll want to switch to something like Oobabooga that supports bunch of different ones. I personally been using GPTQ models with Oobabooga atm they use little less VRAM, 13B GPTQ model fits nicely in 12GB VRAM with 4k context size and I haven't even tested AWQ or EXL2 yet. But GGUF is also solid incase you ever want to utilize CPU or have enough VRAM for the higher quants.

How to see how much of the chat is stored in context?
Its shown by the yellow line if you scroll up the chat, also check summarize from extension tab every once in a while to make sure its somewhat accurate. Vector Storage's injection position setting might also be worth changing to In-chat instead of After Main Prompt.

Windows moving things to RAM slowing things down?
There's actually setting to prevent this in new nvidia drivers but haven't tested, I'm guessing it allows you to run without slowing down very near the max VRAM and then crash if it exceeds? Research needed.

I want to use phone/tablet to chat with SillyTavern hosted on PC?
Create whitelist.txt in the SillyTavern folder and add 127.0.0.1 and 192.168.*.* into it or whatever local IP space you use. Restart and now you can connect from another device in your local network.

Some Bookmarks:
- Source for models, people like TheBloke upload quantized versions.
- SillyTavern discussion & guides
- Discussion, news & reviews about local models.
- Writes reviews
- Model rated by their ERP potential
- Character cards
- ST Documentation & Guides
- Makes alot of useful videos about this
wow looks soo awesome! TYVM for your work here Dofus !
 

SoniaHopkins

New Member
Nov 6, 2023
1
0
Your article is great. It helps me understand a game as well as I intend to play it. I have also experienced many games from to famous racing games and I have never seen an article as detailed as yours. I will definitely try it out.
 
Last edited:

esio

New Member
Sep 22, 2017
3
39
I don't get why you're using Vector Storage and summarize at the same time, since they both do the same thing.
I'd say the results would be much better by using either one or the other, not run them side by side and confusing the model with different instructions and context.

Also, there is a site 'benchmarking' different models. It's not flawless but it's a nice overview of a few of current models.


I'd probably use 7B or 13B models at most for local machines if you have the power to run it. Myself I'm currently playing around with Pygmalion2-13B & Mythomax L2 currently, running in a Google Colab instance. Before that I experimented a lot with ChatGPT 3.5 & 4 with the ChatGPT API but it's difficult to find good jailbreaks that still work without 'influencing' the model by the morale settings from OpenAI imho and while ChatGPT4 is pretty good otherwise, not to say the best, at least for SFW RP.

I had also some good results with NovelAI, which is a paid subscription and needs some additional setup in SillyTavern to work well but at this point it's better to wait for their Aetherroom chatbot imho, which would be basically character.ai (don't bother with it) but uncensored. It's based on that NovelAI Kayra bot, just tweaked for roleplay and you won't need to use SillyTavern for it anymore. If everything works according to their plans, they'll release it this year.
 
Last edited:

Dofus

New Member
Oct 7, 2017
5
16
I don't get why you're using Vector Storage and summarize at the same time, since they both do the same thing.
I'd say the results would be much better by using either one or the other, not run them side by side and confusing the model with different instructions and context.
From how I've understood it Vector Storage acts more like world info and is only searched through based on the words you say, meaning its good for referencing specific things but doesn't actually help keeping the AI on track of the overall story like summarize does? So unless I'm mistaken Vector Storage would be equal to manually filling lorebook/world info with new information and summarize would be equal to updating the character card's description / beginning of the story. With that said I've had summarize generation cause me issues or just making up details on some model so disabling that is not terrible idea.

Ayumi's list was already mentioned in the extra section and yes from my experience 13B has given more consistent results, but mistral has gotten pretty good and 7B allows to spend the vram budget on larger context.
 
Last edited:

esio

New Member
Sep 22, 2017
3
39
Well, to make it short, Vector Storage is the newer one and will probably get maintained in the future(Use staging / testing branch if you want to test around with it) , while summarize will probably not get updates anymore.
Summarize will spend a lot of the context memory, that you're basically trying to save.
And afaik it's similar to Vector Storage, you never know if the information you're trying to keep, is actually within the context.

Also Vector Storage is using the model API itself and some models don't like that as well.

The model making up details could be actually from either of them. I've seen reports from people using summarize but also VS.

So while there may be still a bit of headroom for optimization in all of this, let me just say that even the ST dev and some other experienced people out there are recommending to NOT use any of them at all. At a certain point, if your max context size is 4096, it's 4096.

The better solution for memorys that need to be kept, I would rather edit / add stuff to the character description, or use the Authors Notes within ST.
 

jack_px

Member
Oct 21, 2016
186
27
Basically work on LLM AI's has advanced enough so that you can run them reliably on semi-high end consumer PC's thanks to something something quantification. You can use combination of CPU and/or GPU to run these, but I will only focus on running them on GPU for near instant responses.

There's a lot of options of running this stuff on cloud if you don't have good enough PC, from paid options such as NovelAI, OpenRouter to free options like google colab, poe, horde and so on but I will focus in this guide only on using our own hardware. For context, I have RTX 4070 with 12GB VRAM and 16 GB RAM on Windows 10. I've only played with local models for about month now, so I have limited knowledge on the matter.

Next thing you should know is this stuff advances fast, there's probably already better stuff out that I haven't heard about yet. New stuff comes out daily, so keep that in mind if you start googling for information and guides, a lot of is outdated. Extra details about why certain choices were made or what some setting means were moved to end of the post.



1. BACKEND
For backend we will use because its a simple single .exe file with everything. is good alternative choice with better support for different model formats but perhaps slightly more complex to use.

So grab koboldcpp.exe from the and move it somewhere where you don't run into windows permission issues. Such as C:\AI\Kobold\
AMD USERS: At the time of writing was recommended over default KoboldCPP.

2. MODEL
For this guide I'll use small 7B model specifically so it will actually fit into 8GB VRAM GPU's. If you have 12GB VRAM like me you can grab 13B model, here's one I've been experimenting with this week its extra spicy with ERP.

So lets grab the file, click the little download icon next to it and when finished downloading move it to your KoboldCPP folder.



3. FFRONTEND
KoboldCPP actually comes with light frontend so you could skip this step but in my opinion has become essential with all of its bells and whistles to keep you immersed.

Now there's couple different ways of installing ST.

Option 1: Personally I would just install with git like guided on the , which is just install NodeJS, Git and clone the release branch then open Start.bat. If you don't want to install git for some reason you could do the discouraged zip method but will have to do updates manually in future.

Option 2: For more guided version there's also , click the Source code (zip), extract the folder somewhere and open 'Install STSL' file, browser page should open after a while and you can click the orange button "Install Release Branch", click Ok, and then "Launch ST Release" it will say timed out but after a while new web page should open which is the Silly Tavern frontend we will be using.





When you have the Silly Tavern page open its should ask for your name and you can start with simple mode to not get overwhelmed if you want, but should eventually switch over to advanced for more options.

4. BOOTING UP BACKEND
Now that we have SillyTavern installed, lets start KoboldCPP. Default settings are for loading the entire thing with CPU only, so we change couple things, first one is Presets if you're Nvidia user change to Use CuBLAS, AMD users should use hipBLAS.

Click Browse for the model file you downloaded.

Next GPU Layers, set to 35 for 7B models or 43 for 13B model manually. You can confirm it says 'offloaded 35/35 layers to GPU' once we launch to make sure entire thing is on GPU. Lower than that and you'll be using CPU aswell meaning slower responses.

And last Context Size, think of this like max memory for the AI. When you chat with the AI, its context size will fill up and after that some memories will need to be discarded, which usually means that the bot wont remember things you chatted about earlier on. There are tricks in SillyTavern like Vector Storage and Summarize that tries to alleviate that though.

I believe 4096 is good starting point for most models atm. You can probably get away even with 8k on mistral 7B Q5_K_M model on 8GB card. Check your VRAM usage and how limited you're then start pushing higher. If going past 8k on 7b or past 4k on 13B models look into RoPE setting.

Leave rest on default, you can leave Launch Browser on if you want to check out the Kobold Lite frontend. Hit Launch



5. SILLYTAVERN SETTINGS
Back to SillyTavern, if you closed it just use the Start.bat again.

At top Power Plug Icon -> API -> KoboldAI -> Change port to 5001 instead of the default 5000 and hit connect it should show green icon.

Left most Settings Icon -> Left Panel Opens -> Kobold Presets
You need to just experiment and find one that fits your model and gives writing results that you like. Storywriter or RecoveredRuins should be a decent start.

Enable Streaming, it makes it so you see the AI writing in real time instead of posting whole message at once.
Set Context Size same as you did in Kobold. Response Length default is fine but I keep mine at 300. Repetition Penalty at 1.1 has in my experience been good, play with it if u start getting a lot of repetition but its is also sometimes just a limitation of small models. Temperature controls how creative the AI can get but too much and it turns into nonsense. Remember these sliders are double-sided swords.

Advanced Formatting (A icon) -> Context Template/Presets. I recommend either using the one that is mentioned in card details on hugging face where you downloaded the model (ChatML for our 7B OpenHermes and Alpaca for 13B Utopia) or you can try the Roleplay preset if you want really detailed and long answers. Experiment, look into advanced settings, instruct mode, google etc.

User Settings -> I recommend enabling 'Quick "Continue" Button' it adds button to the bottom that attempts to finish the text if it was cut off mid sentence. Advanced formatting also has Trim Incomplete Sentences option but imo after using it for a while its better to mess with settings, card or instruct mode sequence so that it starts doing exactly the size of messages you wanted and then once in a while that it gets cut off use the continue button instead of just always trimming.

Background -> change to something nice.

More about extension settings in the next step.

Character Management (last icon) -> This is where you select bot who you want to chat with, import new ones or create one from scratch.

6. CHARACTER CARDS
Instead of writing the bot yourself we can find bunch of ready-made ones online that you can edit to your liking later on. One of the sites is , toggle the NSFW slider on top and find something to your liking. For the guide I'll use that I've forked and created expression images for.

Click the purple V2 icon, it will download image file that has all the character data stored into its metadata and also click the last red icon with loving smiley that will download zip file of the expressions.



On SillyTavern click the character card icon on top and then the small import character from file icon and find the image file you downloaded. The bot should appear into the list on right, click it and accept the lorebook import if it asks, if not you can manually import it by clicking the globe icon.




Then click the 3 boxes icon on top bar (Extensions) go to character expressions check "Local server classification" then scroll down and press the "Upload sprite pack (ZIP)" button and find the zip file you downloaded. Also I recommend enable Vector Storage and to turn on Summarize by selecting Main API from the drop down menu. From User Settings tab I recommend enabling Visual Novel Mode especially for mobile devices. Advanced settings has more options to move the images etc. Also if you don't have expressions for the card you can just click the avatar to expand it.




PROFIT
Start chatting away, try different character cards, try different writing presets and settings, research extensions, try downloading different models.


There's bunch of other stuff that you can research such as Text-to-Speech, Live2D for animated vtuber model, talking head for creating animated face from static image file and Stable Diffusion to let AI create images into the chat based on the roleplay and characters. Some of these use your GPU so if you have extra VRAM, running Stable Diffusion is pretty good addition for the RP. If you end up trying it keep in mind ST has now out of box support for it you don't need the extras server installed anymore like mentioned in most old guides. There's also google colab hosted stuff etc for running these addons in cloud for free but within limits and potential privacy risks.

_____________________________________________________________________________________________
EXTRA - NOT SO IMPORTANT INFORMATION

When you're more comfortable with this stuff I recommend trying Oobabooga (or 0cc4m's kobold fork) and GPTQ models with llamav2 loader to see if you can get better performance / VRAM usage etc. If you extra VRAM you might wanna look into the bigger models or just running higher quant versions of the 13B ones and with more context size.

Why exactly this Q4_K_M file? K_M is the most recommended version at the time of writing this, Q4 is considered lowest sweetspot, any lower and the model starts to become too dumb. But if you have extra VRAM to spare you can test larger ones like Q5 or Q8. on huggingface always has nice table on his releases model card page that gives you some direction how much memory you gonna need for each version and what the different letters mean.

At the time of writing this (Early November 2023) variations of Mistral 7B are considered best small models and there's no clear winner for 13B, if you google a lot of people will recommend MythoMax but its getting pretty old imo. Something based on Xwin like Tiefighter or one trained from that should be good atm.

New models come out daily so if you're lost and want some guidance read post history on reddit he does rankings for different use cases, look into his RP tests. Another good option is list. But take these test with grain of salt and experiment on your own.

How to see how much of the chat is stored in context?
Its shown by the yellow line if you scroll up the chat, also check summarize from extension tab every once in a while to make sure its somewhat accurate. Vector Storage's injection position setting might also be worth changing to In-chat instead of After Main Prompt but more testing needed.

More on Context Size
Setting it too high even if u have enough vram and the model will start writing garbage once the context fills enough. Research into that specific model's limits, what it was trained on and RoPE setting.

Keep in mind when you get near the max VRAM Windows likes to start moving it to your RAM and the thing will get slow. There's actually setting called 'CUDA - Sysmem Fallback Policy' to prevent this in the latest nvidia drivers.

SillyTavern is trying to process something on its own and throwing "Could not extract reply in 5 attempts"?
It's usually related to the Summarize extension failing on background in my experience, try restarting ST, different model or just disable it if you feel like you can do without.

I want to use phone/tablet to chat with SillyTavern hosted on PC?
Create whitelist.txt in the SillyTavern folder and add 127.0.0.1 and 192.168.*.* into it or whatever local IP space you use. Restart and now you can connect from another device in your local network.

Some Bookmarks:
- Source for models, people like TheBloke upload quantized versions.
- SillyTavern discussion & guides
- Discussion, news & reviews about local models.
- Writes reviews
- Model rated by their ERP potential
- Character cards
- ST Documentation & Guides
- Makes alot of useful videos about this
man thanks for the guide, im planning on upgrading my gpu in some time, and im having a blast of fun playing with IA, using horde, but i couldnt really understand how to set thing locally, this is like the guide for dummies i needed xD, so thanks man!
 

XinoTrax

Newbie
Feb 3, 2019
24
14
This is so cool man! Coincidentally been playing with Real-Time Voice Conversion AI stuff for shits and giggles for the past week, and lo and behold I stumble into this fantastic thread. More rabbit holes to dive into. Thank you for publishing this guide, I'm looking forward to researching and delving into all of this stuff. Interesting times we live in...
 

trickofspades

New Member
Feb 7, 2023
4
0
Thanks for the tutorial. I know you didnt do this to get into trouble shooting.

But...

I feel like I followed the tutorial and am getting a "404 not found" error from the web page koboldccp.exe itself which I assume is why I can't connect it to tavern properly.

My guess is either my GPU is not up to snuff geforce rtx 2060 or i've got some sort of weird permissions/firewall issue going on.

I've tried searching for help but I cant seem to find any recent threads with my issue.

I've snipped the last part of the cmd box below
You don't have permission to view the spoiler content. Log in or register now.
 

soldano

Member
Jan 29, 2018
232
385
What surprises me is that this technology is not starting to be integrated into real games. I guess maybe it's still early, but from lack of knowledge I think it wouldn't be so crazy to embed language models in a visual novel (or other types of video games, but visual novels seem like the ideal place for this to land). This could provide a much more immersive experience than current hard-coded dialogue, and would take visual novels to another level.

And this would have to be studied more carefully, but I think it could take a lot of work away from developers. I mean, instead of having to design specific dialogues, the designer would just have to design the correct "tokens" for his NPC. Of course, like everything, it must be done carefully and knowing well what you want to achieve. A character with insufficient or poorly designed tokens could be so generic that they become indistinguishable from other characters. Or it could lead to strange situations, where the NPC says things that don't match the story. But like anything, if done right, it can result in a wonderful experience.

If I, who have no idea, have managed to create a couple of characters in TavernAI that have perfectly responded to what I wanted from them, someone with more knowledge and time can do wonders.
 
  • Like
Reactions: WetZoner

JoelPerez

New Member
Jun 29, 2022
1
0
Hey i kind of move some of the settings and i was wondering if you also have some slow writing or is it normal, it takes my pc to write 300 tokens about 100 seconds, is this normal or i should change the configuration
 
Last edited:

TacoTown

New Member
Aug 26, 2019
11
24
Hey i kind of move some of the settings and i was wondering if you also have some slow writing or is it normal, it takes my pc to write 300 tokens about 100 seconds, is this normal or i should change the configuration
What are your specs? Which settings did you change? Is it slow analyzing your prompt or just when generating?
 

soldano

Member
Jan 29, 2018
232
385
Hey i kind of move some of the settings and i was wondering if you also have some slow writing or is it normal, it takes my pc to write 300 tokens about 100 seconds, is this normal or i should change the configuration
It depends on the model you are using. Look at the ETA, the less the better. In Speed, the more the better. In Queue, I guess the less the better but I'm not sure. And the workers.

All that, of course, if you are using an online service like Horde. If you are running the models locally, then perhaps your computer is just not up to par.
 

mrme

Active Member
Nov 8, 2017
872
978
Thanks for the guide.


I'm pretty impressed, I've got a 24gb 3090ti so I'm able to run the larger models with a large context (8192), the response time is only a few seconds for a couple of sentences and manages to stay pretty coherent - a little random at times but not getting stuck in nonsensical loops.
About as good as novelai from a quick couple of playthroughs (RP not chat).

I'm using Noromaid-13B with the Alpaca-Roleplay formatting preset. I found that ticking the 'generate one line option' to give good detailed responses without the AI running away with the story (especially when continuing)


I've found author's notes, but what the equivalent of 'Memory' in Silly Tavern, if it has one?
And is there something like NovelAI's instruct block?
 
Last edited:

Wcares

Newbie
Nov 13, 2016
58
51
Thanks for the guide. I was able to get everything set up running a 13B model locally and its super fast.... well for text generation.

Now the part I struggle with is getting an instance of StableDiffusion running locally and hooking that up to SillyTavern. I followed some guide and did manage to get SD running (even generated a test image on the webUI) but when I try to run both models in parallel it slows to a crawl. Not only does text generation in SillyTavern noticeably crawl but I could never get SD to generate an image in SillyTavern. It doesn't error out I can see the model generating the prompt and even see SD making the image it just doesnt progress. Like it got 30% through one 512x512 image after 5 minutes.
Im guessing its because im using too much VRAM but I have no clue how to control the amount of vram kobold or SD are using. For reference I have a high end AMD CPU and GPU.

Anyone got any tips or a good guide for hooking up image generation to SillyTavern?