Have been doing a lot more testing with local models. A promising combination for TA is:
Wizard-8x22b (gemma 27b can do it cheaper but fuck that license) as orchestrator and world state manager
L3 70b New Dawn on response writing and editing (
You must be registered to see the links
)
Literally any recent non-meme tune small raw instruct model for all misc tasks (query writing, aggregation, maintenance, etc) like wizardlm2-7b, qwen2-7b... Alright, I guess qwen-7b only hypothetically. I really like the Qwen models for some reason, they do have good writing, but when they fuck up it's usually in a major way.
For response writing, I've been experimenting with a very very smart corpo model scoring and judging which models to use but it's... Difficult to explain the process in concrete terms. There are times when plugging in Stheno can work wonders, but you definitely don't want to leave it on for too long. Stheno is way too chaotic and _will_ wreck your shit, it's only a matter of time. It's kind of what I was describing with the in-context base model learning. A totally different route to a similar result - very high creativity but metal patient behavior.
I just started messing with New Dawn. Somehow the release passed me by. Obviously Midnight Miqu still contender for best RP model despite its age. You see what I mean about L3 being untrainable though. Even author here himself admits:
> I suspect the first thing people will want to know is how this model stacks up against Midnight Miqu. I'd say it compares favorably, although they're more like cousins than siblings. I would say that Midnight Miqu still has an edge in terms of raw creative juice when it has a good squeeze, but New Dawn is smarter and understands nuances better. You can judge for yourself, but keep in mind that these are simple, one-shot prompts. As you get deeper into your own complex scenarios, I think you'll see more of New Dawn's worth.
Yeah, Midnight Miqu was a fine-tune of a leaked quant model that was blown up to f16 with padded weights. Most people (myself included) would then be running a re-quantized version of_that_. That it worked at all is a testament to how fucking good of a model Mistral Medium actually is / was.
But a proper tune of a model with full weights available, like Llama3, which has claims to being the "best open model", several months after the base model released, done by someone competent and with experience and the best we can get is "well, it's kind of a lateral move" compared to miqu? OK fine, not a tune, technically, but a merge. But I've tried Smaug by itself too and honestly, I would still take something like Mistral Medium over that (I mean the actual medium on the API, not miqu). Now, sonnet-3.5, I guess. All these are priced roughly the same, with L3 being slightly cheaper. I mean if you're going corp cloud already, might as well go for the boss at this weight class. My guess is sonet-3.5 is an 8x22b MoE. Wizard is not as smart or insightful, but it I think that's down to its training, not its weight class. I mean, maybe Anthropic found some other weird weight class and it's some freak we don't even know can be good like 8x29b or something, but it's definitely on that order. Cost, intelligence, and speed.
I did try L3 Euryale when I first started this but it wasn't workable. Stheno is, although it needs careful management and scaffolding around it to keep it on some rails... Might not be worth it. Then again, most of that scaffolding is already in place, it would just need some extra special casing for the times Stheno is swapped in (long term goal being custom tunes I do on my own hardware of L3 / qwen2-7b special cased to AUTISM outputs). L3 8b having a properly done abliterated version already as a starting point makes it an easy sell over qwen. I mean, can you fucking imagine? Software that just does what you tell it to do and doesn't moralize at you first and you don't have to fucking hack your own tools to do the job you're trying to do?
There is also Magnum to try out - a qwen tune, so I'm interested.
You must be registered to see the links
. "Designed to replicate the prose quality of claude"... I mean, I get it. Claude has the best writing, no doubt. But this approach (sao10k's claude opus synthetic dataset which was used here) can only ever produce an - at best - slightly shittier Claude. I can get good Claude already at about what I pay to run a 72b. I'm going to try it and not complain any more though. Ultimately, I don't have a better suggestion. Building a data set is not easy and especially with the recent models you need insane volumes of data... I remain hopeful for some breakthrough making training more practical. In the meantime, again, I like qwen. qwen pretending to be claude - i'm not going to pass that up.
Edit: Tried Magnum for a bit, generally unimpressed. Inherited lots of Claudisms
In addition to the horniness. Had much better results with New Dawn.
Edit 2: Working with New Dawn... It has its own problems... Hear me out. I have a few test personas that should not be difficult to portray. If I showed you the card you'd say "that's boring shit, why would you even make a card like this". But they have some very, very specific constraints. Constraints which are unambiguous and referred to throughout the context in various ways and re-iterated several times. For one example - this character cannot physically touch this other character. It's a clear, unambiguous instruction that should be easy enough to follow. I don't provide a reason - absolutely everything else is a normal vanilla situation - I just stipulate that clause and reiterate it in different ways with examples. You'd be surprised at how difficult it is to get a model to follow this instruction in an RP context. It's a pretty good test (one of them, anyway) for how effectively a model generalizes to unusual situations. Typically, the tunes which were trained on a large volume of narrative training data or with aggressive training hyperparameters with a narrative data set (e.g. llama-3-storywriter) find this task completely impossible. Once they lock into a pattern in the writing (which itself, has a gravitational pull to Llama3's base personality), it will latch onto those things more and more. This happens to a certain extent with all models, but in the "story" tunes, this is especially egregious.
What makes this problem even more insidious is New Dawn's actual inherent quality. It writes very well, is creative. It's almost exactly what I'm looking for. But when it fucks up, it's usually in a completely breaking way like that - it'll violate a core constraint instruction. Like I mentioned before - a single failure in this system can be a very big deal. We have hard persistence and - for better or worse - events that happen, happen. I'm trying to create an immersive simulation experience. That means no editing of persona responses. That concept doesn't even make sense in my system. The response you are seeing is an aggregated editor response. Even if you were to edit this response, you're not editing any world updates which may have gone to the DB, updates to the user's psych profile, memories created in the DB for a persona... Right now, I do allow a single regen that does an undo. That's as far as I'll go there. My point is that consistency is more important to my use case than most. Yes, I have a QC bot I can activate than check the inputs going into the DB. Here's the problem though. New Dawn is very smart and produces very high quality, readable, and convincing prose. It will easily fool a small model that's sanity checking data. Just use a bigger model that can understand? Well then, that model would just replace New Dawn altogether, wouldn't it? It's kind of a critical flaw. I'm not giving up on it, just something I noticed.