My own personal AI thread

desmosome

Conversation Conqueror
Sep 5, 2018
6,644
14,996
This is a common misconception, mostly because of OpenAI marketing being a bit misleading. They aren't lying, they do have a next level AI compared to most others, but the art is limited. It's damn good, but until OpenAI gives the access and customization available to stable diffusion and similar technologies, it will never equal them. And yea, companies will definitely pay people with that knowledge in the future, mostly due to the fact that as AI imaging advances it is slowly becoming more and more complex to actually accomplish.

Basically, what I'm saying is that no matter how advanced OpenAI is it is basically microsoft paint compared to stable diffusion, comfy ui, forged, etc.

Visual aid to this concept.
View attachment 4759026
Stable diffusion, and all diffusion models (basically everything except the new GPT), are fundamentally limited due to the fact that they start generation from random noise.

Yes, there are ControlNet and all sorts of things, but it lacks basic contextualization and prompt adherence because it's literally very randomly generated.

I suppose stable diffusion with all the bells and whistles might produce a cooler looking image currently, but the ability to produce the image that the user asked for is severely lacking. Fiddling with this inherent weakness is what separates the average user and a more experienced user.

With GPT, the AI can understand images as they would understand and contextualize text. Tokenizing it and building it bit by bit rather than refining some picture from random background noise. This drastically lowers the barrier of entry. Basically anyone can create a consistent set of images with just some text prompt.

If you think it looks bad, it would just be a matter of needing more training data. But the direction of AI image generation has shifted. Every big player are almost certainly developing their own version of this. Diffusion models cannot compete, and they will probably die out.
 

tanstaafl

Well-Known Member
Oct 29, 2018
1,780
2,264
I suppose stable diffusion with all the bells and whistles might produce a cooler looking image currently, but the ability to produce the image that the user asked for is severely lacking.
This is where the strengths of using things like comfy.ui comes in to play where you can reinforce your prompts with loras and input additional prompts into refiners and upscalers, etc.
Every big player are almost certainly developing their own version of this. Diffusion models cannot compete, and they will probably die out.
I'm still fairly new to the AI imaging thing only having played around with it for about 2 months now, so I can't really respond to this without learning more myself. Generally what I'm seeing in the more fast paced places still split between stable diffusion (and comfy) about equally with others.
 

desmosome

Conversation Conqueror
Sep 5, 2018
6,644
14,996
This is where the strengths of using things like comfy.ui comes in to play where you can reinforce your prompts with loras and input additional prompts into refiners and upscalers, etc.

I'm still fairly new to the AI imaging thing only having played around with it for about 2 months now, so I can't really respond to this without learning more myself. Generally what I'm seeing in the more fast paced places still split between stable diffusion (and comfy) about equally with others.
All these secondary tools for diffusion models like LORAs, picking the models, using ControlNet, using inpainting, upscaling, using post-processing... ALL of these things are basically trying to remedy the fact that diffusion models are inherently random and rarely spit out exactly what the user asked for. So you need different tools that specifically manipulate some parts of the output. That's the reality.

GPT is basically an "intelligent" interface that can understand your request in simple prompt form. There is an enormous difference between the two concepts.

Instead of using a LORA of some character, you just tell it to draw this character. Instead of using art style LORAs, you just tell it to use this art style. Instead of using inpainting, you just tell it to keep everything same but fix the hands. Instead of using upscaling, you just tell it to upscale. You can tell it to zoom out. You can tell it to rotate. It's an interface where you can communicate with the image generator as if you would to an LLM. And the image will be built bit by bit, ensuring it adheres to the request.

Granted, the more advanced diffusion models are better able to understand prompts written like sentences, but they are still working with random noise as the starting point.

The massive benefit of GPT styled, multimodal LLM with image generation capabilities is the fact that the AI can keep track of the entire project. You can make iterative refinement. Ask it to create a cat on a skateboard. It does it, then you ask it to change art style. Then you ask it to add sunglasses. You then upload a logo of your company and tell it to put it on the cat's shirt. Then you ask it to write some slogan for your company after explaining what your company does. This kind of workflow with one single program. And all you are doing is just writing prompts in a sentence.

You can upload a picture of your OC and ask the AI to create various other poses and expressions for this character for VN sprite work. Or maybe you ask it to create RPGM pixel sprites in the four cardinal directions.

And the biggest reason for us gooners. Having the ability to illustrate your chat in real time, staying consistent always with the art style and the character appearance. THIS is the holy grail that we've been waiting for. Of course, it will never happen with GPT censorship, but this technology will be the future of image generation.
 

tanstaafl

Well-Known Member
Oct 29, 2018
1,780
2,264
All these secondary tools for diffusion models like LORAs, picking the models, using ControlNet, using inpainting, upscaling, using post-processing... ALL of these things are basically trying to remedy the fact that diffusion models are inherently random and rarely spit out exactly what the user asked for. So you need different tools that specifically manipulate some parts of the output. That's the reality.
And AI is in it's infancy on all fronts. Working with Loras, ControlNet, inpainting, upscaling and a ton of other moving bits and parts that I still don't fully understand add a level of fine tuning that is simply not available elsewhere.
GPT is basically an "intelligent" interface that can understand your request in simple prompt form. There is an enormous difference between the two concepts.
I'm really failing to see the downside. One aspect of progress creates an image from white noise using a ton of moving parts to refine the image to what you want and the other has more fine control at inception but in comparison to diffusion is still in it's infancy. I pay for the ChatGPT 4o access (the $20 tier, expensed to work, hah) and have played extensively with its image generation. I can tell it to change parts of a picture or to change mood of a picture and it will try somewhat, but very rarely does it do what I want and when it does try its accuracy is hit or miss (For instance, if I ask for a warrior with an axe it gets the ax backwards 50% of the time even after repeated corrections), whereas with Loras and ControlNet and in/outpainting I can get what I want faster.

There are two completely separate approaches of progress and they're both moving forward very quickly. The only downside I see is something that you haven't even mentioned and that's the dependence on the checkpoint (model) and being limited by its progress. That said, I'll still keep using both until I hit the upper wall that is the limit of my video card...and I'm very close to that.
Granted, the more advanced diffusion models are better able to understand prompts written like sentences, but they are still working with random noise as the starting point.
Again, random noise isn't the drawback that you seem to think it is. That noise allows a level of refinement in post processing tasks that make the level of wanted detail nice.

Edit: I will admit that ChatGPT pictures look much, much better on first click. I've always assumed that whatever base that it has access to eclipses what any checkpoint could possibly have, all other advances aside.
 

VelvetChainsDev

New Member
Game Developer
Aug 19, 2019
8
51
Did you make those videos?

I'm using AI in my game but the animations aren't nearly as neat as the ones you showcased in the first post

i'm using WAN2.1 was this a different model?
 

tanstaafl

Well-Known Member
Oct 29, 2018
1,780
2,264
Did you make those videos?

I'm using AI in my game but the animations aren't nearly as neat as the ones you showcased in the first post

i'm using WAN2.1 was this a different model?
No, those aren't my videos, my computer isn't even close to enough to efficiently make animation and other options aren't free. But I'm pretty sure they're all made with kling ai (not free).



or if you use comfy ui you can use and connect to the kling API and animate in your set up.
 
Last edited:
  • Like
Reactions: VelvetChainsDev

desmosome

Conversation Conqueror
Sep 5, 2018
6,644
14,996
Also, wanted to add, that mixing the two will always be a thing. For instance is an attempt to pull gpt into comfy and all its image as source options.
At its core, ComfyUI-KepOpenAI works by connecting to the OpenAI GPT-4V API, which is a sophisticated model capable of understanding and generating human-like text based on both visual and textual inputs. Here's a simple breakdown of how it operates:


  1. Input: You provide an image and a text prompt. The image can be any visual content you are working on, and the text prompt can be a question, a description, or any text that you want the model to consider.
  2. Processing: The extension sends these inputs to the GPT-4V API. The API processes the image and text together, understanding the context and relationships between them.
  3. Output: The API generates a text completion that is contextually relevant to both the image and the text prompt. This output is then displayed in the ComfyUI interface, ready for you to use in your creative projects. Think of it as having a smart assistant that can look at your artwork and provide insightful, relevant text that complements your visual content.
----

So basically, it's just asking GPT to summarize your prompt and image, which is then fed into ComfyUI interface to be used with whatever model you want to use. That's like... not really doing anything unique. You can write your own prompt rather than hooking to GPT to summarize a picture for you. And the ability of LLMs to analyze an image isn't unique to GPT with this new technology for creating images. It existed before and Gemini can also do it (and probably some others).

Anyways, we will just have to see where these companies focus their R&D. It will not change overnight. The companies would have had their top of the line diffusion models in development when OAI came out with this new modality. I would guess that once more companies get their own models that can do what GPT is doing with image generation, the diffusion models will be relegated to niche use with much less money going into it.
 
  • Like
Reactions: tanstaafl