Just follow this video and install Automatic1111, it's only different user interfaces so you don't loose anything by choosing this over the other. In fact just forget about any other UI for now.
In the video he show you how to install and use a basic version of Stable diffusion 1.5 .
Stable Diffusion 1.5 or 2.1 is the actual "models", then you go to civitai.com and get a trained checkpoint or merged checkpoint.
This is stable diffusion 1.5 that has been trained for specific genre's and niche's. This is what you use as a "generator".
Then you can add other things like lora's for a more "controlled" result. Stable Diffusion 2.1 is not NSFW content for now as far as I know so not interesting to us perverts..
Exactly this - A1111 is really just an automatic way of installing and starting SD Webui. Don't worry about the tiny differences, A1111 will probably take over as it's the automated option!
Please note, we're not being dismissive of Easy Diffusion or, for that matter, you. We're not experts in this - very few people are. This technology is just over six months old, the only real experts are those who developed it in the first place! Everyone else is on a steep learning curve. Having an "easy" version now might seem like a great idea, it's not. It will get left in the dust and you won't know how to use the complex version everyone will be using in six months' time. It's like Photoshop - those who have used it for decades forget it used to be reasonably simple and mostly intuitive. I never used it and now it's moved on so much it's absolutely impenetrable.
TL;DR - surfing a big wave is easier when it starts as a small one and you're carried along with it as it grows.
Quick glossary:
(N.B. Some of the 'definitions' below may be laughably incorrect to those with deep technical knowledge - those people are not the intended audience. These are broad analogies to allow people to understand the concepts!)
-
Model. The big Checkpoint file, 2-8GB in size. You load one of these in to get a 'world' or genre i.e. different ones for anime, photorealism, oil paintings, landscapes.These have a .ckpt or .safetensors file extension.
-
LoRA. Low Rate Adaptation. 9-400MB in size. An advanced type of embedding that used to require crazy amounts of vRAM on your GPU, now works OK with as little as 8GB. If you want to generate a very specific type of thing such as a specific model of car or a celebrity, you might use a LoRA. Advantage is they can be trained on a small number of images (as little as 3!). Disadvantage is that they often take over and don't always play well with other LoRAs. They have .ckpt, .pt or .safetensors file extensions. Make sure you don't put them in the \models\stable-diffusion folder, they go in the models\Lora folder.
-
Textual Inversion (or TI). Tiny files, a few kB. The older, slower way of training an embedding. Requires a lot of images (min 20). Effectively recreates in miniature how SD was trained originally - by showing it pictures of a thing and hitting it repeatedly over the head with lots of maths until it associates a word with the essence of those pictures. These have a .pt or .bin extension. They go in the \embeddings folder.
-
Hypernetwork. Another form of training. Conceptually they sit in a similar space to LoRAs and are a similar size, although they tend to work as if they were a small collection of LoRAs. Separate tab in WebUI. Tend to 'play well' with other embeddings. They go in the models\hypernetworks folder.
-
Token. Each word (or part thereof if SD doesn't recognise the whole word and has to split it down to work it out) of a prompt is a separate token. Used to be limited to 75 tokens, but long ago got expanded to effectively infinite, but in blocks of 75.
-
Vector. Effectively 'concepts' for SD to remember for a given word (i.e. token). When training a TI you will be asked how many 'Vectors per token' to use. If training a very simple concept (metallic paint for instance), 1 vector per token is all you need. If training a TI on a person's face, it's best to have at least 2. I've had best results with 10-12 for a person, especially if training on whole body images where the person has an unusual body e.g. very slim. The usual rule of thumb is at least 5 images per vector, with the usual minimum of 20 images for TIs.
-
txt2img. The classic way to operate SD, by entering text strings as tokens for it to compare against its training as it carves images out of the static. There are plenty of explanations as to how SD works elsewhere on the Internet, I'm sticking with "carves images out of static"!
-
img2img. Takes a starting image and applies SD to it, instead of almost random static. Very good way of getting what you actually want, particularly poses.
-
ControlNet. A development of img2img that allows a user to pose one or more stick figures, then run SD against that. Works very well most of the time and is pretty intuitive.
-
Seed. A usually numerical code introduced to the initial random static for two reasons. Firstly it gives SD some variance to coalesce an image from ("Shapes in the smoke"), secondly it allows for repeatability, which wouldn't be possible with purely random static. What a seed does
not do is provide consistency of face, clothing, accessories, background or anything really between prompts or models. The same seed & prompts will usually give recognisably related images across different models - but there will still be very considerable variation (a house in the background in one model becomes a car in another; a bandana in one becomes a hat in another, a pathway in one becomes a fashion show runway etc.).
-
Steps. How many times SD runs its prompts against the static. Less than 10 is unlikely to give anything worthwhile, less than 20 is generally not recommended. Exactly how many is 'best' depends on many, many factors including the depths of your patience / pockets. Play around with more or fewer steps to see what works best for what you're aiming for, but be prepared to increase or decrease them at any time. Very small numbers can result in poor shaping or complete ignoring of tokens in a prompt, seemingly at random (because it pretty much is!). Very large numbers can result in images that look 'overworked', 'airbrushed' effects and/or SD suddenly deciding after 120 steps that Taylor Swift has nipples on her neck! Try X/Y/Z plotter to test a sample image with different numbers of steps.
-
Sampler. Unless you're a very, very clever mathematician you don't need (or want) to know what the difference between samplers is in mathematical terms. Just consider them to be words that represent the minutiae of how SD does its thing. Experiment with them, see which ones you prefer for different situations. Can be included in an X/Y/Z plot, so worth playing there.