ComfyFlow Models
ComfyFlow build-in supported Models
Base models
SD1.5
Stable Diffusion is a latent text-to-image diffusion model. Thanks to a generous compute donation from Stability AI and support from LAION, we were able to train a Latent Diffusion Model on 512x512 images from a subset of the LAION-5B database. Similar to Google's Imagen, this model uses a frozen CLIP ViT-L/14 text encoder to condition the model on text prompts. With its 860M UNet and 123M text encoder, the model is relatively lightweight and runs on a GPU with at least 10GB VRAM. See this section below and the model card.
SD1.5 text to image, SD1.5 Runway Model
SD1.5 inpaint model, SD1.5 Inpaint Model
Models:
model | File Name |
---|---|
sd15 | DreamShaper_v8.safetensors |
sd15 | majicmixRealistic-v7.safetensors |
sd15 | realcartoonPixar_v3.safetensors |
sd15 | sakuramix_v70.safetensors |
sd15 | v1-5.safetensors |
SDXL
SDXL consists of an ensemble of experts pipeline for latent diffusion: In a first step, the base model is used to generate (noisy) latents, which are then further processed with a refinement model (available here: https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0/) specialized for the final denoising steps. Note that the base model can be used as a standalone module.
Models:
model | File Name |
---|---|
sdxl | Juggernaut-xl_v9.safetensors |
sdxl | juggernaut-xl_v8.safetensors |
sdxl | playground-xl-v2-5.safetensors |
sdxl | sd_xl_base_1-0.safetensors |
sdxl | sd_xl_refiner_1-0.safetensors |
LoRA
LCM Lora
Latent Consistency Model (LCM) LoRA was proposed in LCM-LoRA: A universal Stable-Diffusion Acceleration Module by Simian Luo, Yiqin Tan, Suraj Patil, Daniel Gu et al.
It is a distilled consistency adapter for stable-diffusion-xl-base-1.0 that allows to reduce the number of inference steps to only between 2 - 8 steps.
Model | Params / M |
---|---|
lcm-lora-sdv1-5 | 67.5 |
lcm-lora-ssd-1b | 105 |
lcm-lora-sdxl | 197M |
Sliders
To enable precise editing without changing structure, we present Concept Sliders that are plug-and-play low rank adaptors applied on top of pretrained models. By using simple text descriptions or a small set of paired images, we train concept sliders to represent the direction of desired attributes. At generation time, these sliders can be used to control the strength of the concept in the image, enabling nuanced tweaking.
Model | Size |
---|---|
age.pt | 8.7M |
cartoon_style.pt | 8.7M |
chubby.pt | 8.7M |
clay_style.pt | 8.7M |
cluttered_room.pt | 8.7M |
curlyhair.pt | 8.7M |
dark_weather.pt | 8.7M |
eyebrow.pt | 8.7M |
eyesize.pt | 8.7M |
festive.pt | 8.7M |
fix_hands.pt | 8.7M |
long_hair.pt | 8.7M |
muscular.pt | 8.7M |
pixar_style.pt | 8.7M |
professional.pt | 8.7M |
repair_slider.pt | 8.7M |
sculpture_style.pt | 8.7M |
smiling.pt | 8.7M |
stylegan_latent1.pt | 8.7M |
stylegan_latent2.pt | 8.7M |
suprised_look.pt | 8.7M |
tropical_weather.pt | 8.7M |
winter_weather.pt | 8.7M |
ControlNet
ControlNet v1.1
Filename | Size |
---|---|
control_v11e_sd15_ip2p.pth | 1.45 GB |
control_v11e_sd15_shuffle.pth | 1.45 GB |
control_v11f1e_sd15_tile.pth | 1.45 GB |
control_v11f1p_sd15_depth.pth | 1.45 GB |
control_v11p_sd15_canny.pth | 1.45 GB |
control_v11p_sd15_inpaint.pth | 1.45 GB |
control_v11p_sd15_lineart.pth | 1.45 GB |
control_v11p_sd15_mlsd.pth | 1.45 GB |
control_v11p_sd15_normalbae.pth | 1.45 GB |
control_v11p_sd15_openpose.pth | 1.45 GB |
control_v11p_sd15_scribble.pth | 1.45 GB |
control_v11p_sd15_seg.pth | 1.45 GB |
control_v11p_sd15_softedge.pth | 1.45 GB |
control_v11p_sd15s2_lineart_anime.pth | 1.45 GB |
Control LoRA
By adding low-rank parameter efficient fine tuning to ControlNet, we introduce Control-LoRAs. This approach offers a more efficient and compact method to bring model control to a wider variety of consumer GPUs.
MiDaS and ClipDrop depth
For each model below, you'll find:
- Rank 256 files (reducing the original 4.7GB ControlNet models down to ~738MB Control-LoRA models) and experimental
- Rank 128 files (reducing to model down to ~377MB)
Filename | Size |
---|---|
control-lora-canny-rank256.safetensors | 774 MB |
control-lora-depth-rank256.safetensors | 774 MB |
control-lora-recolor-rank256.safetensors | 774 MB |
control-lora-sketch-rank256.safetensors | 774 MB |
T2I Adapter-SDXL
Official implementation of T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models based on Stable Diffusion-XL.
Filename | Size |
---|---|
t2i-adapter-lineart-sdxl-1.0 | 316MB |
t2i-adapter-canny-sdxl-1.0 | 316MB |
t2i-adapter-depth-zoe-sdxl-1.0 | 316MB |
t2i-adapter-depth-midas-sdxl-1.0 | 316MB |
t2i-adapter-sketch-sdxl-1.0 | 316MB |
t2i-adapter-openpose-sdxl-1.0 | 316MB |
ControlNet-LLLite
ControlNet-LLLite is a lightweight version of ControlNet. It is a "LoRA Like Lite" that is inspired by LoRA and has a lightweight structure. Currently, only SDXL is supported.
Sample weight file is available here: Models
File Name |
---|
kohya_controllllite_xl_depth_anime.safetensors |
kohya_controllllite_xl_blur_anime.safetensors |
kohya_controllllite_xl_openpose_anime.safetensors |
kohya_controllllite_xl_canny.safetensors |
kohya_controllllite_xl_canny_anime.safetensors |
kohya_controllllite_xl_openpose_anime_v2.safetensors |
kohya_controllllite_xl_depth.safetensors |
kohya_controllllite_xl_blur.safetensors |
kohya_controllllite_xl_scribble_anime.safetensors |
Depth Anything
This work presents Depth Anything, a highly practical solution for robust monocular depth estimation by training on a combination of 1.5M labeled images and 62M+ unlabeled images.
Model | Params | RTX4090 (TensorRT) |
---|---|---|
Depth-Anything-Small | 24.8M | 3 |
Depth-Anything-Base | 97.5M | 6 |
Depth-Anything-Large | 335.3M | 12 |
Features of Depth Anything
- Relative depth estimation:
Our foundation models listed here can provide relative depth estimation for any given image robustly. Please refer here for details.
- Metric depth estimation
We fine-tune our Depth Anything model with metric depth information from NYUv2 or KITTI. It offers strong capabilities of both in-domain and zero-shot metric depth estimation. Please refer here for details.
- Better depth-conditioned ControlNet
We re-train a better depth-conditioned ControlNet based on Depth Anything. It offers more precise synthesis than the previous MiDaS-based ControlNet. Please refer here for details. You can also use our new ControlNet based on Depth Anything in ControlNet WebUI or ComfyUI's ControlNet.
- Downstream high-level scene understanding
The Depth Anything encoder can be fine-tuned to downstream high-level perception tasks, e.g., semantic segmentation, 86.2 mIoU on Cityscapes and 59.4 mIoU on ADE20K. Please refer here for details.
Segment
Segment Anything(SAM)
The Segment Anything Model (SAM) produces high quality object masks from input prompts such as points or boxes, and it can be used to generate masks for all objects in an image. It has been trained on a dataset of 11 million images and 1.1 billion masks, and has strong zero-shot performance on a variety of segmentation tasks.
Filename |
---|
efficientsam_s.pth |
efficientsam_ti.pth |
mobile_sam.pth |
sam_vit_b_01ec64.pth |
sam_vit_h_4b8939.pth |
sam_vit_l_0b3195.pth |
sam_hq_vit_b.pth |
sam_hq_vit_h.pth |
sam_hq_vit_l.pth |
SAM vs HQ-SAM
Segment Anything model (SAM) is a foundation vision model for general image segmentation that segments a wide range of objects, parts, and visual structures in diverse scenarios, by taking a prompt consisting of points, a bounding box, or a coarse mask as input. It also can work in zero-shot segmentation scenarios when the model takes the images and predicts the masks for objects without specifying the exact class name for them. The model was introduced by Facebook Research team and was trained on billion of annotated images.
Segment Anything in High Quality model (HQ-SAM) is an extension of the original Segment Anything model which predicts more accurate object segmentation and reuses the pre-trained model weights of SAM, while only introducing minimal additional parameters injected into SAM’s mask decoder. It was released by VIS Group at ETH Zürich. The authors compose a dataset of 44K fine-grained masks from several sources and trained the model in just about 4 hours on 8 GPUs.
Grounded-Segment-Anything
We plan to create a very interesting demo by combining Grounding DINO and Segment Anything which aims to detect and segment anything with text inputs! And we will continue to improve it and create more interesting demos based on this foundation. And we have already released an overall technical report about our project on arXiv, please check Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks for more details.
🍄 Why Building this Project?
The core idea behind this project is to combine the strengths of different models in order to build a very powerful pipeline for solving complex problems. And it's worth mentioning that this is a workflow for combining strong expert models, where all parts can be used separately or in combination, and can be replaced with any similar but different models (like replacing Grounding DINO with GLIP or other detectors / replacing Stable-Diffusion with ControlNet or GLIGEN/ Combining with ChatGPT).
Model:
File Name |
---|
GroundingDINO_SwinB.cfg.py |
GroundingDINO_SwinT_OGC.cfg.py |
groundingdino_swinb_cogcoor.pth |
groundingdino_swint_ogc.pth |
Grounded-Segment-Anything Project
Upscale
Real-ESRGAN
Real-ESRGAN aims at developing Practical Algorithms for General Image/Video Restoration. We extend the powerful ESRGAN to a practical restoration application (namely, Real-ESRGAN), which is trained with pure synthetic data.
For General Images
Models | Scale | Description |
---|---|---|
RealESRGAN_x4plus | X4 | X4 model for general images |
RealESRGAN_x2plus | X2 | X2 model for general images |
RealESRNet_x4plus | X4 | X4 model with MSE loss (over-smooth effects) |
official ESRGAN_x4 | X4 | official ESRGAN model |
realesr-general-x4v3 | X4 (can also be used for X1, X2, X3) | A tiny small model (consume much fewer GPU memory and time); not too strong deblur and denoise capacity |
For Anime Images / Illustrations
Models | Scale | Description |
---|---|---|
RealESRGAN_x4plus_anime_6B | X4 | Optimized for anime images; 6 RRDB blocks (smaller network) |
Remove Background
Rembg
Rembg is a tool to remove images background.
Models The available models are:
- u2net: A pre-trained model for general use cases.
- u2netp: A lightweight version of u2net model.
- u2net_human_seg: A pre-trained model for human segmentation.
- u2net_cloth_seg: A pre-trained model for Cloths Parsing from human portrait. Here clothes are parsed into 3 category: Upper body, Lower body and Full body.
- silueta: Same as u2net but the size is reduced to 43Mb.
- isnet-general-use: A new pre-trained model for general use cases.
- isnet-anime: A high-accuracy segmentation for anime character.
- sam encoder, decoder: A pre-trained model for any use cases.
RMBG
RMBG v1.4 is our state-of-the-art background removal model, designed to effectively separate foreground from background in a range of categories and image types. This model has been trained on a carefully selected dataset, which includes: general stock images, e-commerce, gaming, and advertising content, making it suitable for commercial use cases powering enterprise content creation at scale. The accuracy, efficiency, and versatility currently rival leading source-available models. It is ideal where content safety, legally licensed datasets, and bias mitigation are paramount.
Developed by BRIA AI, RMBG v1.4 is available as a source-available model for non-commercial use.
Others
IP-Adapter
we present IP-Adapter, an effective and lightweight adapter to achieve image prompt capability for the pre-trained text-to-image diffusion models. An IP-Adapter with only 22M parameters can achieve comparable or even better performance to a fine-tuned image prompt model. IP-Adapter can be generalized not only to other custom models fine-tuned from the same base model, but also to controllable generation using existing controllable tools. Moreover, the image prompt can also work well with the text prompt to accomplish multimodal image generation.
IP-Adapter for SD 1.5
IP-Adapter for SDXL
IP-Adapter models SD 1.5
Filename | Size |
---|---|
ip-adapter-full-face_sd15.safetensors | 43.6 MB |
ip-adapter-plus-face_sd15.safetensors | 98.2 MB |
ip-adapter-plus_sd15.safetensors | 98.2 MB |
ip-adapter_sd15.safetensors | 44.6 MB |
ip-adapter_sd15_light.safetensors | 44.6 MB |
IP-Adapter models SDXL
Filename | Size |
---|---|
ip-adapter-plus-face_sdxl_vit-h.safetensors | 848 MB |
ip-adapter-plus_sdxl_vit-h.safetensors | 848 MB |
ip-adapter_sdxl_vit-h.safetensors | 698 MB |
Switch to CLIP-ViT-H: we trained the new IP-Adapter with OpenCLIP-ViT-H-14 instead of OpenCLIP-ViT-bigG-14. Although ViT-bigG is much larger than ViT-H, our experimental results did not find a significant difference, and the smaller model can reduce the memory usage in the inference phase.
Recommend to use CLIP-ViT-H (ipadapter-image-encoder-sd1.5)
IP-Adapter-FaceID models
Name | Size |
---|---|
ip-adapter-faceid-plus_sd15.bin | 150M |
ip-adapter-faceid-plusv2_sd15.bin | 150M |
ip-adapter-faceid-plusv2_sdxl.bin | 1.4G |
ip-adapter-faceid-portrait-v11_sd15.bin | 62M |
ip-adapter-faceid-portrait_sd15.bin | 62M |
ip-adapter-faceid_sd15.bin | 93M |
ip-adapter-faceid_sdxl.bin | 1022M |
IP-Adapter Model IP-Adapter-FaceID
InstantID
InstantID is a new state-of-the-art tuning-free method to achieve ID-Preserving generation with only single image, supporting various downstream tasks.
Comparison with Previous Works
Filename |
---|
ip-adapter.bin |