ComfyFlow Models

ComfyFlow build-in supported Models

Base models

SD1.5

Stable Diffusion is a latent text-to-image diffusion model. Thanks to a generous compute donation from Stability AI and support from LAION, we were able to train a Latent Diffusion Model on 512x512 images from a subset of the LAION-5B database. Similar to Google's Imagen, this model uses a frozen CLIP ViT-L/14 text encoder to condition the model on text prompts. With its 860M UNet and 123M text encoder, the model is relatively lightweight and runs on a GPU with at least 10GB VRAM. See this section below and the model card.

SD1.5 text to image, SD1.5 Runway Model

SD1.5 inpaint model, SD1.5 Inpaint Model

Models：

model	File Name
sd15	DreamShaper_v8.safetensors
sd15	majicmixRealistic-v7.safetensors
sd15	realcartoonPixar_v3.safetensors
sd15	sakuramix_v70.safetensors
sd15	v1-5.safetensors

SDXL

SDXL consists of an ensemble of experts pipeline for latent diffusion: In a first step, the base model is used to generate (noisy) latents, which are then further processed with a refinement model (available here: https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0/) specialized for the final denoising steps. Note that the base model can be used as a standalone module.

SDXL base 1.0 Model

SDXL Refiner 1.0 Model

Models:

model	File Name
sdxl	Juggernaut-xl_v9.safetensors
sdxl	juggernaut-xl_v8.safetensors
sdxl	playground-xl-v2-5.safetensors
sdxl	sd_xl_base_1-0.safetensors
sdxl	sd_xl_refiner_1-0.safetensors

LoRA

LCM Lora

Latent Consistency Model (LCM) LoRA was proposed in LCM-LoRA: A universal Stable-Diffusion Acceleration Module by Simian Luo, Yiqin Tan, Suraj Patil, Daniel Gu et al.

It is a distilled consistency adapter for stable-diffusion-xl-base-1.0 that allows to reduce the number of inference steps to only between 2 - 8 steps.

Model	Params / M
lcm-lora-sdv1-5	67.5
lcm-lora-ssd-1b	105
lcm-lora-sdxl	197M

LCM-LoRA-SDXL Model

Sliders

To enable precise editing without changing structure, we present Concept Sliders that are plug-and-play low rank adaptors applied on top of pretrained models. By using simple text descriptions or a small set of paired images, we train concept sliders to represent the direction of desired attributes. At generation time, these sliders can be used to control the strength of the concept in the image, enabling nuanced tweaking.

Trained Sliders

Model	Size
age.pt	8.7M
cartoon_style.pt	8.7M
chubby.pt	8.7M
clay_style.pt	8.7M
cluttered_room.pt	8.7M
curlyhair.pt	8.7M
dark_weather.pt	8.7M
eyebrow.pt	8.7M
eyesize.pt	8.7M
festive.pt	8.7M
fix_hands.pt	8.7M
long_hair.pt	8.7M
muscular.pt	8.7M
pixar_style.pt	8.7M
professional.pt	8.7M
repair_slider.pt	8.7M
sculpture_style.pt	8.7M
smiling.pt	8.7M
stylegan_latent1.pt	8.7M
stylegan_latent2.pt	8.7M
suprised_look.pt	8.7M
tropical_weather.pt	8.7M
winter_weather.pt	8.7M

Slider Project

ControlNet

ControlNet v1.1

Filename	Size
control_v11e_sd15_ip2p.pth	1.45 GB
control_v11e_sd15_shuffle.pth	1.45 GB
control_v11f1e_sd15_tile.pth	1.45 GB
control_v11f1p_sd15_depth.pth	1.45 GB
control_v11p_sd15_canny.pth	1.45 GB
control_v11p_sd15_inpaint.pth	1.45 GB
control_v11p_sd15_lineart.pth	1.45 GB
control_v11p_sd15_mlsd.pth	1.45 GB
control_v11p_sd15_normalbae.pth	1.45 GB
control_v11p_sd15_openpose.pth	1.45 GB
control_v11p_sd15_scribble.pth	1.45 GB
control_v11p_sd15_seg.pth	1.45 GB
control_v11p_sd15_softedge.pth	1.45 GB
control_v11p_sd15s2_lineart_anime.pth	1.45 GB

Control LoRA

By adding low-rank parameter efficient fine tuning to ControlNet, we introduce Control-LoRAs. This approach offers a more efficient and compact method to bring model control to a wider variety of consumer GPUs.

MiDaS and ClipDrop depth

For each model below, you'll find:

Rank 256 files (reducing the original 4.7GB ControlNet models down to ~738MB Control-LoRA models) and experimental
Rank 128 files (reducing to model down to ~377MB)

Filename	Size
control-lora-canny-rank256.safetensors	774 MB
control-lora-depth-rank256.safetensors	774 MB
control-lora-recolor-rank256.safetensors	774 MB
control-lora-sketch-rank256.safetensors	774 MB

Control-LoRA Model

T2I Adapter-SDXL

Official implementation of T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models based on Stable Diffusion-XL.

Filename	Size
t2i-adapter-lineart-sdxl-1.0	316MB
t2i-adapter-canny-sdxl-1.0	316MB
t2i-adapter-depth-zoe-sdxl-1.0	316MB
t2i-adapter-depth-midas-sdxl-1.0	316MB
t2i-adapter-sketch-sdxl-1.0	316MB
t2i-adapter-openpose-sdxl-1.0	316MB

T2I-Adapter-SDXL Project

ControlNet-LLLite

ControlNet-LLLite is a lightweight version of ControlNet. It is a "LoRA Like Lite" that is inspired by LoRA and has a lightweight structure. Currently, only SDXL is supported.

Sample weight file is available here: Models

File Name
kohya_controllllite_xl_depth_anime.safetensors
kohya_controllllite_xl_blur_anime.safetensors
kohya_controllllite_xl_openpose_anime.safetensors
kohya_controllllite_xl_canny.safetensors
kohya_controllllite_xl_canny_anime.safetensors
kohya_controllllite_xl_openpose_anime_v2.safetensors
kohya_controllllite_xl_depth.safetensors
kohya_controllllite_xl_blur.safetensors
kohya_controllllite_xl_scribble_anime.safetensors

Depth Anything

This work presents Depth Anything, a highly practical solution for robust monocular depth estimation by training on a combination of 1.5M labeled images and 62M+ unlabeled images.

Model	Params	RTX4090 (TensorRT)
Depth-Anything-Small	24.8M	3
Depth-Anything-Base	97.5M	6
Depth-Anything-Large	335.3M	12

Features of Depth Anything

Relative depth estimation:

Our foundation models listed here can provide relative depth estimation for any given image robustly. Please refer here for details.

Metric depth estimation

We fine-tune our Depth Anything model with metric depth information from NYUv2 or KITTI. It offers strong capabilities of both in-domain and zero-shot metric depth estimation. Please refer here for details.

Better depth-conditioned ControlNet

We re-train a better depth-conditioned ControlNet based on Depth Anything. It offers more precise synthesis than the previous MiDaS-based ControlNet. Please refer here for details. You can also use our new ControlNet based on Depth Anything in ControlNet WebUI or ComfyUI's ControlNet.

Downstream high-level scene understanding

The Depth Anything encoder can be fine-tuned to downstream high-level perception tasks, e.g., semantic segmentation, 86.2 mIoU on Cityscapes and 59.4 mIoU on ADE20K. Please refer here for details.

Depth Anything Project

Segment

Segment Anything(SAM)

The Segment Anything Model (SAM) produces high quality object masks from input prompts such as points or boxes, and it can be used to generate masks for all objects in an image. It has been trained on a dataset of 11 million images and 1.1 billion masks, and has strong zero-shot performance on a variety of segmentation tasks.

Model checkpoints

Filename
efficientsam_s.pth
efficientsam_ti.pth
mobile_sam.pth
sam_vit_b_01ec64.pth
sam_vit_h_4b8939.pth
sam_vit_l_0b3195.pth
sam_hq_vit_b.pth
sam_hq_vit_h.pth
sam_hq_vit_l.pth

SAM vs HQ-SAM

Segment Anything model (SAM) is a foundation vision model for general image segmentation that segments a wide range of objects, parts, and visual structures in diverse scenarios, by taking a prompt consisting of points, a bounding box, or a coarse mask as input. It also can work in zero-shot segmentation scenarios when the model takes the images and predicts the masks for objects without specifying the exact class name for them. The model was introduced by Facebook Research team and was trained on billion of annotated images.

Segment Anything in High Quality model (HQ-SAM) is an extension of the original Segment Anything model which predicts more accurate object segmentation and reuses the pre-trained model weights of SAM, while only introducing minimal additional parameters injected into SAM’s mask decoder. It was released by VIS Group at ETH Zürich. The authors compose a dataset of 44K fine-grained masks from several sources and trained the model in just about 4 hours on 8 GPUs.

Segment Anything Project

Grounded-Segment-Anything

We plan to create a very interesting demo by combining Grounding DINO and Segment Anything which aims to detect and segment anything with text inputs! And we will continue to improve it and create more interesting demos based on this foundation. And we have already released an overall technical report about our project on arXiv, please check Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks for more details.

🍄 Why Building this Project?

The core idea behind this project is to combine the strengths of different models in order to build a very powerful pipeline for solving complex problems. And it's worth mentioning that this is a workflow for combining strong expert models, where all parts can be used separately or in combination, and can be replaced with any similar but different models (like replacing Grounding DINO with GLIP or other detectors / replacing Stable-Diffusion with ControlNet or GLIGEN/ Combining with ChatGPT).

Model:

File Name
GroundingDINO_SwinB.cfg.py
GroundingDINO_SwinT_OGC.cfg.py
groundingdino_swinb_cogcoor.pth
groundingdino_swint_ogc.pth

Grounded-Segment-Anything Project

Upscale

Real-ESRGAN

Real-ESRGAN aims at developing Practical Algorithms for General Image/Video Restoration. We extend the powerful ESRGAN to a practical restoration application (namely, Real-ESRGAN), which is trained with pure synthetic data.

Model Zoo

For General Images

Models	Scale	Description
RealESRGAN_x4plus	X4	X4 model for general images
RealESRGAN_x2plus	X2	X2 model for general images
RealESRNet_x4plus	X4	X4 model with MSE loss (over-smooth effects)
official ESRGAN_x4	X4	official ESRGAN model
realesr-general-x4v3	X4 (can also be used for X1, X2, X3)	A tiny small model (consume much fewer GPU memory and time); not too strong deblur and denoise capacity

For Anime Images / Illustrations

Models	Scale	Description
RealESRGAN_x4plus_anime_6B	X4	Optimized for anime images; 6 RRDB blocks (smaller network)

Real-ESRGAN Project

Remove Background

Rembg

Rembg is a tool to remove images background.

Models The available models are:

u2net: A pre-trained model for general use cases.
u2netp: A lightweight version of u2net model.
u2net_human_seg: A pre-trained model for human segmentation.
u2net_cloth_seg: A pre-trained model for Cloths Parsing from human portrait. Here clothes are parsed into 3 category: Upper body, Lower body and Full body.
silueta: Same as u2net but the size is reduced to 43Mb.
isnet-general-use: A new pre-trained model for general use cases.
isnet-anime: A high-accuracy segmentation for anime character.
sam encoder, decoder: A pre-trained model for any use cases.

rembg Project

RMBG

RMBG v1.4 is our state-of-the-art background removal model, designed to effectively separate foreground from background in a range of categories and image types. This model has been trained on a carefully selected dataset, which includes: general stock images, e-commerce, gaming, and advertising content, making it suitable for commercial use cases powering enterprise content creation at scale. The accuracy, efficiency, and versatility currently rival leading source-available models. It is ideal where content safety, legally licensed datasets, and bias mitigation are paramount.

Developed by BRIA AI, RMBG v1.4 is available as a source-available model for non-commercial use.

RMBG Model

Others

IP-Adapter

we present IP-Adapter, an effective and lightweight adapter to achieve image prompt capability for the pre-trained text-to-image diffusion models. An IP-Adapter with only 22M parameters can achieve comparable or even better performance to a fine-tuned image prompt model. IP-Adapter can be generalized not only to other custom models fine-tuned from the same base model, but also to controllable generation using existing controllable tools. Moreover, the image prompt can also work well with the text prompt to accomplish multimodal image generation.

IP-Adapter for SD 1.5

IP-Adapter for SDXL

IP-Adapter models SD 1.5

Filename	Size
ip-adapter-full-face_sd15.safetensors	43.6 MB
ip-adapter-plus-face_sd15.safetensors	98.2 MB
ip-adapter-plus_sd15.safetensors	98.2 MB
ip-adapter_sd15.safetensors	44.6 MB
ip-adapter_sd15_light.safetensors	44.6 MB

IP-Adapter models SDXL

Filename	Size
ip-adapter-plus-face_sdxl_vit-h.safetensors	848 MB
ip-adapter-plus_sdxl_vit-h.safetensors	848 MB
ip-adapter_sdxl_vit-h.safetensors	698 MB

Switch to CLIP-ViT-H: we trained the new IP-Adapter with OpenCLIP-ViT-H-14 instead of OpenCLIP-ViT-bigG-14. Although ViT-bigG is much larger than ViT-H, our experimental results did not find a significant difference, and the smaller model can reduce the memory usage in the inference phase.

Recommend to use CLIP-ViT-H (ipadapter-image-encoder-sd1.5)

IP-Adapter-FaceID models

Name	Size
ip-adapter-faceid-plus_sd15.bin	150M
ip-adapter-faceid-plusv2_sd15.bin	150M
ip-adapter-faceid-plusv2_sdxl.bin	1.4G
ip-adapter-faceid-portrait-v11_sd15.bin	62M
ip-adapter-faceid-portrait_sd15.bin	62M
ip-adapter-faceid_sd15.bin	93M
ip-adapter-faceid_sdxl.bin	1022M

IP-Adapter Model IP-Adapter-FaceID

InstantID

InstantID is a new state-of-the-art tuning-free method to achieve ID-Preserving generation with only single image, supporting various downstream tasks.

Comparison with Previous Works

Filename
ip-adapter.bin

InstantID Model

ComfyFlow Custom Nodes Release Note