Not-For-All-Audiences

JTP-3 Hydra

e621 Image Classifier by Project RedRocket

JTP-3 Hydra is a finetune of the SigLIP2 image classifier with a custom classifier head, designed to predict 7,504 popular tags from e621.

A public demo of the model is available here: https://huggingface.co/spaces/RedRocket/JTP-3-Demo

Jump to section:
Downloading
Easy Windows Installation and Usage
Advanced Windows Installation and Usage
Linux Installation and Usage
Using inference.py/inference.bat
Usage Notes
Calibration
Using Extensions
Training Extensions
Technical Notes
Credits / Citations

Downloading

If you have Git+LFS installed, download the repository using git clone https://huggingface.co/RedRocket/JTP-3.

If you are unable to do this, manually download all the .py files, requirements.txt, models/jtp-3-hydra.safetensors, and data/jtp-3-hydra-tags.csv.
If you are on Windows, also download the .bat files and follow the instructions below for easy installation.
If you want to run calibration, you also need data/jtp-3-hydra-val.csv.

Easy Windows Installation and Usage

For Windows, ensure you have at least Python 3.11 installed and available on your path. If you are unsure about your version of Python, you can run easy.bat and it will let you know.

For Windows, double-click easy.bat to run easy mode. Easy mode walks you through all the commands. When easy mode asks you for a file or folder, you can drag and drop it onto the easy mode window and press enter, copy and paste the path, or type it yourself.

Advanced Windows Installation and Usage

Double-click install.bat to run installation, which will create a virtual environment for all the requirements and install them. You can check your version of Python by opening a command prompt and typing python -V.

You can run the WebUI by double clicking app.bat and navigating your browser to the URL it shows. The link is not shared publicly.

On the command line, you can use inference.bat to do bulk operations such as tagging entire directories. Run inference.bat --help for help using the command line. If you provide a path to a file or directory, it will write .txt caption files beside each image using the default threshold of 0.5.

Instead of using a fixed threshold, you can run the calibration wizard with calibrate.bat.

Linux Installation and Usage

If your OS Python install is not 3.11 or above, install a more recent version of Python according to your distribution's instructions and use that python to create the venv. You can check your version of python with python -V.

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

source venv/bin/activate
python app.py

source venv/bin/activate
python inference.py --help

Using inference.py (or inference.bat)

If you do not have a calibration file, the default threshold of 0.5 is conservative. If you plan on manually reviewing the tags, consider using -t 0.2 or -t 0.1.

$ python inference.py --help
usage: inference.py [-h] [-t THRESHOLD_OR_PATH] [-i MODE] [-x CATEGORY] [-p PREFIX] [-o PATH] [-O] [-M PATH] [-m PATH] [-e PATH] [-E] [-b BATCH_SIZE] [-w N_WORKERS] [--no-shm] [-S SEQLEN] [-d TORCH_DEVICE] [-r] [PATH ...]

positional arguments:
  PATH                  Paths to files and directories to classify. If none are specified, run interactively.

options:
  -h, --help            show this help message and exit
  -r, --recursive       Classify directories recursively. Dotfiles will be ignored.

classification:
  -t, --threshold THRESHOLD_OR_PATH
                        Classification threshold -1.0 to 1.0. Or, a path to a CSV calibration file. (Default: calibration.csv)
  -i, --implications MODE
                        Automatically apply implications. Requires tag metadata. (Default: inherit)
  -x, --exclude CATEGORY
                        Exclude the specified category of tags. May be specified multiple times. Requires tag metadata.

output:
  -p, --prefix PREFIX   Prefix all .txt caption files with the specified text. If the prefix matches a tag, the tag will not be repeated.
  -o, --output PATH     Path for CSV output, or '-' for standard output. If not specified, individual .txt caption files are written.
  -O, --original-tags   Do not rewrite tags for compatibility with diffusion models.

model:
  -M, --model PATH      Path to model file.
  -m, --metadata PATH   Path to CSV file with additional tag metadata. (Default: data/jtp-3-hydra-tags.csv)
  -e, --extension PATH  Path to extension. May be specified multiple times. If a directory is specified, all extensions in the specified directory are loaded. (Default: extensions/jtp-3-hydra)
  -E, --no-default-extensions
                        Do not load extensions by default.

execution:
  -b, --batch BATCH_SIZE
                        Batch size.
  -w, --workers N_WORKERS
                        Number of dataloader workers. (Default: number of cores)
  --no-shm              Disable shared memory between workers.
  -S, --seqlen SEQLEN   NaFlex sequence length. (Default: 1024)
  -d, --device TORCH_DEVICE
                        Torch device. (Default: cuda)

MODE:
  inherit           Tags inherit the highest probability of the more specific tags that imply them.
  constrain         Tags are constrained to the lowest probability of the more general tags they imply.
  remove            Exclude implied tags from output.
  constrain-remove  Combination of constrain followed by remove.
  off               No implications are applied.

CATEGORY:
  general artist copyright character species meta lore

Try to avoid running multiple copies of inference.py at once, as each copy will load the entire model. If you are tagging only a few images, run with -w 0 to use in-process dataloading.

Interactive Mode

If you do not provide a list of files or directories to classify, inference.py will launch in an interactive mode where you can provide files one-at-a-time.

$ python inference.py
JTP-3 Hydra Interactive Classifier
  Type 'q' to quit, or 'h' for help.
  For bulk operations, quit and run again with a path, or '-h' for help.

> h
Provide a file path to classify, or one of the following commands:
  threshold NUM      (-1.0 to 1.0, 0.2 to 0.8 recommended)
  calibration [PATH] (load calibration csv file)
  exclude CATEGORY   (general copyright character species meta lore)
  include CATEGORY   (general copyright character species meta lore)
  implications MODE  (inherit constrain remove constrain-remove off)
  seqlen LEN         (64 to 2048, 1024 recommended)
  quit               (or 'q', 'exit')

Usage Notes

The model predicts 7,501 e621 tags, as well as the added rating meta-tags safe, questionable, and explicit.

The model is trained with implications, but its raw predictions are not constrained. If you use the inference script, it will leverage the tag metadata, if available, to automatically apply implications unless you specify otherwise with -i off. For example, with implications off it's possible the model can say tyrannosaurus rex is more likely than dinosaur. In the default inherit mode, it will instead say that dinosaur is as likely as tyrannosaurus rex. In the constrain mode, it will say that tyrannosaurus rex is as likely as dinosaur.

The model is trained on images on e621 only, and not on photographs of people or real animals. While it has retained some ability to classify photos, this is not in any way supported.

The interactive interfaces use a threshold convention of -100% to 100%. This is different from other classifier models that generally range from 0% to 100%.

The model sees all transparency as a black background.

Using calibrate.bat (or Easy Mode calibration)

You can just press ENTER to get the default calibration until it asks you for a list of tags to exclude. If you don't want to exclude any tags, press ENTER again and answer y to get the default calibration.

Members of the Furry Diffusion Community may have created their own calibration files for you to try out, too. Be cautious if anyone offers you a custom calibration file that ends in .py and tells you to run it. However, .csv calibration files are always safe.

Using Extensions

JTP-3 Hydra supports adding and replacing tags with extensions, which are simple .safetensors files similar in spirit to LORAs. By default, .safetensors files placed in extensions/jtp-3-hydra will be loaded as extensions.

Members of the Furry Diffusion Community may have created their own extension files for you to try out, too. JTP-3 Hydra extension files are always safe.

If you are using calibrations, be sure to re-calibrate after adding an extension.

Training Extensions

In order to train extensions, you will need to have some basic familiarity with the command line. .bat wrapper files which load the virtual environment are intentionally not provided.

You will need around 1.5 GB of free VRAM to train, or more if you use higher batch sizes. Training is generally very quick. You can expect a run to complete in under 10 minutes unless your dataset is many thousands of images.

If you are training on Windows, you will need triton-windows. Run pip install triton-windows with the virtual environment active. CPU training is not supported due to the dependency on Triton. If you really want to train on a platform not supported by Triton, manually replace the optimizer, perhaps with AdamW in float32.

Step 1 – Dataset

To train new extensions for JTP-3 create the following directories inside the train directory (or elsewhere):

tag_name/
  positive/
  negative/

Add at least 100 example images having the tag to the positive directory. For best results, you must manually review every image to ensure it has the tag you are trying to train. Try to use a diverse set of images having the tag. Don't just use your favorite images, especially if they are from a single artist.

Add a similar number of images not having the tag to the negative directory. For best results, you must manually review every image to ensure it does not have the tag you are trying to train.

Dataset Tips

At least half of the negative dataset should be random images not having the tag. For better results, the other half should be hand-selected images that contain concepts that might be easily confused with your tag. It's fine filter out images with unsavory content that you absolutely don't want to see while reviewing.

So, for example, let's say you're training dragon_on_top_gryphon_on_bottom with 200 positive examples. Your negative set might look like:

100 random images that don't have the tag
30 images of gryphons and dragons in other scenarios
20 images of a dragon on top with another species
20 images of a gryphon on bottom with another species
30 images of a gryphon on top with a dragon on bottom

Step 2 – Training

Run python train_extension.py --help to familiarize yourself with the options provided by the training script.

At a minimum, you will want to:

Adjust the batch sizes (-b/-B) and/or gradient accumulation (-a) to match your available VRAM.
Adjust the size of your validation set. Try to target 5%-10% of your available data, but never less than -v 20. (Note that the -v option reserves an equal number of positive and negative examples. The default -v 20 reserves 40 total examples, 20 positive and 20 negative.)
Set the checkpoint interval or maximum number of epochs (-c/-e).

Please resist the urge to tweak hyperparameters until you have first succeeded with the defaults.

Training begins by building a feature cache for the dataset. This should only take a few minutes, but be aware that the feature cache for each dataset item consumes about 2.3 MB of disk space.

Here's an example training run. This took about 2 minutes on a RTX 3090 with 200 total examples. (Yes, it's that fast.)

$ python train_extension.py -c 0 example_tag

Loading 'models/jtp-3-hydra.safetensors' ... 7504 tags
caching: 100%|█████████| 200/200 [00:19<00:00, 11.02it/s]
...
EPOCH 1 VALIDATION: loss=0.6758, cti=0.5556, thr=0.4501
EPOCH 2 VALIDATION: loss=0.6633, cti=0.5556, thr=0.4501
EPOCH 3 VALIDATION: loss=0.6320, cti=0.5882, thr=0.4800
EPOCH 4 VALIDATION: loss=0.5922, cti=0.6923, thr=0.5499
...
EPOCH 65 VALIDATION: loss=0.0106, cti=1.0000, thr=0.0804
EPOCH 66 VALIDATION: loss=0.0105, cti=1.0000, thr=0.0804
EPOCH 67 VALIDATION: loss=0.0112, cti=1.0000, thr=0.0901
EPOCH 68 VALIDATION: loss=0.0115, cti=1.0000, thr=0.0995
EPOCH 69 VALIDATION: loss=0.0116, cti=1.0000, thr=0.1097
EPOCH 70 VALIDATION: loss=0.0113, cti=1.0000, thr=0.0995
...

In this case, selecting epoch 66 seems reasonable, which would have been saved as train/example_tag/checkpoints/<timestamp>_e66.pt.

Step 3 – Build Extension

Run python build_extension.py --help to familiarize yourself with the options provided by the extension builder. The extension builder converts pytorch checkpoints in training mode to inference-ready safetensors files with additional metadata, of which some is essential.

Continuing with the example above:

$ python build_extension.py -a "Project RedRocket" train/example_tag/checkpoints/<timestamp>_e66.pt example_tag general

Loading checkpoint 'train/example_tag/checkpoints/<timestamp>_e66.pt'...
Preparing metadata...
  modelspec.sai_model_spec: '1.0.0'
  modelspec.architecture: 'naflexvit_so400m_patch16_siglip+rr_hydra'
  modelspec.implementation: 'redrocket.extension.label.v1'
  modelspec.description: 'This is an extension for the RedRocket JTP-3 Hydra image classifier. You can find usage instructions at https://huggingface.co/RedRocket/JTP-3.'
  modelspec.date: '<timestamp>'
  modelspec.tags: 'Image Classification'
  classifier.label: 'example_tag'
  classifier.label.category: 'general'
  modelspec.title: 'JTP-3 Hydra Extension: example_tag'
  modelspec.author: 'Project RedRocket'
  modelspec.license: 'MIT'
  modelspec.language: 'en/US'
Building extension...
  Apply optimizer state: attn_pool.q
  Apply optimizer state: attn_pool.out_proj.weight
  Normalize: attn_pool.q
Saving extension 'extensions/jtp-3-hydra/example_tag.safetensors'...

Safetensors Metadata Editor

A simple metadata editor for .safetensors files is included as edit_metadata.py. You can use this to view and edit already-built extensions, perhaps to change the tag name or add implications.

Technical Notes

The model consists of SigLIP2 So400m Patch16 NAFlex followed by a custom cross-attention transformer block with learned per-tag queries, SwiGLU feedforward, and per-tag SwiGLU output heads. The per-tag cross attention mechanism is the origin of the moniker "hydra".

Subject to the preprocessing mentioned below, the initial set of training tags was all general tags with at least 1,200 examples, all species and character tags with at least 500 examples, a semi-automated selection of copyright and meta tags, and a handful of manually-selected lore tags which are sometimes discernible from the image. This resulted in 8,067 tags. After training, tags with very poor validation performance were pruned, resulting in the final set of 7,504 tags.

Extensive semi-manual dataset curation was used to improve the quality of the training data. The dataset preprocessing code consists of over 12,000 lines of code and data files. In addition to correcting implications, manually-defined rules are used to detect common scenarios of missing, incomplete, or contradictory tagging and to selectively mask individual tags on a per-dataset-item basis. This is responsible for JTP-3's excellent performance in detecting colors and "combo tags" such as male_feral.

Margin-focal cross entropy loss based on ASL was used to mitigate the effects of inconsistent labeling on e621 and the extreme class imbalance. The dataset was sampled in mini-epochs according to a self-entropy metric. Loss weight for negative labels was logarithmically redistributed from images with few tags to those with many tags.

Raw validation performance metrics and tag lists are available in the data folder. These can be used to create P/R curves, compute CTI or F₁ scores, or select automated thresholds for each tag. The list of supported tags is also embedded in the safetensors metadata as classifier.labels.

Internally, the model operates on logits as normal and classification thresholds are expressed in the interval from 0.0 to 1.0. This is reflected in the data files and csv output of inference.py.

Credits

RedHotTensors — Architecture design, dataset curation, infrastructure and training, testing, and release.
DrHead — WebUI, multi-layer CAM, testing, and additional code.
Thessalo — Advice and testing.
Furry Diffusion Community — Feedback and compatibility fixes.
Google Gemini — Hero image.

Citations

Michael Tschannen, et al. SigLIP 2.
Emanuel Ben-Baruch, et al. Asymmetric Loss For Multi-Label Classification.
Noam Shazeer. GLU Variants Improve Transformer.
Pedram Zamirai, et al. Revisiting BFloat16 Training.

Downloads last month: -

Model tree for RedRocket/JTP-3

Base model

google/siglip2-so400m-patch16-naflex

Finetuned

(4)

this model

Spaces using RedRocket/JTP-3 3

Papers for RedRocket/JTP-3

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Paper • 2502.14786 • Published Feb 20, 2025 • 157