Skip to content

Architecture

Overview

modern-yolonas follows the standard YOLO-NAS architecture:

Input (3, 640, 640)
Backbone (YoloNASBackbone)
  ├── Stem: Conv 3→48
  ├── Stage1: QARepVGG ↓2 → CSP → 96ch
  ├── Stage2: QARepVGG ↓2 → CSP → 192ch
  ├── Stage3: QARepVGG ↓2 → CSP → 384ch
  ├── Stage4: QARepVGG ↓2 → CSP → 768ch
  └── SPP: Spatial Pyramid Pooling → 768ch
  │ outputs: [c2(96), c3(192), c4(384), c5(768)]
Neck (YoloNASPANNeckWithC2)
  ├── neck1 (up): [c5, c4, c3] → upsample + concat + CSP
  ├── neck2 (up): [n1, c3, c2] → upsample + concat + CSP → p3
  ├── neck3 (down): [p3, n2_inter] → downsample + concat + CSP → p4
  └── neck4 (down): [p4, n1_inter] → downsample + concat + CSP → p5
  │ outputs: [p3, p4, p5]
Heads (NDFLHeads)
  ├── head1: p3 → cls(80) + reg(4×17) @ stride 8
  ├── head2: p4 → cls(80) + reg(4×17) @ stride 16
  └── head3: p5 → cls(80) + reg(4×17) @ stride 32
Output: [B, 8400, 4] bboxes + [B, 8400, 80] scores

Key components

QARepVGGBlock

The fundamental building block. During training it maintains three branches (3x3 conv+BN, 1x1 conv, identity) that are fused into a single convolution for inference. This enables quantization-aware training while keeping inference fast.

DFL Heads

Uses Distribution Focal Loss with reg_max=16 — predicts a discrete probability distribution over 17 offset bins per box edge, then reduces via softmax + linear projection. This gives more precise box regression than direct coordinate prediction.

State dict compatibility

All attribute names (backbone.stem, neck.neck1, heads.head1, etc.) exactly match the super-gradients module hierarchy, so pretrained checkpoints load directly with only DDP/EMA prefix stripping.

Variants

Variant concat_intermediates Head width_mult Params
S False everywhere 0.5 ~12M
M True in stages 1-3 0.75 ~31M
L True everywhere 1.0 ~44M