Architecture¶

Overview¶

modern-yolonas follows the standard YOLO-NAS architecture:

Input (3, 640, 640)
  │
  ▼
Backbone (YoloNASBackbone)
  ├── Stem: Conv 3→48
  ├── Stage1: QARepVGG ↓2 → CSP → 96ch
  ├── Stage2: QARepVGG ↓2 → CSP → 192ch
  ├── Stage3: QARepVGG ↓2 → CSP → 384ch
  ├── Stage4: QARepVGG ↓2 → CSP → 768ch
  └── SPP: Spatial Pyramid Pooling → 768ch
  │
  │ outputs: [c2(96), c3(192), c4(384), c5(768)]
  ▼
Neck (YoloNASPANNeckWithC2)
  ├── neck1 (up): [c5, c4, c3] → upsample + concat + CSP
  ├── neck2 (up): [n1, c3, c2] → upsample + concat + CSP → p3
  ├── neck3 (down): [p3, n2_inter] → downsample + concat + CSP → p4
  └── neck4 (down): [p4, n1_inter] → downsample + concat + CSP → p5
  │
  │ outputs: [p3, p4, p5]
  ▼
Heads (NDFLHeads)
  ├── head1: p3 → cls(80) + reg(4×17) @ stride 8
  ├── head2: p4 → cls(80) + reg(4×17) @ stride 16
  └── head3: p5 → cls(80) + reg(4×17) @ stride 32
  │
  ▼
Output: [B, 8400, 4] bboxes + [B, 8400, 80] scores

Key components¶

QARepVGGBlock¶

The fundamental building block. During training it maintains three branches (3x3 conv+BN, 1x1 conv, identity) that are fused into a single convolution for inference. This enables quantization-aware training while keeping inference fast.

DFL Heads¶

Uses Distribution Focal Loss with reg_max=16 — predicts a discrete probability distribution over 17 offset bins per box edge, then reduces via softmax + linear projection. This gives more precise box regression than direct coordinate prediction.

State dict compatibility¶

All attribute names (backbone.stem, neck.neck1, heads.head1, etc.) exactly match the super-gradients module hierarchy, so pretrained checkpoints load directly with only DDP/EMA prefix stripping.

Variants¶

Variant	`concat_intermediates`	Head `width_mult`	Params
S	False everywhere	0.5	~12M
M	True in stages 1-3	0.75	~31M
L	True everywhere	1.0	~44M