Automatic Batch Sizing (AutoBatch)
Linnaeus can automatically search for a memory-safe per-GPU batch size before training.
The search uses a binary strategy implemented in linnaeus.utils.autobatch.auto_find_batch_size
.
Known Limitation - Multi-GPU (DDP) Training
⚠️ Important: When using autobatch in a multi-GPU distributed training setup (DDP), the current implementation may cause NCCL timeout errors on non-rank-0 processes. While autobatch is designed to run only on rank 0 with other ranks waiting, the current behavior can result in timeouts.
Recommended Workflow for Multi-GPU Training
-
Discovery Phase (Single GPU):
bash # Use the standalone tool or run with single GPU python tools/analyze_batch_sizes.py --cfg my_exp.yaml --fractions 0.8 --modes train,val
-
Configure Discovered Batch Sizes:
yaml DATA: BATCH_SIZE: 64 # Use discovered training batch size BATCH_SIZE_VAL: 128 # Use discovered validation batch size AUTOBATCH: ENABLED: False # Disable autobatch for multi-GPU run ENABLED_VAL: False
-
Run Multi-GPU Training:
bash torchrun --nproc_per_node=4 -m linnaeus.main --cfg my_exp.yaml
Configuration
DATA:
AUTOBATCH:
ENABLED: False # Run the search for the training batch size
TARGET_MEMORY_FRACTION: 0.8 # Fraction of GPU memory to use
MAX_BATCH_SIZE: 512 # Upper bound for the search
MIN_BATCH_SIZE: 1 # Lower bound
STEPS_PER_TRIAL: 2 # Steps to simulate per trial
LOG_LEVEL: "INFO" # Logging level for the autobatch logger
ENABLED_VAL: ${DATA.AUTOBATCH.ENABLED} # Also search validation size
TARGET_MEMORY_FRACTION_VAL: ${DATA.AUTOBATCH.TARGET_MEMORY_FRACTION}
MAX_BATCH_SIZE_VAL: ${DATA.AUTOBATCH.MAX_BATCH_SIZE} * 2
MIN_BATCH_SIZE_VAL: ${DATA.AUTOBATCH.MIN_BATCH_SIZE}
STEPS_PER_TRIAL_VAL: ${DATA.AUTOBATCH.STEPS_PER_TRIAL}
LOG_LEVEL_VAL: ${DATA.AUTOBATCH.LOG_LEVEL}
Set ENABLED
(and optionally ENABLED_VAL
) to True
to run the search at the start of training.
The discovered batch size will overwrite DATA.BATCH_SIZE
(and DATA.BATCH_SIZE_VAL
).
Usage Example
python -m linnaeus.train \
DATA.AUTOBATCH.ENABLED True \
DATA.AUTOBATCH.TARGET_MEMORY_FRACTION 0.85
AutoBatch will log the trial results and set the final batch size accordingly.
Standalone Analysis Tool
The tools/analyze_batch_sizes.py
script runs the same search outside of the training loop.
This is useful for exploring different memory fractions.
python tools/analyze_batch_sizes.py --cfg my_exp.yaml --fractions 0.6,0.8 --modes train,val
The script outputs a JSON or CSV report with the best batch sizes. A typical workflow is:
- Run the analysis tool with your experiment config.
- Choose a memory fraction that yields a suitable batch size.
- Enable AutoBatch in your config (or set the batch size manually) before launching training.