Developer Guide: Metrics and Logging
This document describes the architecture and usage of the metrics tracking and logging system in linnaeus.
1. Architecture Overview
The system separates responsibilities for clarity and maintainability:
graph LR
subgraph Metrics & Logging Flow
direction LR
MT[MetricsTracker] -- Metrics State --> SL(StepMetricsLogger);
OS(OpsSchedule) -- Logging Intervals --> SL;
SL -- Formatted Metrics --> CONSOLE{Console (Rank 0)};
SL -- Formatted Metrics --> WB(wandb.py);
WB -- wandb.log() --> WANDB_API[WandB Service];
end
subgraph Training Loop
direction TB
TL(train_one_epoch / validate_one_pass) -- Update --> MT;
TL -- Log Trigger --> SL;
end
TL --> MT;
TL --> SL;
SL --> CONSOLE;
SL --> WB;
OS --> SL;
style MT fill:#f9f,stroke:#333,stroke-width:2px;
style SL fill:#ccf,stroke:#333,stroke-width:2px;
style OS fill:#lightgrey,stroke:#333,stroke-width:1px;
style WB fill:#cfc,stroke:#333,stroke-width:2px;
style WANDB_API fill:#eee,stroke:#333,stroke-width:1px;
style CONSOLE fill:#eee,stroke:#333,stroke-width:1px;
style TL fill:#eee,stroke:#333,stroke-width:1px;
MetricsTracker(utils/metrics/tracker.py): The central state store. It accumulates metric values, tracks best scores, manages subset metrics, and handles checkpointing/restoration of metric state. It does not perform any logging itself.OpsSchedule(ops_schedule/ops_schedule.py): Determines when logging should occur based on configured intervals (SCHEDULE.METRICS.*).StepMetricsLogger(utils/metrics/step_metrics_logger.py): The coordinator. It queriesOpsScheduleto decide if it's time to log. If so, it retrieves current metric values fromMetricsTracker, formats them, logs to the console (rank 0 only), and passes formatted data towandb.pyfor WandB logging.wandb.py(utils/logging/wandb.py): The dedicated interface to the WandB API. It receives formatted metrics and handleswandb.init,wandb.log, andwandb.config.update.
2. MetricsTracker Details
Metric Storage
- Uses the
Metricclass (utils/metrics/tracker.py) to store individual metric values, best values, and best epochs. Supportshigher_is_betterflag. - Organizes metrics by phase (
train,val,val_mask_meta, plus dynamically created phases likeval_mask_TEMPORAL).phase_metrics: Global metrics per phase (e.g.,loss,chain_accuracy).phase_task_metrics: Per-task metrics per phase (e.g.,acc1,acc3,loss).phase_subset_metrics: Handles subset aggregation usingSubsetMetricWrapper.
- Accumulates partial sums/counts for task metrics (
partial_task_sums,partial_task_counts) updated per batch, finalized per epoch. - Stores current schedule values (
schedule_values), pipeline metrics (metrics), GradNorm metrics (gradnorm_metrics), and learning rates (lr_dict). - Maintains historical lists (optional, limited size) for debugging/analysis.
Key Methods
update_train_batch(...),update_val_metrics(...): Called per batch to update partial sums and chain accuracy counters.start_val_phase(...): Resets accumulators for a specific validation phase.finalize_train_epoch(...),finalize_val_phase(...): Called per epoch/phase end. Calculates final averages from partial sums, updatesMetricobjects (current and best values), resets accumulators.update_pipeline_metrics(...),update_task_weights(...),update_schedule_values(...),update_learning_rates(...),update_gradnorm_metrics(...): Update specific metric categories.get_wandb_metrics(): Crucial method. Gathers and formats all relevant current metric values into a flat dictionary suitable forwandb.log. Organizes keys hierarchically (e.g.,train/loss,core/val_acc1/taxa_L10).state_dict(),load_state_dict(): For checkpointing.
3. StepMetricsLogger Details
Role
- Acts as the central dispatcher for logging during the training loop (
train_one_epoch). - Decides if and what to log at each step based on
OpsSchedule.
Key Methods
log_step_metrics(...): Called withintrain_one_epoch. Checks intervals usingOpsSchedule. If logging is due:- Formats basic loss, LR, etc.
- Logs a summary line to console (rank 0).
- Accumulates metrics for WandB averaging (
accumulate_metrics_for_wandb). - If WandB interval is met, calculates averages (
get_averaged_wandb_metrics) and callswandb_utils.log_training_metrics.
log_learning_rates(...): Called periodically. Gets LR dict from scheduler, updatesMetricsTracker, logs to console, callswandb_utils.log_learning_rates.log_pipeline_metrics(...): Called periodically. Gets metrics from dataset, updatesMetricsTracker, logs to console, callswandb_utils.log_pipeline_metrics.log_gradnorm_metrics(...): Called after GradNorm update. Logs summary to console, callswandb_utils.log_gradnorm_metrics.log_validation_summary(...): Called after a validation pass completes. Formats metrics fromMetricsTracker, logs summary to console, callswandb_utils.log_validation_metrics.log_schedule_values(...): Called periodically (or once at start). Gets dynamic schedule values fromOpsSchedule(passingcurrent_step), logs to console, callswandb_utils.log_schedule_values. Logs static schedule summary to WandB config once.
WandB Interval Averaging
- If
SCHEDULE.METRICS.WANDB_INTERVALis larger thanSCHEDULE.METRICS.STEP_INTERVAL,StepMetricsLoggeraverages metrics accumulated since the last WandB log. - Averaged metrics are logged with a
step_avg_prefix (e.g.,train/step_avg_loss) to distinguish them from per-step instantaneous values ifSTEP_INTERVAL==WANDB_INTERVAL. - Core metrics (
core/...) are always logged with their averaged value for consistency.
4. wandb.py Utilities
- Provides simple functions wrapping
wandbAPI calls (initialize_wandb,log_training_metrics,log_validation_metrics, etc.). - Ensures all WandB interactions happen through this single module.
- Handles generating/broadcasting
run_idfor distributed runs. - Constructs the initial
wandb.configdictionary from the YACS config and dataset/model metadata. - Handles
resume="must"logic based on checkpoint loading.
5. WandB Metric Naming Conventions
Using consistent, hierarchical names in MetricsTracker.get_wandb_metrics() is key for organizing the WandB dashboard.
- Prefixes:
train/,val/,val_mask_meta/,val_mask_COMPONENT/,pipeline/,schedule/,gradnorm/,core/. - Structure:
{prefix}/{metric_type}_{task_key}(e.g.,val/acc1_taxa_L10) or{prefix}/{metric_name}(e.g.,train/loss). - Core: Key metrics duplicated under
core/for easy dashboarding (e.g.,core/val_loss,core/val_acc1/taxa_L10). - Step Averaging: Use
step_avg_prefix for metrics averaged over the WandB interval (e.g.,train/step_avg_loss). - Schedule Strings: Use
_strsuffix for non-numeric schedule values (e.g.,schedule/mixup_group_str).
6. Logging Levels and Debugging
- Use Python's standard
logginglibrary. - Leverage different log levels (
DEBUG,INFO,STATS,WARNING,ERROR).STATSis a custom level between INFO and DEBUG. - Use dual-level file logging (
create_logger) to separate detailed debug logs from concise info logs. - Use
MetricsTracker.dump_metrics_state()triggered byDEBUG.DUMP_METRICSorDEBUG.VALIDATION_METRICSfor detailed inspection.
7. Conclusion
This structured approach separates concerns, making the metrics and logging system easier to manage, debug, and extend. MetricsTracker holds the state, OpsSchedule dictates timing, StepMetricsLogger orchestrates logging actions, and wandb.py handles the final API interaction. Consistent naming and clear roles are paramount.