Architecture Notes

This document contains important notes about architectural patterns, implementation quirks, and gotchas in the linnaeus codebase. It serves as a reference for developers and AI assistants working with the code.

Dataset System Architecture

Class Hierarchy

The dataset system uses a multi-level hierarchy with wrapper classes:

BasePrefetchingDataset (base class with concurrency functionality)
PrefetchingH5Dataset (pure HDF5 dataset)
PrefetchingHybridDataset (HDF5 labels + images on disk)

_SingleFileH5SubsetWrapper (wrapper for PrefetchingH5Dataset)
_SingleFileHybridSubsetWrapper (wrapper for PrefetchingHybridDataset)

Dataset Build Scenarios

In h5data/build.py, four distinct scenarios are supported:

Scenario A (Separate Train+Val)
Separate HDF5 files for train and validation
No wrappers, direct instantiation of dataset classes
Scenario B (Single-file pure-HDF5)
One HDF5 file contains both train and validation data
Runtime train/val split with _SingleFileH5SubsetWrapper
Scenario B-H (Single-file Hybrid)
One HDF5 for labels + images in directory
Runtime train/val split with _SingleFileHybridSubsetWrapper
Scenario C (Train-only)
Only training data, no validation
Direct instantiation of dataset classes

Wrapper Class Delegation Pattern

The wrapper classes (_SingleFileH5SubsetWrapper and _SingleFileHybridSubsetWrapper) use delegation to:

Present a subset view of the underlying dataset
Map local indices to global indices in the base dataset
Pass through important methods/properties like:
start_prefetching
fetch_next_batch
close
metrics
_shutdown_event

Important Methods and Lifecycle

Initialization:
Base dataset is created with all data
Wrapper is created with subset indices
Sampling:
GroupedBatchSampler calls set_current_group_level_array on dataset
Wrapper maps local to global indices and calls the base dataset
Data Loading:
start_prefetching initiates prefetching for an epoch
_read_raw_item is called by worker threads
fetch_next_batch returns batches from the prefetch queue

Key Implementation Quirks

Logging Delegation Issues

Direct Logger Reference Issue

Problem: The dataset wrapper classes (_SingleFileH5SubsetWrapper and _SingleFileHybridSubsetWrapper) do not properly propagate logger instances. When using wrapped datasets, calls to self.h5data_logger in the base dataset methods won't produce logs in the expected files.

Root Cause: The wrapper classes don't pass the logger reference, and self.h5data_logger refers to the instance variable that doesn't exist in the wrapped context.

Solution: Use direct access to the named logger instead of the instance variable:

# Before (broken with wrappers)
self.h5data_logger.debug("Message")

# After (works with wrappers)
h5data_logger = logging.getLogger('h5data')
h5data_logger.debug("Message")

Index Conditions in Logging

Problem: Debug logs in dataset methods may be filtered by inappropriate index conditions, particularly in shuffled datasets.

Root Cause: Conditions like if idx < 5 fail to account for the nature of indices in dataset methods: - In _read_raw_item(self, idx), the idx is the original HDF5 row index - With shuffled/grouped dataloaders, these indices are processed in random order - The first few samples processed rarely have indices 0-4 - This means logs gated by idx < 5 might never appear, even if the debug flag is enabled

Solution: Avoid adding index-based conditions to debug logs in dataset methods:

# Bad approach - may never log anything with shuffled data:
if debug_flag and idx < 5:
    logger.debug(f"Processing idx={idx}")

# Better approach - logs whenever the flag is enabled:
if debug_flag:
    logger.debug(f"Processing idx={idx}")

For high-volume logs where some filtering is necessary, consider: 1. Use mod-based sampling instead of low-index checks: if idx % 1000 == 0 2. Track and log based on a counter of items processed rather than original indices 3. Sample logs probabilistically: if random.random() < 0.01 (logs ~1% of items) 4. Add special debug flags for verbosity levels

Affected Areas: Classes like PrefetchingHybridDataset when used with wrapper classes, particularly in methods like _read_raw_item that rely on debug configuration flags.

Group ID and Index Mapping

The system uses several levels of index mapping:

Original HDF5 indices: Raw indices in the original file
Valid indices: Filtered indices that pass validation criteria
Local subset indices: Train/val subset indices (local to the wrapper)
Group indices: Used for grouping samples for operations like mixup

When using wrappers: - The wrapper sees local subset indices (0 to len(subset)-1) - These are mapped to original indices when accessing the base dataset - The set_current_group_level_array method handles this mapping

Data Prefetching and Concurrency

The dataset system uses multiple threads for prefetching:

IO Threads: Read data from disk/HDF5
Processing Threads: Preprocess data (resize, augment, etc.)
Memory Cache: Stores processed batches

This introduces potential concurrency issues: - Thread synchronization is handled via queues and events - The _shutdown_event property delegation is crucial for clean shutdown

Best Practices for Extensions

Extending Datasets

When implementing new dataset types:

Inherit from BasePrefetchingDataset
Override _read_raw_item for custom data reading
Use direct logger references (logging.getLogger('h5data')) for debug logging
Maintain proper shutdown mechanisms by calling super().close()

Working with Wrappers

When working with or extending wrapper classes:

Pass through critical methods and properties to the base dataset
Ensure proper index mapping between local and global space
Avoid assuming logger instance variables will be accessible
Be careful with identity comparisons - wrapper and base are distinct objects

Debugging Dataset Issues

For debugging data flow issues:

Enable verbose logging with DEBUG.DATASET.READ_ITEM_VERBOSE=True
Use direct logger references to ensure logs appear regardless of wrapper usage
Track tensor identities (id()) and data pointers (tensor.data_ptr()) to detect unintended copying
Check if a dataset is wrapped by checking for the presence of base_dataset attribute