Architecture Notes
This document contains important notes about architectural patterns, implementation quirks, and gotchas in the linnaeus codebase. It serves as a reference for developers and AI assistants working with the code.
Dataset System Architecture
Class Hierarchy
The dataset system uses a multi-level hierarchy with wrapper classes:
BasePrefetchingDataset (base class with concurrency functionality)
PrefetchingH5Dataset (pure HDF5 dataset)
PrefetchingHybridDataset (HDF5 labels + images on disk)
_SingleFileH5SubsetWrapper (wrapper for PrefetchingH5Dataset)
_SingleFileHybridSubsetWrapper (wrapper for PrefetchingHybridDataset)
Dataset Build Scenarios
In h5data/build.py
, four distinct scenarios are supported:
- Scenario A (Separate Train+Val)
- Separate HDF5 files for train and validation
-
No wrappers, direct instantiation of dataset classes
-
Scenario B (Single-file pure-HDF5)
- One HDF5 file contains both train and validation data
-
Runtime train/val split with
_SingleFileH5SubsetWrapper
-
Scenario B-H (Single-file Hybrid)
- One HDF5 for labels + images in directory
-
Runtime train/val split with
_SingleFileHybridSubsetWrapper
-
Scenario C (Train-only)
- Only training data, no validation
- Direct instantiation of dataset classes
Wrapper Class Delegation Pattern
The wrapper classes (_SingleFileH5SubsetWrapper
and _SingleFileHybridSubsetWrapper
) use delegation to:
- Present a subset view of the underlying dataset
- Map local indices to global indices in the base dataset
- Pass through important methods/properties like:
start_prefetching
fetch_next_batch
close
metrics
_shutdown_event
Important Methods and Lifecycle
- Initialization:
- Base dataset is created with all data
-
Wrapper is created with subset indices
-
Sampling:
GroupedBatchSampler
callsset_current_group_level_array
on dataset-
Wrapper maps local to global indices and calls the base dataset
-
Data Loading:
start_prefetching
initiates prefetching for an epoch_read_raw_item
is called by worker threadsfetch_next_batch
returns batches from the prefetch queue
Key Implementation Quirks
Logging Delegation Issues
Direct Logger Reference Issue
Problem: The dataset wrapper classes (_SingleFileH5SubsetWrapper
and _SingleFileHybridSubsetWrapper
) do not properly propagate logger instances. When using wrapped datasets, calls to self.h5data_logger
in the base dataset methods won't produce logs in the expected files.
Root Cause: The wrapper classes don't pass the logger reference, and self.h5data_logger
refers to the instance variable that doesn't exist in the wrapped context.
Solution: Use direct access to the named logger instead of the instance variable:
# Before (broken with wrappers)
self.h5data_logger.debug("Message")
# After (works with wrappers)
h5data_logger = logging.getLogger('h5data')
h5data_logger.debug("Message")
Index Conditions in Logging
Problem: Debug logs in dataset methods may be filtered by inappropriate index conditions, particularly in shuffled datasets.
Root Cause: Conditions like if idx < 5
fail to account for the nature of indices in dataset methods:
- In _read_raw_item(self, idx)
, the idx
is the original HDF5 row index
- With shuffled/grouped dataloaders, these indices are processed in random order
- The first few samples processed rarely have indices 0-4
- This means logs gated by idx < 5
might never appear, even if the debug flag is enabled
Solution: Avoid adding index-based conditions to debug logs in dataset methods:
# Bad approach - may never log anything with shuffled data:
if debug_flag and idx < 5:
logger.debug(f"Processing idx={idx}")
# Better approach - logs whenever the flag is enabled:
if debug_flag:
logger.debug(f"Processing idx={idx}")
For high-volume logs where some filtering is necessary, consider:
1. Use mod-based sampling instead of low-index checks: if idx % 1000 == 0
2. Track and log based on a counter of items processed rather than original indices
3. Sample logs probabilistically: if random.random() < 0.01
(logs ~1% of items)
4. Add special debug flags for verbosity levels
Affected Areas: Classes like PrefetchingHybridDataset
when used with wrapper classes, particularly in methods like _read_raw_item
that rely on debug configuration flags.
Group ID and Index Mapping
The system uses several levels of index mapping:
- Original HDF5 indices: Raw indices in the original file
- Valid indices: Filtered indices that pass validation criteria
- Local subset indices: Train/val subset indices (local to the wrapper)
- Group indices: Used for grouping samples for operations like mixup
When using wrappers:
- The wrapper sees local subset indices (0 to len(subset)-1)
- These are mapped to original indices when accessing the base dataset
- The set_current_group_level_array
method handles this mapping
Data Prefetching and Concurrency
The dataset system uses multiple threads for prefetching:
- IO Threads: Read data from disk/HDF5
- Processing Threads: Preprocess data (resize, augment, etc.)
- Memory Cache: Stores processed batches
This introduces potential concurrency issues:
- Thread synchronization is handled via queues and events
- The _shutdown_event
property delegation is crucial for clean shutdown
Best Practices for Extensions
Extending Datasets
When implementing new dataset types:
- Inherit from
BasePrefetchingDataset
- Override
_read_raw_item
for custom data reading - Use direct logger references (
logging.getLogger('h5data')
) for debug logging - Maintain proper shutdown mechanisms by calling
super().close()
Working with Wrappers
When working with or extending wrapper classes:
- Pass through critical methods and properties to the base dataset
- Ensure proper index mapping between local and global space
- Avoid assuming logger instance variables will be accessible
- Be careful with identity comparisons - wrapper and base are distinct objects
Debugging Dataset Issues
For debugging data flow issues:
- Enable verbose logging with
DEBUG.DATASET.READ_ITEM_VERBOSE=True
- Use direct logger references to ensure logs appear regardless of wrapper usage
- Track tensor identities (
id()
) and data pointers (tensor.data_ptr()
) to detect unintended copying - Check if a dataset is wrapped by checking for the presence of
base_dataset
attribute