Image Verification

Overview

The image verification feature provides a robust mechanism for handling missing or corrupted image files in hybrid datasets (where images are stored on disk separately from labels stored in HDF5).

Components

The system consists of two main components:

Initial Verification: A parallel verification process that runs on dataset initialization
Runtime Fallback: A mechanism to handle missing images encountered during training

Configuration

Image verification is configured in the DATA.HYBRID section of your configuration file:

DATA:
  HYBRID:
    USE_HYBRID: True
    IMAGES_DIR: '/path/to/images'
    FILE_EXTENSION: '.jpg'
    ALLOW_MISSING_IMAGES: False  # Enable runtime fallback

    VERIFY_IMAGES:
      ENABLED: True              # Enable verification on initialization
      MAX_MISSING_RATIO: 0.01    # Maximum allowed missing ratio (1%)
      MAX_MISSING_COUNT: 100     # Maximum allowed missing count
      NUM_WORKERS: 8             # Workers for parallel verification
      CHUNK_SIZE: 1000           # Images per chunk for efficiency
      LOG_MISSING: True          # Log details about missing files
      REPORT_PATH: '{output_dir}/assets/missing_images_report.json'

Verification Modes

Initial Verification

When enabled, the system performs a parallel scan of all image files at dataset initialization:

Images are verified in parallel using a ThreadPoolExecutor
Results are filtered against configured thresholds (MAX_MISSING_RATIO, MAX_MISSING_COUNT)
A detailed report is generated in JSON format at the specified REPORT_PATH
If thresholds are exceeded, training fails early with a clear error message

Runtime Fallback

When ALLOW_MISSING_IMAGES is enabled:

Training continues even if images are missing at runtime
Missing images are replaced with placeholder images (zeros)
Limited logging prevents console spam when multiple images are missing

When ALLOW_MISSING_IMAGES is disabled:

Any missing image encountered during training causes an error
Training will terminate with a detailed error message

!!! warning "Verification Limitation" The missing images threshold is calculated with respect to the entire labels file, not the specific subset of labels included in the dataset.

Samples can be fully excluded from the dataset for various reasons (e.g., `DATA.PARTIAL.LEVELS=false` which excludes samples with null labels for enabled task keys, or similar flags for excluding samples missing metadata components), but these excluded samples are still counted in the verification process.

Detailed Report

The verification process generates a JSON report with detailed information:

{
  "total_images_checked": 366858,
  "missing_count": 1000,
  "missing_ratio": 0.002725850329010135,
  "images_dir": "/path/to/images",
  "verification_timestamp": "2025-04-15 18:43:00",
  "missing_identifiers": [
    "10196698_0.jpg",
    "102961696_0.jpg",
    ...
  ],
  "missing_indices": [
    486,
    517,
    ...
  ]
}

This report is saved to the location specified by REPORT_PATH (with {output_dir} automatically replaced with the actual experiment output directory).

Best Practices

Initial Development: Start with ALLOW_MISSING_IMAGES=False to catch any dataset issues early
Production Use:
For maximum robustness, set ALLOW_MISSING_IMAGES=True and use reasonable thresholds
For maximum data integrity, use ALLOW_MISSING_IMAGES=False and ensure all images are available
Thresholds:
Set reasonable MAX_MISSING_RATIO (e.g., 0.01 for 1%) to allow small numbers of missing files
Use MAX_MISSING_COUNT as an absolute ceiling regardless of dataset size
Reports: Always review the JSON report after training to identify any issues with your dataset

Implementation Details

The implementation is focused on high performance and robustness:

Uses ThreadPoolExecutor for efficient parallel I/O operations
Implements chunked processing for better memory management
Provides progress tracking for long-running verification operations
Uses optimized path existence checking for high-volume operations
Handles corrupt image files gracefully during training
Provides detailed error reporting for traceability