Metadata Masking in linnaeus

Overview

The metadata masking system in linnaeus is designed to improve model robustness when dealing with partial or missing metadata. Real-world deployment scenarios often involve incomplete metadata availability, and our models need to handle these scenarios gracefully.

Basic Metadata Masking

The basic metadata masking schedule randomly masks all metadata components during training with a configurable probability. This probability can be scheduled to decrease over time, allowing the model to gradually rely more on metadata as training progresses.

SCHEDULE:
  META_MASKING:
    ENABLED: True
    START_PROB: 1.0  # Start with 100% probability of masking all metadata
    END_PROB: 0.05   # End with 5% probability of masking all metadata
    END_FRACTION: 0.3  # Reach the END_PROB at 30% of training

This approach helps models learn to make predictions when metadata is entirely unavailable. However, it doesn't address cases where only some metadata components are missing.

Granular Metadata Masking

The granular metadata masking feature extends the basic approach to handle partial metadata masking. Instead of masking all metadata or none, it allows selectively masking specific components or combinations of components during training and validation.

Configuration

To configure granular metadata masking, add a PARTIAL section to the META_MASKING configuration:

SCHEDULE:
  META_MASKING:
    ENABLED: True
    START_PROB: 1.0
    END_PROB: 0.05
    END_FRACTION: 0.3

    # Partial meta masking configuration
    PARTIAL:
      ENABLED: True  # Enable partial meta masking
      START_FRACTION: 0.1  # Start partial masking after 10% of training
      END_FRACTION: 0.9  # Continue until 90% of training

      # Probabilistic partial meta masking (new capability)
      START_PROB: 0.01  # Initial probability of applying partial meta masking (1%)
      END_PROB: 0.7     # Final probability of applying partial meta masking (70%)
      PROB_END_FRACTION: 0.5  # Reach END_PROB at 50% of training

      WHITELIST:  # Components to selectively mask
        - ["TEMPORAL"]
        - ["SPATIAL"]
        - ["ELEVATION"]
        - ["TEMPORAL", "SPATIAL"]
        - ["TEMPORAL", "ELEVATION"]
        - ["SPATIAL", "ELEVATION"]
      WEIGHTS: [1.0, 1.0, 1.0, 0.5, 0.5, 0.5]  # Optional weights for component combinations

The whitelist defines which combinations of metadata components should be masked. Each entry is a list of component names. When partial masking is active, a random combination from the whitelist is chosen, and those specific components are masked for each sample. The WEIGHTS parameter (optional) allows you to control the probability distribution for selecting combinations.

The probability scheduling parameters (START_PROB, END_PROB, and PROB_END_FRACTION or PROB_END_STEPS) control the likelihood of applying partial masking to samples that aren't already fully masked. This ensures that some proportion of samples retain all their metadata during training, which is important for model performance.

Training Behavior

During training, for each batch:

With probability meta_mask_prob, all metadata is masked (global meta masking)
Otherwise, for each remaining sample:
With probability partial_meta_mask_prob (scheduled from START_PROB to END_PROB):
- A random combination from the whitelist is chosen
- Only those specific metadata components are masked
Otherwise (with probability 1 - partial_meta_mask_prob):
- All metadata is retained for that sample

This approach ensures the model learns to handle various realistic partial-metadata states during training while still having some samples with full metadata available.

Validation with Partial Masking

To evaluate model performance with specific partial metadata combinations, you can configure validation passes that mask particular components:

SCHEDULE:
  VALIDATION:
    # Partial meta mask validation configuration
    PARTIAL_MASK_META:
      ENABLED: True
      STEP_FRACTION: 0.05  # Run every 5% of total steps
      WHITELIST:  # Component combinations to validate with
        - ["TEMPORAL"]
        - ["SPATIAL"]
        - ["ELEVATION"]
        - ["TEMPORAL", "SPATIAL"]

    # Final epoch exhaustive validation
    FINAL_EPOCH:
      EXHAUSTIVE_PARTIAL_META_VALIDATION: True
      EXHAUSTIVE_META_COMPONENTS:  # All components to generate combinations from
        - "TEMPORAL"
        - "SPATIAL"
        - "ELEVATION"

This configuration: 1. Runs periodic validation passes with each combination in the whitelist 2. Optionally performs an exhaustive validation at the final epoch, testing all possible combinations of the specified components (except the full set, which is redundant with standard masking)

Metrics Tracking

Partial mask validation results are tracked with phase names based on the masked components: - val_mask_TEMPORAL for validation with only TEMPORAL masked - val_mask_TEMPORAL_SPATIAL for validation with both TEMPORAL and SPATIAL masked

These metrics help understand how each metadata component impacts model performance.

Usage Guidelines

Selecting Component Combinations

Choose whitelist combinations that reflect real-world deployment scenarios. For example: - If temporal data is often unavailable, include ["TEMPORAL"] - If spatial and elevation data tend to be missing together, include ["SPATIAL", "ELEVATION"]

Weighting Combinations

Use the WEIGHTS parameter to emphasize more common real-world scenarios. For example, if missing temporal data is twice as common as missing spatial data, use weights like:

WHITELIST:
  - ["TEMPORAL"]
  - ["SPATIAL"]
WEIGHTS: [2.0, 1.0]

Validation Strategy

Configure periodic validation with the most important metadata combinations, and use exhaustive validation at the end of training to get a complete understanding of component importance.