Taxonomy Representation in linnaeus
This document details how taxonomic hierarchies are represented and processed within the linnaeus framework, focusing on the transition from the raw hierarchy_map
to the centralized TaxonomyTree
class.
1. Biological Taxonomy Structure
Biological classification follows a hierarchical tree structure, organizing taxa into nested ranks of increasing specificity (e.g., Species < Genus < Family < Order < Class). Each taxon (except the ultimate root) has exactly one parent at the next higher rank but can have multiple children at the next lower rank. linnaeus typically works with ranks like L10 (Species), L20 (Genus), L30 (Family), L40 (Order).
2. The Raw hierarchy_map
2.1 Generation
During dataset preparation (vectorized_dataset_processor.py
), a data structure named hierarchy_map
is generated based on co-occurring labels observed across adjacent taxonomic levels in the training and validation data.
2.2 Structure
The hierarchy_map
has the following Python structure:
hierarchy_map: Dict[str, Dict[int, int]]
# Example: {
# "taxa_L10": { child_idx_L10: parent_idx_L20, ... },
# "taxa_L20": { child_idx_L20: parent_idx_L30, ... },
# ...
# }
- The outer keys are the child task level strings (e.g.,
"taxa_L10"
). - The inner dictionary maps the class index of a taxon at the child level (e.g.,
child_idx_L10
) to the class index of its parent at the next higher level (e.g.,parent_idx_L20
). - Crucially, it encodes the cross-level parent-child relationship using the class indices defined for each level.
2.3 Limitations
While containing the necessary information, this raw structure presented challenges:
- Interpretation: Its structure (
child_task -> {child_idx: parent_idx}
) required careful interpretation by consuming components. - Redundancy: Different components (label smoothing, hierarchical heads) independently parsed and processed this map, risking inconsistency.
- Lack of Validation: The raw map generation didn't inherently validate the overall tree structure for consistency (e.g., cycles, multiple parents).
3. The Centralized TaxonomyTree
Class
To address these limitations, linnaeus uses the TaxonomyTree
class (utils/taxonomy/taxonomy_tree.py
) as the central, validated representation of the hierarchy.
3.1 Purpose and Design
- Single Source of Truth: Provides a unified, validated representation of the taxonomy derived from the raw
hierarchy_map
. - Abstraction: Hides the raw map structure, offering a clean API for querying relationships.
- Validation: Performs structural validation (single parent, no cycles, index bounds) during initialization.
- Efficiency: Uses an internal bidirectional graph for efficient traversal.
- Extensibility: Centralizes taxonomy-related computations (e.g., distance calculation).
3.2 Initialization
The TaxonomyTree
is instantiated once during dataset processing (vectorized_dataset_processor.py
) after the raw hierarchy_map
and num_classes
are determined.
# Inside vectorized_dataset_processor.py
raw_hierarchy_map = self._generate_hierarchy_map()
num_classes = {task: len(mapping) for task, mapping in self.class_to_idx.items()}
taxonomy_tree = TaxonomyTree(raw_hierarchy_map, self.task_keys, num_classes)
3.3 Key Concepts and API
The TaxonomyTree
operates on Nodes, represented as Tuple[str, int]
, e.g., ("taxa_L10", 5)
.
- Traversal:
get_parent(node)
: Returns the parentNode
orNone
.get_children(node)
: Returns aList[Node]
of direct children.get_ancestors(node)
: Returns the pathList[Node]
from the node up to its root.get_descendants(node)
: Returns aList[Node]
of all nodes in the subtree below the node (including itself).
- Structure Query:
get_nodes_at_level(task_key)
: Returns allNode
s at a specific level.get_root_nodes()
: ReturnsList[Node]
of nodes with no parent in the provided map. These are the highest-level nodes available.get_leaf_nodes()
: ReturnsList[Node]
of nodes with no children in the provided map.
- Distance:
taxonomic_distance(node1, node2)
: Computes the shortest path distance (number of edges) between two nodes in the tree via their Lowest Common Ancestor (LCA). Returnsfloat('inf')
if nodes are disconnected.build_distance_matrix(task_key)
: Generates a[N, N]
matrix of pairwise distances between all nodes at the specified level.
- Helper Matrices:
build_hierarchy_matrices()
: Generates matrices mapping parent probabilities to child nodes, used by conditional heads. Structure:Dict[f"{parent_task}_{child_task}", Tensor[num_parent, num_child]]
.
- Validation:
- Implicitly validates structure during
__init__
. RaisesValueError
on critical issues (cycles, multiple parents).
- Implicitly validates structure during
4. Handling Metaclades (Forests)
The term "metaclade" refers to a dataset derived from multiple distinct top-level taxonomic roots (e.g., combining Insecta and Arachnida, both classes under the phylum Arthropoda).
- Detection Limitation: The
TaxonomyTree
, working only with thehierarchy_map
derived from the dataset labels, cannot definitively determine if the dataset represents a true biological metaclade. It can only identify nodes at the highest specifiedtask_key
level that lack a parent within the map. These might be true roots or simply the highest level provided. Verifying true independence requires external taxonomic databases (like theexpanded_taxa
table used in the shell scripts). - Handling Strategy: Components needing to handle potential multiple roots should:
- Check
len(taxonomy_tree.get_root_nodes())
. - Consult relevant configuration flags (e.g.,
LOSS.TAXONOMY_SMOOTHING.UNIFORM_ROOTS
) to decide on the appropriate behavior (e.g., apply uniform smoothing if multiple roots are detected and the flag is set).
- Check
5. Conclusion
The TaxonomyTree
class provides a robust, validated, and centralized interface to the taxonomic hierarchy within linnaeus, simplifying the logic within consuming components and ensuring consistent interpretation of the underlying taxonomic structure derived from the dataset.