expanded_taxa Table Documentation

Overview

The expanded_taxa table is a denormalized view that transforms iNaturalist's compact taxonomy representation into an expanded format optimized for efficient querying, filtering, and analysis. It "unpacks" the ancestry string from the original taxa table into discrete columns for each taxonomic rank level, enabling fast clade-based filtering and ancestor lookups without recursive string parsing.

Purpose and Design Philosophy

The table serves as the foundation for taxonomic data processing and export operations, providing: - Efficient clade filtering through indexed rank-level columns - Direct parent access via immediate ancestor columns - Fast taxonomic traversal without recursive queries - Common name integration from Catalog of Life Data Package (ColDP)

Core Table Schema

Primary Identification Columns

Column	Type	Description
taxonID	integer	Primary key; unique taxon identifier from iNaturalist
rankLevel	double precision	Numeric indicator of the taxonomic rank (lower = higher in hierarchy)
rank	varchar(255)	Human-readable taxonomic rank label (e.g., "species", "genus")
name	varchar(255)	Scientific name of the taxon
taxonActive	boolean	Indicates whether the taxon is currently active in iNaturalist's taxonomy
commonName	varchar(255)	Primary common name (populated from ColDP integration)

Immediate Ancestor Columns

These columns provide direct access to parent taxa in the taxonomic hierarchy, eliminating the need for recursive queries:

Column	Type	Description
immediateMajorAncestor_taxonID	integer	Taxon ID of the immediate major ancestor (next major rank up)
immediateMajorAncestor_rankLevel	double precision	Rank level of the immediate major ancestor
immediateAncestor_taxonID	integer	Direct parent taxon ID in the taxonomy
immediateAncestor_rankLevel	double precision	Rank level of the direct parent

Expanded Taxonomic Hierarchy Columns

For each standard taxonomic rank level, three columns are provided: - L{level}_taxonID - Taxon ID at that rank level - L{level}_name - Scientific name at that rank level - L{level}_commonName - Common name at that rank level (from ColDP)

Complete Rank Level Mapping

Level	Standard Rank	Column Prefix	Major Rank
5	subspecies	L5_	No
10	species	L10_	Yes
11	species group	L11_	No
12	species subgroup	L12_	No
13	species complex	L13_	No
15	hybrid	L15_	No
20	genus	L20_	Yes
24	subgenus	L24_	No
25	section	L25_	No
26	subsection	L26_	No
27	series	L27_	No
30	tribe	L30_	No
32	subtribe	L32_	No
33	supertribe	L33_	No
33.5	subfamily	L33_5_	No
34	family	L34_	Yes
34.5	epifamily	L34_5_	No
35	superfamily	L35_	No
37	infraorder	L37_	No
40	order	L40_	Yes
43	suborder	L43_	No
44	infraorder	L44_	No
45	superorder	L45_	No
47	infraclass	L47_	No
50	class	L50_	Yes
53	subclass	L53_	No
57	superclass	L57_	No
60	subphylum	L60_	No
67	phylum	L67_	Yes
70	kingdom	L70_	Yes

Note: Levels 33.5 and 34.5 use underscore notation (L33_5_, L34_5_) in column names.

Indexing Strategy

The table employs strategic indexing for optimal query performance:

Primary and Core Indexes

Primary key on taxonID
B-tree index on name for scientific name lookups
B-tree index on rankLevel for rank-based filtering

Taxonomic Level Indexes (for clade filtering)

idx_expanded_taxa_l10_taxonid - Species-level filtering
idx_expanded_taxa_l20_taxonid - Genus-level filtering
idx_expanded_taxa_l30_taxonid - Tribe-level filtering
idx_expanded_taxa_l40_taxonid - Order-level filtering
idx_expanded_taxa_l50_taxonid - Class-level filtering
idx_expanded_taxa_l60_taxonid - Subphylum-level filtering
idx_expanded_taxa_l70_taxonid - Kingdom-level filtering

Immediate Ancestor Indexes

idx_immediate_ancestor_taxon_id - For parent lookups
idx_immediate_major_ancestor_taxon_id - For major ancestor traversal

Common Use Cases

Species-Level Filtering

-- Get all records for a specific species
SELECT * FROM expanded_taxa
WHERE L10_taxonID = 47219;  -- Apis mellifera

Clade-Based Filtering

-- Get all bees (Anthophila epifamily)
SELECT * FROM expanded_taxa
WHERE L34_5_taxonID = 630955;

-- Get all birds (Aves class)
SELECT * FROM expanded_taxa
WHERE L50_taxonID = 3;

Ancestor Traversal

-- Find immediate parent
SELECT parent.*
FROM expanded_taxa child
JOIN expanded_taxa parent ON child.immediateAncestor_taxonID = parent.taxonID
WHERE child.taxonID = 47219;

Taxonomic Hierarchy Display

-- Show full taxonomic hierarchy for a species
SELECT
    name as species_name,
    L20_name as genus,
    L34_name as family,
    L40_name as order,
    L50_name as class,
    L67_name as phylum,
    L70_name as kingdom
FROM expanded_taxa
WHERE taxonID = 47219;

Finding Taxa with Common Names

-- Find all taxa with common names at species level
SELECT taxonID, name, commonName
FROM expanded_taxa
WHERE rankLevel = 10
  AND commonName IS NOT NULL;

Performance Considerations

Use Indexed Columns: Always filter on indexed LXX_taxonID columns for clade queries
Avoid String Operations: Use numeric taxonID comparisons rather than name matching when possible
Leverage Immediate Ancestors: Use the immediate ancestor columns for parent lookups instead of joins
Batch Operations: For large updates, use appropriate transaction boundaries

Important Notes on Ancestry Column

The ancestry column has been deprecated in production databases in favor of: - Immediate ancestor columns for direct parent relationships - Expanded LXX_taxonID columns for efficient hierarchical queries

Modern implementations should NOT rely on the ancestry column. The typus library handles both legacy (with ancestry) and modern (without ancestry) database schemas transparently.

Integration with Typus Library

The typus library provides async taxonomy services that work with this table structure:

from typus import PostgresTaxonomyService

# Connect to a database with expanded_taxa
service = PostgresTaxonomyService("postgresql+asyncpg://user:pass@host/db")

# Get a taxon - works with or without ancestry column
taxon = await service.get_taxon(47219)

# Traverse hierarchy using parent relationships
async for child in service.children(630955, depth=2):
    print(f"{child.scientific_name} (rank: {child.rank_level})")

Data Sources and Updates

Primary Source: iNaturalist taxonomy exports
Common Names: Catalog of Life Data Package (ColDP) via mapping tables
Update Frequency: Synchronized with iNaturalist data releases
Generation Process: Created via expand_taxa.sh script during ingestion pipeline

Version History

v0r1: Added immediate ancestor columns, complete ColDP integration
v0r0: Basic expanded taxonomy without common names or immediate ancestors

For implementation details and operational procedures, see the ingestion and ColDP integration documentation.