Official Dataset Provenance (ibrida-v0-r1)
This document details the generation process for the initial set of official Polli Linnaeus pre-trained models. These models (covering North American Mammalia, Amphibia, Reptilia, Aves, Primary Terrestrial Arthropoda, and Angiospermae) were trained on datasets derived from the ibrida-v0-r1
database. Understanding this provenance is crucial for interpreting model performance, taxonomic scope, and applicability.
For guidance on preparing your own custom datasets for Polli Linnaeus, please refer to Data Loading for Training.
Dataset Generation Tool: ibrida.generator
The datasets were generated using an internal tool called ibrida.generator
.
Note on
ibrida.generator
Availability: Theibrida.generator
tool is currently internal. We plan to open-source it in late summer 2025. If you have a pressing need for this tool before then, please open an issue on the Polli Linnaeus GitHub repository, and we may be able to prioritize its release based on demand.
Executive Summary of Dataset Provenance
The six datasets (Mammalia
, Amphibia
, Reptilia
, Aves
, PTA
, Angiospermae
) were generated from the ibrida-v0-r1
database, which is built from the iNaturalist Open Data dump of December 2024. The export process follows a sophisticated, multi-stage filtering pipeline designed to produce datasets suitable for training hierarchical taxonomic classifiers. All datasets share a common core logic, with minor variations in observation thresholds.
The key stages of the data generation process are as follows:
-
Source Data: All exports originate from the
ibrida-v0-r1
database, which includes elevation data for each observation. -
Regional Species Selection: The pipeline first identifies a set of "in-threshold" species. For all six datasets, this is defined as species having at least 50 research-grade observations (60 for PTA and Angiospermae) within the North America (
NAfull
) bounding box. -
Ancestor-Aware Expansion: This is a critical step for hierarchical modeling. For each "in-threshold" species, the pipeline traverses its entire taxonomic lineage (e.g., genus, family, order, etc.) using the pre-computed
expanded_taxa
table. The union of the initial species set and all their unique ancestors forms the final set of taxa for the dataset.- For all datasets except Amphibia, this "full ancestor search" (
fas
) includes both major (e.g., family, order) and minor (e.g., subfamily, tribe) taxonomic ranks. - For the Amphibia dataset, the ancestor search was configured to include only major ranks.
- For all datasets except Amphibia, this "full ancestor search" (
-
Global Observation Inclusion: Once the final taxon list (species + ancestors) is defined, the pipeline gathers all observations for these taxa globally, not just those within North America. A boolean flag
in_region
is computed for each observation to indicate if it originated within theNAfull
bounding box. ThisINCLUDE_OUT_OF_REGION_OBS=true
setting ensures the model is exposed to a wider geographic distribution for the target taxa. -
Cladistic and Quality Filtering: The globally gathered observations are then filtered to the specific taxonomic group (
CLADE
orMETACLADE
). The following quality filters are applied:- Quality Grade Filtering (
RG_FILTER_MODE
): A mode ofALL_EXCLUDE_SPECIES_NON_RESEARCH
is used. This means:- All
research
grade observations are kept. casual
orneeds_id
grade observations are only kept if they are not identified to the species level (i.e., they are coarse-level labels like genus or family). This preserves valuable partial labels while discarding lower-quality species-level data.
- All
- Partial Rank Wiping (
MIN_OCCURRENCES_PER_RANK
): To prevent the model from learning from insufficient data, any taxonomic rank (e.g., genus, family) with fewer than 50 total occurrences (60 for PTA and Angiospermae) in the final dataset has its label "wiped" (set to NULL) for the relevant observations. This effectively pushes rare taxa up to their next-most-common ancestor.
- Quality Grade Filtering (
-
Final Sampling and Export:
- Only the primary photo (
position=0
) for each observation is included (PRIMARY_ONLY=true
). - To prevent class imbalance from hyper-abundant species, the number of research-grade, species-level observations is capped via random sampling (
MAX_RN
). - The final export includes all
expanded_taxa
columns (providing full ancestry for each observation) and theelevation_meters
column.
- Only the primary photo (
This comprehensive process yields six large-scale, hierarchically-consistent datasets that are optimized for training robust, taxonomy-aware computer vision models.
Exhaustive Filtering Parameters Table
This table details the specific parameters from the ibrida-v0-r1
export wrappers used to generate each of the six datasets.
Parameter | Description | Mammalia | Amphibia | Reptilia | Aves | PTA (Arthropoda) | Angiospermae |
---|---|---|---|---|---|---|---|
Database & Release | |||||||
VERSION_VALUE |
Database schema version. | v0 |
v0 |
v0 |
v0 |
v0 |
v0 |
RELEASE_VALUE |
iNaturalist data release. | r1 |
r1 |
r1 |
r1 |
r1 |
r1 |
DB_NAME |
Source database name. | ibrida-v0-r1 |
ibrida-v0-r1 |
ibrida-v0-r1 |
ibrida-v0-r1 |
ibrida-v0-r1 |
ibrida-v0-r1 |
Regional Filtering | |||||||
REGION_TAG |
Bounding box for species selection. | NAfull |
NAfull |
NAfull |
NAfull |
NAfull |
NAfull |
MIN_OBS |
Min research-grade obs in-region for a species to be included. | 50 |
50 |
50 |
50 |
50 |
60 |
INCLUDE_OUT_OF_REGION_OBS |
If true, fetch all global obs for selected taxa. | true |
true |
true |
true |
true |
true |
Cladistic & Ancestor Filtering | |||||||
CLADE / METACLADE |
The root taxonomic group for the dataset. | mammalia |
amphibia |
reptilia |
aves |
pta |
angiospermae |
INCLUDE_MINOR_RANKS_IN_ANCESTORS |
If true, includes minor ranks (e.g., subfamily) in ancestor traversal. | true |
false |
true |
true |
true |
true |
Quality & Observation Filtering | |||||||
RG_FILTER_MODE |
Logic for filtering by quality_grade . |
ALL_EXCLUDE_SPECIES_NON_RESEARCH |
ALL_EXCLUDE_SPECIES_NON_RESEARCH |
ALL_EXCLUDE_SPECIES_NON_RESEARCH |
ALL_EXCLUDE_SPECIES_NON_RESEARCH |
ALL_EXCLUDE_SPECIES_NON_RESEARCH |
ALL_EXCLUDE_SPECIES_NON_RESEARCH |
MIN_OCCURRENCES_PER_RANK |
If a rank has fewer occurrences than this, its label is wiped. | 50 |
50 |
50 |
50 |
50 |
60 |
Export & Sampling | |||||||
MAX_RN |
Max random sample of research-grade observations per species. | 2500 |
2500 |
2500 |
2500 |
2750 |
1750 |
PRIMARY_ONLY |
If true, only export the primary photo for an observation. | true |
true |
true |
true |
true |
true |
INCLUDE_ELEVATION_EXPORT |
If true, include elevation_meters column in the final CSV. |
true |
true |
true |
true |
true |
true |
Source Wrapper Script | Script used for generation. | wrapper_mammalia_..._fas_elev.sh |
wrapper_amphibia_..._fas_elev.sh |
wrapper_reptilia_..._fas_elev.sh |
wrapper_aves_..._fas_elev.sh |
wrapper_pta_..._fas_elev.sh |
wrapper_angiospermae_..._fas_elev.sh |
Understanding Taxonomic Scope and Root Taxa
The filtering logic described above (especially Regional Species Selection and Ancestor-Aware Expansion) defines the taxonomic scope of each model. While a full list of included taxa for each dataset is extensive, the parameters provide insight into how this scope was determined. The CLADE
or METACLADE
parameter in the Exhaustive Filtering Parameters Table specifies the highest-level taxonomic group(s) from which all other taxa in the dataset descend.
Here are the root taxa definitions for the six initial Polli Linnaeus models:
-
Mammalia Model:
- Root: Class Mammalia
- Definition:
("L50_taxonID" = 40151)
(Taxon ID 40151 at taxonomic rank L50)
-
Amphibia Model:
- Root: Class Amphibia
- Definition:
("L50_taxonID" = 20978)
(Taxon ID 20978 at taxonomic rank L50)
-
Reptilia Model:
- Root: Class Reptilia
- Definition:
("L50_taxonID" = 26036)
(Taxon ID 26036 at taxonomic rank L50)
-
Aves Model:
- Root: Class Aves
- Definition:
("L50_taxonID" = 3)
(Taxon ID 3 at taxonomic rank L50)
-
PTA (Primary Terrestrial Arthropoda) Model:
- Type: Metaclade (derived from multiple roots)
- Roots: Class Insecta OR Class Arachnida
- Definition:
("L50_taxonID" = 47158 OR "L50_taxonID" = 47119)
(Taxon ID 47158 [Insecta] or Taxon ID 47119 [Arachnida] at taxonomic rank L50)
-
Angiospermae Model:
- Root: Subphylum Angiospermae (Flowering Plants)
- Definition:
("L57_taxonID" = 47125)
(Taxon ID 47125 at taxonomic rank L57)
This documentation aims to provide transparency into how the official Polli Linnaeus datasets were constructed, enabling users to better understand model behavior and make informed decisions about their application.