Reference Database

The molecules_reference.db reference database is the shared data foundation that powers the entire molecular component of IsoFind. It contains 156 standardized molecules across 11 families, 50 tabulated degradation pathways, 56 isotopic fractionations, and 49 parent-metabolite relationships. This page describes the technical structure of this database, the distinction between the global reference and the user catalog, querying methods, and enrichment procedures.

General Structure

The reference database is a SQLite database, set to read-only from a project's perspective. It is distributed with the application and updated during IsoFind releases. Five tables structure the information, linked by foreign keys that ensure data consistency.

Table Role Rows
ref_molecules Main molecule catalog with metadata and thresholds 156
molecule_degradation_pathways Degradation pathways with conditions, kinetics, metabolites 50
molecule_isotope_fractionation Isotopic fractionations by pathway and element 56
molecule_metabolites Parent-metabolite relations with yields 49
ref_molecules_isotopes Isotopic interpretation by molecule and element 18

Table ref_molecules: The 24 Fields

The main table carries 24 columns covering chemical identity, taxonomy, regulatory thresholds, analytical limits, and catalog management. Some fields are systematically filled, while others are optional depending on the molecule.

Field Type Default Filling Status
id INTEGER - Auto-incremented primary key
nom TEXT - Systematic (common name, mandatory)
nom_iupac TEXT - Optional, 49% filled (80 nulls out of 156)
cas TEXT - Systematic for catalog molecules
formule TEXT - Systematic (chemical formula)
masse_molaire REAL - Systematic (g/mol)
mz_principal REAL - Systematic (main MS transition)
mz_secondaires TEXT - 86% filled (comma-separated list)
famille TEXT - Systematic (mandatory, 11 values)
sous_famille TEXT - Systematic
type_polluant TEXT - Organic (144) or inorganic (12)
niveau_acces TEXT 'gratuit' free / pro / defense
seuil_eau REAL - 77% filled (36 molecules without EU/EPA threshold)
seuil_sol REAL - 24% filled (scarcely documented in soil regulations)
seuil_unit TEXT 'µg/L' 149 µg/L, 4 pg/L (dioxins), 3 mg/L
reglementation TEXT - Free text describing applicable frameworks
notes TEXT - 74% filled (scientific comments)
unite_defaut TEXT 'µg/L' 101 µg/L, 45 ng/L (PFAS, PAH), 6 pg/L, 4 mg/L
lod REAL - Analytical Limit of Detection
loq REAL - Limit of Quantification
methode_ref TEXT - Applicable analysis standards (ISO, EPA, EN)
version_db TEXT '2.0' Data version (126 in 2.0, 30 in 2)
actif INTEGER 1 Logical deactivation flag, 156 active
created_at TEXT datetime('now') Timestamp of record creation
The actif flag allows a molecule to be removed from the visible catalog without being physically deleted, which preserves references of historical measurements to a deactivated molecule. All 156 current molecules are active.

Full Example: PFOA Data Sheet

Below is the exact content of a database record, illustrating the typical richness of a well-documented entry.

id = 2
nom = 'PFOA'
nom_iupac = 'Perfluorooctanoic acid'
cas = '335-67-1'
formule = 'C8HF15O2'
masse_molaire = 414.07
mz_principal = 413.0
mz_secondaires = '169.0,219.0,269.0,319.0,369.0'
famille = 'PFAS'
sous_famille = 'PFCA-C8'
type_polluant = 'organique'
niveau_acces = 'gratuit'
seuil_eau = 0.1
seuil_sol = 2.0
seuil_unit = 'µg/L'
reglementation = 'EU 2020/2184 (sum of 4 PFAS ≤0.10); REACH Ann.XVII banned manuf. 2020'
unite_defaut = 'ng/L'
lod = 0.001
loq = 0.005
methode_ref = 'ISO 21675:2019; EN 17892:2023; EPA 537.1'
version_db = '2.0'
actif = 1

Several elements are noteworthy: the threshold unit is in µg/L, but the default display unit is ng/L because laboratories report PFAS in ng/L for readability. IsoFind automatically normalizes these two units for comparison against the threshold. Secondary MS transitions are stored as comma-separated text for flexibility, and three method standards are cited together to cover differing practices between US and European labs.

Reference vs. User Catalog

IsoFind distinguishes between two storage levels for molecules: the shared reference database (ref_molecules) and the user catalog specific to each project (user_molecules). This distinction is fundamental to the IsoFind data model.

Aspect ref_molecules (Reference) user_molecules (User Catalog)
Status Read-only, distributed with IsoFind Editable, project-specific
Scope Common to all projects Isolated per project
Evolution Updated during IsoFind releases Controlled by the user
Deletion Impossible, deactivation via "actif" flag Possible at the project level
Referenced by Measurements No (no direct link) Yes (foreign key molecule_id)

To use a reference molecule in a project, it must be explicitly imported into user_molecules. This one-time copy allows the user to locally adjust thresholds or LOQs without impacting other projects. The dedicated endpoint for this operation is POST /api/molecules/reference/{ref_id}/importer.

The import detects duplicates by CAS number: if the molecule already exists in the user catalog with the same CAS, the import returns the existing ID without creating a duplicate. This behavior protects against accidental multiple imports but does not exclude duplicates introduced manually with distinct names and different CAS numbers.

Reference Module Endpoints

Six endpoints expose the reference database to client applications, all under the prefix /api/molecules/reference/.

Method Path Usage
GET /reference/catalogue Catalog filtered by family, access_level, text search
GET /reference/familles List of the 11 families with counts and access levels
GET /reference/{ref_id} Full data sheet of a reference molecule
GET /reference/{ref_id}/isotopes Associated isotopic data (CSIA and interpretations)
POST /reference/{ref_id}/importer Copies a molecule to the project's user_molecules
POST /reference/importer-batch Batch import using a list of identifiers

The catalog endpoint accepts three optional parameters: famille (exact filter), niveau_acces (free / pro / defense), and q (text search on name, CAS, formula). The default limit is 200 molecules per request, adjustable via limit.

The Families Endpoint

The /reference/familles endpoint is useful for populating navigation interfaces. It returns the list of families with their count and breakdown by access level, allowing the UI to display the number of available molecules per family based on the current license.

Family Molecules free / pro / defense
Pesticides 38 Distributed across molecules
PFAS 26 23 / 3 / 0
Pharmaceuticals / EDs 21 Mixed
PAHs 19 1 / 18 / 0
Chlorinated Solvents 16 4 / 12 / 0
Explosives 12 Significant defense share
PCBs 9 Mixed
Perchlorates 4 Free
Dioxins / Furans 4 Pro
Cyanides 4 Free
Inorganics (oxyanions) 3 Free
Total 156 67 / 77 / 12

Consistency and Validation

The reference database is verified at each IsoFind publication through a series of automated checks that ensure data consistency. These checks cover both structure and content.

  • CAS Uniqueness: No duplicates allowed on the cas field for active molecules.
  • Family Consistency: The famille field value must belong to the closed list of 11 official families.
  • Threshold Plausibility: Values for seuil_eau are bounded by physical limits (positive, less than 10 mg/L in µg/L equivalent).
  • Foreign Keys: Molecules cited in molecule_degradation_pathways, molecule_isotope_fractionation, and molecule_metabolites must all exist in ref_molecules.
  • Essential Metadata: Name, CAS, formula, molar mass, and family are mandatory.

Enrichment and Evolution

The database evolves by versions. The version_db field of each record indicates the version under which it was created or last modified. The current majority version is 2.0 (126 records), with 30 records still in version 2, representing older additions not yet retouched since the format migration.

Future enrichments focus on four identified axes.

Enrichment Axis Planned Examples
Molecular Catalog Extension BTEX, C10-C40 hydrocarbons, phthalates, brominated flame retardants
CSIA Densification CSIA fractionations for neonicotinoid pesticides, missing chlorinated solvents
Missing Degradation Pathways Aerobic pathways for 4+ ring PAHs, aqueous PFAS photolysis
Soil Thresholds Closing the gap for the seuil_sol field, currently only 24% filled
User contributions to the reference database can be submitted to IsoFind SAS for inclusion in future releases. The procedure requires a verifiable bibliographic reference for each addition. Purely local enrichments remain stored in the project's user catalog without being uploaded to the shared repository.

Backup and Integrity

The reference database is a binary SQLite file. Its disk location depends on the IsoFind installation configuration. It is loaded at startup and cached for frequent queries. Accidental file corruption is detected at loading via integrity checks; in such cases, the molecular module returns empty lists with the flag ref_disponible: false rather than throwing an error that would block the application.

This silent degradation allows the user to continue working on existing measurements (which point to user_molecules) while a restoration of the reference database is performed. Restoration simply consists of replacing the file with the one provided by the installer; no data migration is necessary.

Further Reading