CSRD/ESRS sustainability disclosures,
structured and queryable.
A working reference implementation of the Big-4 “Sustainability Data Hub” pattern: ingest CAC 40 sustainability PDFs, extract ESRS metrics with Claude + Mistral, land them in a Snowflake star-schema warehouse modelled with dbt, surface them on this dashboard.
Built end-to-end on real disclosures from LVMH, TotalEnergies, and Schneider Electric — every metric carries page-level source citation and a confidence score that routes uncertain extractions to a human review queue.
Right now in the warehouse
Real ESRS metrics extracted
32
Published mart (conf ≥ 0.80)
14
Human review queue
18
dbt models · pytest cases
7 · 167
1,100+ ESRS datapoints. 300-page PDFs. Annual deadline.
The EU's Corporate Sustainability Reporting Directive (CSRD) mandates that every large listed company publish detailed ESG disclosures using the European Sustainability Reporting Standards (ESRS). Wave 1 covers FY2024 reports. The reports come out as PDFs. Investors, banks, regulators, and corporate compliance teams need them structured and queryable — not as PDFs.
This is the problem Capgemini, Deloitte, PwC, KPMG, and EY are selling solutions for to French G-SIBs (BNP Paribas, Société Générale, Crédit Agricole, BPCE) right now under names like Sustainability Data Hub, ESG Reporting Manager, and CSRD 360 Navigator. The architectural pattern is consistent. CSRD-Lake is its open-source reference implementation.
PDFs → warehouse → dashboard.
Six layers, each independently testable, each with a quality gate.
Ingest
Extract
Confidence-route
Land
Model
Test
Modern data stack, end-to-end.
What's real, what's a stub.
Real and validated end-to-end
· 3 real CAC 40 sustainability PDFs (LVMH, TotalEnergies, Schneider)
· 32 ESRS metrics extracted via Claude + Mistral fallback chain
· Snowflake warehouse currently powers this snapshot — DDL, key-pair auth, marts built, 52 of 54 dbt tests pass
· DuckDB local target also fully working — same dbt models, same row counts, byte-identical export
· 167 pytest cases, ~91% coverage, GitHub Actions CI
· Live dashboard deployed to Vercel, statically prerendered
Stubs and open work
· 7 of 10 manifest companies pending PDF ingestion
· Airflow DAG defined but executed via Python CLI (orchestration-pattern visibility)
· 14 rows fail the source-snippet-contains-value test (LLM normalises “129 million” → 129000000) — exactly the hallucination class the test is designed to catch
· Hand-verified gold-set accuracy claim still pending
· Portfolio exposure values are clearly-labelled synthetic