CSRD-Lake
FY2024 · live snapshot extracted 2026-05-03

CSRD/ESRS sustainability disclosures,
structured and queryable.

A working reference implementation of the Big-4 “Sustainability Data Hub” pattern: ingest CAC 40 sustainability PDFs, extract ESRS metrics with Claude + Mistral, land them in a Snowflake star-schema warehouse modelled with dbt, surface them on this dashboard.

Built end-to-end on real disclosures from LVMH, TotalEnergies, and Schneider Electric — every metric carries page-level source citation and a confidence score that routes uncertain extractions to a human review queue.

Right now in the warehouse

Real ESRS metrics extracted

32

From 3 of 10 CAC 40 companies

Published mart (conf ≥ 0.80)

14

Cleared the confidence gate — joinable, citable

Human review queue

18

Below 0.80 — held back from publication

dbt models · pytest cases

7 · 167

54 dbt tests + 3 custom data-integrity tests
The problem

1,100+ ESRS datapoints. 300-page PDFs. Annual deadline.

The EU's Corporate Sustainability Reporting Directive (CSRD) mandates that every large listed company publish detailed ESG disclosures using the European Sustainability Reporting Standards (ESRS). Wave 1 covers FY2024 reports. The reports come out as PDFs. Investors, banks, regulators, and corporate compliance teams need them structured and queryable — not as PDFs.

This is the problem Capgemini, Deloitte, PwC, KPMG, and EY are selling solutions for to French G-SIBs (BNP Paribas, Société Générale, Crédit Agricole, BPCE) right now under names like Sustainability Data Hub, ESG Reporting Manager, and CSRD 360 Navigator. The architectural pattern is consistent. CSRD-Lake is its open-source reference implementation.

How it works

PDFs → warehouse → dashboard.

Six layers, each independently testable, each with a quality gate.

01

Ingest

TOML manifest of 10 CAC 40 companies. Direct PDF download from each issuer's investor-relations site, with retry, atomic write, and PDF magic-byte validation.
02

Extract

Per (PDF × ESRS topic): keyword-filter to relevant pages, send to Claude Sonnet, fall back to Mistral Large on rate-limit or malformed JSON. Output validated by Pydantic.
03

Confidence-route

Every metric scored on logprob × structural pass × snippet-match × language-match. Below 0.80 lands in the human review queue, never the published mart.
04

Land

Bulk-insert into raw.disclosure_extracted. Same column shape on Snowflake (cloud) and DuckDB (local) — the current dashboard snapshot is sourced from Snowflake. Loader uses parameterised executemany; same metric_to_row mapping for both backends.
05

Model

dbt star schema: stg_disclosure (dedupe on natural key) → dim_company / dim_metric (auto-extending) / dim_period → fact_disclosure → published + review_queue marts.
06

Test

54 generic dbt tests + 3 custom: source_snippet contains value, confidence_score in [0,1], published and review marts disjoint. Surfaces real LLM hallucinations at build time.
Tech stack

Modern data stack, end-to-end.

Ingestion
Python 3.12httpxtenacityPydantic v2
Extraction
Anthropic Claude SonnetMistral Largepdfplumberstructlog
Warehouse
Snowflake (live snapshot)DuckDB (local)snowflake-connector-pythonRSA key-pair auth
Transform
dbt 1.11dbt-snowflakedbt-duckdbdbt-utils
Orchestration
Airflow 2.10TaskFlow APIDynamic task mapping
Dashboard
Next.js 16React 19Tailwind v4shadcn primitives
DevEx
uv lockfileruffmypy strictpytest 167GitHub Actions
Honest scope

What's real, what's a stub.

Real and validated end-to-end

· 3 real CAC 40 sustainability PDFs (LVMH, TotalEnergies, Schneider)

· 32 ESRS metrics extracted via Claude + Mistral fallback chain

· Snowflake warehouse currently powers this snapshot — DDL, key-pair auth, marts built, 52 of 54 dbt tests pass

· DuckDB local target also fully working — same dbt models, same row counts, byte-identical export

· 167 pytest cases, ~91% coverage, GitHub Actions CI

· Live dashboard deployed to Vercel, statically prerendered

Stubs and open work

· 7 of 10 manifest companies pending PDF ingestion

· Airflow DAG defined but executed via Python CLI (orchestration-pattern visibility)

· 14 rows fail the source-snippet-contains-value test (LLM normalises “129 million” → 129000000) — exactly the hallucination class the test is designed to catch

· Hand-verified gold-set accuracy claim still pending

· Portfolio exposure values are clearly-labelled synthetic

Pick your next click.