FY2024 · live snapshot extracted 2026-05-03

CSRD/ESRS sustainability disclosures,
structured and queryable.

A working reference implementation of the Big-4 “Sustainability Data Hub” pattern: ingest CAC 40 sustainability PDFs, extract ESRS metrics with Claude + Mistral, land them in a Snowflake star-schema warehouse modelled with dbt, surface them on this dashboard.

Built end-to-end on real disclosures from LVMH, TotalEnergies, and Schneider Electric — every metric carries page-level source citation and a confidence score that routes uncertain extractions to a human review queue.

Explore companies Portfolio rollup View on GitHub

Right now in the warehouse

Real ESRS metrics extracted

32

From 3 of 10 CAC 40 companies

Published mart (conf ≥ 0.80)

14

Cleared the confidence gate — joinable, citable

Human review queue

18

Below 0.80 — held back from publication

dbt models · pytest cases

7 · 167

54 dbt tests + 3 custom data-integrity tests

The problem

1,100+ ESRS datapoints. 300-page PDFs. Annual deadline.

The EU's Corporate Sustainability Reporting Directive (CSRD) mandates that every large listed company publish detailed ESG disclosures using the European Sustainability Reporting Standards (ESRS). Wave 1 covers FY2024 reports. The reports come out as PDFs. Investors, banks, regulators, and corporate compliance teams need them structured and queryable — not as PDFs.

This is the problem Capgemini, Deloitte, PwC, KPMG, and EY are selling solutions for to French G-SIBs (BNP Paribas, Société Générale, Crédit Agricole, BPCE) right now under names like Sustainability Data Hub, ESG Reporting Manager, and CSRD 360 Navigator. The architectural pattern is consistent. CSRD-Lake is its open-source reference implementation.

How it works

PDFs → warehouse → dashboard.

Six layers, each independently testable, each with a quality gate.

Ingest

TOML manifest of 10 CAC 40 companies. Direct PDF download from each issuer's investor-relations site, with retry, atomic write, and PDF magic-byte validation.

Extract

Per (PDF × ESRS topic): keyword-filter to relevant pages, send to Claude Sonnet, fall back to Mistral Large on rate-limit or malformed JSON. Output validated by Pydantic.

Confidence-route

Every metric scored on logprob × structural pass × snippet-match × language-match. Below 0.80 lands in the human review queue, never the published mart.

Land

Bulk-insert into raw.disclosure_extracted. Same column shape on Snowflake (cloud) and DuckDB (local) — the current dashboard snapshot is sourced from Snowflake. Loader uses parameterised executemany; same metric_to_row mapping for both backends.

Model

dbt star schema: stg_disclosure (dedupe on natural key) → dim_company / dim_metric (auto-extending) / dim_period → fact_disclosure → published + review_queue marts.

Test

54 generic dbt tests + 3 custom: source_snippet contains value, confidence_score in [0,1], published and review marts disjoint. Surfaces real LLM hallucinations at build time.

Tech stack

Modern data stack, end-to-end.

Ingestion

Python 3.12httpxtenacityPydantic v2

Extraction

Anthropic Claude SonnetMistral Largepdfplumberstructlog

Warehouse

Snowflake (live snapshot)DuckDB (local)snowflake-connector-pythonRSA key-pair auth

Transform

dbt 1.11dbt-snowflakedbt-duckdbdbt-utils

Orchestration

Airflow 2.10TaskFlow APIDynamic task mapping

Dashboard

Next.js 16React 19Tailwind v4shadcn primitives

DevEx

uv lockfileruffmypy strictpytest 167GitHub Actions

Honest scope

What's real, what's a stub.

Real and validated end-to-end

· 3 real CAC 40 sustainability PDFs (LVMH, TotalEnergies, Schneider)

· 32 ESRS metrics extracted via Claude + Mistral fallback chain

· Snowflake warehouse currently powers this snapshot — DDL, key-pair auth, marts built, 52 of 54 dbt tests pass

· DuckDB local target also fully working — same dbt models, same row counts, byte-identical export

· 167 pytest cases, ~91% coverage, GitHub Actions CI

· Live dashboard deployed to Vercel, statically prerendered

Stubs and open work

· 7 of 10 manifest companies pending PDF ingestion

· Airflow DAG defined but executed via Python CLI (orchestration-pattern visibility)

· 14 rows fail the source-snippet-contains-value test (LLM normalises “129 million” → 129000000) — exactly the hallucination class the test is designed to catch

· Hand-verified gold-set accuracy claim still pending

· Portfolio exposure values are clearly-labelled synthetic

Pick your next click.

Browse the 3 ingested companies Read the project context Read the source

CSRD/ESRS sustainability disclosures,structured and queryable.