BNP Paribas

BNP.PA

Banking · FR · FY2024 CSRD/ESRS disclosures

0

Total metrics extracted across all ESRS topics

0.0%

Average confidence score across all metrics

0 / 0

Published mart vs review queue (threshold: 0.80)

What does the confidence score mean?

A composite of four internal signals — not a claim of factual correctness. The source citation on every row is the actual verification handle.

1. LLM self-rating

The model rates its own certainty in [0, 1]. Useful but weak alone — models hallucinate confidently.

2. Structural pass

Output parsed into a valid Pydantic ESRSMetric shape. Hard fail → score is 0.

3. Snippet contains value

The verbatim source text we returned literally contains the extracted value. Strong circumstantial evidence; failure halves the score.

4. Language match

Manifest-claimed language matches detected language (cross-check via langdetect — placeholder in v1, always passing).

≥ 0.80 · published mart< 0.80 · human review queue· routing is automatic, never silent

What the score doesn't catch: column-confusion in tables (extracted FY2023 instead of FY2024), unit mistakes (kt read as tonnes), or values picked from chart captions instead of the disclosure proper. The custom dbt test metric_value_in_source_textcatches LLM normalisations (e.g. “129 million” → 129000000) — currently flagging 14 rows in the warehouse for review.

How correctness is actually verified: every row carries (source_page, source_snippet). A human can open the source PDF at the cited page and verify any value in seconds. The 800-datapoint hand-verified gold-set (see README, planned v1.1) is what would let us claim a percentage accuracy — until then, treat published-mart values as system-validated, not human-validated.