0
0.0%
0 / 0
What does the confidence score mean?
A composite of four internal signals — not a claim of factual correctness. The source citation on every row is the actual verification handle.
The model rates its own certainty in [0, 1]. Useful but weak alone — models hallucinate confidently.
Output parsed into a valid Pydantic ESRSMetric shape. Hard fail → score is 0.
The verbatim source text we returned literally contains the extracted value. Strong circumstantial evidence; failure halves the score.
Manifest-claimed language matches detected language (cross-check via langdetect — placeholder in v1, always passing).
What the score doesn't catch: column-confusion in tables (extracted FY2023 instead of FY2024), unit mistakes (kt read as tonnes), or values picked from chart captions instead of the disclosure proper. The custom dbt test metric_value_in_source_textcatches LLM normalisations (e.g. “129 million” → 129000000) — currently flagging 14 rows in the warehouse for review.
How correctness is actually verified: every row carries (source_page, source_snippet). A human can open the source PDF at the cited page and verify any value in seconds. The 800-datapoint hand-verified gold-set (see README, planned v1.1) is what would let us claim a percentage accuracy — until then, treat published-mart values as system-validated, not human-validated.