Skip to content

Latest commit

 

History

History
163 lines (107 loc) · 10.8 KB

File metadata and controls

163 lines (107 loc) · 10.8 KB

Part 2: Automated Enrichment Plan

Overview

If key details are missing from an article, such as date of birth, middle name, nationality, or occupation, the system is more likely to default to refer. Automated enrichment would attempt to resolve these gaps through targeted web research before escalating to an analyst, reducing unnecessary referrals without introducing false confidence.

The central risk is that enrichment can make screening worse. Searching for "John Smith date of birth" and finding a DOB belonging to a different John Smith could make the verdict even worse / more confusing. This is how I would implement autonomous enrichment while strongly considering that risk.


Trigger Conditions

Enrichment should only fire ideally when all three conditions are met (but if I was to actually implement I would test this hypothesis):

  1. The current verdict is refer, or the score is borderline: Enrichment is not needed when confidence is already sufficient for match or discard
  2. A specific signal is missing that would materially change the score: This could be DOB, but also middle name (especially where a common first / last name combination has a unique middle name), nationality, or occupation (a rare occupation such as "astronaut" is a strong discriminating signal even without DOB)
  3. The name match is at least possible: Enrichment is probably not worth attempting when the name signal is already unlikely or no_match, and risks finding corroborating information about the wrong person

Source Strategy

Sources are queried in strict priority order. Higher-trust structured sources are always consulted before open web search.

Tier 1 — High-authority structured sources

  • Wikipedia REST API: reliable DOBs for public figures, free, always queried first
  • Wikidata (SPARQL): machine-readable structured attributes (birthDate, occupation, nationality), multilingual aliases, and sameAs links; particularly useful for non-English subjects
  • Companies House API / national business registries: live director and officer data, useful for corporate figures
  • GLEIF / LEI registry: links individuals to legal entities via structured registration data
  • SEC EDGAR: for US-listed company officers and filings
  • Official government sources: parliamentary membership pages, sanctions lists

Tier 2 — Semi-structured sources

  • Publisher biography pages and organisational websites (universities, law firms, companies). Mmany expose Schema.org Person markup with names, affiliations, dates, and sameAs links, which can be parsed directly without LLM involvement
  • Verified news archive searches cross-referencing the same person across multiple articles
  • LinkedIn and Twitter/X could be potential sources but are explicitly excluded from automated enrichment due to API restrictions and terms of service constraints. If a particular article contained the LinkedIn / Twitter handle of the person inq uestion, this would be more likely to be considered. (Depending on the cost benefit analysis 9how much it costs vs how much we / the client beenfit), a licensed commercial data provider like LexisNexis, Refinitiv World-Check, Dow Jones Risk & Compliance could be a helpful alternative given they already aggregate exactly this kind of biographical data legally (often what banks use in practice for this kind of work)

Tier 3 — Search-engine mediated discovery

  • Bing Custom Search (Azure AI): Grounding with Bing Search in Azure AI Agents (the new viable Microsoft-stack option for general search)
  • Google Programmable Search JSON API / Gemini Search with grounding (It looks like some of the big search providers are movign to LLMs / agents for search, replacing their previous / exisiting search APIs. I would want to test heavily before using) Open web search should be used as a last resort, and only for rare names — a high volume of conflicting results across sources is itself a signal that the name is common, and enrichment should be treated as inconclusive in that case. Multiple independent sources must corroborate any finding before it influences the verdict

Query Construction

The enrichment system should not issue a single query. A query set is constructed deterministically from the input identity plus any signals already extracted from the article:

  • Full canonical name + DOB year
  • Full name + known employer or occupation from article
  • Nickname variant + surname
  • Reordered name for East Asian or other cultural conventions
  • Transliterated forms where relevant
  • Name + city or country from article

Using an LLM could be genuinely useful here (prior to its main use in validation): not as a decision-maker, but as a query planner that expands the search space by creating various plausible ways generating script variants, transliterations, and cultural reorderings for multilingual cases. The enrichment model is used for query expansion and evidence normalisation, not for identity resolution.

Retrieval is bounded: a maximum of 5–10 candidate pages per enrichment run. The goal is not completeness but finding the smallest evidence set that can justify a safe, auditable recommendation.


Parsing & Validation (Consider whether to use LLM in entity extraction process)

Once candidate URLs are acquired, deterministic processing runs before the LLM sees anything:

1. Name presence check Does the input name (or a normalised variant) appear in the page text? If not, probably worth discarding the record.

2. Structured data extraction Pages with Schema.org / JSON-LD markup are parsed directly — Person properties including birthDate, jobTitle, affiliation, and sameAs links can be extracted reliably without LLM involvement. Wikipedia infoboxes are parsed separately using the REST API.

3. Regex date extraction For pages without structured markup, regex patterns extract date-like strings for comparison against the input DOB using the same component scoring already in Stage 3

4. Component validation Any extracted DOB is compared to existing signals (article-extracted age, publish date arithmetic). Conflicts are flagged, not silently resolved.

5. Corroboration check A DOB value must appear in at least two independent sources before it is treated as high_confidence. A single Tier 1 source is medium_confidence. A single Tier 3 source will be treated as inconclusive on its own.

From each retrieved page, we do extraction first, and that is structured / aligned to follow a predefined schema e.g. person_name_variants, date (year / month) of birth, age, date of article, occupation, employer, nationality, locations, education

This makes enrichment output testable and auditable.


Evidence-Weighted Entity Resolution

Rather than naive fuzzy matching, enrichment builds a small identity graph: the central node is the input {name, dob}, with evidence nodes for employer, role, location, education, alternate names, and article-mentioned facts. Each edge carries provenance and confidence.

Suggested signal weights:

Signal Weight
Explicit DOB match from Tier 1 source Very high positive
Exact employer + role match High positive
Multilingual alias match (Wikidata) Medium positive
Same city only Low positive
DOB contradiction from Tier 1 source Very high negative
DOB contradiction from Tier 3 source Weak negative

Common names require more corroborating edges before enrichment can move the verdict. Rare names require fewer, but never few enough to allow a discard from a single weak source.


LLM Role (Narrow and Late)

The LLM can be used to judge genuine ambiguity:

  • Conflicting candidates: two different DOBs found across sources, which is more likely correct for this individual given the surrounding context?
  • Same-name disambiguation: If a search result contains the right name but it is unclear whether it refers to the same person...does the surrounding evidence (employer, location, role) make this plausible?

Enrichment Output Format

Enrichment results should not just return a single label. INstead it should return an object with the 'result' and some supporting evidence, e.g.:

enrichment_status:        helpful | inconclusive | conflicting
top_corroborating_facts:  list of up to 5 facts with source URLs
top_conflicting_facts:    list of contradictions with source URLs
resolved_identity_fields: normalised best-estimate of person attributes
recommendation:           keep_refer | upgrade_to_match | safe_to_discard
why:                      plain-English rationale for analyst review

All enriched values are stored separately from article-extracted values, with source URL, tier, extraction method (json_ld | infobox | regex | llm), and confidence clearly attributed. The analyst can always see what came from the article and what came from enrichment.


Decision Policy

Enrichment follows a conservative policy consistent with the asymmetric cost of false negatives:

  • refermatch: requires strong corroboration from multiple independent sources
  • referdiscard: requires an explicit contradiction from a Tier 1 source (clearly different DOB, middle name, or employer combination), ideally corroborated by a second source
  • Everything else: remains refer

This policy is easy to defend to a regulator — it is conservative, explainable, and ensures no automated significant decision is made without a meaningful evidence trail for human review.


Pipeline Integration

... → [4. LLM Assessment] → [4b. Enrichment — if triggered] → [5. Scoring]

Enrichment is a conditional async step, time-bounded to avoid stalling the pipeline. If enrichment times out or returns inconclusive, the verdict falls back to the pre-enrichment score.


Limitations & Risks

  • Common names: enrichment is most likely to go wrong for people with common names where search results frequently return the wrong person; confidence gating mitigates but does not eliminate this (ironically it would also be most useful here)
  • Obscure individuals: enrichment has highest value and lowest risk for public figures; for private individuals not on Wikipedia or in any registry, it will often return nothing useful
  • Cost vs Benefit: additional API calls add cost the benefit (reduction in analyst referrals) should be measured against this before investing in full implementation
  • GDPR: enrichment involves additional processing of personal data..potentially this needs to be justified to regulators (although we're not a bank so I don't know the rules on this)
  • Scope creep: enrichment could expand indefinitely / give so much information that just confuses the 'model'...limiting when we run enrichment and what we run it for will help with this. Strict signal-targeting (DOB and middle name in the first iteration) mitigates this risk (not completely)