Skip to content

Fix/resolve data issues#34

Merged
wdmuer merged 4 commits into
masterfrom
fix/resolve-data-issues
Apr 23, 2026
Merged

Fix/resolve data issues#34
wdmuer merged 4 commits into
masterfrom
fix/resolve-data-issues

Conversation

@wdmuer
Copy link
Copy Markdown
Contributor

@wdmuer wdmuer commented Apr 17, 2026

Summary
This PR fixes data quality issues around UUID handling, example URI usage, and annotation deduplication across the geocoding pipeline.

Changes
Provide UUIDs for all entities; prevent use of example URIs
Ensure correct realizes and is_realized_by links between translated expressions and works
Remove title addition during segmentation (handled by PDF content extraction service)
Remove duplicate annotations
Bump the version of DECIDe AI base service

How to test
This PR must be tested together with the changes of other PRs. List of all PRs:
semantic-ai/decide-pdf-scraper#4
semantic-ai/decide-pdf-content-extraction#17
#34
semantic-ai/entity-linking-backend#5
lblod/app-decide#55
semantic-ai/decide-ai-service-base#4

  1. Build the wheel in this repo using uv build
  2. Paste the wheel in the folders of the pdf-scraper, pdf-content-extraction, geocoding services
  3. In these 3 folders, comment out https://github.com/semantic-ai/decide-ai-service-base/releases/download/0.1.3/decide_ai_service_base-0.1.3-py3-none-any.whl in the requirements.txt file
  4. Paste the following two lines in the Dockerfile of the these 3 folders:
    COPY decide_ai_service_base-0.1.3-py3-none-any.whl .
    RUN uv pip install decide_ai_service_base-0.1.3-py3-none-any.whl
  5. Build a local image of the pdf-scraper, pdf-content-extraction, geocoding and entity-linking services using the names provided in /compose/ai.yml of lblod/app-decide, or use you own names and change the image names of the services in ai.yml file accordingly.
  6. Replace your LLM API keys where necessary in the following files: /compose/ai.yml and /config/ner/config.json
  7. docker compose up -d in the app-decide folder
  8. In the Harvesting dashboard, launch a new "Harvest PDF & Publish as ELI" job using the following URL as example: https://lblod.bredene.be/LBLODWeb/Home/Overzicht/079c62c3c9a5fe661f62f2833bb02c8c5bfe646ec97a2c1ba6055df3ba04013d/GetPublication/?filename=Notulen_Raad%20voor%20maatschappelijk%20welzijn_22-09-2025_RVMW%20notulen.pdf and choose "Bredene" as the municipality
  9. Verify that all 6 tasks have successfully run

Ward De Muer added 3 commits April 17, 2026 14:41
… correct "realizes" and "is_realized_by" links between translated expressions and works, and don't add titles from segmentation (already done in PDF content extraction service)
@wdmuer wdmuer merged commit 37f15ca into master Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants