Improve scrapers content extraction by mikesiez · Pull Request #19 · CarletonComputerScienceSociety/cs-assistant

mikesiez · 2026-06-20T15:26:52Z

Implemented scraping with bs4 instead of trifilatura.
Reduced unnecessary elements being scraped, and fixed necessary elements being ignored / improperly parsed such as links & accordions.
Added offline test files in tests/ingestion/fixtures and a script to run assertions at tests/ingestion/offline_scraper.py
Updated dependency list to include bs4 via uv add beautifulsoup4

Linked to issue #6

Updated the HTML scraper to remove 'nav' tags in addition to 'header' and 'footer'.

mikesiez and others added 3 commits June 14, 2026 20:12

Completed scraping, meets all requirements

8ebe583

better nav identification to be removed

b61e058

Updated the HTML scraper to remove 'nav' tags in addition to 'header' and 'footer'.

Added bs4, fixed scraper, added offline testing

69b7f5b

mikesiez linked an issue Jun 20, 2026 that may be closed by this pull request

Improve scraper's content extraction #6

Open

michael added 3 commits June 20, 2026 11:28

linter fixes

b6feb41

more linter fixes:

89af866

linter fixes fixes

6394893

mikesiez requested a review from AJaccP June 20, 2026 15:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve scrapers content extraction#19

Improve scrapers content extraction#19
mikesiez wants to merge 6 commits into
mainfrom
improve-scrapers-content-extraction

mikesiez commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mikesiez commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant