Skip to content

refactor(scraper): migrate to new vendor#4253

Merged
ravern merged 28 commits intomasterfrom
ravern/migrate-to-new-nus-api
Feb 11, 2026
Merged

refactor(scraper): migrate to new vendor#4253
ravern merged 28 commits intomasterfrom
ravern/migrate-to-new-nus-api

Conversation

@ravern
Copy link
Copy Markdown
Member

@ravern ravern commented Nov 11, 2025

Background

NUS changed the API we scrape from to a completely new one. This PR adapts the scraper to use this new API, and adds some special handling to avoid re-ordering the classes of courses in the output.

Why not scraper v3?

The version number of the scraper is linked to the schema of the scraper output, not the API that the scraper is pulling from. Therefore, I decided not to bump the version of the scraper.

Changes

Pipeline Restructuring

Module metadata (title, description, prerequisites) doesn't vary by semester, but the old pipeline fetched it 4× (once per semester) because modules were returned per-semester. The new API returns modules in unified fashion already, so we only need a single fetch per module. The new pipeline:

GetFacultyDepartment
  → GetAllModules (fetches module info ONCE for the year)
    → For each semester [1,2,3,4] IN PARALLEL:
        GetSemesterTimetable + GetSemesterExams (in parallel)
        Combine with pre-fetched module info
        CollateVenues
      → CollateModules

Type Definitions (api.ts)

ModuleInfo fields updated to match new API response shape:

Old New
CourseTitle Title
Subject SubjectArea
Description CourseDesc
ModularCredit (string) UnitsMin/UnitsMax (number | null)
AcademicOrganisation.Code OrganisationCode
AcademicGroup.Code AcademicGroup (flat string)
PreRequisite PrerequisiteSummary
CoRequisite CorequisiteSummary
Preclusion PreclusionSummary
WorkLoadHours WorkloadHoursNUSMods
ModuleAttributes CourseAttributes (with Code/Value instead of CourseAttribute/CourseAttributeValue)
PrintCatalog (removed — no longer in API response)

New fields: Code, ApplicableFromYear, ApplicableFromSem, OrganisationName, AcademicGroupDesc

Data Sanitization (utils/api.ts, utils/data.ts)

  • sanitizeModuleInfo: Strips HTML tags and decodes entities from all text fields at fetch time, before data enters the pipeline
  • stripTags: Removes HTML tags and normalizes whitespace (including NBSPs)
  • cleanString: Combines stripTags + decodeHTMLEntities + trim
  • mapTermToApiParams: Converts 4-digit term codes to applicableInYear/applicableInSem params for the new API
  • parseWorkload now handles null/undefined input gracefully

Other Changes

  • GetFacultyDepartment: Hardcodes faculty code 099 (Non-Faculty-based Departments) if missing from API response — needed for modules like CS2101
  • elastic.ts: Early return when bulk body is empty to avoid sending empty bulk requests
  • Config: Updated env.example.json and Config type with new API keys; validation now checks for the new keys instead of the old ones
  • FacultyCode type: Added to types/modules.ts for type safety on faculty code parameters

Test Plan

  • Unit tests updated and passing for nus-api, DataPipeline, GetSemesterData
  • Full scrape run against the new NUS API endpoints
  • Verify output JSON matches expected format

🤖 Generated with Claude Code

@vercel
Copy link
Copy Markdown

vercel Bot commented Nov 11, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
nusmods-export Ready Ready Preview, Comment Feb 11, 2026 8:00am
nusmods-website Ready Ready Preview, Comment Feb 11, 2026 8:00am

Request Review

@codecov
Copy link
Copy Markdown

codecov Bot commented Nov 11, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 56.85%. Comparing base (988c6fd) to head (79d686a).
⚠️ Report is 168 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4253      +/-   ##
==========================================
+ Coverage   54.52%   56.85%   +2.33%     
==========================================
  Files         274      297      +23     
  Lines        6076     6933     +857     
  Branches     1455     1674     +219     
==========================================
+ Hits         3313     3942     +629     
- Misses       2763     2991     +228     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ravern ravern force-pushed the ravern/migrate-to-new-nus-api branch from 5ae9972 to 4da03bd Compare December 5, 2025 23:43
@ravern ravern force-pushed the ravern/migrate-to-new-nus-api branch from 0bb413d to 1adf54c Compare December 24, 2025 10:54
…llModulesEndpoint

callModulesEndpoint uses callApi which only throws UnknownApiError, not
NotFoundError. The previous catch was dead code inherited from when
callApi had application-level error checking built in.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a static mapping of lessonType|classNo -> array index from old
timetable data, and re-sort new API output to match. This prevents
existing user timetables from breaking due to changed lesson ordering.

This is a temporary fix until the frontend no longer depends on
timetable array ordering.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
lessonType|classNo is not unique — classes meeting on multiple days
(e.g., Lecture 1 on both Monday and Thursday) share the same key.
This caused 478/480 differing modules in the timetable diff.

Now uses lessonType|classNo|day|startTime|endTime|venue as the full
key for exact matching, with a fallback to lessonType|classNo for
lessons whose details changed between APIs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… only

Drop endTime and venue from the matching key since those fields may
change between APIs. Warn when endTime/venue differ from legacy data.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread scrapers/nus-v2/src/tasks/DataPipeline.test.ts
Comment on lines +240 to +243
// TODO: TEMPORARY - Only apply legacy ordering for the academic year the mapping was generated from.
const semesterOrder = this.academicYear === '2025/2026'
? legacyTimetableOrder[String(this.semester) as keyof typeof legacyTimetableOrder]
: undefined;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for ensuring the stability for current users!

Comment thread scrapers/nus-v2/src/utils/data.ts Outdated
…ering

Remove partial-match warning logic and instead match on all lesson fields
including weeks. Also remove the generate-timetable-order.py script.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ravern
Copy link
Copy Markdown
Member Author

ravern commented Feb 11, 2026

@greptile

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Feb 11, 2026

Greptile Overview

Confidence Score: 4/5

  • This PR is safe to merge with moderate risk
  • The refactoring is well-structured with good test coverage, but requires a full scrape test against production API before deployment. The hardcoded year check and broad 404 error handling are minor concerns that should be addressed.
  • Pay close attention to GetSemesterTimetable.ts (hardcoded year check) and nus-api.ts (404 error handling)

Important Files Changed

Filename Overview
scrapers/nus-v2/src/services/nus-api.ts Refactored API client to use new NUS API endpoints with different authentication headers and parameters
scrapers/nus-v2/src/utils/api.ts Added new utility functions for term mapping and HTML sanitization
scrapers/nus-v2/src/utils/data.ts Added string cleaning functions, module matching logic, and legacy timetable ordering support
scrapers/nus-v2/src/tasks/DataPipeline.ts Restructured to fetch module info once per year instead of per semester
scrapers/nus-v2/src/tasks/GetAllModules.ts New task that fetches all module info once for the entire academic year with fallback to per-semester fetch
scrapers/nus-v2/src/tasks/GetSemesterData.ts Updated to consume pre-fetched module info and propagate timetable data to dual-coded modules
scrapers/nus-v2/src/tasks/GetSemesterTimetable.ts Added legacy timetable ordering to preserve existing user timetables during API migration

Sequence Diagram

sequenceDiagram
    participant Pipeline as DataPipeline
    participant GetFD as GetFacultyDepartment
    participant GetAll as GetAllModules
    participant GetSem as GetSemesterData
    participant API as NUS API
    participant GetTT as GetSemesterTimetable
    participant GetExam as GetSemesterExams
    
    Pipeline->>GetFD: Get faculties & departments
    GetFD->>API: getFaculty()
    GetFD->>API: getDepartment()
    API-->>GetFD: Return org data
    GetFD-->>Pipeline: Return organizations
    
    Pipeline->>GetAll: Get all modules (year-level)
    GetAll->>API: getFacultyModulesForYear() [30 calls]
    API-->>GetAll: Return module info
    GetAll-->>Pipeline: Return all modules
    
    par Semester 1
        Pipeline->>GetSem: Get semester 1 data
        GetSem->>GetTT: Get timetable
        GetTT->>API: getSemesterTimetables()
        API-->>GetTT: Stream timetable data
        GetTT-->>GetSem: Return timetable
        GetSem->>GetExam: Get exams
        GetExam->>API: getTermExams()
        API-->>GetExam: Return exam data
        GetExam-->>GetSem: Return exams
        GetSem->>GetSem: Combine with pre-fetched module info
        GetSem->>GetSem: Propagate timetable to dual-coded modules
        GetSem-->>Pipeline: Return semester 1 data
    and Semester 2
        Pipeline->>GetSem: Get semester 2 data
        Note over GetSem,API: Same parallel fetch as Sem 1
        GetSem-->>Pipeline: Return semester 2 data
    and Semester 3
        Pipeline->>GetSem: Get semester 3 data
        Note over GetSem,API: Same parallel fetch as Sem 1
        GetSem-->>Pipeline: Return semester 3 data
    and Semester 4
        Pipeline->>GetSem: Get semester 4 data
        Note over GetSem,API: Same parallel fetch as Sem 1
        GetSem-->>Pipeline: Return semester 4 data
    end
    
    Pipeline->>Pipeline: Collate all semester data
    Pipeline->>Pipeline: Return final module list
Loading

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

23 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

Comment thread scrapers/nus-v2/src/tasks/GetSemesterTimetable.ts Outdated
Comment thread scrapers/nus-v2/src/tasks/GetSemesterData.ts
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Feb 11, 2026

Additional Comments (2)

scrapers/nus-v2/src/services/nus-api.ts
catching all 404 errors may hide legitimate issues - consider logging when this happens to distinguish between "no records found" vs actual errors


scrapers/nus-v2/src/utils/data.ts
mapped CourseAttributes to use Code/Value fields instead of CourseAttribute/CourseAttributeValue

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@ravern
Copy link
Copy Markdown
Member Author

ravern commented Feb 11, 2026

With reference to this comment, I wrote a script to make a diff of the data/ directory between production, and the newly scraped data. Here's the output for semester 2:

============================================================
  Semester 2
============================================================

  Differing timetables: 7

  --- Aggregate lesson counts ---
  Lessons added in new (not in old):     6
  Lessons removed from new (in old only): 9
  Lessons with changed details:           1
  Modules with reordered lessons:         7

  --- Added lessons by type ---
    Tutorial             +6

  --- Removed lessons by type ---
    Tutorial             -6
    Lecture              -3

  --- Changed detail fields (across all changed lessons) ---
    weeks                1

  --- Module categories ---
  Only lessons added (new has more):    0
  Only lessons removed (new has fewer): 3
  Both added & removed:                 4
  Same lessons, details changed only:   0
  Only reordered (no content diff):     0

  --- Per-module breakdown ---

  DTS5701:
    [REORDERED]
    Removed (1):
      - Lecture 1 @ 0830 (Monday 0830-1730)

  DTS5703:
    [REORDERED]
    Removed (1):
      - Lecture 1 @ 0830 (Tuesday 0830-1730)

  DTS5733:
    [REORDERED]
    Removed (1):
      - Lecture 1 @ 0830 (Tuesday 0830-1730)
    Changed (1):
      ~ Lecture 1: weeks
          weeks: {'start': '2026-04-01', 'end': '2026-06-03', 'weeks': [1, 4, 5, 6, 7, 8, 9, 10]} -> [11]

  EC2374:
    [REORDERED]
    Added (1):
      + Tutorial W01 @ 1400 (Wednesday 1400-1500)
    Removed (1):
      - Tutorial W01 @ 1400 (Wednesday 1400-1500)

  GEC1032:
    [REORDERED]
    Added (2):
      + Tutorial E5 @ 1000 (Tuesday 1000-1200)
      + Tutorial E6 @ 1200 (Tuesday 1200-1400)
    Removed (2):
      - Tutorial E5 @ 1000 (Tuesday 1000-1200)
      - Tutorial E6 @ 1200 (Tuesday 1200-1400)

  GESS1003:
    [REORDERED]
    Added (1):
      + Tutorial D1 @ 1000 (Monday 1000-1200)
    Removed (1):
      - Tutorial D1 @ 1000 (Monday 1000-1200)

  GET1003:
    [REORDERED]
    Added (2):
      + Tutorial E5 @ 1000 (Tuesday 1000-1200)
      + Tutorial E6 @ 1200 (Tuesday 1200-1400)
    Removed (2):
      - Tutorial E5 @ 1000 (Tuesday 1000-1200)
      - Tutorial E6 @ 1200 (Tuesday 1200-1400)

I think the differences highlighted here are genuine differences from the API. I will monitor these after merging.

ravern and others added 2 commits February 11, 2026 15:57
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use config.academicYear instead of hardcoded '2025/2026' for the legacy
timetable ordering check. Add warn-level logging when the modules
endpoint returns 404 to help distinguish "no records" from real errors.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ravern
Copy link
Copy Markdown
Member Author

ravern commented Feb 11, 2026

@greptile

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Feb 11, 2026

Greptile Overview

Confidence Score: 4/5

  • Safe to merge with minor caveats around hardcoded academic year checks and temporary backwards-compatibility code
  • Score reflects well-structured refactoring with comprehensive test coverage and proper error handling, but includes temporary hardcoded values (2025/2026 academic year checks in GetSemesterTimetable.ts:241-243) and a large legacy ordering JSON that will need future cleanup. The dual-coded module propagation logic is sound but adds complexity. No critical bugs identified.
  • scrapers/nus-v2/src/tasks/GetSemesterTimetable.ts (contains hardcoded academic year check that will need updating), scrapers/nus-v2/src/legacy-timetable-order.json (37k+ line temporary file for backwards compatibility)

Important Files Changed

Filename Overview
scrapers/nus-v2/src/tasks/DataPipeline.ts Refactored to fetch module info once for entire year (via GetAllModules) then run semester-specific tasks in parallel, eliminating redundant API calls
scrapers/nus-v2/src/services/nus-api.ts Updated API methods to use new endpoints (CourseNUSMods) with pagination and proper term-to-API-params mapping, added year-level module fetching
scrapers/nus-v2/src/types/api.ts Updated ModuleInfo type to match new API schema with flattened fields (Title, SubjectArea, OrganisationCode, etc.) and CourseAttributes array structure
scrapers/nus-v2/src/utils/api.ts Added mapTermToApiParams for term code conversion, sanitizeModuleInfo for HTML stripping/entity decoding, and containsNbsps validation
scrapers/nus-v2/src/utils/data.ts Added cleanString/stripTags for HTML sanitization, findEquivalentModules for dual-coded module matching, sortTimetableByLegacyOrder for backwards compatibility
scrapers/nus-v2/src/tasks/GetAllModules.ts New task that fetches module metadata once per year (not per semester) to avoid 4× redundant calls, with year-query fallback to per-semester fetching
scrapers/nus-v2/src/tasks/GetSemesterData.ts Receives pre-fetched module info, fetches semester-specific timetable/exams, and propagates timetable data to dual-coded modules missing timetable in new API
scrapers/nus-v2/src/tasks/GetSemesterTimetable.ts Added legacy timetable ordering logic (lines 240-243, 256-260) to preserve existing user timetables during API migration, hardcoded to 2025/2026

Sequence Diagram

sequenceDiagram
    participant Pipeline as DataPipeline
    participant FD as GetFacultyDepartment
    participant AM as GetAllModules
    participant SD as GetSemesterData
    participant ST as GetSemesterTimetable
    participant SE as GetSemesterExams
    participant API as NUS API

    Pipeline->>FD: Get faculties & departments
    FD->>API: getFaculty() + getDepartment()
    API-->>FD: AcademicGrp[] + AcademicOrg[]
    FD-->>Pipeline: Organizations

    Pipeline->>AM: Get all modules (year-level)
    AM->>API: getFacultyModulesForYear(year, faculty)
    API-->>AM: ModuleInfo[] (once per year)
    AM->>AM: sanitizeModuleInfo (strip HTML, decode entities)
    AM-->>Pipeline: All modules with metadata

    par Semester 1
        Pipeline->>SD: GetSemesterData(1)
        SD->>ST: GetSemesterTimetable(1)
        ST->>API: getSemesterTimetables(term)
        API-->>ST: TimetableLesson[]
        ST->>ST: sortTimetableByLegacyOrder
        ST-->>SD: Timetables by module
        SD->>SE: GetSemesterExams(1)
        SE->>API: getTermExams(term)
        API-->>SE: ModuleExam[]
        SE-->>SD: Exams by module
        SD->>SD: findEquivalentModules (dual-coded)
        SD->>SD: Propagate timetable to dual-coded modules
        SD-->>Pipeline: Semester 1 data
    and Semester 2
        Pipeline->>SD: GetSemesterData(2)
        Note over SD,API: Same parallel flow
        SD-->>Pipeline: Semester 2 data
    and Semester 3
        Pipeline->>SD: GetSemesterData(3)
        Note over SD,API: Same parallel flow
        SD-->>Pipeline: Semester 3 data
    and Semester 4
        Pipeline->>SD: GetSemesterData(4)
        Note over SD,API: Same parallel flow
        SD-->>Pipeline: Semester 4 data
    end

    Pipeline->>Pipeline: CollateModules (merge semester data)
Loading

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

24 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants