refactor(scraper): migrate to new vendor#4253
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #4253 +/- ##
==========================================
+ Coverage 54.52% 56.85% +2.33%
==========================================
Files 274 297 +23
Lines 6076 6933 +857
Branches 1455 1674 +219
==========================================
+ Hits 3313 3942 +629
- Misses 2763 2991 +228 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
5ae9972 to
4da03bd
Compare
0bb413d to
1adf54c
Compare
…llModulesEndpoint callModulesEndpoint uses callApi which only throws UnknownApiError, not NotFoundError. The previous catch was dead code inherited from when callApi had application-level error checking built in. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a static mapping of lessonType|classNo -> array index from old timetable data, and re-sort new API output to match. This prevents existing user timetables from breaking due to changed lesson ordering. This is a temporary fix until the frontend no longer depends on timetable array ordering. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
lessonType|classNo is not unique — classes meeting on multiple days (e.g., Lecture 1 on both Monday and Thursday) share the same key. This caused 478/480 differing modules in the timetable diff. Now uses lessonType|classNo|day|startTime|endTime|venue as the full key for exact matching, with a fallback to lessonType|classNo for lessons whose details changed between APIs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… only Drop endTime and venue from the matching key since those fields may change between APIs. Warn when endTime/venue differ from legacy data. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| // TODO: TEMPORARY - Only apply legacy ordering for the academic year the mapping was generated from. | ||
| const semesterOrder = this.academicYear === '2025/2026' | ||
| ? legacyTimetableOrder[String(this.semester) as keyof typeof legacyTimetableOrder] | ||
| : undefined; |
There was a problem hiding this comment.
Thanks for ensuring the stability for current users!
…ering Remove partial-match warning logic and instead match on all lesson fields including weeks. Also remove the generate-timetable-order.py script. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@greptile |
Greptile OverviewConfidence Score: 4/5
|
| Filename | Overview |
|---|---|
| scrapers/nus-v2/src/services/nus-api.ts | Refactored API client to use new NUS API endpoints with different authentication headers and parameters |
| scrapers/nus-v2/src/utils/api.ts | Added new utility functions for term mapping and HTML sanitization |
| scrapers/nus-v2/src/utils/data.ts | Added string cleaning functions, module matching logic, and legacy timetable ordering support |
| scrapers/nus-v2/src/tasks/DataPipeline.ts | Restructured to fetch module info once per year instead of per semester |
| scrapers/nus-v2/src/tasks/GetAllModules.ts | New task that fetches all module info once for the entire academic year with fallback to per-semester fetch |
| scrapers/nus-v2/src/tasks/GetSemesterData.ts | Updated to consume pre-fetched module info and propagate timetable data to dual-coded modules |
| scrapers/nus-v2/src/tasks/GetSemesterTimetable.ts | Added legacy timetable ordering to preserve existing user timetables during API migration |
Sequence Diagram
sequenceDiagram
participant Pipeline as DataPipeline
participant GetFD as GetFacultyDepartment
participant GetAll as GetAllModules
participant GetSem as GetSemesterData
participant API as NUS API
participant GetTT as GetSemesterTimetable
participant GetExam as GetSemesterExams
Pipeline->>GetFD: Get faculties & departments
GetFD->>API: getFaculty()
GetFD->>API: getDepartment()
API-->>GetFD: Return org data
GetFD-->>Pipeline: Return organizations
Pipeline->>GetAll: Get all modules (year-level)
GetAll->>API: getFacultyModulesForYear() [30 calls]
API-->>GetAll: Return module info
GetAll-->>Pipeline: Return all modules
par Semester 1
Pipeline->>GetSem: Get semester 1 data
GetSem->>GetTT: Get timetable
GetTT->>API: getSemesterTimetables()
API-->>GetTT: Stream timetable data
GetTT-->>GetSem: Return timetable
GetSem->>GetExam: Get exams
GetExam->>API: getTermExams()
API-->>GetExam: Return exam data
GetExam-->>GetSem: Return exams
GetSem->>GetSem: Combine with pre-fetched module info
GetSem->>GetSem: Propagate timetable to dual-coded modules
GetSem-->>Pipeline: Return semester 1 data
and Semester 2
Pipeline->>GetSem: Get semester 2 data
Note over GetSem,API: Same parallel fetch as Sem 1
GetSem-->>Pipeline: Return semester 2 data
and Semester 3
Pipeline->>GetSem: Get semester 3 data
Note over GetSem,API: Same parallel fetch as Sem 1
GetSem-->>Pipeline: Return semester 3 data
and Semester 4
Pipeline->>GetSem: Get semester 4 data
Note over GetSem,API: Same parallel fetch as Sem 1
GetSem-->>Pipeline: Return semester 4 data
end
Pipeline->>Pipeline: Collate all semester data
Pipeline->>Pipeline: Return final module list
Additional Comments (2)
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! |
|
With reference to this comment, I wrote a script to make a diff of the I think the differences highlighted here are genuine differences from the API. I will monitor these after merging. |
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use config.academicYear instead of hardcoded '2025/2026' for the legacy timetable ordering check. Add warn-level logging when the modules endpoint returns 404 to help distinguish "no records" from real errors. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@greptile |
Greptile OverviewConfidence Score: 4/5
|
| Filename | Overview |
|---|---|
| scrapers/nus-v2/src/tasks/DataPipeline.ts | Refactored to fetch module info once for entire year (via GetAllModules) then run semester-specific tasks in parallel, eliminating redundant API calls |
| scrapers/nus-v2/src/services/nus-api.ts | Updated API methods to use new endpoints (CourseNUSMods) with pagination and proper term-to-API-params mapping, added year-level module fetching |
| scrapers/nus-v2/src/types/api.ts | Updated ModuleInfo type to match new API schema with flattened fields (Title, SubjectArea, OrganisationCode, etc.) and CourseAttributes array structure |
| scrapers/nus-v2/src/utils/api.ts | Added mapTermToApiParams for term code conversion, sanitizeModuleInfo for HTML stripping/entity decoding, and containsNbsps validation |
| scrapers/nus-v2/src/utils/data.ts | Added cleanString/stripTags for HTML sanitization, findEquivalentModules for dual-coded module matching, sortTimetableByLegacyOrder for backwards compatibility |
| scrapers/nus-v2/src/tasks/GetAllModules.ts | New task that fetches module metadata once per year (not per semester) to avoid 4× redundant calls, with year-query fallback to per-semester fetching |
| scrapers/nus-v2/src/tasks/GetSemesterData.ts | Receives pre-fetched module info, fetches semester-specific timetable/exams, and propagates timetable data to dual-coded modules missing timetable in new API |
| scrapers/nus-v2/src/tasks/GetSemesterTimetable.ts | Added legacy timetable ordering logic (lines 240-243, 256-260) to preserve existing user timetables during API migration, hardcoded to 2025/2026 |
Sequence Diagram
sequenceDiagram
participant Pipeline as DataPipeline
participant FD as GetFacultyDepartment
participant AM as GetAllModules
participant SD as GetSemesterData
participant ST as GetSemesterTimetable
participant SE as GetSemesterExams
participant API as NUS API
Pipeline->>FD: Get faculties & departments
FD->>API: getFaculty() + getDepartment()
API-->>FD: AcademicGrp[] + AcademicOrg[]
FD-->>Pipeline: Organizations
Pipeline->>AM: Get all modules (year-level)
AM->>API: getFacultyModulesForYear(year, faculty)
API-->>AM: ModuleInfo[] (once per year)
AM->>AM: sanitizeModuleInfo (strip HTML, decode entities)
AM-->>Pipeline: All modules with metadata
par Semester 1
Pipeline->>SD: GetSemesterData(1)
SD->>ST: GetSemesterTimetable(1)
ST->>API: getSemesterTimetables(term)
API-->>ST: TimetableLesson[]
ST->>ST: sortTimetableByLegacyOrder
ST-->>SD: Timetables by module
SD->>SE: GetSemesterExams(1)
SE->>API: getTermExams(term)
API-->>SE: ModuleExam[]
SE-->>SD: Exams by module
SD->>SD: findEquivalentModules (dual-coded)
SD->>SD: Propagate timetable to dual-coded modules
SD-->>Pipeline: Semester 1 data
and Semester 2
Pipeline->>SD: GetSemesterData(2)
Note over SD,API: Same parallel flow
SD-->>Pipeline: Semester 2 data
and Semester 3
Pipeline->>SD: GetSemesterData(3)
Note over SD,API: Same parallel flow
SD-->>Pipeline: Semester 3 data
and Semester 4
Pipeline->>SD: GetSemesterData(4)
Note over SD,API: Same parallel flow
SD-->>Pipeline: Semester 4 data
end
Pipeline->>Pipeline: CollateModules (merge semester data)
Background
NUS changed the API we scrape from to a completely new one. This PR adapts the scraper to use this new API, and adds some special handling to avoid re-ordering the classes of courses in the output.
Why not scraper v3?
The version number of the scraper is linked to the schema of the scraper output, not the API that the scraper is pulling from. Therefore, I decided not to bump the version of the scraper.
Changes
Pipeline Restructuring
Module metadata (title, description, prerequisites) doesn't vary by semester, but the old pipeline fetched it 4× (once per semester) because modules were returned per-semester. The new API returns modules in unified fashion already, so we only need a single fetch per module. The new pipeline:
Type Definitions (
api.ts)ModuleInfofields updated to match new API response shape:CourseTitleTitleSubjectSubjectAreaDescriptionCourseDescModularCredit(string)UnitsMin/UnitsMax(number | null)AcademicOrganisation.CodeOrganisationCodeAcademicGroup.CodeAcademicGroup(flat string)PreRequisitePrerequisiteSummaryCoRequisiteCorequisiteSummaryPreclusionPreclusionSummaryWorkLoadHoursWorkloadHoursNUSModsModuleAttributesCourseAttributes(withCode/Valueinstead ofCourseAttribute/CourseAttributeValue)PrintCatalogNew fields:
Code,ApplicableFromYear,ApplicableFromSem,OrganisationName,AcademicGroupDescData Sanitization (
utils/api.ts,utils/data.ts)sanitizeModuleInfo: Strips HTML tags and decodes entities from all text fields at fetch time, before data enters the pipelinestripTags: Removes HTML tags and normalizes whitespace (including NBSPs)cleanString: CombinesstripTags+decodeHTMLEntities+ trimmapTermToApiParams: Converts 4-digit term codes toapplicableInYear/applicableInSemparams for the new APIparseWorkloadnow handlesnull/undefinedinput gracefullyOther Changes
GetFacultyDepartment: Hardcodes faculty code099(Non-Faculty-based Departments) if missing from API response — needed for modules like CS2101elastic.ts: Early return when bulk body is empty to avoid sending empty bulk requestsenv.example.jsonandConfigtype with new API keys; validation now checks for the new keys instead of the old onesFacultyCodetype: Added totypes/modules.tsfor type safety on faculty code parametersTest Plan
nus-api,DataPipeline,GetSemesterData🤖 Generated with Claude Code