A Python library for extracting course information from Universidad Nacional de Colombia's SIA (Sistema de Informacion Academica). It uses a Rust-backed async HTTP/session layer and returns structured, typed course data.
SIA is Universidad Nacional de Colombia's academic information system. Its public catalog contains course metadata such as schedules, groups, prerequisites, and enrollment conditions.
SIA is built on Oracle Application Development Framework (ADF), which requires strict
stateful navigation (ViewState, window/page IDs, event ordering). sia-scraper abstracts
that workflow behind an async Python API.
This project depends on Oracle ADF component behavior that may change without notice.
- UI/component ID changes in SIA can break request flows.
- Action order must stay exact for dependent dropdown interactions.
- ViewState must be synchronized after each POST request.
For deeper details, see Oracle ADF Quirks.
Install directly from GitHub:
pip install git+https://github.com/BetterCampus/sia-scraper.gitFor local development:
git clone https://github.com/BetterCampus/sia-scraper.git
cd sia-scraper
pip install -e ".[dev]"
python scripts/sync_rust_extension.py --build --release --verifyIf you update Rust code later, run the sync command again to refresh the local extension binary.
import asyncio
from sia_scraper import SiaScraper
async def main() -> None:
scraper = await SiaScraper.create()
await scraper.set_career("0-2-8-3")
course = await scraper.get_course_info(course_code="2016489")
print(f"{course.course_name} ({course.credits} credits)")
await scraper.close_session()
asyncio.run(main())import asyncio
from sia_scraper import SiaSession
async def main() -> None:
session = await SiaSession.create()
await session.set_career("0-2-8-3")
xml = await session.get_course_xml(0)
print(len(xml))
await session.close()
asyncio.run(main())See docs/MIGRATION_v2.md for complete migration guidance.
Career codes use the format {level}-{campus}-{faculty}-{career}.
import asyncio
from sia_scraper import SiaScraper
async def main() -> None:
scraper = await SiaScraper.create()
await scraper.set_career("0-2-8-3")
course = await scraper.get_course_info(course_code="2016489")
print(course.course_name)
print(course.typology)
print(course.available_spots)
for group in course.groups:
print(group.group_name, group.teacher, group.spots)
for sch in group.schedules:
print(sch.day, sch.start_time, sch.end_time, sch.classroom)
await scraper.close_session()
asyncio.run(main())async def check_prerequisites():
prereqs = await scraper.get_course_prereqs(course_code="2016489")
for condition in prereqs.conditions:
print(condition.type)
for req in condition.prerequisites:
print(req.course_code, req.course_name)
asyncio.run(check_prerequisites())import asyncio
from sia_scraper import SiaScraper, init_sia_scraper
async def main() -> None:
scraper = await SiaScraper.create()
await scraper.set_career("0-2-8-3")
saved = scraper.get_session_data()
await scraper.close_session()
restored = await init_sia_scraper("0-2-8-3", False, session_data=saved)
course = await restored.get_course_info(course_code="2016489")
print(course.course_name)
await restored.close_session()
asyncio.run(main())Sia-scraper provides a typed exception hierarchy with two independent trees:
Rust exceptions (from sia_scraper_rust, re-exported via sia_scraper.core.exceptions):
Exception
└── SiaScraperException
├── NetworkError -- DNS, connection refused, unreachable
├── HttpStatusError -- HTTP 4xx/5xx responses
├── SiaTimeoutError -- Request timeout
├── ParseError -- Response cannot be parsed
└── SessionError -- Session not initialized or expired
Python exceptions (from sia_scraper.core.exceptions):
Exception
└── SiaSessionException
├── SessionNotSet -- Operation without active session
├── CareerNotSet -- Course operation without career selected
├── TimeoutError -- Legacy timeout (prefer SiaTimeoutError)
├── InvalidStatus -- Incompatible action for current state
└── ConcurrentAccessError -- Concurrent access detected
from sia_scraper import SiaScraper
from sia_scraper.core.exceptions import (
SiaSessionException, # Python session errors
CareerNotSet, # Career not set
SiaScraperException, # Rust base exception
NetworkError, # Connection failures
HttpStatusError, # HTTP 4xx/5xx
SiaTimeoutError, # Request timeouts
ParseError, # Parse failures
SessionError, # Session state errors
)- API reference: https://bettercampus.github.io/sia-scraper
- Migration guide: docs/MIGRATION_v2.md
- Debugging guide: docs/DEBUGGING.md
- Oracle ADF quirks: docs/QUIRKS.md
Current version: 0.2.1.
- Python
>=3.10 - Runtime dependencies:
lxml~=5.2.0cssselect~=1.2.0pydantic>=2.0,<3.0loguru~=0.7.0
src/sia_scraper/
├── scraper.py # Async facade (Rust-backed session)
├── session.py # Async Rust-backed session wrapper
├── core/
│ ├── adf_state.py # ViewState extraction utilities
│ └── exceptions.py # Exception hierarchy
├── utils/
│ ├── date_formatter.py
│ └── debug.py
├── constants/
└── parsers/ # HTML/XML parsing
The public API is async-first and Rust-backed. Network/session workflow is handled by Rust
(reqwest + tokio) through the sia_scraper_rust extension.
The scrape_courses() method includes resilient batch processing:
- SKIP: Skip rows that fail to parse
- RETRY: Retry failed rows up to 3 times with configurable delay
- ABORT: Abort on first failure
Configure via scrape_courses(error_mode="retry").
pytest
ruff check .
pyright
cargo clippy --manifest-path Cargo.tomlThis repository includes cargo-fuzz targets for core Rust parsers:
fuzz_get_course_listfuzz_get_plain_textfuzz_extract_view_state
Run them with:
cargo install cargo-fuzz
cargo fuzz run --manifest-path fuzz/Cargo.toml fuzz_get_course_list
cargo fuzz run --manifest-path fuzz/Cargo.toml fuzz_get_plain_text
cargo fuzz run --manifest-path fuzz/Cargo.toml fuzz_extract_view_stateUseful variants:
pytest --cov=src/sia_scraper
pytest tests/utils/test_date_formatter.py
pytest -m "not integration"
pytest tests/fixtures/test_fixtures_validity.py
pytest tests/fixtures/test_contracts.py tests/fixtures/test_regression.pySee CONTRIBUTING.md for development setup, style rules, testing expectations, and pull request guidelines.
License is currently TBD. A project license file will be added to formalize terms.