Skip to content

UmdTask443_DATA605_Spring2026_HuggingFace_Text_Classification_Model_1#496

Open
riyaapuri wants to merge 9 commits into
gpsaggese:masterfrom
riyaapuri:UmdTask443_DATA605_Spring2026_HuggingFace_Text_Classification_Model_1
Open

UmdTask443_DATA605_Spring2026_HuggingFace_Text_Classification_Model_1#496
riyaapuri wants to merge 9 commits into
gpsaggese:masterfrom
riyaapuri:UmdTask443_DATA605_Spring2026_HuggingFace_Text_Classification_Model_1

Conversation

@riyaapuri
Copy link
Copy Markdown

Related to #443

HuggingFace Text Classification - AGNews Dataset

Contains code deliverables for the end-to-end pipeline that fine-tunes DistilBERT on the AG News dataset (120K articles, 4 classes) for news article classification.
The full pipeline - data loading, preprocessing, training, interactive prediction, and model evaluation - is capable of running inside Docker with no local Python setup required.


Files Added

  • config.py - Central config for all hyperparameters and constants
  • utils/dataset_loader.py - Loads AG News from HuggingFace Hub, creates 90/10 train/val split
  • utils/preprocessing.py - Cleans text (HTML, URLs) and tokenizes with AutoTokenizer
  • utils/metrics.py - Trainer callback and sklearn classification report utility
  • scripts/train.py - Fine-tunes DistilBERT; saves best checkpoint by macro-F1
  • scripts/evaluate_model.py - Batch inference on test dataset; exports report, confusion matrix, metrics chart, predictions CSV
  • scripts/predict.py - Single article, file, and interactive inference with confidence scores
  • run.sh - Unified pipeline wrapper; forwards CLI flags to each step
  • requirements.txt - Full Python dependency list
  • Dockerfile + docker shell scripts - Full standalone Docker integration for all pipeline steps for efficiency and reproducibilty

If needed, please refer to the commit history on this branch for an iterative breakdown of changes and progress across development stages.


Authors: @riyaapuri @stupatel17
Reviewers: @protocorn @gpsaggese

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant