This repository implements a distributed Natural Language Processing (NLP) system using the Message Passing Interface (MPI). The project explores multiple parallel communication patterns to preprocess text data and compute Term Frequency (TF) and Document Frequency (DF) statistics over a predefined vocabulary.
The system processes an input text corpus by applying a standard NLP pipeline and computing frequency-based statistics commonly used in text analysis and vector-space models.
- Lowercasing
- Punctuation removal
- Stopword removal
- Term Frequency (TF) counting
- Document Frequency (DF) counting
Each sentence in the input text is treated as an independent document for DF computation.
Workers perform the full preprocessing pipeline and compute local TF results, which are aggregated by the manager.
Each worker handles a single preprocessing stage, forming a linear pipeline with chunked data flow.
Multiple independent pipelines operate concurrently, each computing partial TF results.
Workers preprocess data, exchange it in pairs, and split TF and DF computation across processes using asymmetric communication.
.
├── src/
│ └── solution.py # MPI-based NLP implementation
├── docs/
│ └── report.pdf # Detailed design and experimental analysis
├── test_cases/
│ ├── text_1.txt
│ ├── vocab_1.txt
│ ├── stopwords_1.txt
│ ├── ...
│ └── text_5.txt
└── README.md # Project documentation
mpiexec -n <num_processes> python3 src/solution.py --text <text_file> --vocab <vocab_file> --stopwords <stopwords_file> --pattern <pattern_id>See docs/report.pdf for a full description of the design, implementation, and experimental results.
Provided for educational and portfolio use.