Parallel NLP Processing with MPI

This repository implements a distributed Natural Language Processing (NLP) system using the Message Passing Interface (MPI). The project explores multiple parallel communication patterns to preprocess text data and compute Term Frequency (TF) and Document Frequency (DF) statistics over a predefined vocabulary.

Project Overview

The system processes an input text corpus by applying a standard NLP pipeline and computing frequency-based statistics commonly used in text analysis and vector-space models.

Supported NLP Operations

Lowercasing
Punctuation removal
Stopword removal
Term Frequency (TF) counting
Document Frequency (DF) counting

Each sentence in the input text is treated as an independent document for DF computation.

MPI Communication Patterns

Pattern 1 — Parallel End-to-End Processing

Workers perform the full preprocessing pipeline and compute local TF results, which are aggregated by the manager.

Pattern 2 — Linear Pipeline

Each worker handles a single preprocessing stage, forming a linear pipeline with chunked data flow.

Pattern 3 — Parallel Pipelines

Multiple independent pipelines operate concurrently, each computing partial TF results.

Pattern 4 — End-to-End Processing with Task Parallelism

Workers preprocess data, exchange it in pairs, and split TF and DF computation across processes using asymmetric communication.

Repository Structure

.
├── src/
│   └── solution.py        # MPI-based NLP implementation
├── docs/
│   └── report.pdf         # Detailed design and experimental analysis
├── test_cases/
│   ├── text_1.txt
│   ├── vocab_1.txt
│   ├── stopwords_1.txt
│   ├── ...
│   └── text_5.txt
└── README.md              # Project documentation

How to Run

mpiexec -n <num_processes> python3 src/solution.py   --text <text_file>   --vocab <vocab_file>   --stopwords <stopwords_file>   --pattern <pattern_id>

Report

See docs/report.pdf for a full description of the design, implementation, and experimental results.

License

Provided for educational and portfolio use.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docs		docs
src		src
test_cases		test_cases
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parallel NLP Processing with MPI

Project Overview

Supported NLP Operations

MPI Communication Patterns

Pattern 1 — Parallel End-to-End Processing

Pattern 2 — Linear Pipeline

Pattern 3 — Parallel Pipelines

Pattern 4 — End-to-End Processing with Task Parallelism

Repository Structure

How to Run

Report

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Parallel NLP Processing with MPI

Project Overview

Supported NLP Operations

MPI Communication Patterns

Pattern 1 — Parallel End-to-End Processing

Pattern 2 — Linear Pipeline

Pattern 3 — Parallel Pipelines

Pattern 4 — End-to-End Processing with Task Parallelism

Repository Structure

How to Run

Report

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages