Skip to content

erer-can/parallel-nlp-mpi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Parallel NLP Processing with MPI

This repository implements a distributed Natural Language Processing (NLP) system using the Message Passing Interface (MPI). The project explores multiple parallel communication patterns to preprocess text data and compute Term Frequency (TF) and Document Frequency (DF) statistics over a predefined vocabulary.


Project Overview

The system processes an input text corpus by applying a standard NLP pipeline and computing frequency-based statistics commonly used in text analysis and vector-space models.

Supported NLP Operations

  • Lowercasing
  • Punctuation removal
  • Stopword removal
  • Term Frequency (TF) counting
  • Document Frequency (DF) counting

Each sentence in the input text is treated as an independent document for DF computation.


MPI Communication Patterns

Pattern 1 — Parallel End-to-End Processing

Workers perform the full preprocessing pipeline and compute local TF results, which are aggregated by the manager.

Pattern 2 — Linear Pipeline

Each worker handles a single preprocessing stage, forming a linear pipeline with chunked data flow.

Pattern 3 — Parallel Pipelines

Multiple independent pipelines operate concurrently, each computing partial TF results.

Pattern 4 — End-to-End Processing with Task Parallelism

Workers preprocess data, exchange it in pairs, and split TF and DF computation across processes using asymmetric communication.


Repository Structure

.
├── src/
│   └── solution.py        # MPI-based NLP implementation
├── docs/
│   └── report.pdf         # Detailed design and experimental analysis
├── test_cases/
│   ├── text_1.txt
│   ├── vocab_1.txt
│   ├── stopwords_1.txt
│   ├── ...
│   └── text_5.txt
└── README.md              # Project documentation

How to Run

mpiexec -n <num_processes> python3 src/solution.py   --text <text_file>   --vocab <vocab_file>   --stopwords <stopwords_file>   --pattern <pattern_id>

Report

See docs/report.pdf for a full description of the design, implementation, and experimental results.


License

Provided for educational and portfolio use.

About

A distributed NLP system implemented with MPI, exploring multiple parallel communication patterns for efficient text preprocessing and TF–DF computation.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages