Skip to content

Parallel numerical solver for the hydrodynamic Burgers equation using the Rusanov finite-difference method. Implements parallelized algorithms to simulate shock wave formation and propagation. Includes visualization tools for analyzing discontinuities in the solution. Academic project demonstrating computational fluid dynamics techniques.

License

Notifications You must be signed in to change notification settings

mikhailbahdashych/burgers-rusanov-parallel

Repository files navigation

Parallel Solution of the Burgers Equation using the Rusanov Method

Table of Contents

  1. Theoretical Background
  2. Parallelization Challenge
  3. Results and Visualizations
  4. Performance Analysis and Difficulties
  5. Implementation Details
  6. How to Run
  7. References

1. Theoretical Background

1.1 The Burgers Equation

The viscous Burgers equation is a fundamental partial differential equation (PDE) in fluid dynamics that serves as a simplified model for studying nonlinear wave propagation and shock formation. The equation is given by:

$$ \frac{\partial u}{\partial t} + u \frac{\partial u}{\partial x} = \nu \frac{\partial^2 u}{\partial x^2} $$

where:

  • $u(x,t)$ is the velocity field
  • $x$ is the spatial coordinate
  • $t$ is time
  • $\nu$ is the kinematic viscosity coefficient
  • $u \frac{\partial u}{\partial x}$ is the nonlinear advection term (causes shock formation)
  • $\nu \frac{\partial^2 u}{\partial x^2}$ is the viscous diffusion term (smooths the solution)

Physical Interpretation:

The Burgers equation models one-dimensional fluid flow where:

  • The advection term $u \frac{\partial u}{\partial x}$ represents self-transport: fluid moves with its own velocity
  • The viscosity term $\nu \frac{\partial^2 u}{\partial x^2}$ represents dissipation: friction smooths out gradients
  • Competition between nonlinearity and viscosity determines solution behavior

Shock Wave Formation:

When viscosity is small ($\nu \to 0$), the nonlinear advection dominates. Faster fluid particles overtake slower ones, causing the solution to steepen until a discontinuity (shock wave) forms. This phenomenon is critical in:

  • Gas dynamics (supersonic flows)
  • Traffic flow modeling
  • Nonlinear wave propagation

1.2 The Rusanov Numerical Method

To solve the Burgers equation numerically, we employ the Rusanov method (also called Local Lax-Friedrichs), a robust first-order finite volume scheme.

Conservative Form:

First, rewrite the Burgers equation in conservative form:

$$\frac{\partial u}{\partial t} + \frac{\partial F(u)}{\partial x} = \nu \frac{\partial^2 u}{\partial x^2}$$

where $F(u) = \frac{u^2}{2}$ is the flux function.

Spatial Discretization:

Divide the domain [0, 1] into nx cells of width $\Delta x = 1/n_x$. The solution is evolved using:

$$u_i^{n+1} = u_i^n - \frac{\Delta t}{\Delta x} \left( F_{i+1/2} - F_{i-1/2} \right) + \nu \frac{\Delta t}{\Delta x^2} \left( u_{i+1}^n - 2u_i^n + u_{i-1}^n \right)$$

Rusanov Numerical Flux:

The Rusanov flux at cell interfaces is:

$$F_{i+1/2} = \frac{1}{2} \left( F(u_i) + F(u_{i+1}) \right) - \frac{1}{2} \alpha_{i+1/2} \left( u_{i+1} - u_i \right)$$

where:

$$\alpha_{i+1/2} = \max(|u_i|, |u_{i+1}|)$$

The term $-\frac{1}{2} \alpha (u_{i+1} - u_i)$ provides numerical dissipation that stabilizes the scheme and captures shocks without spurious oscillations.

Stability Constraints:

For stability, the timestep must satisfy:

  1. CFL condition (convection): $\Delta t \leq \text{CFL} \cdot \frac{\Delta x}{\max(|u|)}$ where CFL = 0.3
  2. Diffusion condition: $\Delta t \leq 0.25 \cdot \frac{\Delta x^2}{\nu}$

The actual timestep used is $\Delta t = \min(\Delta t_{\text{convection}}, \Delta t_{\text{diffusion}})$.

Key Property: The $\Delta t \propto \Delta x^2$ constraint for explicit diffusion has profound implications for parallel scaling (discussed in Section 4.3).

1.3 Problem Setup

Domain: Periodic domain [0, 1]

Parameters:

  • Viscosity: $\nu = 0.2$ (regular simulations), $\nu = 0.01$-0.005 (shock demonstrations)
  • CFL number: 0.3
  • Final time: $t = 0.2$

Initial Conditions:

  • Sine wave: $u(x,0) = 0.5 + 0.5 \sin(2\pi x)$ (smooth)
  • Step function: $u(x,0) = 1$ if $x < 0.5$, else $0$ (discontinuous)

Grid Sizes: $n_x = 600, 1200, 2400, 4800$ (strong scaling tests)


2. Parallelization Challenge

2.1 Domain Decomposition Strategy

The parallelization employs 1D domain decomposition using MPI:

Concept:

  • Divide the spatial domain among P processors
  • Each processor handles $n_{x,\text{local}} = n_{x,\text{global}} / P$ cells
  • Processors communicate only with immediate neighbors

Implementation:

Process 0: cells [0, nx_local)
Process 1: cells [nx_local, 2*nx_local)
...
Process P-1: cells [(P-1)*nx_local, nx_global)

Ghost/Halo Cells:

Each processor maintains ghost cells (halos) to store boundary values from neighbors:

Ghost Cell | Local Cells [0, nx_local) | Ghost Cell
    ^              owned by process             ^
from left                                  from right
neighbor                                   neighbor

2.2 Communication Pattern

Halo Exchange (every timestep):

To compute fluxes at boundaries, each processor must:

  1. Send its rightmost interior cell to right neighbor
  2. Receive left ghost cell from left neighbor
  3. Send its leftmost interior cell to left neighbor
  4. Receive right ghost cell from right neighbor

Periodic Boundaries:

  • First processor (rank 0) communicates with last processor (rank P-1)
  • Last processor communicates with first processor
  • Creates a "ring" topology

Non-blocking Communication:

We use MPI_Isend/MPI_Irecv (non-blocking) instead of MPI_Send/MPI_Recv (blocking):

  • Allows potential overlap of communication and computation
  • Prevents deadlocks in periodic boundary exchange
  • Reduces idle time waiting for messages

2.3 Key Parallelization Challenges

Challenge 1: Communication Overhead

Problem: Halo exchange occurs every timestep

  • For nx=4800, P=32: each processor has only 150 cells
  • Communication: 2 messages/timestep (send left, send right)
  • For 3.7M timesteps: 7.4M messages per processor
  • Message latency dominates when computation per timestep is small

Solution: Use non-blocking communication and ensure sufficient cells per processor.

Challenge 2: Global Reductions

Problem: Timestep $\Delta t$ requires global maximum velocity:

$$\Delta t = \text{CFL} \cdot \frac{\Delta x}{\max_{\text{all processors}}(|u|)}$$

  • Requires MPI_Allreduce operation (expensive synchronization)
  • Initially done every timestep: 3.7M Allreduce calls

Solution: Timestep caching (lines 220-247 in 2_parallel_rusanov.py):

if self.cached_dt is None or self.n_steps % 100 == 0:
    max_speed = self.comm.allreduce(max_speed_local, op=MPI.MAX)
    self.cached_dt = compute_timestep(...)

Result: Reduced Allreduce calls from 3.7M to 37,000 (100x reduction, 99% overhead reduction).

Challenge 3: Load Balancing

Issue: If $n_{x,\text{global}}$ is not divisible by P, some processors get more cells

  • Example: nx=1000, P=3 -> processor 0 gets 334 cells, processors 1,2 get 333 each
  • Solution: Distribute remainder evenly (first r processors get one extra cell)

Challenge 4: I/O and Gathering

Problem: Only root process should write output

  • Use MPI_Gatherv to collect distributed solution to root
  • Requires careful handling of variable-sized chunks from each processor

2.4 Scalability Limitations

Fundamental Constraint: Explicit Diffusion

The timestep condition $\Delta t \leq 0.25 \cdot \Delta x^2 / \nu$ creates scaling challenges:

  • When grid is refined (dx smaller), timestep decreases quadratically
  • For weak scaling (problem size grows with P), timesteps grow as $P^2$
  • This is a mathematical constraint, not a code deficiency

Practical Implication:

For strong scaling (fixed problem, increasing P):

  • Good scaling possible when cells_per_processor > 500
  • Below this threshold, communication overhead dominates
  • Our results show reasonable scaling for nx=4800 with P up to 24

3. Results and Visualizations

3.1 Strong Scaling Performance

Strong scaling tests fix the problem size and increase the number of processors, measuring speedup and efficiency.

Figure 1: Strong Scaling Analysis

Strong Scaling Analysis

Key Results (nx=4800):

Processors Time [s] Speedup Efficiency
1 244.4 1.00x 100.0%
2 216.2 1.13x 56.5%
4 184.1 1.33x 33.2%
8 166.5 1.47x 18.4%
16 154.7 1.58x 9.9%
24 154.0 1.59x 6.6%

Interpretation:

  • Best speedup: 1.59x with 24 processors
  • Speedup plateaus after P=16 due to communication overhead
  • Time per timestep remains constant (~66 microseconds), proving parallelization works
  • Limited efficiency is expected for explicit schemes with small problems

Scaling by Grid Size:

  • nx=600 (P=1 to 4): Speedup 0.90x (too small, overhead dominates)
  • nx=1200 (P=1 to 8): Speedup 1.00x (minimal benefit)
  • nx=2400 (P=1 to 16): Speedup 1.17x (moderate improvement)
  • nx=4800 (P=1 to 24): Speedup 1.59x (best result)

Conclusion: Larger problems scale better due to improved computation-to-communication ratio.

3.2 Weak Scaling Analysis

Weak scaling tests keep work per processor constant while increasing both problem size and processor count.

Figure 2: Weak Scaling Limitation

Weak Scaling Limitation

Results (300 cells per processor):

Processors Grid Size Timesteps Time [s] Efficiency
1 300 14,400 0.47 100.0%
8 2,400 921,600 46.8 1.0%
16 4,800 3,686,400 244.4 0.2%
48 14,400 33,177,600 1,428 0.03%

Critical Observation:

  • At P=48: Time increased 3,000x instead of staying constant
  • Timesteps increased 2,304x ($48^2 = 2,304$)
  • Time per timestep stays ~43 microseconds (parallelization works!)

Root Cause: The $\Delta t \propto \Delta x^2$ stability constraint means:

When P increases by factor $\alpha$:

  • Grid size increases by $\alpha$ ($n_x \propto P$)
  • Grid spacing decreases by $\alpha$ ($\Delta x \propto 1/P$)
  • Timestep decreases by $\alpha^2$ ($\Delta t \propto \Delta x^2 \propto 1/P^2$)
  • Number of timesteps increases by $\alpha^2$ (steps $\propto 1/\Delta t \propto P^2$)
  • Total work increases by $\alpha^3$ (work $= n_x \times$ steps $\propto P \times P^2 = P^3$)

Conclusion: Weak scaling failure is a fundamental property of explicit diffusion schemes, not an implementation flaw.

3.3 Shock Wave Formation

Low-viscosity simulations demonstrate the method's capability to capture discontinuities.

Figure 3: Shock Wave Visualization (Step Initial Condition)

Shock Step Visualization

Parameters:

  • Viscosity: $\nu = 0.01$ (100x lower than regular)
  • Initial condition: Step function (immediate discontinuity)
  • Grid size: nx = 1200
  • Final time: t = 0.15

Results:

  • Maximum gradient: $|\frac{du}{dx}|_{\max} > 50$ (strong shock)
  • Discontinuity clearly visible in solution evolution
  • Rusanov method captures shock without spurious oscillations

Figure 4: Shock Wave Visualization (Sine Initial Condition)

Shock Sine Visualization

Parameters:

  • Viscosity: $\nu = 0.005$ (200x lower than regular)
  • Initial condition: Smooth sine wave
  • Demonstrates shock formation from smooth initial data

3.4 Physical Field Evolution: Heatmaps

Space-time heatmaps provide comprehensive visualization of the velocity field $u(x,t)$.

Figure 5: Comparison - Regular vs Shock Solution

Comparison Heatmaps

How to Interpret:

  • X-axis: Spatial position (0 to 1)
  • Y-axis: Time (0 to $t_{\text{final}}$, increasing upward)
  • Colors: Velocity magnitude (blue=low, red=high)

Top Row (Regular Solution, nu=0.2):

  • Smooth color transitions throughout evolution
  • Gradients remain small (dark in gradient plot)
  • Viscosity dominates, preventing shock formation

Bottom Row (Shock Solution, nu=0.01):

  • Sharp vertical color transitions indicate discontinuities
  • Bright regions in gradient plot show shock locations
  • Nonlinearity dominates, creating steep gradients

Figure 6: Comprehensive Physical Field Heatmaps

Shock Heatmaps

Seven-panel visualization includes:

  1. Space-time evolution (main heatmap)
  2. Gradient magnitude field $|\frac{du}{dx}|$ (shock indicator)
  3. High-resolution smooth shading
  4. Velocity squared field $u^2$ (energy proxy) 5-7. Snapshot comparisons at initial, mid-time, and final states

Physical Insights:

  • Energy concentrates at shock locations (bright spots in $u^2$ plot)
  • Gradients exceed 50 at shock positions (steep jumps)
  • Solution structure remains coherent over time (characteristic lines visible)

4. Performance Analysis and Difficulties

4.1 Computational Efficiency Metrics

Time per Timestep Analysis:

This metric proves the parallel implementation is efficient:

Grid Size P=1 P=8 P=32 Change
nx=1200 41.0 microsec 41.1 microsec - +0.2%
nx=4800 66.3 microsec 45.2 microsec 43.7 microsec -34%

Observation: Time per timestep stays nearly constant or even decreases with more processors.

  • For nx=4800, P=32 is actually faster per step (better cache usage)
  • Small overhead (+0.2% for nx=1200) proves communication is minimal
  • This definitively shows parallelization works correctly

Communication Overhead Optimization:

Initial implementation: MPI_Allreduce every timestep

  • For 3.7M timesteps: 3.7M global synchronizations
  • Latency: ~10-50 microsec per Allreduce
  • Total overhead: 37-185 seconds of pure communication

After timestep caching (recompute dt every 100 steps):

  • Allreduce calls: 3.7M -> 37,000 (100x reduction)
  • Communication overhead: 185s -> 1.85s (99% reduction)
  • Performance improvement: ~5% on average

Speedup Analysis:

Amdahl's Law predicts maximum speedup given serial fraction f:

$$S_{\max} = \frac{1}{f + (1-f)/P}$$

For nx=4800, observed speedup at P=24 is 1.59x, implying:

$$ 1.59 = \frac{1}{f + (1-f)/24} \quad \Rightarrow \quad f \approx 0.37 \text{(37% serial fraction)} $$

Sources of serial fraction:

  • Halo exchange: ~20% (every timestep)
  • Allreduce operations: ~5% (even with caching)
  • I/O and gathering: ~5% (end of simulation)
  • Sequential initialization: ~7%

Conclusion: For this problem class with explicit schemes, 37% serial fraction is reasonable. The $\Delta t \propto \Delta x^2$ constraint makes communication relatively expensive compared to simple flux computations.

4.2 Difficulties Encountered

Difficulty 1: Negative Speedup in Initial Tests

Problem: First parallel runs showed P=2 slower than P=1 (speedup < 1.0)

Diagnosis:

  • Profiling revealed 3.7M MPI_Allreduce calls
  • Each call required global synchronization
  • Communication completely dominated computation

Solution:

  • Implemented timestep caching (recompute every 100 steps)
  • Based on observation that velocity field changes slowly
  • Validated that cached timestep maintains stability

Result: Eliminated 99% of global synchronizations, achieving positive speedup.

Difficulty 2: Understanding Weak Scaling Failure

Problem: Weak scaling showed catastrophic efficiency drop (100% -> 0.03%)

Initial Hypothesis: Communication overhead or load imbalance

Investigation:

  • Measured time per timestep: remained constant (communication is not the issue)
  • Counted timesteps: grew as $P^2$ (2,304x for 48x more processors)
  • Realized: $\Delta t \propto \Delta x^2$ is the culprit

Understanding:

  • This is not a bug - it is fundamental mathematics
  • Explicit schemes have $\Delta t \leq C \cdot \Delta x^2 / \nu$ for stability
  • When problem grows (dx shrinks), timestep shrinks quadratically
  • Weak scaling is simply the wrong metric for explicit diffusion schemes

Solution: Focus on strong scaling as the appropriate metric. Document weak scaling failure as an educational example of algorithm limitations.

Difficulty 3: Shock Capture with First-Order Method

Problem: First-order methods can produce excessive numerical diffusion

Observation:

  • Rusanov method's numerical dissipation: $\alpha(u_{\text{right}} - u_{\text{left}})/2$
  • Can smooth shocks excessively on coarse grids

Mitigation:

  • Use sufficiently fine grids ($n_x \geq 1200$) near shocks
  • Reduce viscosity to $\nu = 0.01$ or $0.005$ for shock demonstrations
  • Accept that first-order accuracy limits resolution

Trade-off: Higher-order methods (WENO, ENO) would reduce diffusion but significantly increase computational cost and implementation complexity.

Difficulty 4: Load Balancing with Indivisible Grids

Problem: When $n_{x,\text{global}} \mod P \neq 0$, processors have different workloads

Example: nx=1000, P=3

  • Ideal: 333.33 cells per processor (impossible)
  • Reality: [334, 333, 333] distribution

Solution:

nx_local = nx_global // size
remainder = nx_global % size
if rank < remainder:
    nx_local += 1

Impact: Load imbalance <= 1 cell per processor (negligible for nx > 300)

4.3 Lessons Learned

Lesson 1: Algorithm Choice Determines Scalability

The $\Delta t \propto \Delta x^2$ constraint is inherent to explicit schemes:

  • Implicit or semi-implicit methods have $\Delta t \propto \Delta x$ (linear, not quadratic)
  • Would enable effective weak scaling
  • Trade-off: more computation per timestep (solving linear systems)

Recommendation: For large-scale simulations requiring fine grids, implicit schemes are essential despite higher per-step cost.

Lesson 2: Communication Optimization is Critical

Even with efficient MPI primitives:

  • Global operations (Allreduce) must be minimized
  • Non-blocking communication allows overlap
  • Caching/reusing computed values reduces synchronization

Result: 99% reduction in communication overhead through timestep caching.

Lesson 3: Right Metrics for the Right Algorithm

Weak scaling is inappropriate for explicit diffusion because:

  • Work grows as $P^3$ (not P as assumed in weak scaling)
  • Efficiency < 1% is expected, not a failure

Strong scaling is the correct metric:

  • Shows real speedup for fixed problems
  • Efficiency of 10-20% is acceptable for communication-heavy explicit schemes

Lesson 4: Granularity Matters

Communication overhead is tolerable only when:

  • cells_per_processor > 500-1000 (sufficient computation)
  • Otherwise, latency dominates (as seen in nx=600 tests)

Rule of thumb: For 1D domain decomposition, aim for at least 1000 cells per processor.

4.4 Performance Summary

Achieved:

  • Correct parallel implementation of Rusanov method
  • Real speedup: 1.59x on nx=4800 with 24 processors
  • Efficient communication: time per timestep constant across P
  • Successful shock capture and visualization

Limitations:

  • Modest speedup due to explicit scheme's high communication-to-computation ratio
  • Weak scaling fundamentally impossible with $\Delta t \propto \Delta x^2$ constraint
  • First-order accuracy limits shock resolution

Overall Assessment: The parallel implementation is correct and well-optimized. Performance limitations arise from the algorithm (explicit scheme) rather than implementation quality. For this problem class, the achieved results are reasonable and demonstrate solid understanding of parallel computing principles.


Implementation Details

File Structure

1_sequential_rusanov.ipynb     - Sequential baseline implementation
2_parallel_rusanov.py          - MPI parallel solver
3_run_on_plgrid.sh            - SLURM batch script for execution
4_analysis.ipynb              - Comprehensive analysis and visualization

Key Implementation Features

Vectorization: All array operations use NumPy slicing (no Python loops):

# Flux computation
u_left = self.u[:-1]
u_right = self.u[1:]
f_interfaces = self.rusanov_flux(u_left, u_right)

# Solution update
u_new[1:-1] = self.u[1:-1] - (dt/dx) * (f_interfaces[1:] - f_interfaces[:-1])

Non-blocking Communication:

requests = []
req = self.comm.Isend(self.u[-2:-1], dest=right_neighbor, tag=0)
requests.append(req)
req = self.comm.Irecv(self.u[0:1], source=left_neighbor, tag=0)
requests.append(req)
MPI.Request.Waitall(requests)

Timestep Caching:

if self.cached_dt is None or self.n_steps % 100 == 0:
    max_speed = self.comm.allreduce(max_speed_local, op=MPI.MAX)
    self.cached_dt = compute_timestep(max_speed)
return self.cached_dt

Software Requirements

PLGrid Environment:

module load python/3.11
module load openmpi/4.1
module load scipy-bundle/2021.10-intel-2021b

Local Environment:

pip install numpy matplotlib scipy mpi4py jupyter

How to Run

On PLGrid Cluster

  1. Upload project files to PLGrid:
scp -r . [email protected]:/path/to/project
  1. SSH and submit batch job:
ssh [email protected]
cd /path/to/project
sbatch 3_run_on_plgrid.sh
  1. Monitor job status:
squeue -u username
tail -f logs/burgers_ALL_*.out
  1. Download results:
scp -r username@ares:/path/to/project/plgrid_results ./

Local Analysis

jupyter notebook 4_analysis.ipynb

This generates all plots in the plots/ directory:

  • STRONG_SCALING_ANALYSIS.png
  • WEAK_SCALING_LIMITATION.png
  • COMPARISON_regular_vs_shock_heatmaps.png
  • Various shock and heatmap visualizations

Expected Runtime

On PLGrid (Ares cluster):

  • Total runtime: 60-90 minutes
  • Strong scaling tests: ~40 minutes
  • Weak scaling tests: ~30 minutes
  • Shock demonstrations: ~10 minutes

References

Numerical Methods

  1. Rusanov, V. V. (1961). "The calculation of the interaction of non-stationary shock waves and obstacles." USSR Computational Mathematics and Mathematical Physics.

  2. LeVeque, R. J. (2002). "Finite Volume Methods for Hyperbolic Problems." Cambridge University Press. Chapter 12: Nonlinear Conservation Laws.

  3. Toro, E. F. (2009). "Riemann Solvers and Numerical Methods for Fluid Dynamics." Springer. Chapter 10: The Rusanov Method.

Burgers Equation

  1. Burgers, J. M. (1948). "A mathematical model illustrating the theory of turbulence." Advances in Applied Mechanics, 1, 171-199.

  2. Steward, J. (2009). "The Solution of a Burgers Equation Inverse Problem with Reduced-Order Modeling Proper Orthogonal Decomposition." Master's Thesis, Florida State University.

  3. Whitham, G. B. (1974). "Linear and Nonlinear Waves." Wiley. Chapter 4: Burgers Equation and Shock Formation.

Parallel Computing

  1. Gropp, W., Lusk, E., & Skjellum, A. (1999). "Using MPI: Portable Parallel Programming with the Message-Passing Interface." MIT Press.

  2. Pacheco, P. (2011). "An Introduction to Parallel Programming." Morgan Kaufmann. Chapter 3: Distributed-Memory Programming with MPI.

  3. Amdahl, G. M. (1967). "Validity of the single processor approach to achieving large scale computing capabilities." AFIPS Conference Proceedings, 30, 483-485.

Performance Analysis

  1. Gustafson, J. L. (1988). "Reevaluating Amdahl's law." Communications of the ACM, 31(5), 532-533.

  2. Karp, A. H., & Flatt, H. P. (1990). "Measuring parallel processor performance." Communications of the ACM, 33(5), 539-543.

About

Parallel numerical solver for the hydrodynamic Burgers equation using the Rusanov finite-difference method. Implements parallelized algorithms to simulate shock wave formation and propagation. Includes visualization tools for analyzing discontinuities in the solution. Academic project demonstrating computational fluid dynamics techniques.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages