Skip to content

Enable CUDA Graphs with vLLM Data Parallel#3020

Open
ihebchaa wants to merge 1 commit intoEleutherAI:mainfrom
ihebchaa:fix/vllm-dp-non-eager
Open

Enable CUDA Graphs with vLLM Data Parallel#3020
ihebchaa wants to merge 1 commit intoEleutherAI:mainfrom
ihebchaa:fix/vllm-dp-non-eager

Conversation

@ihebchaa
Copy link
Copy Markdown

Problem:

When using vLLM with data_parallel_size > 1, the current implementation forces enforce_eager=True, which disables CUDA graphs and significantly impacts performance. This is particularly problematic for Reasoning models that require large max_new_tokens (e.g., 32k+ tokens).

Solution:

this PR removes the forced enforce_eager=True when using data parallel.

Key Changes

  • Improved CUDA device isolation: Each worker process correctly sets its CUDA_VISIBLE_DEVICES by running in an isolated env.
  • Removed forced eager execution

Tests

Tested with tp=2 and dp (1, 2, 4) with a 7b reasoning model

@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


Iheb Chaabane seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@baberabb
Copy link
Copy Markdown
Contributor

Hi! thanks for the PR. Are you aware why they enforce eager if you follow their public API?

@ihebchaa
Copy link
Copy Markdown
Author

It's not clear to me why it's forced here
Multiple vLLM instances are created separately on each dp rank and do not communicate with each other, so I don't see why enforce_eager is enforced.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants