-
Notifications
You must be signed in to change notification settings - Fork 3.6k
[WebGPU] Register DataTransfer to Env #26450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
04434ea to
47b3a2d
Compare
fs-eire
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Environment::data_transfer_mgr_ is different from InferenceSession::data_transfer_mgr_. They are the same type, but the one inside Environment should not depend on any session instance. The changes in this PR brings dependency on a specific session for Environment and I believe this is not what we want.
There is an existing method CreateAndRegisterInternalEps in class Environment, which should have already called RegisterExecutionProviderLibrary on WebGPU. Why the data transfer is not correctly registered - I can take a look at it.
|
Please add override
in class |
WebGPU DataTransfer requires a BufferManager. |
Done. Use the context 0's buffer manager. Will create one if not exist. |
This PR enables the graph capture for webgpu. It implements CopyDeviceToCpu\CopyCpuToDevice\CopyFrom\Zero functions using the new `CopyTensors` API. The ort part needs to apply this PR [#26450](microsoft/onnxruntime#26450) to make it work for webgpu. Below things will be implemented in following-up PRs to get the full performance gain for graph capture (The original one is #1720). 1. Support UpdateAttentionMask, UpdatePositionIds, and Cast to keep the whole pipeline on gpu. 2. Optimize CopyFrom with offsets --------- Co-authored-by: Copilot <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request adds WebGPU data transfer functionality to the ONNX Runtime core, enabling tensor copying between CPU and GPU devices via the WebGPU execution provider. The implementation provides a C API wrapper with lazy initialization that determines the WebGPU context from the tensors during the first copy operation.
Key Changes:
- Adds
CreateDataTransfermethod toWebGpuEpFactoryfor registering data transfer with the environment - Implements
WebGpuDataTransferImplwrapper that bridges C API and C++ internal data transfer implementation - Introduces lazy initialization of WebGPU context based on tensor device information
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/core/session/plugin_ep/ep_factory_webgpu.h | Declares CreateDataTransfer override method in WebGpuEpFactory |
| onnxruntime/core/session/plugin_ep/ep_factory_webgpu.cc | Implements CreateDataTransfer by calling WebGPU provider's C API function |
| onnxruntime/core/providers/webgpu/webgpu_provider_factory_creator.h | Declares C API function OrtWebGpuCreateDataTransfer() for creating data transfer instances |
| onnxruntime/core/providers/webgpu/webgpu_provider_factory.cc | Implements data transfer wrapper with lazy context initialization, helper functions, and vendor ID filtering |
| onnxruntime/core/providers/webgpu/webgpu_context.h | Adds HasContext method to WebGpuContextFactory for checking context existence |
| onnxruntime/core/providers/webgpu/webgpu_context.cc | Implements HasContext method with thread-safe context lookup |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This PR enables the graph capture for webgpu. It implements CopyDeviceToCpu\CopyCpuToDevice\CopyFrom\Zero functions using the new `CopyTensors` API. The ort part needs to apply this PR [#26450](microsoft/onnxruntime#26450) to make it work for webgpu. Below things will be implemented in following-up PRs to get the full performance gain for graph capture (The original one is #1720). 1. Support UpdateAttentionMask, UpdatePositionIds, and Cast to keep the whole pipeline on gpu. 2. Optimize CopyFrom with offsets --------- Co-authored-by: Copilot <[email protected]>
| { | ||
| std::lock_guard<std::mutex> lock(impl.init_mutex_); | ||
|
|
||
| if (impl.data_transfer_ == nullptr || impl.context_id_ != context_id) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that the code are designed for situations that assume different context ID may be used for the same WebGpuDataTransferImpl instance.
This causes the lock behavior to be complicated. for now, I think it is safe to assume context ID won't change: it should always be 0.
based on this, code may be simplified:
-
use patterns like this to avoid unnecessary lock operation:
if (impl.data_transfer_ == nullptr) { std::lock_guard<std::mutex> lock(impl.init_mutex_); if (impl.data_transfer_ == nullptr) { ... impl.data_transfer_ = ...; } }
-
always create new context:
context_ptr = &webgpu::WebGpuContextFactory::CreateContext(params.context_config);and in ReleaseImpl release it
static void ReleaseImpl(OrtDataTransferImpl* this_ptr) noexcept { WebGpuDataTransferImpl* p_impl = static_cast<WebGpuDataTransferImpl*>(this_ptr); int context_id = p_impl->context_id_; bool data_transfer_initialized = false; { std::lock_guard<std::mutex> lock(p_impl->init_mutex_); data_transfer_initialized = p_impl->data_transfer_ == nullptr; } delete p_impl; if (data_transfer_initialized) { WebGpuContextFactory::ReleaseContext(context_id); } }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One new issue appears with this change.
-
OrtEnv::~OrtEnv() (destructor called first)
Calls webgpu::CleanupWebGpuContexts()
This clears all WebGPU contexts from WebGpuContextFactory -
Environment::~Environment() (destructor called later)
Destroys data_transfer_mgr_ member
This destroys all registered IDataTransfer instances including WebGpuDataTransferImpl
WebGpuDataTransferImpl::ReleaseImpl() (called during destruction) -
Tries to call WebGpuContextFactory::ReleaseContext(context_id)
But the context was already cleared in step 1
ReleaseContext had an ORT_ENFORCE that threw an error ❌
So I removed webgpu::CleanupWebGpuContexts() from OrtEnv::~OrtEnv() in commit b988a12. Is it the right way? Or delay it to Environment::UnregisterExecutionProviderLibrary after data_transfer_mgr_.UnregisterDataTransfer ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just explicitly destroy the Environment first. Then call webgpu::CleanupWebGpuContexts() in OrtEnv::~OrtEnv().
This pull request adds a C API for WebGPU data transfer, enabling tensor copying between CPU and GPU devices via the WebGPU execution provider. The main changes introduce a wrapper implementation for data transfer, integrate it with the plugin execution provider factory, and expose a creation function for use by the ONNX Runtime core.