Skip to content

Conversation

@qjia7
Copy link
Contributor

@qjia7 qjia7 commented Oct 30, 2025

This pull request adds a C API for WebGPU data transfer, enabling tensor copying between CPU and GPU devices via the WebGPU execution provider. The main changes introduce a wrapper implementation for data transfer, integrate it with the plugin execution provider factory, and expose a creation function for use by the ONNX Runtime core.

@qjia7 qjia7 force-pushed the data_transfer_mgr branch from 04434ea to 47b3a2d Compare November 11, 2025 07:39
@qjia7 qjia7 marked this pull request as ready for review November 12, 2025 10:12
Copy link
Contributor

@fs-eire fs-eire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Environment::data_transfer_mgr_ is different from InferenceSession::data_transfer_mgr_. They are the same type, but the one inside Environment should not depend on any session instance. The changes in this PR brings dependency on a specific session for Environment and I believe this is not what we want.

There is an existing method CreateAndRegisterInternalEps in class Environment, which should have already called RegisterExecutionProviderLibrary on WebGPU. Why the data transfer is not correctly registered - I can take a look at it.

@fs-eire
Copy link
Contributor

fs-eire commented Nov 13, 2025

Please add override CreateDataTransfer:

virtual OrtStatus* CreateDataTransfer(_Outptr_result_maybenull_ OrtDataTransferImpl** data_transfer) noexcept {
*data_transfer = nullptr;
return nullptr; // Default implementation does nothing
}

in class WebGpuEpFactory (file in https://github.com/Microsoft/onnxruntime/blob/main/onnxruntime/core/session/plugin_ep/ep_factory_webgpu.h) for the purpose of this change.

@qjia7
Copy link
Contributor Author

qjia7 commented Nov 13, 2025

Please add override CreateDataTransfer:

virtual OrtStatus* CreateDataTransfer(_Outptr_result_maybenull_ OrtDataTransferImpl** data_transfer) noexcept {
*data_transfer = nullptr;
return nullptr; // Default implementation does nothing
}

in class WebGpuEpFactory (file in https://github.com/Microsoft/onnxruntime/blob/main/onnxruntime/core/session/plugin_ep/ep_factory_webgpu.h) for the purpose of this change.

WebGPU DataTransfer requires a BufferManager.
For graph capture, the BufferManager is tied to the execution provider instance not session-independent. That's the problem.

@qjia7
Copy link
Contributor Author

qjia7 commented Nov 17, 2025

Please add override CreateDataTransfer:

virtual OrtStatus* CreateDataTransfer(_Outptr_result_maybenull_ OrtDataTransferImpl** data_transfer) noexcept {
*data_transfer = nullptr;
return nullptr; // Default implementation does nothing
}

in class WebGpuEpFactory (file in https://github.com/Microsoft/onnxruntime/blob/main/onnxruntime/core/session/plugin_ep/ep_factory_webgpu.h) for the purpose of this change.

Done. Use the context 0's buffer manager. Will create one if not exist.

@qjia7 qjia7 requested a review from fs-eire November 17, 2025 05:51
@guschmue guschmue added the ep:WebGPU ort-web webgpu provider label Nov 21, 2025
qjia7 added a commit to microsoft/onnxruntime-genai that referenced this pull request Nov 25, 2025
This PR enables the graph capture for webgpu. It implements
CopyDeviceToCpu\CopyCpuToDevice\CopyFrom\Zero functions using the new
`CopyTensors` API.

The ort part needs to apply this PR
[#26450](microsoft/onnxruntime#26450) to make it
work for webgpu.

Below things will be implemented in following-up PRs to get the full
performance gain for graph capture (The original one is
#1720).
1. Support UpdateAttentionMask, UpdatePositionIds, and Cast to keep the
whole pipeline on gpu.
2. Optimize CopyFrom with offsets

---------

Co-authored-by: Copilot <[email protected]>
@qjia7
Copy link
Contributor Author

qjia7 commented Nov 25, 2025

@fs-eire @guschmue The webgpu related failures have been fixed. Others are not related with my changes. Please take a look, thanks.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds WebGPU data transfer functionality to the ONNX Runtime core, enabling tensor copying between CPU and GPU devices via the WebGPU execution provider. The implementation provides a C API wrapper with lazy initialization that determines the WebGPU context from the tensors during the first copy operation.

Key Changes:

  • Adds CreateDataTransfer method to WebGpuEpFactory for registering data transfer with the environment
  • Implements WebGpuDataTransferImpl wrapper that bridges C API and C++ internal data transfer implementation
  • Introduces lazy initialization of WebGPU context based on tensor device information

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
onnxruntime/core/session/plugin_ep/ep_factory_webgpu.h Declares CreateDataTransfer override method in WebGpuEpFactory
onnxruntime/core/session/plugin_ep/ep_factory_webgpu.cc Implements CreateDataTransfer by calling WebGPU provider's C API function
onnxruntime/core/providers/webgpu/webgpu_provider_factory_creator.h Declares C API function OrtWebGpuCreateDataTransfer() for creating data transfer instances
onnxruntime/core/providers/webgpu/webgpu_provider_factory.cc Implements data transfer wrapper with lazy context initialization, helper functions, and vendor ID filtering
onnxruntime/core/providers/webgpu/webgpu_context.h Adds HasContext method to WebGpuContextFactory for checking context existence
onnxruntime/core/providers/webgpu/webgpu_context.cc Implements HasContext method with thread-safe context lookup

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

guschmue
guschmue previously approved these changes Dec 1, 2025
kunal-vaishnavi pushed a commit to microsoft/onnxruntime-genai that referenced this pull request Dec 5, 2025
This PR enables the graph capture for webgpu. It implements
CopyDeviceToCpu\CopyCpuToDevice\CopyFrom\Zero functions using the new
`CopyTensors` API.

The ort part needs to apply this PR
[#26450](microsoft/onnxruntime#26450) to make it
work for webgpu.

Below things will be implemented in following-up PRs to get the full
performance gain for graph capture (The original one is
#1720).
1. Support UpdateAttentionMask, UpdatePositionIds, and Cast to keep the
whole pipeline on gpu.
2. Optimize CopyFrom with offsets

---------

Co-authored-by: Copilot <[email protected]>
{
std::lock_guard<std::mutex> lock(impl.init_mutex_);

if (impl.data_transfer_ == nullptr || impl.context_id_ != context_id) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that the code are designed for situations that assume different context ID may be used for the same WebGpuDataTransferImpl instance.

This causes the lock behavior to be complicated. for now, I think it is safe to assume context ID won't change: it should always be 0.

based on this, code may be simplified:

  • use patterns like this to avoid unnecessary lock operation:

    if (impl.data_transfer_ == nullptr) {
      std::lock_guard<std::mutex> lock(impl.init_mutex_);
      if (impl.data_transfer_ == nullptr) {
        ...
    
        impl.data_transfer_ = ...;
      }
    }
  • always create new context:

    context_ptr = &webgpu::WebGpuContextFactory::CreateContext(params.context_config);

    and in ReleaseImpl release it

    static void ReleaseImpl(OrtDataTransferImpl* this_ptr) noexcept {
      WebGpuDataTransferImpl* p_impl = static_cast<WebGpuDataTransferImpl*>(this_ptr);
      int context_id = p_impl->context_id_;
      bool data_transfer_initialized = false;
      {
        std::lock_guard<std::mutex> lock(p_impl->init_mutex_);
        data_transfer_initialized = p_impl->data_transfer_ == nullptr;
      }
      delete p_impl;
      if (data_transfer_initialized) {
        WebGpuContextFactory::ReleaseContext(context_id);
      }
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One new issue appears with this change.

  1. OrtEnv::~OrtEnv() (destructor called first)
    Calls webgpu::CleanupWebGpuContexts()
    This clears all WebGPU contexts from WebGpuContextFactory

  2. Environment::~Environment() (destructor called later)
    Destroys data_transfer_mgr_ member
    This destroys all registered IDataTransfer instances including WebGpuDataTransferImpl
    WebGpuDataTransferImpl::ReleaseImpl() (called during destruction)

  3. Tries to call WebGpuContextFactory::ReleaseContext(context_id)
    But the context was already cleared in step 1
    ReleaseContext had an ORT_ENFORCE that threw an error ❌

So I removed webgpu::CleanupWebGpuContexts() from OrtEnv::~OrtEnv() in commit b988a12. Is it the right way? Or delay it to Environment::UnregisterExecutionProviderLibrary after data_transfer_mgr_.UnregisterDataTransfer ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just explicitly destroy the Environment first. Then call webgpu::CleanupWebGpuContexts() in OrtEnv::~OrtEnv().

@qjia7 qjia7 requested a review from fs-eire December 9, 2025 12:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ep:WebGPU ort-web webgpu provider

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants