[WebGPU] Register DataTransfer to Env #26450

qjia7 · 2025-10-30T06:15:12Z

This pull request adds a C API for WebGPU data transfer, enabling tensor copying between CPU and GPU devices via the WebGPU execution provider. The main changes introduce a wrapper implementation for data transfer, integrate it with the plugin execution provider factory, and expose a creation function for use by the ONNX Runtime core.

fs-eire

The Environment::data_transfer_mgr_ is different from InferenceSession::data_transfer_mgr_. They are the same type, but the one inside Environment should not depend on any session instance. The changes in this PR brings dependency on a specific session for Environment and I believe this is not what we want.

There is an existing method CreateAndRegisterInternalEps in class Environment, which should have already called RegisterExecutionProviderLibrary on WebGPU. Why the data transfer is not correctly registered - I can take a look at it.

fs-eire · 2025-11-13T01:28:36Z

Please add override CreateDataTransfer:

onnxruntime/onnxruntime/core/session/plugin_ep/ep_factory_internal_impl.h

Lines 56 to 59 in c30905d

    
           virtual OrtStatus* CreateDataTransfer(_Outptr_result_maybenull_ OrtDataTransferImpl** data_transfer) noexcept { 
        
             *data_transfer = nullptr; 
        
             return nullptr;  // Default implementation does nothing 
        
           }

in class WebGpuEpFactory (file in https://github.com/Microsoft/onnxruntime/blob/main/onnxruntime/core/session/plugin_ep/ep_factory_webgpu.h) for the purpose of this change.

qjia7 · 2025-11-13T03:12:25Z

Please add override CreateDataTransfer:

onnxruntime/onnxruntime/core/session/plugin_ep/ep_factory_internal_impl.h

Lines 56 to 59 in c30905d

virtual OrtStatus* CreateDataTransfer(_Outptr_result_maybenull_ OrtDataTransferImpl** data_transfer) noexcept {

*data_transfer = nullptr;

return nullptr; // Default implementation does nothing

}

in class WebGpuEpFactory (file in https://github.com/Microsoft/onnxruntime/blob/main/onnxruntime/core/session/plugin_ep/ep_factory_webgpu.h) for the purpose of this change.

WebGPU DataTransfer requires a BufferManager.
For graph capture, the BufferManager is tied to the execution provider instance not session-independent. That's the problem.

qjia7 · 2025-11-17T05:51:02Z

Please add override CreateDataTransfer:

onnxruntime/onnxruntime/core/session/plugin_ep/ep_factory_internal_impl.h

Lines 56 to 59 in c30905d

virtual OrtStatus* CreateDataTransfer(_Outptr_result_maybenull_ OrtDataTransferImpl** data_transfer) noexcept {

*data_transfer = nullptr;

return nullptr; // Default implementation does nothing

}

in class WebGpuEpFactory (file in https://github.com/Microsoft/onnxruntime/blob/main/onnxruntime/core/session/plugin_ep/ep_factory_webgpu.h) for the purpose of this change.

Done. Use the context 0's buffer manager. Will create one if not exist.

This PR enables the graph capture for webgpu. It implements CopyDeviceToCpu\CopyCpuToDevice\CopyFrom\Zero functions using the new `CopyTensors` API. The ort part needs to apply this PR [#26450](microsoft/onnxruntime#26450) to make it work for webgpu. Below things will be implemented in following-up PRs to get the full performance gain for graph capture (The original one is #1720). 1. Support UpdateAttentionMask, UpdatePositionIds, and Cast to keep the whole pipeline on gpu. 2. Optimize CopyFrom with offsets --------- Co-authored-by: Copilot <[email protected]>

qjia7 · 2025-11-25T07:27:45Z

@fs-eire @guschmue The webgpu related failures have been fixed. Others are not related with my changes. Please take a look, thanks.

Copilot

Pull request overview

This pull request adds WebGPU data transfer functionality to the ONNX Runtime core, enabling tensor copying between CPU and GPU devices via the WebGPU execution provider. The implementation provides a C API wrapper with lazy initialization that determines the WebGPU context from the tensors during the first copy operation.

Key Changes:

Adds CreateDataTransfer method to WebGpuEpFactory for registering data transfer with the environment
Implements WebGpuDataTransferImpl wrapper that bridges C API and C++ internal data transfer implementation
Introduces lazy initialization of WebGPU context based on tensor device information

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
onnxruntime/core/session/plugin_ep/ep_factory_webgpu.h	Declares `CreateDataTransfer` override method in WebGpuEpFactory
onnxruntime/core/session/plugin_ep/ep_factory_webgpu.cc	Implements `CreateDataTransfer` by calling WebGPU provider's C API function
onnxruntime/core/providers/webgpu/webgpu_provider_factory_creator.h	Declares C API function `OrtWebGpuCreateDataTransfer()` for creating data transfer instances
onnxruntime/core/providers/webgpu/webgpu_provider_factory.cc	Implements data transfer wrapper with lazy context initialization, helper functions, and vendor ID filtering
onnxruntime/core/providers/webgpu/webgpu_context.h	Adds `HasContext` method to WebGpuContextFactory for checking context existence
onnxruntime/core/providers/webgpu/webgpu_context.cc	Implements `HasContext` method with thread-safe context lookup

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

onnxruntime/core/providers/webgpu/webgpu_provider_factory.cc

This PR enables the graph capture for webgpu. It implements CopyDeviceToCpu\CopyCpuToDevice\CopyFrom\Zero functions using the new `CopyTensors` API. The ort part needs to apply this PR [#26450](microsoft/onnxruntime#26450) to make it work for webgpu. Below things will be implemented in following-up PRs to get the full performance gain for graph capture (The original one is #1720). 1. Support UpdateAttentionMask, UpdatePositionIds, and Cast to keep the whole pipeline on gpu. 2. Optimize CopyFrom with offsets --------- Co-authored-by: Copilot <[email protected]>

onnxruntime/core/session/plugin_ep/ep_factory_webgpu.cc

onnxruntime/core/providers/webgpu/webgpu_provider_factory.cc

fs-eire · 2025-12-08T22:56:57Z

onnxruntime/core/providers/webgpu/webgpu_provider_factory.cc

+    {
+      std::lock_guard<std::mutex> lock(impl.init_mutex_);
+
+      if (impl.data_transfer_ == nullptr || impl.context_id_ != context_id) {


I understand that the code are designed for situations that assume different context ID may be used for the same WebGpuDataTransferImpl instance.

This causes the lock behavior to be complicated. for now, I think it is safe to assume context ID won't change: it should always be 0.

based on this, code may be simplified:

use patterns like this to avoid unnecessary lock operation:

if (impl.data_transfer_ == nullptr) { std::lock_guard<std::mutex> lock(impl.init_mutex_); if (impl.data_transfer_ == nullptr) { ... impl.data_transfer_ = ...; } }

always create new context:

context_ptr = &webgpu::WebGpuContextFactory::CreateContext(params.context_config);

and in ReleaseImpl release it

static void ReleaseImpl(OrtDataTransferImpl* this_ptr) noexcept { WebGpuDataTransferImpl* p_impl = static_cast<WebGpuDataTransferImpl*>(this_ptr); int context_id = p_impl->context_id_; bool data_transfer_initialized = false; { std::lock_guard<std::mutex> lock(p_impl->init_mutex_); data_transfer_initialized = p_impl->data_transfer_ == nullptr; } delete p_impl; if (data_transfer_initialized) { WebGpuContextFactory::ReleaseContext(context_id); } }

One new issue appears with this change.

OrtEnv::~OrtEnv() (destructor called first)
Calls webgpu::CleanupWebGpuContexts()
This clears all WebGPU contexts from WebGpuContextFactory

Environment::~Environment() (destructor called later)
Destroys data_transfer_mgr_ member
This destroys all registered IDataTransfer instances including WebGpuDataTransferImpl
WebGpuDataTransferImpl::ReleaseImpl() (called during destruction)

Tries to call WebGpuContextFactory::ReleaseContext(context_id)
But the context was already cleared in step 1
ReleaseContext had an ORT_ENFORCE that threw an error ❌

So I removed webgpu::CleanupWebGpuContexts() from OrtEnv::~OrtEnv() in commit b988a12. Is it the right way? Or delay it to Environment::UnregisterExecutionProviderLibrary after data_transfer_mgr_.UnregisterDataTransfer ?

Just explicitly destroy the Environment first. Then call webgpu::CleanupWebGpuContexts() in OrtEnv::~OrtEnv().

register DataTransfer to Env

47b3a2d

qjia7 force-pushed the data_transfer_mgr branch from 04434ea to 47b3a2d Compare November 11, 2025 07:39

qjia7 added 2 commits November 12, 2025 14:19

fix CI errors

974036c

nits

925b620

qjia7 marked this pull request as ready for review November 12, 2025 10:12

qjia7 requested review from fs-eire, guschmue and skottmckay November 12, 2025 10:12

fs-eire requested changes Nov 12, 2025

View reviewed changes

address comments

42d5e64

qjia7 requested a review from fs-eire November 17, 2025 05:51

qjia7 added 4 commits November 17, 2025 14:38

Merge branch 'main' into pr_data_transfer

f79a393

fix CI errors

9275dad

update comments

d0fa357

validationMode=disabled

2fd4559

qjia7 mentioned this pull request Nov 20, 2025

Enable graph capture for webgpu microsoft/onnxruntime-genai#1848

Merged

lazy initialization

0c04416

guschmue added the ep:WebGPU ort-web webgpu provider label Nov 21, 2025

qjia7 added 2 commits November 24, 2025 14:00

Merge branch 'main' into pr_data_transfer

3d7dd9c

fix the CI errors

5ad41d4

qjia7 requested a review from Copilot November 25, 2025 07:33

Copilot started reviewing on behalf of qjia7 November 25, 2025 07:34 View session

Copilot finished reviewing on behalf of qjia7 November 25, 2025 07:39

Copilot AI reviewed Nov 25, 2025

View reviewed changes

address comments

60ef5d8

guschmue previously approved these changes Dec 1, 2025

View reviewed changes