Skip to content

Conversation

@lslusarczyk
Copy link
Contributor

@lslusarczyk lslusarczyk commented Dec 10, 2025

Use perfect forwarding in queue inline implementations. In the same place use event move if possible. For sake of performance.

As the change is small one. Most benchmarks show no performance improvement, but some have improved, e.g. (see difference between green lines above and two last points of red line marked in a circle)
SubmitKernel-out-of-order-using-events-long-kernel_perfect_forwarding
SubmitKernel-out-of-order-with-completion_perfect_forwarding

Not using perfect forwarding in the API taking universal references is a wrong pattern. I want it to be fixed, in order to

  • do not to wonder in the future whether if we are slow because of this or because of something else (quickly jump to "something else" branch)
  • promote correct API for future calls, as people usually look around how things are written and write code in a similar way
  • improve perf a bit, as it indeed improves it

@lslusarczyk lslusarczyk force-pushed the queue-perfect-forwarding branch from e8e1aa1 to 0567a41 Compare December 10, 2025 10:36
@lslusarczyk lslusarczyk force-pushed the queue-perfect-forwarding branch from 0567a41 to d6a5400 Compare December 10, 2025 12:13
detail::code_location::current()) {
std::enable_if_t<ext::oneapi::experimental::is_property_list<
std::decay_t<PropertiesT>>::value,
event> single_task(PropertiesT &&Properties,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are changing a public API. Is it really necessary?

@gmlueck Could you please comment if spec gives us freedom to accept Properties by universal reference?

void>::value)) {
return detail::submit_kernel_direct_single_task<KernelName, true>(
*this, KernelFunc, {}, Properties, TlsCodeLocCapture.query());
*this, KernelFunc, {}, std::forward<PropertiesT>(Properties),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why haven't you updated the else path? And what about the subsequent calls down to the stack? If you forward here, but all next calls in the stack still accept by value, then this optimization makes little sense.

detail::code_location::current()) {
std::enable_if_t<ext::oneapi::experimental::is_property_list<
std::decay_t<PropertiesT>>::value,
event> single_task(event DepEvent, PropertiesT &&Properties,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here as above

const detail::code_location &CodeLoc) {
detail::tls_code_loc_t TlsCodeLocCapture(CodeLoc);
return impl->memset(Ptr, Value, Count, {DepEvent},
return impl->memset(Ptr, Value, Count, {std::move(DepEvent)},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The right fix here should be to use sycl::span in the queue_impl::memset to accept DepEvents to avoid temporary std::vector creation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants