Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify group function inputs must be initialized #516

Conversation

Pennycook
Copy link
Contributor

These clarifications were motivated by code that looks like this:

uint32_t x;                        // uninitialized variable
if (sg.leader()) {
  x = foo();                       // variable is initialized by the group leader only
}
x = sycl::group_broadcast(sg, x);  // value from group leader is broadcast to other work-items

Some SYCL developers (including me!) might expect that this code to be safe, because:

  • The leader always initializes x before its value is broadcast to other work-items
  • The group_broadcast call always returns the value from the leader, and values from other work-items are always discarded
  • The result of the group_broadcast is always assigned to x in all work-items before it is used

However, because group_broadcast accepts x by value, C++ rules say that group_broadcast reads the uninitialized value(s) on all work-items. This is undefined behavior, and so compilers are free to optimize code like the above in surprising ways. For example, a compiler may decide that since the call to group_broadcast is only legal if x is initialized, and x is only initialized if sg.leader() is true, sg.leader() must be unconditionally true and thus the branch can be eliminated.

The fix suggested by this commit is to clarify that the code above is illegal, and must be rewritten as:

uint32_t x  = 0;                   // initialized variable
if (sg.leader()) {
  x = foo();                       // variable is overwritten by the group leader only
}
x = sycl::group_broadcast(sg, x);  // value from group leader is broadcast to other work-items

Changing the declarations of the group functions to accept T& instead of T was considered as an alternative fix. Although it seems promising, there's no way for us (the writers of the SYCL specification) to change the definitions of other backend-specific APIs (e.g., in SPIR-V or CUDA) that may be used to implement these functions. If the backend-specific APIs ultimately take the arguments by value, changing the SYCL interface may not be sufficient to prevent unsafe optimizations.

The group_broadcast and select_from_group functions are unlike the others, in
that they conditionally exchange values between work-items in the group. Some
developers may assume that it is safe for input arguments to be uninitialized
as long as conditions are such that those same values are not exchanged with
another work-item.

However, using an uninitialized value in this way results in undefined behavior
(according to C++), which may result in compilers performing optimizations that
are incompatible with the semantics of group functions.

This commit clarifies that the arguments passed to group_broadcast and
select_from_group must be initialized on all work-items. Although this
clarification is arguably unnecessary (since it is already covered by C++
rules), this is a common enough mistake that stating the precondition
explicitly may help developers who are new to SYCL.
This commit adds a non-normative note to explain why certain group functions
now explicitly require the values from all work-items to be initialized.

The non-normative note appears at the beginning of the group functions section
because this seemed preferable to repeating the note for each overload.
@rolandschulz
Copy link

Changing the declarations of the group functions to accept T& instead of T was considered as an alternative fix. Although it seems promising, there's no way for us (the writers of the SYCL specification) to change the definitions of other backend-specific APIs (e.g., in SPIR-V or CUDA) that may be used to implement these functions. If the backend-specific APIs ultimately take the arguments by value, changing the SYCL interface may not be sufficient to prevent unsafe optimizations.

I don't think this a strong argument against this possible solution. It would be up to the implementation to make sure it is sufficient. An implementation could always implement group_broadcast as

template<class T>
auto group_broadcast(sub_group sg, const T& v) {
    T t();
    if(sg.leader()) t = v;
    backend_broadcast(sg, t)
}

A high quality implementation might want to do something smarter or make sure that the extra branch gets optimized out. But that is purely an implementation problem.

@Pennycook
Copy link
Contributor Author

It would be up to the implementation to make sure it is sufficient. An implementation could always implement group_broadcast as

template<class T>
auto group_broadcast(sub_group sg, const T& v) {
    T t();
    if(sg.leader()) t = v;
    backend_broadcast(sg, t)
}

Hm. I think that might work, but I need to think about it some more.

If we did adopt this as the solution, would you recommend that we change all of the group functions to accept const T& instead of T?

Another of my concerns was that it might be difficult to identify cases where SYCL functions should accept const T&, and that we might make mistakes. Do you think this would only have to impact the group algorithms, because of their collective nature?

@gmlueck
Copy link
Contributor

gmlueck commented Dec 8, 2023

We should think hard about what API breaking ramifications there are if we change these parameters from T to const T&. What existing code would be impacted if we made this change, even if it is a weird scenario?

Here's one to get us started:

  • Any code that specifically gets the type of the function via decltype and then uses the type as a template parameter (e.g. by testing a type trait).

@rolandschulz
Copy link

rolandschulz commented Dec 8, 2023

Do you think this would only have to impact the group algorithms, because of their collective nature?

Not because of their collective nature. A user could write something like:

float data;
if (mask) data = load(...);
data = some_math(data); //e.g. sycl::cos or user function
if (mask) store(data, ...);

This would also be UB. The special thing about the collective SYCL function is that they take values they might not use. And therefore it is unintuitive if one doesn't know that pass-by-value counts as use. Also call-by-value and call-by-reference is indistinguishable looking at just the caller. Making the problem invisible without knowing the callee signature.

Of course a user might write a some_math(value, mask) and inside use value only if mask is true. If they then pass value by-value they have the same problem.

What is unique about the a some of the collective functions among the existing SYCL functions is that they are the only ones which don't always use their pass-by-value parameter and therefore have this unintuitive "pass-by-value counts as use" problem.

I don't like the current solution. If we were to clarify C++ we shouldn't do that for the group function but make this a general note because it also impacts user functions. But I don't think the SYCL spec is a good place to teach C++ and even if we wanted to it seems unclear why this particular C++ quirk needs to be called out. In particular because isn't particularly dangerous one, because if users follow best practice (compile with -Wall), they get a warning.

One could argue that it is bad practice for any function (both SYCL collective and user functions like some_math(v, mask)) to pass-by-value for inputs which aren't always used. We could add a note for the affected collective functions that for historic reasons they follow this bad practice and therefore have the unititive C++ use behavior.

If we were to change the collective functions we should either only do it for some or clarify that they never short-circuit. Because it wouldn't be obvious anymore that all_of or reduce don't short-circuit.

Any code that specifically gets the type of the function via decltype and then uses the type as a template parameter (e.g. by testing a type trait).

This can be generalized: Any code that uses the function type in way that that it matters that the argument is a reference. Instead of decltype you can get the function type in other ways (e.g. template deduction). Not all uses of the function type would be a problem. E.g. it wouldn't matter if it's only used to get the return type.

@nliber
Copy link
Collaborator

nliber commented Dec 9, 2023

Clarifying for myself: x is only read inside of group_broadcast(...) if sg.leader() is true.

If we change it to passing by reference, it is a semantic change. If someone is passing a global or static and they change its value somewhere else (worse, now as a race condition :-(), the change is noticed. That being said, IDK if this affects anything but contrived code.

@Pennycook's solution adds default constructibility as a requirement on T, which isn't required by trivially copyable. That being said, IDK if this affects anything but contrived code either. Also, it should use {} and not () so that aggregates are also default constructed.

And all that being said, it really is a C++ gotcha. :-(. It is likely to be addressed in C++26 by P2795 Erroneous behaviour for uninitialized reads, as x will be initialized and it will still be a bug for pass-by-value to copy it (but well defined and not UB). That is the the solution I'm leaning towards (assuming the paper passes and the likelihood of being backported into previous standards modes).

@Pennycook
Copy link
Contributor Author

This is proving to be much more complicated than I expected.

It sounds to me like the T& solution may be more trouble than it's worth:

  • Nobody likes API breaks.
  • Default constructible is more than is required for the semantics of a shuffle or a broadcast. It may break existing code, and may make it harder for people to move to SYCL from interfaces where trivially copyable is sufficient.
  • The possibility of race conditions would make the specification very complicated, and it's not immediately obvious to me how we would describe the behavior. We'd (probably) need to say either that the value of T& x is read when the function is called, or read when all the work-items have arrived. The read in the first case would have to be unconditional (because we don't know what other work-items will do) so would be identical to the T x version. In the second case, the user would have no control over when the read actually happens, which may be surprising.

If we leave things as they are and wait for P2795, what should the warning note look like? How would people feel about encouraging implementations to issue a warning in this case? Or should that be left entirely to quality of implementation?

@nliber
Copy link
Collaborator

nliber commented Dec 12, 2023

I think the change is more trouble than it is worth. IMO finding workarounds for C++ gotchas is a losing battle.

Other than large things like buffers and arrays, users should be initializing all their variables. Even without optimizations, such initialization is typically in the noise.

As for encouraging implementations to warn, I'm pretty sure gcc and clang already do with -Wuninitialized or -Wmaybe-uninitialized (both part of -Wall).

@Pennycook
Copy link
Contributor Author

As for encouraging implementations to warn, I'm pretty sure gcc and clang already do with -Wuninitialized or -Wmaybe-uninitialized (both part of -Wall).

You're right, they do. But you have to opt in to those. I was imagining that it might be a good idea for SYCL compilers to go beyond that, and issue a warning in this case unconditionally... But the more I think about it, the more I think it would be weird to talk about this in the specification.

I still think there's a good chance that developers migrating to SYCL from other languages (e.g. CUDA) will encounter this and be confused. But perhaps the right fix there is to point this out during training or as part of dedicated migration resources.

@Pennycook Pennycook closed this Dec 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants