This is an archive of the discontinued LLVM Phabricator instance.

[release/16.x][libc++] Revert the bitset sort optimization
ClosedPublic

Authored by ldionne on Mar 20 2023, 7:18 AM.

Details

Reviewers
nilayvaish
jdoerfert
ldionne
Group Reviewers
Restricted Project
Summary

Part of the bitset sort optimization can cause code using an invalid
comparator to go out-of-bounds, which is potentially dangerous. This
was an unintended consequence of the original patch, or at least the
fact that we didn't have anything in place to catch that or announce
that to vendors was unintended.

More specifically, std::sort assumes that the comparison predicate is a
strict weak order (which is a pre-condition to the algorithm). However,
in practice, some predicates may not satisfy the pre-condition in subtle
ways. While that is technically a user problem, we need a good answer
for how to handle this given that we know this exists in the wild and
unconditionally reading out-of-bounds data is not a great answer. We
can't weaken the preconditions of the algorithm, however letting this
situation happen without even a way of diagnosing the issue is pretty bad.

Hence, this patch reverts the optimization from release/16.x so that we
can buy a bit of time to figure out how to properly handle and communicate
this change in the LLVM 17 release.

The bitset sort optimization was 4eddbf9f10a (aka https://llvm.org/D122780).
This patch is not a pure revert of the patch, I also fixed up the includes
a bit.

Diff Detail

Event Timeline

ldionne created this revision.Mar 20 2023, 7:18 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 20 2023, 7:18 AM
ldionne requested review of this revision.Mar 20 2023, 7:18 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 20 2023, 7:18 AM
Herald added a reviewer: Restricted Project. · View Herald Transcript
ldionne updated this revision to Diff 506603.Mar 20 2023, 8:28 AM

Fix clang-tidy

nilayvaish accepted this revision.Mar 20 2023, 8:31 AM
ldionne planned changes to this revision.Mar 20 2023, 9:34 AM

Okay so unfortunately, that didn't fix the issue like we were expecting. I tried a few things and the only thing that reliably fixed the issue we were seeing was reverting 4eddbf9f10a6d1881c93d84f4363d6d881daf848 (aka D122780) entirely. I think I will need to get a reproducer before deciding what's the next step (i.e. whether the issue is sufficiently vexing that it's worth reverting or not).

Okay so unfortunately, that didn't fix the issue like we were expecting. I tried a few things and the only thing that reliably fixed the issue we were seeing was reverting 4eddbf9f10a6d1881c93d84f4363d6d881daf848 (aka D122780) entirely. I think I will need to get a reproducer before deciding what's the next step (i.e. whether the issue is sufficiently vexing that it's worth reverting or not).

Perhaps the broken code depends on the ordering of equivalent elements in the output. One idea to try would be to use std::stable_sort, though that may also change the ordering of equivalent elements from what was emitted by previous implementation of std::sort.

ldionne updated this revision to Diff 508758.Mar 27 2023, 12:32 PM
ldionne retitled this revision from [libc++] Disable the __insertion_sort_unguarded optimization to [release/16.x][libc++] Revert the bitset sort optimization.
ldionne edited the summary of this revision. (Show Details)

After some playing around and talking to other core contributors, I believe that the safest thing to do here is to revert this optimization entirely from LLVM 16, and then figure out a good way forward for LLVM 17.

Update the patch to revert the original optimization entirely.

@tstellar @thieta This patch is a bit special -- it is done on top of release/16.x directly so I wouldn't go through the usual cherry-picking process. Once it is reviewed, should I wait for your approval to land it manually on release/16.x?

To summarize the situation, we noticed some issues after adopting a version of libc++ that included D122780 and we took the decision to pull it out downstream until we can figure out how to handle this appropriately. After discussing it with some core libc++ contributors on Discord, we agreed that pulling it out of release/16.x was the best and safest way forward as well. We are not pulling it out of LLVM 17, instead we will find a way to handle the issue appropriately with a forward fix. But for LLVM 16, there's not enough time and we want to take the safest path forward.

@ldionne You can commit this directly to release/16.x when you are ready. Just ping @thieta or myself when you do this so we can sync with the llvm-release-prs repo.

Per https://discord.com/channels/636084430946959380/654052269301694465/1089997316246737068, I'll push it to a fork of llvm under my GitHub account and then in a issue comment write /branch ldionne/llvm-project/branch,

To catch most issues, we introduced __debug_randomize_range which is present on line 973 on the left. That can be advised to run tests. We found most out of bound bugs at Google with randomization of the input

Also to catch all strict weak ordering bugs, there are O(n^2) algorithms, for example, here. We can introduce a debug check with a square root of elements sampling. Would you be up to do this? Given how long range randomization took, I decided not to go into that direction as it seemed even harder to introduce safely. I can prioritize this work if there will be enough support from the maintainers

Thanks for chiming in. So under the Debug mode we also currently have something that will check that the comparator is consistent. It doesn't change the complexity of the algorithm, but it does double the number of times the comparator is called. The debug mode is not really "productized" right now, turning it on also turns on other checks that are less desirable because they require taking a global lock, are incompatible with constexpr, etc.

Shuffling the range wouldn't help here -- the problem wasn't non-determinism in the output, it's that we literally go and read out-of-bounds when the comparator is wrong. I think this is happening in __insertion_sort_unguarded, and I think I can add an assertion check just in there to "fix" the issue but I haven't had time to get the reproducer integrated in our test suite yet.

ldionne updated this revision to Diff 508984.Mar 28 2023, 6:31 AM

As a fly-by, also export some missing env vars in the CI jobs to allow running CI on release/16.x

ldionne updated this revision to Diff 508990.Mar 28 2023, 6:50 AM

Fix path to clang-format

ldionne updated this revision to Diff 509154.Mar 28 2023, 3:13 PM

Rebase on top of D147065

danlark added a comment.EditedMar 28 2023, 4:30 PM

Shuffling the range wouldn't help here -- the problem wasn't non-determinism in the output, it's that we literally go and read out-of-bounds when the comparator is wrong. I think this is happening in __insertion_sort_unguarded, and I think I can add an assertion check just in there to "fix" the issue but I haven't had time to get the reproducer integrated in our test suite yet.

In the previous implementation, out of bounds is possible, see Howard Hinnant on stackoverflow. The sort will sigsegv with 31 elements with the comparator that always returns true.

In this patch these lines start with 561 on the right. New implementation triggers unguarded optimaztion with fewer elements. That's the root cause. The rollback likely only hides the problem for customers which sort fewer elements.

FWIW, shuffling can help finding these problems, say, pivot takes NaN and thus it can go into the unguarded case.

Shuffling the range wouldn't help here -- the problem wasn't non-determinism in the output, it's that we literally go and read out-of-bounds when the comparator is wrong. I think this is happening in __insertion_sort_unguarded, and I think I can add an assertion check just in there to "fix" the issue but I haven't had time to get the reproducer integrated in our test suite yet.

In the previous implementation, out of bounds is possible, see Howard Hinnant on stackoverflow. The sort will sigsegv with 31 elements with the comparator that always returns true.

Yeah, I was aware of this and in fact I have a downstream bug report about std::sort mis-behaving on floats containing NaNs, but I wasn't aware of that specific SO question, thanks for pointing it out.

In this patch these lines start with 561 on the right. New implementation triggers unguarded optimaztion with fewer elements. That's the root cause. The rollback likely only hides the problem for customers which sort fewer elements.

FWIW, shuffling can help finding these problems, say, pivot takes NaN and thus it can go into the unguarded case.

This patch is not so much about fixing the underlying issue, which would require more work than we can do in the LLVM 16 time frame, given that it's already been released. This patch is about not rocking the boat too much for release/16.x, since this change in behavior was arguably not planned (BTW if it was conscious, it could have been communicated during the original review). In other words, I am OK with shipping the same slightly broken std::sort that we've been shipping for a long time since if users have incorrect comparators, their code seems to be working fine with it. I am not blessing their code, but at least we're not making an active change that's going to break them or introduce serious issues we could be held responsible for.

Now, I am OK with making an active change that will break them, but only if we have a decent story for catching these potentially serious issues like I'm proposing in D147089.

Does that make sense?

ldionne updated this revision to Diff 509347.Mar 29 2023, 6:48 AM

Sort includes to appease clang-tidy

Thanks for a thorough response and productive discussion. LGTM

ldionne accepted this revision.Mar 29 2023, 2:15 PM
ldionne added subscribers: philnik, Mordante.

Accepting on behalf of the project and based on the discussions I've had with @philnik and @Mordante. This is the responsible thing to do for LLVM 16.

This revision is now accepted and ready to land.Mar 29 2023, 2:15 PM
ldionne closed this revision.Apr 19 2023, 8:06 AM

Closing since this has been merged.