Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pull] main from llvm:main #5546

Open
wants to merge 2,012 commits into
base: main
Choose a base branch
from
Open

[pull] main from llvm:main #5546

wants to merge 2,012 commits into from

Conversation

pull[bot]
Copy link

@pull pull bot commented Jan 16, 2025

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.1)

Can you help keep this open source service alive? 💖 Please sponsor : )

@pull pull bot added the ⤵️ pull label Jan 16, 2025
arsenm and others added 29 commits January 30, 2025 20:42
…124111)

Given the rest of the pass just gives up when it needs to compose
subregisters, folding a subregister extract directly into a reg_sequence
is counterproductive. Later fold attempts in the function will give up
on the subregister operand, preventing looking up through the reg_sequence.

It may still be profitable to do these folds if we start handling
the composes. There are some test regressions, but this mostly
looks better.
#124224)

Set the starting index in the constructor instead of treating
0 as a special case. There should also be no need for bounds
checking in the rewrite.
The verifier does not allow reg_sequence to have subregister defs,
even if undef.
…" (#123945)

This reverts commit 22561cf and fixes
b7b9ccf (#112079).

The problem is that x86_64 and Arm 32-bit have memory regions above the
stack that are readable but not writeable. First Arm:
```
(lldb) memory region --all
<...>
[0x00000000fffcf000-0x00000000ffff0000) rw- [stack]
[0x00000000ffff0000-0x00000000ffff1000) r-x [vectors]
[0x00000000ffff1000-0xffffffffffffffff) ---
```
Then x86_64:
```
$ cat /proc/self/maps
<...>
7ffdcd148000-7ffdcd16a000 rw-p 00000000 00:00 0                          [stack]
7ffdcd193000-7ffdcd196000 r--p 00000000 00:00 0                          [vvar]
7ffdcd196000-7ffdcd197000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0                  [vsyscall]
```
Compare this to AArch64 where the test did pass:
```
$ cat /proc/self/maps
<...>
ffffb87dc000-ffffb87dd000 r--p 00000000 00:00 0                          [vvar]
ffffb87dd000-ffffb87de000 r-xp 00000000 00:00 0                          [vdso]
ffffb87de000-ffffb87e0000 r--p 0002a000 00:3c 76927217                   /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1
ffffb87e0000-ffffb87e2000 rw-p 0002c000 00:3c 76927217                   /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1
fffff4216000-fffff4237000 rw-p 00000000 00:00 0                          [stack]
```
To solve this, look up the memory region of the stack pointer (using
https://lldb.llvm.org/resources/lldbgdbremote.html#qmemoryregioninfo-addr)
and constrain the read to within that region. Since we know the stack is
all readable and writeable.

I have also added skipIfRemote to the tests, since getting them working
in that context is too complex to be worth it.

Memory write failures now display the range they tried to write, and
register write errors will show the name of the register where possible.

The patch also includes a workaround for a an issue where the test code
could mistake an `x` response that happens to begin with an `O` for an
output packet (stdout). This workaround will not be necessary one we
start using the [new
implementation](https://discourse.llvm.org/t/rfc-fixing-incompatibilties-of-the-x-packet-w-r-t-gdb/84288)
of the `x` packet.

---------

Co-authored-by: Pavel Labath <[email protected]>
…125061)

This is left over from the old way reductions were implemented.
OpenMPVarMappingStackFrame doesn't actually do anything anymore so these
uses can go away.
#124964)

Check the canonical type in the matchers to handle aliases.
For example std::optional uses add_pointer_t<...>.
…tor loads (#123081)

getRegAllocationHints looks for ZPR2StridedOrContiguous load instructions
which are used by FORM_TRANSPOSED_REG_TUPLE pseudos and adds all
strided registers from this class to the list of hints.
This patch changes getRegAllocationHints to restrict this list:
  - If the pseudo uses ZPRMul class, the first load must begin with a register
    which is a multiple of 2 or 4.
  - Only add a hint if it is part of a sequence of registers that do not already
    have any live intervals.

This also contains changes to suggest hints when the load instructions and
the FORM_TRANSPOSED pseudo use multi-vectors of different lengths,
e.g. a pseudo with a 4-vector sequence of registers formed of one column
extracted from four 2-vector loads.
The compilation was failing because `triple` is an `Xclang` flag.
The failure was hidden by the XFAIL.
An argument graph node without uses forms a trivial SCC, which will
already be handled by the preceding branch.

If a node in the SCC points to a node with empty uses, then it will
be part of a different SCC, and as such assumed to be capturing
if it does not have an attribute. There is no need to handle them
separately.
If it's not the callee operand, it must be a data operand.
… demanded (#124066)

The motivation for this to allow reducing the vl when a user is a
ternary pseudo, where the third operand is tied and also acts as a
passthru.

When checking the users of an instruction, we currently bail if the user
is used as a passthru because all of its elements past vl will be used
for the tail.

We can allow passthru users if we know the tail of their result isn't
used, which we will have computed beforehand after #124530

It's worth noting that this is all irrelevant of the tail policy,
because tail agnostic still ends up using the passthru.

I've checked that SPEC CPU 2017 + llvm-test-suite pass with this (on
qemu with rvv_ta_all_1s=true)

Fixes #123760
)

Enable the option under opt-for-speed. Elementals with shapes
like `(0, HUGE)` should run faster.
This patch inlines hlfir.reshape for simple cases, such as
when there is no ORDER argument; and when PAD is present,
only the trivial types are handled.
…lue (#125059)

The code in `translateToExtendedValue(hlfir::Entity)` was not getting
rid of the fir.box for scalars because isSimplyContiguous() returned
false for them.

This created issues downstream because utilities using
fir::ExtendedValue were not implemented to work with intrinsic scalars
fir.box.

fir.box of intrinsic scalars are not very commonly used as hlfir::Entity
but they are allowed and should work where accepted.
)

This PR optimizes the performance of `std::ranges::copy` and
`std::ranges::copy_n` specifically for `vector<bool>::iterator`,
addressing a subtask outlined in issue #64038. The optimizations yield
performance improvements of up to **2000x** for aligned copies and
**60x** for unaligned copies. Additionally, new tests have been added to
validate these enhancements.


- Aligned source-destination bits

ranges::copy
```
--------------------------------------------------------------------------
Benchmark                                Before        After   Improvement
--------------------------------------------------------------------------
bm_ranges_copy_vb_aligned/8              10.8 ns      1.42 ns           8x
bm_ranges_copy_vb_aligned/64             88.5 ns      2.28 ns          39x
bm_ranges_copy_vb_aligned/512             709 ns      1.95 ns         364x
bm_ranges_copy_vb_aligned/4096           5568 ns      5.01 ns        1111x
bm_ranges_copy_vb_aligned/32768         44754 ns      38.7 ns        1156x
bm_ranges_copy_vb_aligned/65536         91092 ns      73.2 ns        1244x
bm_ranges_copy_vb_aligned/102400       139473 ns       127 ns        1098x
bm_ranges_copy_vb_aligned/106496       189004 ns      81.5 ns        2319x
bm_ranges_copy_vb_aligned/110592       153647 ns      71.1 ns        2161x
bm_ranges_copy_vb_aligned/114688       159261 ns      70.2 ns        2269x
bm_ranges_copy_vb_aligned/118784       181910 ns      73.5 ns        2475x
bm_ranges_copy_vb_aligned/122880       174117 ns      76.5 ns        2276x
bm_ranges_copy_vb_aligned/126976       176020 ns      82.0 ns        2147x
bm_ranges_copy_vb_aligned/131072       180757 ns       137 ns        1319x
bm_ranges_copy_vb_aligned/135168       190342 ns       158 ns        1205x
bm_ranges_copy_vb_aligned/139264       192831 ns       103 ns        1872x
bm_ranges_copy_vb_aligned/143360       199627 ns      89.4 ns        2233x
bm_ranges_copy_vb_aligned/147456       203881 ns      88.6 ns        2301x
bm_ranges_copy_vb_aligned/151552       213345 ns      88.4 ns        2413x
bm_ranges_copy_vb_aligned/155648       216892 ns      92.9 ns        2335x
bm_ranges_copy_vb_aligned/159744       222751 ns      96.4 ns        2311x
bm_ranges_copy_vb_aligned/163840       225995 ns       173 ns        1306x
bm_ranges_copy_vb_aligned/167936       235230 ns       202 ns        1165x
bm_ranges_copy_vb_aligned/172032       244093 ns       131 ns        1863x
bm_ranges_copy_vb_aligned/176128       244434 ns       111 ns        2202x
bm_ranges_copy_vb_aligned/180224       249570 ns       108 ns        2311x
bm_ranges_copy_vb_aligned/184320       254538 ns       108 ns        2357x
bm_ranges_copy_vb_aligned/188416       261817 ns       113 ns        2317x
bm_ranges_copy_vb_aligned/192512       269923 ns       125 ns        2159x
bm_ranges_copy_vb_aligned/196608       273494 ns       210 ns        1302x
bm_ranges_copy_vb_aligned/200704       280035 ns       269 ns        1041x
bm_ranges_copy_vb_aligned/204800       293102 ns       231 ns        1269x
```

ranges::copy_n
```
--------------------------------------------------------------------------
Benchmark                                Before        After   Improvement
--------------------------------------------------------------------------
bm_ranges_copy_n_vb_aligned/8            11.8 ns       0.89 ns         13x
bm_ranges_copy_n_vb_aligned/64           91.6 ns       2.06 ns         44x
bm_ranges_copy_n_vb_aligned/512           718 ns       2.45 ns        293x
bm_ranges_copy_n_vb_aligned/4096         5750 ns       5.02 ns       1145x
bm_ranges_copy_n_vb_aligned/32768       45824 ns       40.9 ns       1120x
bm_ranges_copy_n_vb_aligned/65536       92267 ns       73.8 ns       1250x
bm_ranges_copy_n_vb_aligned/102400     143267 ns       125 ns        1146x
bm_ranges_copy_n_vb_aligned/106496     148625 ns      82.4 ns        1804x
bm_ranges_copy_n_vb_aligned/110592     154817 ns      72.0 ns        2150x
bm_ranges_copy_n_vb_aligned/114688     157953 ns      70.4 ns        2244x
bm_ranges_copy_n_vb_aligned/118784     162374 ns      71.5 ns        2270x
bm_ranges_copy_n_vb_aligned/122880     168638 ns      72.9 ns        2313x
bm_ranges_copy_n_vb_aligned/126976     175596 ns      76.6 ns        2292x
bm_ranges_copy_n_vb_aligned/131072     181164 ns       135 ns        1342x
bm_ranges_copy_n_vb_aligned/135168     184697 ns       157 ns        1176x
bm_ranges_copy_n_vb_aligned/139264     191395 ns       104 ns        1840x
bm_ranges_copy_n_vb_aligned/143360     194954 ns      88.3 ns        2208x
bm_ranges_copy_n_vb_aligned/147456     208917 ns      86.1 ns        2426x
bm_ranges_copy_n_vb_aligned/151552     211101 ns      87.2 ns        2421x
bm_ranges_copy_n_vb_aligned/155648     213175 ns      89.0 ns        2395x
bm_ranges_copy_n_vb_aligned/159744     218988 ns      86.7 ns        2526x
bm_ranges_copy_n_vb_aligned/163840     225263 ns       156 ns        1444x
bm_ranges_copy_n_vb_aligned/167936     230725 ns       184 ns        1254x
bm_ranges_copy_n_vb_aligned/172032     235795 ns       119 ns        1981x
bm_ranges_copy_n_vb_aligned/176128     241145 ns       101 ns        2388x
bm_ranges_copy_n_vb_aligned/180224     250680 ns      99.5 ns        2519x
bm_ranges_copy_n_vb_aligned/184320     262954 ns      99.7 ns        2637x
bm_ranges_copy_n_vb_aligned/188416     258584 ns       103 ns        2510x
bm_ranges_copy_n_vb_aligned/192512     267190 ns       125 ns        2138x
bm_ranges_copy_n_vb_aligned/196608     270821 ns       213 ns        1271x
bm_ranges_copy_n_vb_aligned/200704     279532 ns       262 ns        1067x
bm_ranges_copy_n_vb_aligned/204800     283412 ns       222 ns        1277x
```

- Unaligned source-destination bits
```
--------------------------------------------------------------------------------
Benchmark                                    Before           After  Improvement
--------------------------------------------------------------------------------
bm_ranges_copy_vb_unaligned/8               12.8 ns         8.59 ns         1.5x
bm_ranges_copy_vb_unaligned/64              98.2 ns         8.24 ns          12x
bm_ranges_copy_vb_unaligned/512              755 ns         18.1 ns          42x
bm_ranges_copy_vb_unaligned/4096            6027 ns          102 ns          59x
bm_ranges_copy_vb_unaligned/32768          47663 ns          774 ns          62x
bm_ranges_copy_vb_unaligned/262144        378981 ns         6455 ns          59x
bm_ranges_copy_vb_unaligned/1048576      1520486 ns        25942 ns          59x
bm_ranges_copy_n_vb_unaligned/8             11.3 ns         8.22 ns         1.4x
bm_ranges_copy_n_vb_unaligned/64            97.3 ns         7.89 ns          12x
bm_ranges_copy_n_vb_unaligned/512            747 ns         18.1 ns          41x
bm_ranges_copy_n_vb_unaligned/4096          5932 ns         99.0 ns          60x
bm_ranges_copy_n_vb_unaligned/32768        47776 ns         749 ns           64x
bm_ranges_copy_n_vb_unaligned/262144      378802 ns        6576 ns           58x
bm_ranges_copy_n_vb_unaligned/1048576    1547234 ns       26229 ns           59x
```
…pport (#123149)

As there is now certain areas where we now have the possibility of
having either a ModuleOp or GPUModuleOp and both of these modules can
have DataLayout's and we may require utilising the DataLayout utilities
in these areas I've taken the liberty of trying to extend them for use
with both.

Those with more knowledge of how they wish the GPUModuleOp's to interact
with their parent ModuleOp's DataLayout may have further alterations
they wish to make in the future, but for the moment, it'll simply
utilise the basic data layout construction which I believe combines
parent and child datalayouts from the ModuleOp and GPUModuleOp. If there
is no GPUModuleOp DataLayout it should default to the parent ModuleOp.

It's worth noting there is some weirdness if you have two module
operations defining builtin dialect DataLayout Entries, it appears the
combinatorial functionality for DataLayouts doesn't support the merging
of these.

This behaviour is useful for areas like:
https://github.com/llvm/llvm-project/pull/119585/files#diff-19fc4bcb38829d085e25d601d344bbd85bf7ef749ca359e348f4a7c750eae89dR1412
where we have a crossroads between the two different module operations.
…IR branching error (#123771)

Currently if we generate code for the below target data map that uses an
optional mapping:

       !$omp target data if(present(a)) map(alloc:a)
            do i = 1, 10
                a(i) = i
            end do
       !$omp end target data

We yield an LLVM-IR error as the branch for the else path is not
generated. This occurs because we enter the NoDupPriv path of the call
back function when generating the else branch, however, the emitBranch
function needs to be set to a block for it to functionally generate and
link in a follow up branch. The NoDupPriv path currently doesn't do
this, while it's not supposed to generate anything (as far as I am
aware) we still need to at least set the builders placement back so that
it emits the appropriate follow up branch. This avoids the missing
terminator LLVM-IR verification error by correctly generating the follow
up branch.
Oversight found by ISel fuzz effort. Assuming the argument is a
register, in some cases it can be an immediate. Tablegen's type for the
instruction is SSrc_b32, i.e. register or immediate fine. Added the
repro from the bug reporter as a test case - prior to this patch llvm
will assert in getReg.

Fixes SWDEV-508589
…123906)"" (#125091)

Reverts #123945

Has failed on the Windows on Arm buildbot:
https://lab.llvm.org/buildbot/#/builders/141/builds/5865
```
********************
Unresolved Tests (2):
  lldb-api :: functionalities/reverse-execution/TestReverseContinueBreakpoints.py
  lldb-api :: functionalities/reverse-execution/TestReverseContinueWatchpoints.py
********************
Failed Tests (1):
  lldb-api :: functionalities/reverse-execution/TestReverseContinueNotSupported.py
```
Reverting while I reproduce locally.
LLVM has two tablegen generators: one in llvm/tblgen.bzl (`gentbl`,
macro-based) and one in mlir/tblgen.bzl (`gentbl_cc_library`,
rule-based). The `gentbl_cc_library` generator in MLIR has some
advantages to being a rule, and at any rate, it seems better to just use
the same tablegen rule everywhere instead of competing implementations.
…24848)

This adds a VP version of an existing DAG combine. I've put it in
RISCVISelLowering since we would need to add a ISD::VP_AVGCEIL opcode
otherwise.

This pattern appears in 525.264_r.
…Transpose perms parameter (#124945)

When consolidating transpose ops into one, use `tosa::ConstOp` for the
permutations parameter instead of `arith::ConstantOp`.
topperc and others added 30 commits January 31, 2025 15:09
This PR fixes the folder of a `vector.shuffle` with constant input
vectors in the presence of a poison index. Partially poison vectors are
currently not supported in UB so the folder select v1[0] for elements
indexed by poison.
…125277)

This enables -mcpu=native for the HiFive Premier P550 board.
…ad results of `linalg.generic` op. (#125141)

This functionality was wrapped within a pattern. Expose this as a
separate transformations function that can be used outside of pattern
rewrite mechanism.

---------

Signed-off-by: MaheshRavishankar <[email protected]>
Commit f10441a dropped a special case
for isUndefWeak and --no-dynamic-linking but also made --export-dynamic
ineffective for static PIE.

This change restores the --export-dynamic behavior and entirely drops
special handling of --no-dynamic-linker:

* -pie with no input DSO, similar to --no-dynamic-linker, suppresses
  undefined symbols in .dynsym

The new behaviors resemble GNU ld more.
…. NFC

This was checking whether the erase is needed, but erase is safe
to call with equal iterators.
This change updates the Float to TF32 conversion MLIR Op to include
lowering to the new intrinsics introduced in sm_100 through ptx8.6:

- `nvvm_f2tf32_rn_satfinite`
- `nvvm_f2tf32_rn_relu_satfinite`
- `nvvm_f2tf32_rz_satfinite`
- `nvvm_f2tf32_rz_relu_satfinite`

PTX Spec Reference:

https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cvt
)

Consider the following pattern:
```
%cmp = fcmp <pred> double %x, 0.000000e+00
%negX = fneg <fmf> double %x
%sel = select i1 %cmp, double %x, double %negX
```
We cannot propagate ninf from fneg to select since `%negX` may not be
chosen. Similarly, we cannot propagate nnan unless `%negX` is guaranteed
to be selected when `%x` is NaN.
This patch also propagates nnan/ninf from fcmp to avoid regression in
`PhaseOrdering/generate-fabs.ll`.

Alive2: https://alive2.llvm.org/ce/z/t6U-tA
Closes #121430 and
#113989.
This patch fixes:

  llvm/lib/Analysis/ValueTracking.cpp:116:27: error: unused function
  'safeCxtI' [-Werror,-Wunused-function]
This patch adds a default constructor to BlockFlags to initialize its
members to false, placing initializers close to the member
declarations.

Note that once C++20 is available in our codebase, we can replace
the explicit default constructor with:

  bool Reachable : 1 = true;
  :
…rsion. NFC

The code that moves CheckOpcode before CheckType/CheckChildType/RecordDwith
was running after ContractNodes started unwinding its recursion. If a
move occurs we would start a new recursion going forward
through the list again. I don't believe this can lead to any new
combines so it was just wasted work.

This patch moves the code earlier so it doesn't start a new recursion.
Forked from llvm/test/CodeGen/AArch64/arm64-vmovn.ll

Unknown intrinsics which are currently incorrectly handled by
visitInstruction:
- llvm.aarch64.neon.sqxtn
- llvm.aarch64.neon.sqxtun
- llvm.aarch64.neon.uqxtn
… expression (#117437)

Clang currently support extending lifetime of object bound to reference
members of aggregates, that are created from default member initializer.
This PR address this change and updaye CFG and ExprEngine.

This PR reapply #91879.
Fixes #93725.

---------

Signed-off-by: yronglin <[email protected]>
DeclareImplicitDeductionGuidesForTypeAlias.

This improves the code readability.
… the same base pointer (#121892)

Alive2: https://alive2.llvm.org/ce/z/P5XbMx
Closes #121890

TODO: It is still safe to perform this transform without nowrap flags if
the corresponding scale factor is 1 byte:
https://alive2.llvm.org/ce/z/J-JCJd
cc @tobiasgrosser @wsmoses

this PR adds some new ops and types to the MLIR MPI dialect. the goal is
to get the minimum required ops here to get a project of us working, and
if everything works well, continue adding ops to the mpi dialect on
subsequent PRs until we achieve some level of compliance with the MPI
standard.

---

Things left to do in subsequent PRs:

- Add back the `mpi.comm` type and add as optional argument of current
implemented ops that should support it (i.e. `send`, `recv`, `isend`,
`irecv`, `allreduce`, `barrier`).
- Support defining custom `MPI_Op`s (the MPI operations, not the
tablegen `MPI_Op`) as regions.
- Add more ops.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment