Search this Site
Maintained by the
Second LLVM Performance Workshop at CGO
- What: Second LLVM Performance Workshop at CGO
- When: Saturday February 24th, 2018
- Where: Vienna, Austria
An LLVM Performance Workshop will be held at CGO 2018. The workshop
is co-located with CC, HPCA, and PPoPP. It takes place at the Austria Trend Eventhotel Pyramide
If you are interested in attending the workshop, please register at the
|Time ||Room ||Speaker ||Title || |
||How to Evaluate "In-Memory Computing" Performances without Hardware Measurements?
||Optimizing LLVM IR for Guided Vectorization
||Siddharth Shankar Swain
||Efficient use of memory by reducing size of AST dumps in cross file analysis by clang static analyzer
||Cache-aware Scheduling and Performance Modeling with LLVM-Polly and Kerncraft
||Enabling Automatic Partitioning of Data-Parallel Kernels with Polyhedral Compilation
||LLVM Q&A Panel: Questions Welcome
- Julian Hammer, Johannes Doerfert, Georg Hager, Gerhard
Wellein and Sebastian Hack: Cache-aware Scheduling and
Performance Modeling with LLVM-Polly and Kerncraft
LLVM/Polly is the polyhedral optimizer of the LLVM project. While there
currently is a serious integration effort going on, Polly still lacks
basic support for essential optimizations. In this work we replace the
fixed tile-sizes policy employed by Polly with an access- and hardware-
dependent one. In contrast to Polly's scheduling, our tile-size selection
targets spatial instead of temporal locality. The proposed tile-size
selection is based on analytic performance modeling using the Layer
Conditions model, and extended to cope with non-affine accesses and
non-perfectly nested loops, which are found in many real-world codes.
Nevertheless, it is best suited for linear-sequential accesses as found
in stencil computations.
- Maha Kooli, Henri-Pierre Charles, Jean-Philippe Noel and
Bastien Giraud: How to Evaluate "In-Memory Computing"
Performances without Hardware Measurements?
This paper presents a software platform to evaluate the performance of
In-Memory Computing architecture based on emerging memory that embeds
computing abilities. The platform includes emulation tools that are based
on the Low Level Virtual Machine (LLVM). It permits to early experiment
applications when the hardware system is not fully designed, and generate
execution traces. These execution traces are then analyzed to evaluate
the system performances.
- Arsène Pérard-Gayot, Richard Membarth, Philipp
Slusallek, Simon Moll, Roland Leißa and Sebastian Hack:
Optimizing LLVM IR for Guided Vectorization
Guided vectorization takes a scalar program (operating on a single
element of data) and transforms it into a vectorized program (operating
on multiple elements at once). The performance of the vectorized
program strongly depends on the precision of the analyses performed by
the vectorizing compiler, and the quality of the target code generator.
In particular, these analyses must determine whether an expression is
the same for all lanes (uniform) or not. Since divergent control flow
is expensive, the compiler should ensure that it remains uniform
whenever possible. In this presentation, we present data layout
transformations and optimizations on LLVM IR that improve both the
analyses and the generated code quality of RV, a state-of-the-art
vectorizing framework. We show that, using RV combined with our
optimizations, auto-vectorized ray-tracing kernels perform within 10%
of manually-vectorized implementations by experts.
- Siddharth Shankar Swain: Efficient use of memory by
reducing size of AST dumps in cross file analysis by clang static
Clang SA works well with function call within a translation unit. When
execution reaches a function implemented in another TU, analyzer skips
analysis of called function definition. For handling cross file bugs, the
CTU analysis feature was developed. The CTU model consists of two passes.
The first pass dumps AST for all translation unit, creates a function map
to corresponding AST. In the second pass when TU external function is
reached during the analysis, the location of the definition of that
function is looked up in the function definition index and the definition
is imported from the containing AST binary into the caller's context
using the ASTImporter class. During the analysis, we need to store the
dumped ASTs temporarily. For a large code base this can be a problem and
we have seen it practically where the code analysis stops due to memory
shortage. Not only in CTU analysis but also in general case clang SA
analysis reducing size of ASTs can also lead to scaling of clang SA to
larger code bases. We are basically using two methods:
1) Using Outlining method on the source code to find out AST that
share common factors or sub trees. We throw away those ASTs that
won't match any other AST, thereby reducing number of ASTs dumped in
2) Tree prunning technique to keep only those parts of tree necessary
for cross translation unit analysis and eliminating the rest to
decrease the size of tree. Finding necessary part of tree can be done
by finding the dependency path from the exploded graph where
instructions dependent on the function call/execution will be
present. A thing to note here is that prunning of only those branches
whose no child is a function call should be done.
- Alexander Matz and Holger Fröning: Enabling
Automatic Partitioning of Data-Parallel Kernels with Polyhedral
Data-parallel accelerators are pervasive in today's computing
landscape due to their high energy-efficiency and performance. GPUs,
in particular, are very successful and utilize the
Bulk-Synchronous-Parallel programming model to expose the available
parallelism in an application core to the hardware. Programming a
single GPU using the BSP programming model (in the form of OpenCL and
CUDA) adds moderate complexity and is usually manageable.
If more than a single GPU is to be used, however, all data transfers
and kernel executions have to be orchestrated manually in order to
achieve good performance. This is tedious and error prone. Given the
regular nature of many GPUs kernels, this orchestration and the
distribution of work should be possible automatically.
In this talk, we present an approach to automatically partition
single-GPU CUDA applications for execution on multiple GPUs and a
preliminary performance analysis. We use polyhedral compilation for
the extraction of the memory access patterns of GPU kernels and a
light-weight runtime-system to synchronize device buffers and
orchestrate kernel execution. The runtime-system utilizes code
generated by polyhedral compilation to keep track of the state of
device buffers before and after each kernel execution and issues
minimal data movements if required. Partitioned kernels need to be
extended to only compute a subset of the original execution grid. Our
preliminary performance analysis achieves speedups of up to 12x for
three model applications taken from the Berkeley Dwarves.
Although we focus on NVIDIA CUDA applications in this talk we see no
conceptual differences of this approach in regards to alternative
implementations of the BSP programming model (e.g. OpenCL).
- William Moses: Tensor Comprehensions
Call for Speakers
We invite speakers from academia and industry to present their work on the
following list of topics (including and not limited to:)
- improving performance and size of code generated by LLVM,
- improving performance of LLVM's runtime libraries,
- improving the security of generated code,
- tools developed with LLVM for performance analysis,
- performance tracking over time,
- compiler flags, annotations and remarks to understand and improve
- any other topic related to improving and maintaining the performance
and quality of LLVM generated code.
While the primary focus of the workshop is on these topics, we welcome any
submission related to the LLVM compiler infrastructure, its sub-projects
(clang, lldb, Polly, ...), as well as its use in industry and academia.
We are looking for:
- keynote speakers,
- technical presentations: 30 minutes plus questions and discussion,
Proposals should provide enough information for the review committee to be
able to judge the quality of the submission. Proposals can be submitted under
the form of an extended abstract, full paper, or slides. Proposals should be
The deadline for receiving submissions is December 22, 2017. Speakers
will be notified of acceptance or rejection by January 5.
Workshop organization: Johannes Doerfert, Renato Golin, Aditya Kumar,
Sebastian Pop, Hal Finkel, and Tanya Lattner.