The LLVM Compiler Infrastructure
Site Map:
Download!
Search this Site


Useful Links
Release Emails
18.1.2: Mar 2024
18.1.1: Mar 2024
18.1.0: Mar 2024
17.0.6: Nov 2023
17.0.5: Nov 2023
17.0.4: Oct 2023
17.0.3: Oct 2023
17.0.2: Oct 2023
17.0.1: Sep 2023
All Announcements

Maintained by the
llvm-admin team
Second LLVM Performance Workshop at CGO
  • What: Second LLVM Performance Workshop at CGO
  • When: Saturday February 24th, 2018
  • Where: Vienna, Austria

An LLVM Performance Workshop will be held at CGO 2018. The workshop is co-located with CC, HPCA, and PPoPP. It takes place at the Austria Trend Eventhotel Pyramide in Vienna. If you are interested in attending the workshop, please register at the CGO website.

Schedule

Time Room Speaker Title  
9:15 Europa 2 Maha Kooli How to Evaluate "In-Memory Computing" Performances without Hardware Measurements? [Abstract]
10:00-10:30   Coffee Break
10:30 Europa 2 Arsène Pérard-Gayot Optimizing LLVM IR for Guided Vectorization [Abstract]
11:15 Europa 2 Siddharth Shankar Swain Efficient use of memory by reducing size of AST dumps in cross file analysis by clang static analyzer [Abstract]
12:00-13:30   Lunch
13:30 Europa 2 Julian Hammer Cache-aware Scheduling and Performance Modeling with LLVM-Polly and Kerncraft [Abstract] [Slides]
14:15 Europa 2 Alexander Matz Enabling Automatic Partitioning of Data-Parallel Kernels with Polyhedral Compilation [Abstract] [Slides]
15:00-15:30   Coffee Break
15:30 Europa 2 William Moses Tensor Comprehensions [Abstract]
16:15 Europa 2   LLVM Q&A Panel: Questions Welcome  
17:00   Workshop ends.

Abstracts

  • Julian Hammer, Johannes Doerfert, Georg Hager, Gerhard Wellein and Sebastian Hack: Cache-aware Scheduling and Performance Modeling with LLVM-Polly and Kerncraft Compilation  [Slides]

    LLVM/Polly is the polyhedral optimizer of the LLVM project. While there currently is a serious integration effort going on, Polly still lacks basic support for essential optimizations. In this work we replace the fixed tile-sizes policy employed by Polly with an access- and hardware- dependent one. In contrast to Polly's scheduling, our tile-size selection targets spatial instead of temporal locality. The proposed tile-size selection is based on analytic performance modeling using the Layer Conditions model, and extended to cope with non-affine accesses and non-perfectly nested loops, which are found in many real-world codes. Nevertheless, it is best suited for linear-sequential accesses as found in stencil computations.

  • Maha Kooli, Henri-Pierre Charles, Jean-Philippe Noel and Bastien Giraud: How to Evaluate "In-Memory Computing" Performances without Hardware Measurements?

    This paper presents a software platform to evaluate the performance of In-Memory Computing architecture based on emerging memory that embeds computing abilities. The platform includes emulation tools that are based on the Low Level Virtual Machine (LLVM). It permits to early experiment applications when the hardware system is not fully designed, and generate execution traces. These execution traces are then analyzed to evaluate the system performances.

  • Arsène Pérard-Gayot, Richard Membarth, Philipp Slusallek, Simon Moll, Roland Leißa and Sebastian Hack: Optimizing LLVM IR for Guided Vectorization

    Guided vectorization takes a scalar program (operating on a single element of data) and transforms it into a vectorized program (operating on multiple elements at once). The performance of the vectorized program strongly depends on the precision of the analyses performed by the vectorizing compiler, and the quality of the target code generator. In particular, these analyses must determine whether an expression is the same for all lanes (uniform) or not. Since divergent control flow is expensive, the compiler should ensure that it remains uniform whenever possible. In this presentation, we present data layout transformations and optimizations on LLVM IR that improve both the analyses and the generated code quality of RV, a state-of-the-art vectorizing framework. We show that, using RV combined with our optimizations, auto-vectorized ray-tracing kernels perform within 10% of manually-vectorized implementations by experts.

  • Siddharth Shankar Swain: Efficient use of memory by reducing size of AST dumps in cross file analysis by clang static analyzer

    Clang SA works well with function call within a translation unit. When execution reaches a function implemented in another TU, analyzer skips analysis of called function definition. For handling cross file bugs, the CTU analysis feature was developed. The CTU model consists of two passes. The first pass dumps AST for all translation unit, creates a function map to corresponding AST. In the second pass when TU external function is reached during the analysis, the location of the definition of that function is looked up in the function definition index and the definition is imported from the containing AST binary into the caller's context using the ASTImporter class. During the analysis, we need to store the dumped ASTs temporarily. For a large code base this can be a problem and we have seen it practically where the code analysis stops due to memory shortage. Not only in CTU analysis but also in general case clang SA analysis reducing size of ASTs can also lead to scaling of clang SA to larger code bases. We are basically using two methods:

    1) Using Outlining method on the source code to find out AST that share common factors or sub trees. We throw away those ASTs that won't match any other AST, thereby reducing number of ASTs dumped in memory.

    2) Tree prunning technique to keep only those parts of tree necessary for cross translation unit analysis and eliminating the rest to decrease the size of tree. Finding necessary part of tree can be done by finding the dependency path from the exploded graph where instructions dependent on the function call/execution will be present. A thing to note here is that prunning of only those branches whose no child is a function call should be done.

  • Alexander Matz and Holger Fröning: Enabling Automatic Partitioning of Data-Parallel Kernels with Polyhedral Compilation  [Slides]

    Data-parallel accelerators are pervasive in today's computing landscape due to their high energy-efficiency and performance. GPUs, in particular, are very successful and utilize the Bulk-Synchronous-Parallel programming model to expose the available parallelism in an application core to the hardware. Programming a single GPU using the BSP programming model (in the form of OpenCL and CUDA) adds moderate complexity and is usually manageable.

    If more than a single GPU is to be used, however, all data transfers and kernel executions have to be orchestrated manually in order to achieve good performance. This is tedious and error prone. Given the regular nature of many GPUs kernels, this orchestration and the distribution of work should be possible automatically.

    In this talk, we present an approach to automatically partition single-GPU CUDA applications for execution on multiple GPUs and a preliminary performance analysis. We use polyhedral compilation for the extraction of the memory access patterns of GPU kernels and a light-weight runtime-system to synchronize device buffers and orchestrate kernel execution. The runtime-system utilizes code generated by polyhedral compilation to keep track of the state of device buffers before and after each kernel execution and issues minimal data movements if required. Partitioned kernels need to be extended to only compute a subset of the original execution grid. Our preliminary performance analysis achieves speedups of up to 12x for three model applications taken from the Berkeley Dwarves.

    Although we focus on NVIDIA CUDA applications in this talk we see no conceptual differences of this approach in regards to alternative implementations of the BSP programming model (e.g. OpenCL).

  • William Moses: Tensor Comprehensions

    TBA.

Call for Speakers

We invite speakers from academia and industry to present their work on the following list of topics (including and not limited to:)

  • improving performance and size of code generated by LLVM,
  • improving performance of LLVM's runtime libraries,
  • improving the security of generated code,
  • tools developed with LLVM for performance analysis,
  • performance tracking over time,
  • compiler flags, annotations and remarks to understand and improve performance,
  • any other topic related to improving and maintaining the performance and quality of LLVM generated code.

While the primary focus of the workshop is on these topics, we welcome any submission related to the LLVM compiler infrastructure, its sub-projects (clang, lldb, Polly, ...), as well as its use in industry and academia.

We are looking for:

  • keynote speakers,
  • technical presentations: 30 minutes plus questions and discussion,
  • tutorials,
  • BOFs.

Proposals should provide enough information for the review committee to be able to judge the quality of the submission. Proposals can be submitted under the form of an extended abstract, full paper, or slides. Proposals should be submitted to Easychair LLVM-CGO 2018. The deadline for receiving submissions is December 22, 2017. Speakers will be notified of acceptance or rejection by January 5.

Workshop organization: Johannes Doerfert, Renato Golin, Aditya Kumar, Sebastian Pop, Hal Finkel, and Tanya Lattner.