Site Map:
Download!
Search this Site
Useful Links
Release Emails
Maintained by the
llvm-admin team
|
LLVM Performance Workshop at CGO
- What: LLVM Performance Workshop at CGO
- When: Saturday February 4th, 2017
- Where: Austin, Texas, USA
An LLVM Performance Workshop will be held at CGO 2017. The workshop
is co-located with CC, HPCA, and PPoPP. If you are interested in
attending the workshop, please register at the
CGO website.
Program
The workshop takes place at the Hilton Hotel in
downtown Austin (500 East 4th St).
Update: If you indicated this morning that you wanted to join us for dinner, here's the location of the restaurant: Manuel's Downtown, 310 Congress Avenue, Austin, TX 78701. We have a reservation at 5pm (dinner is at your own expense). The restaurant is within walking distance from the hotel.
Time | Room | Speaker | Title | |
7:30-8:30 |
616AB |
Breakfast |
|
|
Session 1: Parallel Code Generation |
8:30am |
400/402 |
Johannes Doerfert (Saarland University) |
Polyhedral "Driven" Optimizations on Real Codes |
[Abstract] [Slides] |
9:00am |
400/402 |
Tobias Grosser (ETH Zurich) |
Polly-ACC - Accelerator support with Polly-ACC |
[Abstract] [Slides] |
9:30am |
400/402 |
Tao Schardl and William Moses (MIT) |
The Tapir Extension to LLVM's Intermediate Representation for Fork-Join Parallelism |
[Abstract] [Slides] |
10:00-10:30 |
616AB |
Break |
|
|
Session 2: Performance in Libraries and Languages |
10:30am |
400/402 |
Hal Finkel (Argonne National Laboratory) |
Modeling restrict-qualified pointers in LLVM |
[Abstract] [Slides] |
11am |
400/402 |
Pranav Bhandarkar, Anshuman Dasgupta, Ron Lieberman, Dan Palermo (Qualcomm Innovation Center) Dillon Sharlet and Andrew Adams (Google) |
Halide for Hexagon DSP with Hexagon Vector eXtensions (HVX) using LLVM |
[Abstract] [Slides] |
11:30am |
400/402 |
Aditya Kumar and Sebastian Pop (Samsung Austin R&D Center) |
Performance analysis of libcxx |
[Abstract] [Slides] |
12:00-1:30 |
|
Lunch |
|
|
Session 3: Whole-application performance tuning |
1:30pm |
400/402 |
Brian Railing (CMU) |
Improving LLVM Instrumentation Overheads |
[Abstract] [Slides] |
2pm |
400/402 |
Sergei Larin, Harsha Jagasia and Tobias Edler von Koch (Qualcomm Innovation Center) |
Impact of the current LLVM inlining strategy on complex embedded application memory utilization and performance |
[Abstract] [Slides] |
2:30pm |
400/402 |
Mehdi Amini (Apple) |
LTO/ThinLTO BoF |
[Abstract] |
3:00-3:30 |
616AB |
Break |
|
|
Session 4: Backend optimizations |
3:30pm |
400/402 |
Krzysztof Parzyszek (Qualcomm Innovation Center) |
Register Data Flow framework |
[Abstract] [Slides] |
4pm |
400/402 |
Evandro Menezes, Sebastian Pop and Aditya Kumar (Samsung Austin R&D Center) |
Efficient clustering of case statements for indirect branch predictors |
[Abstract] [Slides] |
4:30pm |
|
Workshop ends. |
Abstracts
- Krzysztof Parzyszek: Register Data Flow framework
Register Data Flow is a framework implemented in LLVM that enables
data-flow optimizations on machine IR after register allocation. While
most of the data-flow optimizations on machine IR take place during the
SSA phase, when virtual registers obey the static single assignment
form, passes like pseudo-instruction expansion or frame index
replacement may expose opportunities for further optimizations. At the
same time, data-flow analysis is much more complicated after register
allocation, and implementing compiler passes that require it may not
seem like a worthwhile investment. The intent of RDF is to abstract this
analysis and provide access to it through a familiar and convenient
interface.
The central concept in RDF is a data-flow graph, which emulates SSA. In
contrast to the SSA-based optimization phase where SSA is a part of the
program representation, the RDF data-flow graph is a separate, auxiliary
structure. It can be built on demand and it does not require any
modifications to the program. Traversal of the graph can provide
information about reaching definitions of any given register access, as
well as reached definitions and reached uses for register
definitions. The graph provides connections for easily locating the
corresponding elements of the machine IR. A utility class that
recalculates basic block live-in information is implemented to make
writing whole-function optimizations easier. In this talk, I will give
an overview of RDF and its use in the Hexagon backend.
- Tao Schardl and William Moses: The Tapir Extension to LLVM's Intermediate Representation for Fork-Join Parallelism
This talk explores how fork-join parallelism, as supported by
dynamic-multithreading concurrency platforms such as Cilk and
OpenMP, can be embedded into a compiler's intermediate
representation (IR). Mainstream compilers typically treat parallel
linguistic constructs as syntactic sugar for function calls into a
parallel runtime. These calls prevent the compiler from performing
optimizations across parallel control flow. Remedying this
situation, however, is generally thought to require an extensive
reworking of compiler analyses and code transformations to handle
parallel semantics.
Tapir is a compiler IR that represents logically parallel tasks
asymmetrically in the program's control flow graph. Tapir allows
the compiler to optimize across parallel control flow with only
minor changes to its existing analyses and code transformations. To
prototype Tapir in the LLVM compiler, for example, we added or
modified approximately 5000 lines of LLVM's approximately
3-million-line codebase. Tapir enables many traditional compiler
optimizations for serial code, including loop-invariant-code motion,
common-subexpression elimination, and tail-recursion elimination, to
optimize across parallel control flow, as well as purely parallel
optimizations.
This work was conducted in collaboration with Charles E. Leiserson.
The proposal is a preliminary copy of our paper on Tapir, which will
appear at PPoPP 2017. This talk will focus on the technical details
of implementing Tapir in LLVM.
- Aditya Kumar, Sebastian Pop, and Laxman Sole: Performance analysis of libcxx
We will discuss the improvements and future work on libcxx. This
includes the improvements on standard library algorithms like
string::find and basic_streambuf::xsgetn. These algorithms were
suboptimal and we got huge improvements after optimizing
them. Similarly, we enabled the inlining of constructor and destructor
of std::string. We will present a systematic analysis of function
attributes in libc++ and the places where we added missing
attributes. We will present a comparative analysis of clang-libc++
vs. gcc-libstdc++ on representative benchmarks. Finally we will talk
about our contributions to google-benchmark, which comes with libc++, to
help keep track of performance regressions.
- Hal Finkel: Modeling restrict-qualified pointers in LLVM
It is not always possible for a compiler to statically determine enough
about the pointer-aliasing properties of a program, especially for
functions which need to be considered in isolation, to generate the
highest-performance code possible. Multiversioning can be employed but
its effectiveness is limited by the combinatorially-large number of
potential configurations. To address these practical problems, the C
standard introduced the restrict keyword which can adorn pointer
variables. The restrict keyword can be used by the programmer to convey
pointer-aliasing information to the optimizer. Often, this is
information that is difficult or impossible for the optimizer to deduce
on its own.
The semantics of restrict, however, are subtle and rely on source-level
constructs that are not generally represented within LLVM's
IR. Maximally maintaining the aliasing information correctly in the face
of function inlining and other code-motion transformations, without
interfering with those transformations, is not trivial. While LLVM has
long used strict-qualified pointers that are function arguments, and an
initial phase of this work provided a way to preserve this information
in the face of function inlining, I'll describe a new scheme in LLVM
that allows the representation of aliasing information from block-local
restrict-qualified pointers as well. This more-general class of
restrict-qualified pointers is widely used in scientific code.
In this talk, I'll cover the use cases for restrict-qualified pointers,
the difficulties in representing their semantics at the IR level, why
the existing aliasing metadata cannot represent restrict-qualified
pointers effectively, how the proposed representation allows the
preservation of these semantics with minimal impact to the optimizer,
and how the optimizer can use this information to generate
higher-performance code. I'll also discuss how this scheme relates to
others related to pointer variables (e.g. TBAA and alignment
assumptions).
- Mehdi Amini: LTO/ThinLTO BoF
LTO is an important technique for getting the maximum performance from
the compiler. We presented the ThinLTO model and implementation in LLVM
at the last LLVM Dev Meeting. This provided the audience with a good
overview of the high-level flow of ThinLTO and the 3-phases split
involved.
The proposal for this BoF is to gather and discuss the existing
user-experience, the current limitations and what features folks are
expecting the most out of ThinLTO. We can go over the current
optimizations currently in development upstream.
- Johannes Doerfert: Polyhedral "Driven" Optimizations on Real Codes
In this talk I will present polyhedral "driven" optimizations on real
codes. The term polyhedral "driven" is used as there are two flavors of
optimization I want to discuss (depending on my progress and the
duration of the talk).
The first follows the classical approach applied by LLVM/Polly but with
special consideration of general benchmarks like SPEC. I will show how
LLVM/Polly can be used to perform beneficial optimizations in (at least)
libquantum, hmmer, lbm and bzip2. I will also discuss what I think is
needed to identify such optimization opportunities automatically.
The second polyhedral driven optimization I want to present is a
conceptual follow-up of the "Polyhedral Info" GSoC project. This project
was the first try to augment LLVM analysis and transformation passes
with polyhedral information. While the project was build on top of
LLVM/Polly, I will present an alternative approach. First I will
introduce a modular, demand driven and caching polyhedral program
analysis that natively integrates into the existing LLVM pipeline. Then
I will show how to utilize this analysis in existing LLVM optimizations
to improve performance. Finally, I will use the polyhedral analysis to
derive new, complex control flow optimizations that are not, or only in
a simpler form, present in LLVM.
- Tobias Grosser: Polly-ACC - Accelerator support with Polly-ACC
Programming today's increasingly complex heterogeneous hardware is
difficult, as it commonly requires the use of data-parallel languages,
pragma annotations, specialized libraries, or DSL compilers. Adding
explicit accelerator support into a larger code base is not only costly,
but also introduces additional complexity that hinders long-term
maintenance. We propose a new heterogeneous compiler that brings us
closer to the dream of automatic accelerator mapping. Starting from a
sequential compiler IR, we automatically generate a hybrid executable
that - in combination with a new data management system - transparently
offloads suitable code regions. Our approach is almost regression free
for a wide range of applications while improving a range of compute
kernels as well as two full SPEC CPU applications. We expect our work to
reduce the initial cost of accelerator usage and to free developer time
to investigate algorithmic changes.
- Brian Railing: Improving LLVM Instrumentation Overheads
The behavior and structure of a shared-memory parallel program can be
characterized by a task graph that encodes the instructions, memory
accesses, and dependencies of each piece of parallel work. The task
graph representation can encode the actions of any threading library and
is agnostic to the target architecture. Contech [1] is an LLVM-based
tool that generates a task graph representation, by instrumenting the
program when it is compiled such that it ultimately outputs a task graph
when executed. This paper describes several approaches to improving the
overhead of Contech's instrumentation by augmenting the static compiler
analysis.
The additional analyses are able to first determine similar memory
address calculations in the LLVM intermediate representation and elide
them from the instrumentation to reduce the data recorded, an approach
only previously attempted with dynamic binary instrumentation based on
common registers [2] [3]. Second, this analysis is supplemented by
performing tail duplication which increases the memory operations in a
single basic block and therefore may provide further opportunities to
elide instrumentation, without compromising the accuracy or detail of
the data recorded. These optimizations reduce the data recorded by 22%,
which has a proportionate decrease in overhead from 3.7x to 3.3x for
PARSEC benchmarks.
[1] B. P. Railing, E. R. Hein, and T. M. Conte. "Contech: Efficiently
Generating Dynamic Task Graphs for Arbitrary Parallel Programs". In: ACM
Trans. Archit. Code Optim. 12.2 (July 2015), 25:1-25:24.
[2] Q. Zhao, I. Cutcutache, and W.-F. Wong. "Pipa: Pipelined Profiling
and Analysis on Multi-core Systems". In: Proceedings of the 6th Annual
IEEE/ACM International Symposium on Code Generation and
Optimization. CGO '08. Boston, MA, USA: ACM, 2008, pp. 185-194.
[3] K. Jee et al. "ShadowReplica: Efficient Parallelization of Dynamic
Data Flow Tracking". In: Proceedings of the 2013 ACM SIGSAC Conference
on Computer & Communications Security. CCS '13. Berlin, Germany:
ACM, 2013, pp. 235-246.
- Evandro Menezes, Sebastian Pop, and Aditya Kumar: Efficient clustering of case statements for indirect branch predictors
We present an O(nlogn) algorithm as implemented in LLVM to compile a
switch statement into jump tables. To generate jump tables that can be
efficiently predicted by current hardware branch predictors, we added an
upper bound on the number of entries in each generated jump table. This
modification of the previously best known algorithm reduces the
complexity from O(n^2) to O(nlogn). We illustrate the performance
achieved by the improved algorithm on the Samsung Exynos-M1 processor
running several benchmarks.
- Pranav Bhandarkar, Anshuman Dasgupta, Ron Lieberman, Dan Palermo, Dillon Sharlet, and Andrew Adams: Halide for Hexagon DSP with Hexagon Vector eXtensions (HVX) using LLVM
Halide is a domain specific language that endeavors to make it easier to
construct large and composite image processing applications. Halide is
unique in its design approach to decoupling the algorithm from the
organization (schedule) of the computation. Algorithms once written and
tested for correctness can then be continually tuned for performance as
Halide allows for easily changing the schedule - tiling, parallelizing,
prefetching or vectorizing different dimensions of the loop nest that
form the structure of the algorithm.
Halide programs are transformed into the Halide Intermediate
Representation (IR) by the Halide compiler. This IR is analyzed and
optimized before generating LLVM bitcode for the target
requested. Halide links with the LLVM optimizer and codegen libraries
for supported targets, and uses these to generate object code.
In this workshop, we will present our work on retargeting Halide to the
Hexagon DSP with focus on the Hexagon Vector eXtensions (HVX).
Our workshop will present the halide constructs used in a simple blur
5x5, the corresponding Halide IR, and a few of the important LLVM
Hexagon passes which generate HVX vector instructions.
We will demonstrate compilation using LLVM.org and Halide.org tools, and
execution of the blur 5x5 pipeline on a Snapdragon 820 development board
using the Halide Hexagon offloader. In particular we will demonstrate
the various improvements which can be realized with scheduling changes.
- Sergei Larin, Harsha Jagasia and Tobias Edler von Koch: Impact of the current LLVM inlining strategy on complex embedded application memory utilization and performance
Sophisticated embedded applications with extensive and fine degree of
memory management are presenting a unique challenge to contemporary tool
chains. Like many open source projects LLVM optimizes its core
optimization tradeoffs for common cases and a set of common
architectures. Even with back end specific hooks, it is not always
possible to exert appropriate degree of control over some key
optimizations. We propose a case study on "in-depth" analysis of LLVM
PGO assisted inlining in a complex embedded application.
The program in question is a large scale embedded networking application
designed to be custom tuned for a variety of actual embedded platforms
with a range of memory and performance constrains. It makes a high use
of linker scripts to configure and fine tune memory assignment to
ultimately guarantee optimal performance in constrained memory
environment while being extremely power conscious.
The moment a tool chain is addressing a non-uniform memory model, "one
size fits all" approach to optimizations like inlining stops being
optimal. For instance, based on section assignment, completely unknown
to the compiler, inlining takes place in areas that are facing different
cost/benefit tradeoffs. The content of L1 and L2 Icache should not be
"enlarged" even if performance can theoretically improve. Inlining
across such section boundaries are also ill-advisable, since control
flow exchange (jump) between sections destined to different levels of
memory hierarchy can produce unexpected performance
implications. Finally, tightly budgeted low-level and high-performance
memories might swell beyond their physical limits.
The current state of LLVM inline is somewhat transitional in
anticipation of structural updates to the pass manager, and as such it
still strongly relies on heuristic + PGO based inline cost
computation. In such situation the introduction of back-end hooks might
allow targets to fine-tune inlining decisions to some degree but they
still fall far short to the degree of control needed by the
above-described systems. Additional challenge is posed by high degree of
complexity to capture actual system run-time behavior, and even
collecting appropriate traces to generate meaningful PGO data. Battery
powered embedded chips rarely have sophisticated tracing capabilities,
yet present extremely complex run time environments.
Call for Speakers
We invite speakers from academia and industry to present their work on the
following list of topics (including and not limited to:)
- improving performance and size of code generated by LLVM,
- improving performance of LLVM's runtime libraries,
- tools developed with LLVM for performance analysis of compiler generated code,
- bots and trackers of performance over time,
- improving the security of generated code,
- any other topic related to improving and maintaining the performance and quality of LLVM generated code.
While the primary focus of the workshop is on these topics, we welcome any
submission related to the LLVM compiler infrastructure, its sub-projects
(Clang, Linker, libraries), and its use in industry and academia.
We are looking for:
- keynote speakers,
- technical presentations: 30 minutes plus questions and discussion,
- tutorials,
- BOFs.
Proposals should provide enough information for the review committee to be
able to judge the quality of the submission. Proposals can be submitted under
the form of an extended abstract, full paper, or slides. Proposals should be
submitted to
Easychair
LLVM-CGO 2017. The deadline for receiving submissions is December 1st,
2016. Speakers will be notified of acceptance or rejection by December 15.
Workshop organization: Sebastian Pop, Aditya Kumar, Tobias Edler von Koch, and
Tanya Lattner.
|