The LLVM Compiler Infrastructure
Site Map:
Download!
Search this Site


Useful Links
Release Emails
19.1.7: Jan 2025
19.1.6: Dec 2024
19.1.5: Dec 2024
19.1.4: Nov 2024
19.1.3: Oct 2024
19.1.2: Oct 2024
19.1.1: Oct 2024
19.1.0: Sep 2024
18.1.8: Jun 2024
18.1.7: Jun 2024
18.1.6: May 2024
18.1.5: May 2024
18.1.4: Apr 2024
18.1.3: Apr 2024
18.1.2: Mar 2024
18.1.1: Mar 2024
18.1.0: Mar 2024
17.0.6: Nov 2023
17.0.5: Nov 2023
17.0.4: Oct 2023
17.0.3: Oct 2023
17.0.2: Oct 2023
17.0.1: Sep 2023
All Announcements

Maintained by the
llvm-admin team
2020 European LLVM Developers Meeting
About

The meeting is cancelled, more information on the conference main page.

The meeting serves as a forum for LLVM, Clang, LLDB and other LLVM project developers and users to get acquainted, learn how LLVM is used, and exchange ideas about LLVM and its (potential) applications.

The conference includes:

Technical talks
Modifying LLVM Without Forking Neil Henning (Unity)

LLVM is a powerful technology used in a wide-range of applications. One key component of LLVM that is not broadcasted enough is that it is possible to widely modify some of the core parts of LLVM without forking the codebase to make these modifications. This talk will cover some key ways that users of the LLVM technology can drastically change the code being produced from the compiler, using practical examples from Unity's HPC# Burst compiler codebase to show how we leverage the power of LLVM, without forking.

A Cross Debugger for Multi-Architecture Binaries Jaewoo Shim (The Affiliated Institute of ETRI), Hyukmin Kwon (The Affiliated Institute of ETRI), Sangrok Lee (The Affiliated Institute of ETRI)

In IoT, malicious binaries are executed on various CPU architectures. For example, Mirai and its variants spread over many CPUs(Intel, ARM, MIPS, PPC, etc.). It is very difficult to prepare devices to execute such malware. Furthermore, malware analysts need to understand every architecture and its assembly language to analyze multi-architecture malware. For these reasons, we developed a LLVM- based cross-debugger which can execute and inspect multi-architecture malware on a single host. The input of the cross-debugger is LLVM IR. LLVM IR is lifted from a malware binary through our lifter which is based on existing lifter. We changed the disassembly strategy from recursive traversal to linear sweep with an error correction method using our own local VSA(Value Set Analysis). Our lifter outperformed the existing lifter by speeding 4 times with the same accuracy. LLVM Interpreter(LLI) is used for executing lifted LLVM IR. Current LLI cannot run the “lifted” IR properly due to the two reasons – 1) Direct memory access 2) Uncommon type casting. In our presentation, we will show why these are problematic and how we solved them by modifying LLI source code. We implemented essential debugger features such as breakpoint, code view and hex dump in order to utilize LLI as a debugger. In addition, we added novel features: data flows based instruction tracing which is very helpful to analyze IoT binaries but gdb and IDA pro do not provide. In this talk, we want to discuss how LLVM IR can be used for dynamic binary analysis. First, we will show how to lift a binary to LLVM IR. And we will show lifted LLVM IR code examples which LLI cannot execute. Second, we will discuss that current limitations of the existing LLI and how we solved them. Third, we will explain what is required for cross-debugger and how we designed and implemented these features. Finally, a malware analysis demo with our tool.

TFRT: An MLIR Powered Low-Level Runtime for Heterogenous Accelerators Chris Lattner (Google), Mingsheng Hong (Google)

TFRT is a new effort to provide a common low level runtime for accelerators - enabling multiple heterogenous accelerators (each with domain specific APIs and device specific drivers) in a single system. This approach provides efficient use of the multithreaded host CPUs, supports fully asynchronous programming models, and is focused on low- level efficiency. TFRT is a new runtime that powers TensorFlow, but while our work is focused on the machine learning use-cases, the core runtime is application independent. TFRT is novel in three ways:

  1. it directly builds on MLIR and LLVM infrastructure like the MLIR declarative graph lowering framework, FileCheck based unit tests, and common LLVM data types.
  2. it leverages MLIRs extensible type system to support arbitrary C++ types in the runtime, not being limited to just tensors.
  3. it uses a modular library-based design that is optimized for subset-ability and embedding into applications spanning from mobile to server deployments, integration into a high performance game engine, etc.

This talk discusses the design points of TFRT - including a discussion about the use of MLIR dialects to represent accelerator runtimes, which is the key that enable efficient and highly integrated heterogenous computation in a common framework. Through the use of MLIR, TFRT is able to expose the full power of each accelerator, instead of providing a "lowest common denominator" approach.

Transitioning the Scientific Software Toolchain to Clang/LLVM Mike Pozulp (Lawrence Livermore National Laboratory and University of California, Davis), Shawn Dawson (Lawrence Livermore National Laboratory), Ryan Bleile (Lawrence Livermore National Laboratory and University of Oregon), Patrick Brantley (Lawrence Livermore National Laboratory), M. Scott McKinley (Lawrence Livermore National Laboratory), Matt O'Brien (Lawrence Livermore National Laboratory), Dave Richards (Lawrence Livermore National Laboratory)

For the past 25 years, many of the largest scientific software applications at Lawrence Livermore National Laboratory (LLNL) have used the Intel C/C++ compiler (icc/icpc) to compile the executables provided to users on x86. This spring 2020, the Monte Carlo Transport Project will release our first executable compiled with clang, which builds 25% faster and runs 6.1% faster than icpc. The poster accompanying this paper will describe the challenges of switching toolchains and the resulting advantages of using a clang/LLVM toolchain for large scientific software applications at LLNL. Acknowledgement: The title was inspired by a technical talk from the 2019 LLVM Developers' Meeting, "Transitioning the Networking Software Toolchain to Clang/LLVM".

Exhaustive Software Pipelining using an SMT-Solver Jan-Willem Roorda (Intel)

Software pipelining (SWP) is a classic and important loop- optimization technique for VLIW-processors. It improves instruction- level parallelism by overlapping multiple iterations of a loop and executing them in parallel. Typically, SWP is implemented using heuristics. But, also exhaustive approaches based on Integer Programming (IP) have been proposed. In this talk, we present an alternative approach implemented in LLVM: an exhaustive software pipeliner based on a Satisfiability Modulo Theories (SMT) Solver. We give experimental results in which we compare our approach with heuristic algorithms and hand-optimization. Furthermore, we show how the "unsatisfiable core" generation feature of modern SMT- solvers can be used by the compiler to give feedback to programmers and processor-designers. Finally, we compare our approach to LLVM's implementation of Swing-Modulo-Scheduling (SMS).

Testing the Debugger Jonas Devlieghere (Apple)

Testing the debugger has unique challenges. Unlike the compiler where you have a fixed set of input and output files, the debugger is an interactive tool that deals with many variants, ranging from the compiler and debug info format to the platform being debugged. LLDB's test suite has seen some significant changes over the past two years. Not only has the number of tests increased steadily, we also changed the way we test things. This talk will give an overview of those changes, the different testing strategies used by LLDB and how to decide which one to use when writing a new test case.

Changing Everything With Clang Plugins: A Story About Syntax Extensions, Clang's AST, and Quantum Computing Hal Finkel (Argonne National Laboratory), Alex Mccaskey (Oak Ridge National Laboratory)

Did you know that Clang has a powerful plugin API? Plugins can currently observe Clang's AST during compilation, register new pragmas, and more. In this talk, I'll review Clang's current plugin infrastructure, explaining how to write and use Clang plugins, and then talk about how we're working to enhance Clang's plugin capabilities by allowing plugins to provide custom parsing within function bodies. This new capability has many potential use cases, from parser generators to database-query handling, and we'll discuss how this new capability can potentially enhance a wide spectrum of tools. Finally, we'll discuss one such use case in more detail: embedding a quantum programming language in C++ to create a state-of-the-art hybrid programming model for quantum computing.

Loop Fission: Distributing loops based on conflicting heuristics Ettore Tiotto (IBM Canada), Wai Hung (Whitney) Tsang (IBM Canada), Bardia Mahjour (IBM Canada), Kit Barton (IBM Canada)

This talk is about a new optimization pass implemented in LLVM opt - LoopFissionPass. Loop fission aims at distributing independent statements in a loop into separate loops. In our implementation we use an interference graph, induced from the Data Dependence Graph (DDG), to balance potentially conflicting heuristics and derive an optimal distribution plan. We consider data reuse between statements, memory streams, code size, etc., to decide how to distribute a loop nest. Additional heuristics can be easily incorporated into the model, making this approach a flexible alternative to the existing LoopDistributionPass in LLVM. We will share our experience on running Loop Fission on a real-world application, and we will provide results on industry benchmarks. This talk targets developers who have an interest in loop optimizations and want to learn about how to use the DDG infrastructure now available in LLVM to drive a transformation pass. The takeaways for this talk are:

  • How to balance conflicting heuristics using an interference graph
  • How to use the data dependence graph
  • The key differences between the existing LoopDistribution pass and our new LoopFission pass

Achieving compliance with automotive coding standards with Clang Milena Vujosevic Janicic (RT-RK)

Autosar guidelines for the use of the C++14 language in critical and safety-related systems propose rules that are tailored to improve security, safety and quality of software. In this talk, we will discuss main challenges in extending Clang with source code analyses that are necessary for checking compliance of software with Autosar automotive standard:

  • We will present Clang’s current support for checking compliance to different standards and its strengths and weakness in this area
  • We will compare efficiency and possibilities based on implementing analyses via AST Visitors and AST Matchers.
  • We will present our improvements of Clang's diagnostics.
  • We will discuss similarities and differences between our approach and the solution offered by Clang-Tidy project.
  • We will present some impressions and results on using our extension of Clang (supporting checking compliance with more than 180 Autosar rules) in automotive industry, including running it on parts of Automotive Grade Linux open source code.

Secure Delivery of Program Properties with LLVM Son Tuan Vu (LIP6), Karine Heydemann (LIP6), Arnaud de Grandmaison (Arm), Albert Cohen (Google)

Program analysis and program transformation systems have long used annotations and assertions capturing program properties, to either specify test and verification goals, or to enhance their effectiveness. These may be functional properties of program control and data flow, or non-functional properties about side-channel or faults. Such annotations are typically inserted at the source level for establishing compliance with a specification, or guiding compiler optimizations, and are required at the binary level for the validation of secure code, for instance. In this talk, I will explain our approach to encode, translate and preserve the semantics of both functional and non-functional properties along the optimizing compilation of C to machine code. This involves

  • capturing and translating source-level properties through lowering passes and intermediate representations, such that data and control flow optimizations will preserve their consistency with the transformed program;
  • carrying properties and their translation as debug information down to machine code.

I will also give details on how we modified Clang and LLVM to implement and validate the soundness and efficiency of the approach. I will show how our approach specifically addresses a fundamental open issue in security engineering, by considering some established security properties and applications hardened against side-channel and fault attacks. This talk will be a follow-on to "Compilation and optimization with security annotations", presented at EuroLLVM 2019. It is based on our research paper "Secure Delivery of Program Properties Through Optimizing Compilation", submitted and accepted for the ACM SIGPLAN 2020 International Conference on Compiler Construction (CC20).

Verifying Memory Optimizations using Alive2 Juneyoung Lee (Seoul National University, Korea), Chung-Kil Hur (Seoul National University, Korea), Nuno P. Lopes (Microsoft Research, UK)

Alive2 is a re-implementation of Alive to check existing optimizations without rewriting them in the Alive DSL. It takes a pair of functions as input, and encodes their equivalence(refinement) of condition into a mathematical formula, which is then verified by Z3. Alive2 can be run as a standalone tool as well as an opt plugin which enables running Alive2 on LLVM's unit tests using the lit testing tool. In this talk, I will present a demo that shows how to use Alive2 to prove correctness of optimizations on memory accessing instructions such as load, store, and alloca. It will include running examples of several optimizations that LLVM currently performs. Also, we'll show how to interpret Alive2's error message from incorrect transformations by using real miscompilation bugs that we've found from the LLVM unit tests.

From Tensors to Devices in one IR Oleksandr Zinenko (Google Inc.), Stephan Herhut (Google Inc.), Nicolas Vasilache (Google Inc.)

MLIR is a new compiler infrastructure recently introduced to the LLVM project. Its main power lies in the openness of its instruction set and type system, allowing compiler engineers and researchers to define and combine different levels of abstractions within a single IR. In this talk, we will present an approach for code generation and optimization that significantly reduces implementation complexity by defining operations, types and attributes with strong semantics and structural properties that are preserved across compiler transformations. These semantics can be derived from the results of traditional compiler analyses, such as aliasing or affine loop analysis, or imposed by construction and preserved when lowering progressively from the front-end representation. We illustrate our approach to code generation by a retargetable flow from machine learning frameworks to GPU-like devices, traversing a series of mid- level control flow abstractions such as loops, all expressed as MLIR dialects. These dialects follow the “structured” design paradigm, making them easy to extend, combine and lower into each other progressively, only discarding high-level information when it is no longer necessary. We demonstrate that the structure embedded into operations and types ensures the legality of code transformations (such as buffer assignment, code motion, fusion and unrolling), and is preserved by them, making the set of operations closed under a set of well-defined transformations.

Convergence and control flow lowering in the AMDGPU backend Nicolai Hähnle (Advanced Micro Devices)

GPUs execute many threads of a program in lock-step on SIMD hardware, in what is often called a SIMT or SPMD execution model. The AMDGPU compiler backend is responsible for translating a program's original, thread-level control flow into a combination of predication and wave-level control flow. Some programs contain _convergent_ intrinsics which add further constraints to this transform. We give a brief update on recent developments in the AMDGPU backend and how we plan to model convergence constraints in LLVM IR in the future, with a corresponding take on what convergence should mean. Given enough time, we'll go into some more detail on the convergence intrinsics we're using, our preferred cycle analysis, and how choices in convergence behavior interact with divergence analysis.

Preserving And Improving The Optimized Debugging Experience Tom Weaver (Sony, SN Systems)

The current optimized debugging experience is poor but recently there has been a concerted effort within the LLVM community to rectify this. The ongoing effort has been huge but there's still lots of work to do in the optimized debugging space. A typical optimized debugging experience can be frustrating with variables going missing, holding incorrect values or appearing out of order. The LLVM optimization pipeline presents a large surface area for optimized debugging experience bugs to be introduced. But this doesn't mean that fixing this issue has to be hard. The vast majority of the issues that arise within the optimized debugging experience problem space can be fixed using existing tools and utilities built into the LLVM codebase. This talk aims to inform the audience about the current optimized debugging experience, what we mean by 'debugging experience', why it's bad and what we can do about it. The talk will explain in some detail how debugging information is represented within the LLVM IR, how it represents it and how these debugging information building blocks interact with one another. Finally, it will cover some entry level coding patterns that LLVM contributors can use to improve the debugging experience themselves when working within the LLVM codebase.

ThinLtoJIT: Compiling ahead of time with ThinLTO summaries Stefan Gränitz (Independent / Freelance Developer)

ThinLtoJIT is a new LLVM example project, which makes use of global call-graph information from ThinLTO summaries for speculative compilation with ORCv2. It is an implementation of the concept I presented in my "ThinLTO Summaries in JIT Compilation" talk at the 2018 Developers' Meeting: https://llvm.org/devmtg/2018-10/talk- abstracts.html#lt8 Upfront the JIT only populates the global ThinLTO module index and compiles the main module. All functions are emitted with extra prologue instructions that fire a discovery flag once execution reaches them. In parallel, a discovery thread is busy- watching all these flags. Once it detects some fired, it queries the ThinLTO module index for functions reachable within a number of calls. The set of modules that define these functions is then loaded from disk and submitted to the compilation pipeline asynchronously while execution continues. Ideally the JIT can be tuned in a way, so that the code on the actual path of execution can always be compiled ahead of time. In case a missing function is reached, the JIT has a definition generator in place that loads modules synchronously. We will go through the lifetime of an example program running in ThinLtoJIT and discuss various aspects of the implementation:

  • Generate and inspect bitcode with ThinLTO summaries
  • Populate and query the global module index
  • Build compile pipelines with ORCv2
  • Compiler interception stubs in ORCv2
  • Binary instrumentation for JITed functions
  • Look-free discovery flags
  • Multithreaded dispatch for bitcode parsing and compilation
  • Benchmarks against lli and static compilation

Most topics are beginner friendly in their domain. During the session participants will gain:

  • an advanced understanding of the ORCv2 libraries
  • a basic and practical understanding of ThinLTO summaries, binary instrumentation, multi-threading and lock-free data structures

Bonus: So, should we build Clang stage-1 in memory?

Global Machine Outliner for ThinLTO Kyungwoo Lee (Facebook), Nikolai Tillmann (Facebook)

The existing machine-outliner in LLVM already provides a lot of value to reduce code size but also has significant shortcomings: In the context of ThinLTO, the machine-outliner operates on only one module at a time, and doesn’t reap outlining opportunities that only pay off when considering all modules together. Furthermore, identical outlined functions in different modules do not get deduplicated because of misaligned names. We propose to address these shortcomings: We run machine-level codegen (but not the IR-level optimizations) twice: The first time, the purpose is purely to gather statistics on outlining opportunities. The second time, the gathered knowledge is applied during machine outlining to do more. The core idea is to track information about outlined instruction sequences via a new kind of stable machine instruction hashes that are meaningful and quite exact across modules. In this way, the machine-outliner may outline many identical functions in separate modules. Furthermore, we introduce unique names for outlined functions across modules, and then enable link-once ODR to let the linker deduplicate functions. We also observed that frame-layout code tends to not get outlined: the generated frame-layout code tends to be irregular as it is optimized for performance, using the return address register in unique ways which are not easily outlinable. We change the machine-specific layout code generation to be homogenous, and we synthesize outlined prologue and epilogue helper functions on-demand in way that can be fitted to actually occurring frequent patterns across all modules. Again, we can gather statistics in the first codegen, and apply them in the second one. Fortunately, it turns out that the time spent in codegen is not dominating the overall compilation, and our approach to run codegen twice represents an acceptable cost. Also, codegen tends to be very deterministic, and the information gathered during the first codegen is highly applicable to the second one. In any case, our optimizations are sound. In our experience, this often significantly increases the effectiveness of outlining with ThinLTO in terms of size and even performance of the generated code. We have observed an improvement in the code size reduction of outlining by a factor of two in some large applications.

Embracing SPIR-V in LLVM ecosystem via MLIR Lei Zhang (Google), Mahesh Ravishankar (Google)

SPIR-V is a standard binary intermediate language for representing graphics shaders and compute kernels. It is adopted by multiple open APIs, notably Vulkan and OpenCL. There are consistent interests over proper SPIR-V support in LLVM ecosystem and multiple efforts driving towards that goal. However, none of them are landed thus far due to SPIR-V’s abstraction level, which raises significant challenges to existing LLVM CodeGen infrastructure. MLIR enables a different approach to achieve the goal: SPIR-V can be modeled as a dialect with the native abstraction. Dialect conversion framework facilitates interaction with other dialects, allowing converting to the SPIR-V dialect. This effectively embraces SPIR-V into the LLVM ecosystem. Along this line, this talk discusses how SPIR-V is modeled in MLIR and shows how it is leveraged to build an end-to-end ML compiler (IREE) to target Vulkan compute. Further integration paths are open as well for supporting OpenCL, Vulkan graphics, and interacting with the LLVM dialect. This talk is intended for folks interested in SPIR-V and Vulkan/OpenCL. For folks generally interested in MLIR, this talk gives examples of how to define dialects and conversions in MLIR, together with with useful practices and pitfalls to avoid we found along the way.

PGO: Demystified Internals Pavel Kosov (Huawei R&D)

In this talk we will describe how PGO is implemented in LLVM. First, we will make general overview of PGO, talk about pipeline of instrumentation and sampling, compare two kinds of instrumentation (frontend and IR), overview kinds of counters, look deeper at instrumentation implementation (structures, algorithms). Then we will present some practical information: how counters are stored in executable file and on disk, describe profdata format, how it is loaded by llvm to profile metadata, and how this metadata is used in optimizations. Finally, we will make a comparison with talk about PGO which was presented 7 years ago on LLVM Dev Meeting 2013 (https://llvm.org/devmtg/2013-11/ #talk14 ) – and we will see what was changed and how.

Control-flow sensitive escape analysis in Falcon JIT Artur Pilipenko (Azul Systems)

This talk continues a series of technical talks about internals of Azul's Falcon compiler. Falcon is a production quality, highly optimizing JIT compiler for Java based on LLVM. Java doesn't have value types (yet), so all allocations are heap allocations by default. Because of that idiomatic Java code exposes a lot of opportunities for escape analysis. Over the last year Falcon gained fairly sophisticated control-flow sensitive escape analysis and transformations. At this point this work is mostly downstream, but might be of interest for others. In this session we will look at the cases which motivated this work, will overview the design and the use cases of the analysis we built. We will compare it with the existing capture tracking analysis, and discuss challenges of making existing LLVM transformations and analyses benefit from a smarter escape analysis.

LLVM meets Code Property Graphs Alex Denisov (Shiftleft GmbH), Fabian Yamaguchi (Shiftleft GmbH)

The security of computer systems fundamentally depends on the quality of its underlying software. Despite a long series of research in academia and industry, security vulnerabilities regularly manifest in program code. Consequently, they remain one of the primary causes of security breaches today. The discovery of software vulnerabilities is a classic yet challenging problem of the security domain. In the last decade, there appeared several production-graded solutions with a favorable outcome. Code Property Graph[1] (or CPG) is one such solution. CPG is a representation of a program that combines properties of abstract syntax trees, control flow graphs, and program dependence graphs in a joint data structure. There exist two counterparts[2][3] that allow traversals over code property graphs in order to find vulnerabilities and to extract any other interesting properties. In this talk, we want to cover the following topics:

  • an intro to the code property graphs
  • how we built llvm2cpg, a tool that converts LLVM Bitcode to the CPG representation
  • how we teach the tool to reason about properties of high-level languages (C/C++/ObjC) based on the low-level representation only
  • interesting findings and some results

[1] https:// ieeexplore.ieee.org/document/6956589

[2] https://github.com/ShiftLeftSecurity/codepropertygraph

[3] https://ocular.shiftleft.io

Proposal for A Framework for More Effective Loop Optimizations Michael Kruse (Argonne National Laboratory), Hal Finkel (Argonne National Laboratory)

The current LLVM data structures are intended for analysis and transformations on the instruction- and control-flow level, but are suboptimal for higher-level optimization. As a consequence, writing a loop optimization involves a lot of work including a correctness check, a custom profitability analysis, and handling many low-level issues. However, even when each individual loop optimization pass itself is has the best implementation possible, combined they are not optimal: their profitability models remain separate and, if loop versioning is necessary, each pass duplicates different aspects of the loop nest again and again. Also, phase ordering problems may inhibit optimizations that otherwise would be possible. This motivates an intermediate representation and framework that is centered around loops and can be integrated with LLVM’s optimization pipeline. The talk will present the approach already outlined in an RFC at the beginning of this year.

Student Research Competition
Autotuning C++ function templates with ClangJIT Sebastian Kreutzer (TU Darmstadt), Hal Finkel (Argonne National Laboratory)

ClangJIT is an extension of the Clang compiler that introduces just-in-time compilation of function templates in C++. This feature can be used to generate functions which are specialized for certain inputs. However, especially in computational kernels, the default optimization passes leave much of the potential performance gains on the table. In this work, we try to close this gap by introducing autotuning capabilities to ClangJIT. We employ Polly as a backend for polyhedral optimization and evaluate different code versions, in order to find chains of loop transformations that deliver performance improvements. Using a best-first tree search approach, we are able to demonstrate significant speedups on test kernels.

The Bitcode Database Sean Bartell (University of Illinois at Urbana-Champaign), Vikram Adve (University of Illinois at Urbana-Champaign)

This talk will introduce the Bitcode Database (BCDB), a database that can efficiently store huge amounts of LLVM bitcode. The BCDB can store hundreds of large Linux packages in a single place, without adding significantly to the build time or requiring modifications to the packages. Each bitcode module is split into a separate part for each function, and identical functions are deduplicated, which means that many builds of a program can be kept in the BCDB with minimal overhead. When a program and all of its dynamic libraries are stored in the BCDB, it is possible to link the program and libraries together into a single module and optimize them together. This technique can reduce the size of the final binary by 25-50%, and significantly improve performance in some cases. The talk will conclude with a discussion of more potential uses for the BCDB, such as incremental compilation or efficiently sharing bitcode between different organizations.

RISE: A Functional Pattern-based Dialect in MLIR Martin Lücke (University of Edinburgh), Michael Steuwer (University of Glasgow), Aaron Smith (Microsoft)

Machine learning systems are stuck in a rut. Paul Barham and Michael Isard, two of the original authors of TensorFlow, come to this conclusion in their recent HotOS paper. They argue that while TensorFlow and similar frameworks have enabled great advances in machine learning, their current design and implementations focus on a fixed set of monolithic and inflexible kernels. We present our work on the MLIR dialect RISE, a compiler intermediate representation inspired by pattern-based program representations like Lift. A set of small generic patterns is provided, which can be composed to represent complex computations. We argue that this approach of using simple reusable patterns to break up large monolithic kernels will enable easier exploration of different novel optimizations for machine learning workloads. Rise is a spiritual successor to Lift and developed at the University of Edinburgh, University of Glasgow and University of Münster. Martin Lücke is a PhD student from Edinburgh and works on the MLIR implementation of RISE. This work is mainly focused on the representation of the high-level Rise patterns in MLIR, but we will also talk about the challenges of introducing low-level patterns and a rewriting system in the future.

Tutorials
Implementing Common Compiler Optimizations From Scratch Mike Shah (Northeastern University)

In this tutorial I will present several common compiler optimizations performed in LLVM. Chances are you have learned them in your compilers course, but have you ever had the chance to implement them? The following optimizations will be explained and presented: dead code elimination, common subexpression elimination, code motion, and finally function inlining. Attendees will also learn how to generate a control flow graph and visualize it in this After leaving this tutorial, attendees should be able to implement more advanced program analysis using the LLVM framework. They will be given a set of exercises that they can then challenge themselves with given the knowledge they learn from this tutorial.

LLVM in a Bare Metal Environment Hafiz Abid Qadeer (Mentor Graphics)

This tutorial is about building and validating LLVM toolchain for Embedded Bare Metal Systems. Currently, most of the bare metal toolchains using LLVM depend on an existing GCC installation to provide some runtime bits. In this tutorial, I will go through the steps involved in building an LLVM toolchain that does not have this dependency. The tutorial will cover the following topics:

  • What are multilibs and how to specify them
  • How to generate command line options for compiler, linker and other tools in the driver
  • How building runtime libraries is different from building host tools and ways to build LLVM runtime libraries (compiler-rt, libunwind, libcxxabi, libcxx) for bare metal targets
  • Overview of the LLVM testing and how to test runtime libraries
  • Current testing infrastructure provides support to test runtime libraries on emulator like QEMU. How to extend it to real bare metal hardware

MLIR tutorial Oleksandr Zinenko (Google), Mehdi Amini (Google)

MLIR is a flexible infrastructure for defining custom compiler abstractions and transformations, recently introduced to LLVM. It aims at generalizing the success of LLVM’s intermediate representation to new domains, ranging from device instruction sets, to loop abstractions, to graphs of operators used in machine learning. In this tutorial, we will explain how the few core concepts present in MLIR can be combined to represent and transform various IRs, including LLVM IR itself, by demonstrating the development of an optimizing compiler for a custom DSL step by step. The tutorial should be sufficient for the developers of compilers, IRs and similar tools to start using MLIR to implement custom operations with parsing and printing, define custom type systems and implement generic passes over the combination of those. We will provide an overview of MLIR ecosystem and related efforts, building the analogy with existing LLVM subsystems and frequently discussed LLVM extension proposals, e.g. loop optimizations or GPU-specific abstractions.

How to Give and Receive Code Reviews Kit Barton (IBM Canada), Hal Finkel (ANL)

Code reviews are a critical component to the development process for the LLVM Community. Code maintainers rely on the code review process to ensure a high quality of code and to serve as an early detection and prevention mechanism for potential bugs. Developers also benefit greatly from code reviews through the insight and suggestions they receive from the reviewers. This tutorial will cover the code review process from both the developer and the reviewer's point of view. As a developer, there are several guidelines to follow when preparing patches for review, as well as common etiquette to follow during the review process. As a reviewer, there many things to look for during the review (correctness, style, computational complexity, etc). This talk will discuss both these roles, in depth. It will use demonstrations with Phabricator to emphasize several aspects of the code review process. It will also highlight several features in Phabricator that can be used during code reviews. The focus will be to summarize the current best practices for code reviews that have been discussed on the llvm-dev mailing list and summarized on our website (https://llvm.org/docs /CodeReview.html). It is meant to be as interactive as possible, with questions during the presentation encouraged.

From C to assembly: adding a custom intrinsic to Clang and LLVM Mateusz Belicki (Intel)

This tutorial will introduce you to all necessary steps to create a Clang intrinsic (builtin function) and extend LLVM to generate code for it. This tutorial aims to provide a complete manual for adding a custom target-specific intrinsic including exposition to the source language. After completing this tutorial you should be able to extend clang with custom intrinsic and know how to handle it in LLVM, including steps to test and debug your changes at different stages of development. Fluency in C++ and general programming concepts is expected. The tutorial will try to accommodate for listeners with no prior knowledge of LLVM or compiler-specific topics, but it's recommended to complete general introduction tutorial to LLVM first.

BoFs
Let the compiler do its job? Sjoerd Meijer (ARM)

At the 2019 US LLVM developers' meeting we have presented Arm's new M-profile Vector Extension (MVE), which is a vector extension for Arm's microcontrollers to accelerate execution of DSP workloads. While it is still early days for this new architecture extension and its compiler support, we are now getting experience with vectorisation for this DSP-like architecture. I.e., after adding compiler support for the new architecture features such as vectorisation, predication, and hardware-loops, which is still ongoing work, we are now also confronted with the next challenge: adoption of the technology. The main question is: will LLVM's auto- vectorisation and MVE code-generation good enough for DSP workloads so that people will give up writing intrinsics and even assembly, and can we thus just let the compiler do its job? Since DSP workloads are usually characterised by small, tight loops where every cycle counts, any compiler translation inefficiency means resorting to hand-tuned intrinsics/assembly code, which obviously comes at the expense of portability and maintainability of these codes. For this reason, and just for software ecosystem legacy reasons, the auto-vectoriser's competition for DSP workloads is often still hand-tuned intrinsics/assembly code, but can we change that? In order to answer this question, we need to have a closer look at:

  • What exactly are these DSP workloads? Are there industry accepted benchmarks and workloads, and which DSP idioms are important to translate efficiently?
  • How good is the auto-vectoriser performing against intrinsics, and how far off are we if there is a gap?
  • Do we see obvious areas to improve the vectoriser?
  • Besides performance, usability of the toolchain is crucial. That is, if performance goals are not met, how easy can users get insights in the compiler and auto-vectorisation decision making, and how can it influence and steer this to achieve better results?

Debugging an bare-metal accelerator with LLDB Romaric JODIN (UPMEM)

UPMEM made an accelerator based on PiM (Processing in Memory). It is a standard DRAM-based DDR4 DIMM where each DRAM chip embeds several multi-threaded processors capable of computing a program on the data stored in the DRAM chip. In order to debug such a target, we have made some modifications to LLDB in order to interact with the accelerator. Especially, as no server or gdb stub can run on the accelerator, we added a lldb-server for our bare-metal target that runs on the host CPU (which can be viewed as a kind of a cross-compiled server) and we modified LLDB at different points to be able to have it working. We are using a single lldb client instance to debug both the application running on the host CPU and the multiple accelerator CPU it is using. The aim of the BoF is to present those modifications and discussed about how to make LLDB friendlier with such targets including re-using the lldb-server code for remote target without operating system.

LLVM Binutils BoF James Henderson (SN Systems (Sony Interactive Entertainment))

LLVM has a suite of binary utilities that broadly mirror the GNU binutils suite, with tools such as llvm-readelf, llvm-nm, and llvm- objcopy. These tools are already widely used in testing the rest of LLVM, and have also been adopted as full replacements for the GNU tools in some production environments. This discussion will be a chance for people to present how their migration efforts are going, and to highlight what is impeding their adoption of the tools. It will also provide the opportunity for participants to discuss potential new features and the future direction of new tools.

FunC++. Make functional C++ more efficient Pavel Kosov (Huawei R&D)

In nowadays functional programming (FP) in C++ is not as efficient as it may be. Mainly because of weak optimization of such features as std::variant, std::visit, std::function etc. I will present list of cases of possible improvements and after this I will propose several solutions. Let’s discuss them and maybe we will be able to find others ways to make functional programming in C++ more usable. It is worth to mention that benefit of this work will spread to all C++ programmers, not only FP fans (because std::variant, std::function etc. are used in a lot of different applications)

Loop Optimization BoF Michael Kruse (Argonne National Laboratory), Kit Barton (IBM)

In this Bird-of-a-Feathers we will discuss the current and future development around loop optimizations in LLVM, summarizing and building on topics discussed during the bi-weekly Loop Optimization Working Group conference call. The topics that we intend to discuss include:

  • Loop pass infrastructure such as the pass managers
  • Specific loop passes (LoopVectorize, LoopUnroll, LoopUnrollAndJam, LoopDistribute, LoopFuse, LoopInterchange)
  • Polly and other polyhedral analysis capabilities (e.g., in MLIR)
  • Analyses (LoopInfo, ScalarEvolution, LoopNestAnalysis, LoopCacheAnalysis, etc.)
  • Dependence analysis, in particular progress on the DataDependenceGraph and PragmaDependencyGraph
  • Canonical loop forms (such as rotated, simplified, LCSSA, max- fused or max-distributed, etc)
  • User-directed transformations
  • Alternative intermediate representations (MLIR, VPlan, Loop Hierarchy)

Code Size Optimization Sean Bartell (University of Illinois at Urbana-Champaign)

Code size is often overlooked as a target of optimization, but is still important in situations ranging from space-constrained embedded devices to improving cache coherency on supercomputers. This will be an open-ended BoF for anyone interested in optimizing code size. Potential topics of discussion include benefits of reducing code size, size optimization techniques, and related improvements that could be made to LLVM.

Panels
Vector Predication Andrew Kaylor (Intel), Florian Hahn (Apple), Roger Ferrer Ibáñez (Barcelona Supercomputing Center), Simon Moll (NEC Deutschland)

LLVM lacks support for predicated vector instructions. Predicated vector operations in LLVM IR are required to properly target SIMD/Vector ISAs such as Intel AVX512, ARM MVE/SVE, RISC V V-Extension and NEC SX-Aurora TSUBASA. This panel discusses various design ideas and requirements to bring native vector predication to LLVM with the goal of opening up on-going efforts to the scrutiny of the wider LLVM community. This panel follows up on various round tables and the BoF at EuroLLVM 2019. We are planning to address the following aspects:

  • Design alternatives & choices - limits of the instruction+select pattern.
  • Generating vector-predicated code (ie making predicated ops available for VPlan/LV/RV).
  • Making existing optimizations work for vector-predicated code.
  • The LLVM-VP (D57504) prototype and roadmap.

The panelists have a diverse background in X86, RISC-V V extension and NEC SX-Aurora code generation as well as experience with SLP/LV/VPlan vectorizers and the out-of-tree Region Vectorizer, constrained fp and the current RFCs to bring predicated vector operations to LLVM.

OpenMP (Target Offloading) in LLVM [Panel/BoF] Johannes Doerfert (ANL)

Offloading, thus moving computation to accelerators, has (to) become reality in various fields, including but not exclusively HPC. OpenMP is a promising language for many people as it integrates well into existing code bases written in C/C++ or Fortran. In this Panel (or BoF) we want to give people an overview of the current support, what is being worked on, and how researchers can impact this important topic. While we hope for questions from the audience, we will present various topics to start the conversation, including:

  • the redesign of the OpenMP device runtime library to support more targets
  • the OpenMP optimization pass and scalar optimizations
  • OpenMP 5.0 and 5.1 support
  • OpenMP in Flang

The panelists are from companies and institutions involved in these efforts. We are in contact with: Jon Chesterfield (AMD) Simon Moll (NEC) Xinmin Tian (Intel) Alexey Bataev (IBM) as well as representatives from national labs and other hardware vendors. Note that depending on the format we will need to list more people as authors.

Lightning talks
Support for mini-debuginfo in LLDB - How to read the .gnu_debugdata section. Konrad Kleine (Red Hat)

The "official" mini-debuginfo man-page describes the topic best: > Some systems ship pre-built executables and libraries that have a > special ".gnu_debugdata" section. This feature is called MiniDebugInfo. > This section holds an LZMA- compressed object and is used to supply extra > symbols for backtraces. > > The intent of this section is to provide extra minimal debugging information > for use in simple backtraces. It is not intended to be a replacement for > full separate debugging information (see Separate Debug Files). In this talk I'll explain what it took to interpret support for mini-debuginfo in LLDB, how we've tested it, and what to think about when implementing this support (e.g. merging .symtab and .gnu_debugdata sections).

OpenACC MLIR dialect for Flang and maybe more Valentin Clement (Oak Ridge National Laboratory), Jeffrey S. Vetter (Oak Ridge National Laboratory)

OpenACC [1] is a directive-based programming model to target heterogenous architectures with minimized change in original code. The standard is available for Fortran, C and C++. It is used in variety of scientific applications to exploit the compute power of the biggest supercomputers in the world. While there is a wide range of approaches in C and C++ to target accelerators, Fortran is stuck with directive based programming models like OpenMP and OpenACC. In this lightning talk we are presenting our idea to introduce an OpenACC dialect in MLIR and implement the standard in Flang/LLVM. This project might benefit other efforts like the Clacc [2] project doing this in clang/LLVM.

[1] OpenACC standard: https://www.openacc.org/

[2] Clacc: Translating OpenACC to OpenMP in Clang. Joel E. Denny, Seyong Lee, and Jeffrey S. Vetter. 2018 IEEE/ACM 5th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC), Dallas, TX, USA, (2018).

LLVM pre-merge checks Mikhail Goncharov (Google), Christian Kühnel (Google)

I would like to give a short presentation about https://github.com/google/llvm-premerge-checks to advertise pre-merge checks, why do we have them and how it works.

LIT Testing For Out-Of-Tree Projects Andrzej Warzynski (Arm)

Have you ever wondered how to configure LLVM's Integrated Tester (LIT) for your out-of-tree LLVM projects? Would you like to know how to use hosted CI services to run your LIT tests automatically? As most of these services are free for open source projects, it is really worthwhile to be familiar with the available options. In this lightning talk I will present how to:

  • configure LIT for an out-of-tree project
  • satisfy a dependency on LLVM in a hosted CI system.

As a reference example I will use the set-up that I have been using for a hobby GitHub project.

Inter-Procedural Value Range Analysis with the Attributor Hideto Ueno (University of Tokyo), Johannes Doerfert (ANL)

In the talk, I’ll explain how inter-procedural propagation in the Attributor framework works, focusing on the new range analysis and illustrative code examples.

Reproducers in LLVM - inspiration for clangd? Jan Korous (Apple)

Supporting wide-scale deployment of clangd is going to create a need to have a way of reporting bugs that is both convenient for users and actionable for maintainers. The idea of reproducers was successfully implemented in other projects under the LLVM umbrella— for example, clang and lldb. Here's an overview of how these work and what ideas could be used in clangd.

Matrix Support in Clang and LLVM Florian Hahn (Apple)

Fast matrix operations are the key to the performance of numerical linear algebra algorithms, which serve as engines of machine learning networks and AR applications. We added support for key matrix operations to Clang and LLVM. We show examples of the C++ language level, will discuss LLVM intrinsics for matrix operations that require information about the shape/layout of the underlying matrix, and compare the performance to vanilla vector based implementations.

Unified output format for Clang-Tidy and Static Analyzer Artem Dergachev (Apple)

Warnings emitted by the Clang Static Analyzer are more sophisticated than normal compiler warnings and are hard to comprehend without a good graphical interface. For that reason the Analyzer uses a custom diagnostic engine that supports multiple output formats, such as the human-readable HTML output format and the machine-readable Plist format used for IDE integration. These output formats are now available for other tools to use. In particular, Clang-Tidy is ported over to the Static Analyzer's diagnostic engine, allowing easy integration of Clang-Tidy into any environment that already provides Static Analyzer integration.

Extending ReachingDefAnalysis for Dataflow analysis Samuel Parker (Arm)

ReachingDefAnalysis was originally introduced to enable the breaking false dependencies in the backend. It has now been extended to enable post-RA dataflow queries that can enable the movement, insertion or removal of machine instructions. This lightening talk will highlight the changes and aim to show the audience how this is useful for code generation.

Flang Update Steve Scalpone (NVIDIA / Flang)

Provide an update about flang with an overview of changes since the last developer's meeting and the changes planned for the near future. Topics will cover migration to the monorepo, integration with MLIR, current in-flight projects, etc.

Extending Clang and LLVM for Interpreter Profiling Perf-ection Frej Drejhammar (RISE SICS)

When profiling a highly optimized interpreter, such as the Erlang virtual machine, a profiler does not really give you the information you need. This talk will show how surprisingly easy it is to extend Clang and LLVM to solve an one-off profiling task using the Perf tool. The Erlang virtual machine (BEAM) is a classic threaded interpreter, using first class labels and gotos, contained in a single function. For profiling purposes this is bad, as the profiler will attribute execution time to the main interpreter function when you as a developer really want execution time attributed to individual BEAM opcodes. By adding custom attributes to Clang and an analysis late in the LLVM back-end, we can easily traverse the CFG of the interpreter and figure out which basic blocks are executed by each BEAM opcode. With a small patch to Perf's JIT interface, we can make this basic block information override the debug information for the main interpreter function, thus allowing Perf to assign execution time to individual BEAM opcodes.

Data Parallel C++ compiler for accelerator programming Alexey Bader (Intel), Oleg Maslov (Intel)

This talk introduces the clang-based SYCL compiler with focus on the front-end and the driver enhancements enabling offloading of C++ code to wide range of accelerators. We will cover "SYCL device compiler" design and demonstrate how we leverage existing LLVM project infrastructure for offload code outlining, separate diagnostics for offload code and driver offload mode. We also review how third-party open source tools from the Khronos working group used to make our solution portable across different types of accelerators supporting OpenCL. We discuss ABI between host and device parts of the application and how to integrate SYCL offloading compiler with arbitrary C++11 compiler in addition to clang. We will update on the current status of SYCL support in Clang and plans for future development.

CUDA2OpenCL - a tool to assist porting CUDA applications to OpenCL Anastasia Stulova (Arm), Marco Antognini (Arm)

Conceptually, CUDA and OpenCL are similar programming models. Therefore it is feasible to convert applications from one to another, especially after the recent development of C++ for OpenCL (https://clang.llvm.org/docs/UsersManual.html#cxx-for- opencl) that allows to write OpenCL applications fully in C++ mode. In this talk we would like to present a tool that uses Clang Tooling and Rewriter to help migrating applications from CUDA to OpenCL. This tool combines (i) automatic rewriting for trivial and safe changes; (ii) source code annotation for non-trivial changes to assist manual porting of applications. We use Clang Tooling to parse the CUDA source and create an Abstract Syntax Tree (AST). Then a custom AST Consumer will visit the AST and with the help of Clang Rewriter will either modify the original source or insert annotation comments. If the mapping between CUDA and OpenCL constructs is straightforward, the construct is likely to be rewritten, e.g., address space, kernel attribute, kernel invocation. If the mapping is not straightforward the tool emits annotations explaining how the code can be modified manually, e.g., if CUDA __shared__ variables are declared in the scope disallowed by OpenCL. Unlike OpenCL, CUDA combines device (also known as kernel) and host code into one single source file. The tool will output two so-called OpenCL code templates - one for the host side and one for the device side. In each template, irrelevant code will be stripped out from the original, trivial constructs will be rewritten and annotation hints will be added. Both templates can be further modified if needed and then compiled using any C++ compiler for the host template and using Clang for the device template. The tool is at an early stage of development and we are planning to open source it by the time of EuroLLVM 2020. The mechanics are now fully in place but we don’t support many CUDA features yet and therefore only a few simple examples can run successfully. We would like to invite developers to use the tool and provide feedback on the missing features they would like to see added or even to help us add popular features that are missing. One aim of this project is to keep the output from the tool as close to the original source as possible to allow developers reading and modifying the output manually. While Clang Tooling and Rewriter are excellent choices to accomplish our goals there are a number of suggestions for improvements that we are hoping to highlight, e.g. improving accuracy of source information in Rewriter and propagation of build options from Clang Driver.

Experiences using MLIR to implement a custom language Klas Segeljakt (KTH - Royal Institute of Technology)

In this lightning talk, we will share our experiences using MLIR, both as experienced and beginner LLVM users, when implementing a middle-end for the language Arc. We will cover learning how to use the framework, creating custom operations, types, optimizations, and transforms, and integrating MLIR as a dependency into our research project. Arc is a functional intermediate representation for data analytics which is able to express distributed online stream operations. We use the standard optimizations provided by MLIR and implement our Arc-specific high-level optimizations in the MLIR framework. The MLIR framework gives us optimizations such as common subexpression elimination and constant propagation. In contrast to other compilers in the LLVM world, we do not lower our MLIR-level program to LLVM IR, instead we stay at the high-level dialects and produce Rust source code which is compiled and executed by our runtime system.

llvm-diva – Debug Information Visual Analyzer Carlos Enciso (Sony Interactive Entertainment)

Complexity and source-to-DWARF mapping are common problems with LLVM’s debug information. For example, see the different sections used to store several items such as strings, types, locations lists, line information, executable code, etc. In 2017 we presented DIVA [1] which we have successfully used to analyse several debug information issues in Clang and LLVM. DIVA used libdwarf [2] to parse DWARF debug information from ELF files. We have since re-implemented and expanded upon this functionality in llvm-diva, a new tool which requires no additional dependencies outside of LLVM. llvm-diva is a command line tool that reads a file (e.g. ELF or PDB) containing debug information (DWARF or CodeView) and produces an output that represents its logical view. The logical view is a high-level representation of the debug information composed of scopes, types, symbols and lines. llvm-diva has two modes: Printing and Comparison. The first prints a logical view containing attributes such as: lexical scopes, disassembly code associated with the debug line records, types, variables percentage coverage, etc. The second compares logical views to produce a report with the logical elements that are missing or added. This is a very powerful aid to find semantic differences in debug information produced by different toolchain versions, or even debug information formats [3]. The tool currently supports the ELF, MacOS and PDB file formats and the DWARF and COFF debug information formats. In this lightning talk I will show some of the above features, to illustrate how to use llvm-diva with the debug information generated by Clang. We aim to propose llvm-diva for inclusion into the LLVM monorepo soon.

[1] https://llvm.org/devmtg/2017-03/as sets/slides/diva_debug_information_visual_analyzer.pdf

[2] https://www.pre vanders.net/dwarf.html

[3] https://llvm.org/PR43905

Optimization Pass Sandboxing in LLVM: Replacing Heuristics on Statically Scheduled Targets Pierre-Andre Saulais (Codeplay Software)

Many optimizations operate using a parameter that affects how the program is transformed. For example, the unrolling factor for loop unrolling or offset for software pipelining. The value of this parameter is typically chosen at compilation time using a heuristic, which may involve a model of the execution target to accurately predict the effect of the optimization. On statically scheduled targets such as some in-order processors, the effect of later backend passes such as packetization, scheduling and register allocation on performance makes writing such a model very difficult. Since it is typically straightforward to estimate the performance of a given block of assembly instructions, trying multiple values for a pass parameter and picking the one that produces the best code gives more accurate results at the expense of compilation time. With optimization pass sandboxing, a pass is executed multiple times in a sandbox, once for a selection of values. The entire LLVM backend pass pipeline is also executed in isolation in order to produce assembly, from which a performance metric is estimated. The value with the best metric is then chosen for the pass parameter, and the sandbox results discarded.

Compile Faster with the Program Repository and ccache Ying Yi (SN Systems Limited), Paul Bowen-Huggett (SN Systems Limited)

The Program Repository (llvm-prepo) is an LLVM/Clang compiler with program repository support. It aims to improve turnaround times and eliminate duplication of effort by centralising program data in a repository. This reduces compilation time by reusing previously optimised functions and global variable fragments, including both sharing them across multiple translation units and reusing them even when other portions of the relevant source files have changed. ccache is a compiler caching tool that uses textual hashing of the source files. When used to build a large project, the ccache cache can quickly become invalid due to the frequency of header file changes. Thus, llvm-prepo reduces the build time for changed files, whereas ccache reduces the build time for unchanged files. This lightning talk will focus on showing how using the llvm-prepo and ccache together achieves much faster builds than using either of them individually. We will show the benefits by building the LLVM+Clang project at points through its commit history.

Adventures using LLVM OpenMP Offloading for Embedded Heterogeneous Systems Lukas Sommer (TU Darmstadt)

Modern embedded systems combine general-purpose processors with accelerators, such as GPUs, in a single, powerful heterogeneous system-on-chip (SoC). Such systems can be efficiently programmed using the device offloading features introduced in recent versions of the OpenMP standard. In this talk, we present an extension of LLVM's OpenMP Nvidia GPU offloading capabilities for embedded, heterogeneous systems combining ARM CPUs and Nvidia GPUs. Additionally, we adapted libomptarget and its Nvidia GPU plugin to make use of physically shared memory on the device through the CUDA unified memory model. We demonstrate the use of the adapted infrastructure on three automotive benchmark-kernels from the autonomous driving domain. Our adapted LLVM OpenMP offloading infrastructure allows the user to significantly improve execution times on embedded, heterogeneous systems by allocating unified memory for simultaneous use on CPU and GPU and thereby eliminating unnecessary data-transfers.

Merging Vector Registers in Predicated Codes Matthias Kurtenacker (Compiler Design Lab, Saarland University), Simon Moll (NEC Germany), Sebastian Hack (Compiler Design Lab, Saarland University)

Vector Predication allows vectorizing if-converted code. New architectures, and extensions to existing ones, allow to enable and disable execution on individual vector lanes during program execution. As with predication in the scalar case, static analyses over the predicates allow refining the register allocation process. The liveness information over a vector value can be extended to include liveness predicates as well. This can be used for instance to reduce the amount of spilling that a function needs to perform. We extend the greedy register allocator to take per lane liveness information into account when allocating vector registers. The target-dependent parts of this approach were implemented for NECs SX-Aurora TSUBASA architecture. First benchmarks show promising results with speedups of up to 16%.

OpenMP in LLVM --- What is changing and why Johannes Doerfert (ANL)

This lighting talk will give a short overview on all the currently ongoing efforts involving OpenMP. We will (try to) highlight the following topics with their respective rational:

  • The OpenMPOpt pass, the dedicated optimization pass that knows about and transforms OpenMP runtime calls.
  • The OpenMPIRBuilder, the new location for *all* OpenMP related code generation.
  • The interplay of OpenMP and Flang.
  • The implementation of OpenMP loop transformations.
  • The OpenMP device runtime redesign, a stepping stone to allow us to support more than a single offloading target.
  • Scalar optimization for outlined OpenMP functions, transparent in the Attributor framework.

A Multidimensional Array Indexing Intrinsics Prashanth NR (Compiler Tree Technologies), Vinay Madhusudan (Compiler Tree Technologies), Ranjith Kumar (Compiler Tree Technologies)

LLVM linearizes the multidimensional array indices. This hinders the memory dependency analysis for loop nest optimization. Techniques like delinearization are adhoc and pattern based. Newer front ends like FC, F18 plan to alleviate the issue by using a new high level IR called MLIR. For the traditional front ends like flang, where MLIR lowering is not planned, a new technique is proposed to circumvent the issue. We use intrinsics in the front end to communicate the dimensions of array indices. We have implemented the same in flang/clang frameworks and have successfully experimented with moderately big input programs.

Improving Code Density for RISC-V Target Wei Wei (Huawei), Chao Yu (Huawei)

RISC-V ISA is an open-source instruction set architecture designed to be useful in a wide range of embeded applications and devices. For many resource-constrained micro-controllers, code density will be a very important metric. Compression extension(named RVC) in RISC-V, is designed to reduce instruction bandwidth for common instructions, resulted in a 25%–30% code-size reduction. In this talk I'll present some code size results by llvm and gcc compilers with RVC, and find out why the GCC-generated code is more compact. Finally, I will describe some implementation we are doing on the LLVM side to close these code size gaps.

Posters
Automatic generation of LLVM based compiler toolchains from a high-level description Pavel Snobl (Codasip)

At Codasip we have developed a method for automatic generation of LLVM based compilers from a high level, architecture description language called CodAL. From this description, the register and instruction set architecture (ISA) definition is extracted in a process we call semantics extraction. This definition is then used as an input to the tool called backendgen which uses it to generate a fully functional C/C++ cross compiler. The high-level description is also used to generate all other parts of a standard SDK needed to develop applications for a typical processor - LLVM based assembler and disassembler, linker (LLD), debugger (LLDB) and a simulator. In this short talk and the related poster, I will describe the CodAL language and the process of automatic compiler generation and how it allows users with no previous compiler development experience to quickly create an LLVM based toolchain for their architecture.

Using MLIR to implement a compiler for Arc, a language for Batch and Stream Programming Klas Segeljakt (KTH - Royal Institute of Technology), Frej Drejhammar (RISE SICS)

This poster covers the design and implementation of a compiler using MLIR for the language Arc. Arc is a intermediate representation for data analytics which supports distributed online stream operations, and comes with its own compilation pipeline and runtime system. The Arc compiler uses the MLIR framework for high-level optimizations. Using MLIR allows us to concentrate on defining Arc- specific optimizations and reuse standard high-level optimizations provided by MLIR. In addition, MLIR offers a rich infrastructure for representing the Arc parse tree, custom transformations, command-line parsing, and regression testing. The Arc compiler translates its parse tree into MLIR's Affine and Standard dialects together with a new dialect for the Arc-specific operations. We define Arc-specific dataflow optimizations, such as operator reordering, fission, and fusion using the MLIR framework. The MLIR framework leverages optimizations such as common subexpression elimination and constant propagation. In contrast to other compilers in the LLVM world, we do not lower our MLIR-level program to LLVM IR, instead we stay at the high-level dialects and produce Rust source code which is compiled and executed by the runtime.

MultiLevel Tactics: Lifting loops in MLIR lorenzo chelini (TU Eindhoven), Andi Drebes (Inria and École Normale Supérieure), Oleksandr Zinenko (Google), Albert Cohen (Google), Henk Corporaal (TU Eindhoven), Tobias Grosser (ETH), Nicolas Vasilache (Google)

We propose MultiLevel Tactics, or ML Tactics for short, an extension to MLIR that recognizes patterns of high-level abstractions (e.g., linear algebra operations) in low-level dialects and replaces them with the corresponding operations of an appropriate high-level dialect. Our current prototype recognizes matrix multiplications in loop nests of the Affine dialect and lifts these to the Linalg dialect. The pattern recognition and replacement scheme are designed as reusable building blocks for transformations between arbitrary dialects and can be used to recognize commonly recurrent patterns in HPC applications.

Interpreted Pattern Matching in MLIR with MLIR Jeff Niu (Google), Mehdi Amini (Google), River Riddle (Google)

A pattern matching and rewrite system underlies many of MLIR’s transformations on code, including optimizations, canonicalization, and operation legalization. The current approach to pattern execution involves writing C++ classes to implement a match and rewrite function or using TableGen to describe patterns, from which a backend generates C++. This method is powerful, easy to use, and fits nicely into the overall system, but suffers from some pitfalls:

  • Not extensible at runtime: adding or modifying patterns requires rebuilding the compiler, which makes it cumbersome for users to easily modify pattern sets, especially for those not normally working with C++.
  • Duplicate work between patterns: many patterns have similar constraints and checks, some of which can be expensive. E.g. attribute lookups are linear searches using string comparisons. Current pattern generation involves no intermediate form upon which optimizations may be performed.
  • C++ code generation from TableGen results in binary size bloat.

The proposed solution involves representing pattern sets as bytecode and executing it in an interpreter embedded in MLIR, as with SelectionDagISel, but using a pipeline built with MLIR and representing patterns as an MLIR dialect. This pattern dialect should be able to express a superset of TableGen patterns and, if necessary, hook into native function calls to provide power similar to writing C++ patterns. Optimizations can be performed on sets of patterns represented in this intermediate form, which is then injected into the existing framework, allowing interoperability with existing C++ patterns. Allowing emission of this intermediate form from “front- ends”, such as Python, JSON, and TableGen, enables users to specify patterns dynamically, without rebuilding the compiler. Then, pattern sets can be distributed separately from the compiler itself. Or, users can modify patterns on-the-fly with whatever DSL they work in. This specification leads to a series of sub-problems. Of them include designing the pattern dialect to be feature-complete, optimizing this intermediate form, “lowering” pattern sets into a byte-code, and designing the interpreter, in addition to how this system will integrate with the existing infrastructure and how it needs to be modified. An early version of this work was presented at an MLIR Open Design Meeting, see slides here: https://docs.google.com/presentation/d/1e8MlXOBgO04kdoBoKTErvaPLY7 4vUaVoEMINm8NYDds/edit?usp=sharing

Case Study: LLVM Optimizations for AI Applications Using RISC-V V Extension Chia-Hsuan Chang (National Tsing Hua University, Taiwan), Pi-You Chen (National Tsing Hua University, Taiwan), Chao-Lin Lee (National Tsing Hua University, Taiwan), Jenq-Kuen Lee (National Tsing Hua University, Taiwan)

RISC-V is an open ISA with small and flexible features. Hardware vendors for RISC-V could select the extension by their requirements for the specific application. Among the extension, vector extension is one of the RISC-V extensions to enable the superword SIMD in RISC-V architectures to support the fallback engine of the AI Computing. As the specification is still new, there are needed supports in the LLVM compiler site. In our paper, we describe the techniques to efficiently support RISC-V with V extension at LLVM via both vector intrinsic functions and basic llvm vector builders. Note RISC-V vector extension allows one to dynamically set the size of each element in the vector and also the amount of vector elements. This was designed in the specification to allow the flexibility to deploy different widths for low-power numeric with different layers in the deep learning models. However, it creates challenges in the implementation site. In the optimization site, we support an extra llvm compiler phase for the redundancy elimination of the vsetvl instructions. With the flexibility of the dynamic vector size for each layer, there are extra vsetvl instructions generated in the vector code generations. Our redundancy elimination phase reduces the unnecessary vsetvl codes. In addition, an efficient vector initialization is devised. We perform AI model experiments with TVM compiler flow to our LLVM compiler with RISC-V V extension and achieve average 4.24x instruction reductions for the runtime execution than the baseline without SIMD supports.

OpenMP codegen in Flang using MLIR Kiran Chandramohan (Arm Ltd)

Flang is the Fortran frontend of LLVM under construction. This presentation (and/or poster) provides a brief summary of the design of LLVM IR generation for OpenMP constructs in Flang. Two major components are used for this project. i) MLIR: A dialect is created for OpenMP. The dialect is designed to be generic (so that other frontends can use it), inter-operable with other dialects and also capable of optimisations. ii) OpenMP IRBuilder: The OpenMP IRBuilder project refactors codegen for OpenMP directives from Clang and places them in the LLVM directory. This way both Clang and Flang can share the LLVM IR generation code for OpenMP. The overall flow will be as follows. The Flang parser will parse the Fortran source into a parse tree. The parse tree is then lowered to a mix of FIR and OpenMP dialects. These are then optimised and finally converted to mix of OpenMP and LLVM MLIR dialects. The mix is translated to LLVM IR using the existing translation library for LLVM MLIR and the OpenMP IRBuilder. The presentation will include the details of the OpenMP dialect, some examples, how it interacts with other dialects and how it is translated to LLVM IR. Also, see the RFC for the OpenMP dialect in MLIR group. https://groups.google.com/a/tensorfl ow.org/d/msg/mlir/SCerbBpoxng/bVqWTRY7BAAJ

Some Improvements to the Branch Probability Information (BPI) Akash Banerjee (IIT Hyderabad), Venkata Keerthy S (IIT Hyderabad), Rohit Aggarwal (IIT Hyderabad), Ramakrishna Upadrasta (IIT Hyderabad)

The BranchProbabilityInfo (BPI) pass is LLVM’s heuristic-based profiler. A study on this analysis pass indicates that the heuristics implemented in it were fast, but not adequate. We propose to improve the current heuristics to make them more robust and give better predictions. This has the potential to be useful in the absence of actual profile information (for example, from PGO). We suggest some possible improvements to the existing heuristics in the current implementation and experimentally observe that such improvements have a positive impact on the runtime when used by the standard O3 sequence, and we obtained an average speed-up of 1.07.

Is Post Dominator tree spoiling your party? Reshabh Kumar Sharma (AMD Inc)

The difference in perspective of the implementation and use can sometimes result in behaviors that are not expected. They may not necessarily be bugs. We present you the same with a concrete example of post dominator tree construction algorithm in LLVM. Post dominator tree is a very important abstraction of a property of cfg (post dominance) which has wide applications in various analysis and transform passes in LLVM. We take two near similar cfg as the base of the analysis. We show these test cases exploit the post dominator tree construction algorithm to generate two different yet valid post dominator trees. We took it further to analyze the ripple effect on other passes which depends on it. We present a few cases that demonstrate this ripple effect. The main aim is to demonstrate that such behaviors can have a larger effect than expected and can be harder to debug in comparison with implementation bugs. Such behaviors if found can be very difficult to correct as sometimes the correction can bring in big performance regression.

DragonFFI: using Clang/LLVM for seamless C interoperability, and much more! Adrien Guinet (Quarkslab)

DragonFFI [1] is a Clang/LLVM-based library that allows calling C functions and using C structures from any languages. It provides a way to easily call C functions and manipulate C structures from any language. Its purpose is to parse C libraries headers without any modifications and transparently use them in a foreign language, like Python or Ruby. The first release has been published in February 2018. A blog post presenting the project has been published on the LLVM blog in March 2018 [2], and been presented to Fosdem 2018 [3]. Since then, it has been improved to fulfill various users' needs, and stabilized so it is near being production-ready. That's why a stable DragonFFI 1.0 version is planned for March 2020, and will include:

  • stable C++ and Python API/ABI
  • generating Python portable structures from a C header file (for a given ABI). This is something the security community asks for, to make (for instance) exploit research easier.
  • tutorials for first-users and proposer API documentation

This talk will showcase this version and be structured in this way:

  • why DragonFFI, and what are the pros and cons against existing solutions (e.g. libffi, cffi, cppyy)
  • how DragonFFI use Clang and LLVM internally
  • what could be improved in Clang and/or LLVM to make our life easier
  • the life of a cross-platform DragonFFI release, and its pitfalls
  • demos !
  • future directions

[1] https://github. com/aguinet/dragonffi/

[2] https://blog.llvm.org/2018/03/dragonffi-ffijit- for-c-language-using.html

[3] https://archive.fosdem.org/2018/schedule/event/dragonffi/