The LLVM Compiler Infrastructure
Site Map:
Download now: LLVM 3.9.1
All Releases
APT Packages
Win Installer

View the open-source

Search this Site

Useful Links
Release Emails
3.9.0: Sep 2016
3.8.1: July 2016
3.8.0: Mar 2016
3.7.1: Jan 2016
3.7.0: Sep 2015
3.6.2: Jul 2015
3.6.1: May 2015
3.5.2: Apr 2015
3.6: Feb 2015
3.5.1: Jan 2015
3.5: Sep 2014
3.4.2: June 2014
3.4.1: May 2014
3.4: Jan 2014
3.3: Jun 2013
3.2: Dec 2012
3.1: May 2012
3.0: Dec 2011
2.9: Apr 2011
2.8: Oct 2010
2.7: Apr 2010
2.6: Oct 2009
2.5: Mar 2009
2.4: Nov 2008
2.3: Jun 2008
2.2: Feb 2008
2.1: Sep 2007
2.0: May 2007
Older News

Maintained by:
Chris Lattner
Open LLVM Projects

Written by the LLVM Team

Welcome prospective Google Summer of Code 2017 Students! This document is your starting point to finding interesting and important projects for LLVM, Clang, and other related sub-projects. This list of projects is not just developed for Google Summer of Code, but open projects that really need developers to work on and are very beneficial for the LLVM community.

We encourage you to look through this list and see which projects excite you and match well with your skill set. We also invite proposals not on this list. However, you must propose your idea to the LLVM community through our developers' mailing list ( or specific subproject mailing list). Feedback from the community is a requirement for your proposal to be considered and hopefully accepted.

The LLVM project has participated in Google Summer of Code for several years and has had some very successful projects. We hope that this year is no different and look forward to hearing your proposals. For information on how to submit a proposal, please visit the Google Summer of Code main website.

Description of the project: When instantiating a template, the template arguments are canonicalized before being substituted into the template pattern. Clang does not preserve type sugar when subsequently accessing members of the instantiation.

    std::vector<std::string> vs;
    int n = vs.front(); // bad diagnostic: [...] aka 'std::basic_string<char>' [...]

    template<typename T> struct Id { typedef T type; };
    Id<size_t>::type // just 'unsigned long', 'size_t' sugar has been lost
Clang should "re-sugar" the type when performing member access on a class template specialization, based on the type sugar of the accessed specialization. The type of vs.front() should be std::string, not std::basic_string<char, [...]>.

Suggested design approach: add a new type node to represent template argument sugar, and implicitly create an instance of this node whenever a member of a class template specialization is accessed. When performing a single-step desugar of this node, lazily create the desugared representation by propagating the sugared template arguments onto inner type nodes (and in particular, replacing Subst*Parm nodes with the corresponding sugar). When printing the type for diagnostic purposes, use the annotated type sugar to print the type as originally written.

For good results, template argument deduction will also need to be able to deduce type sugar (and reconcile cases where the same type is deduced twice with different sugar).

Expected results: Diagnostics preserve type sugar even when accessing members of a template specialization. T<unsigned long> and T<size_t> are still the same type and the same template instantiation, but T<unsigned long>::type single-step desugars to 'unsigned long' and T<size_t>::type single-step desugars to 'size_t'.

Confirmed Mentor:Vassil Vassilev, Richard Smith

Desirable skills: Good knowledge of clang API, clang’s AST, intermediate knowledge of C++.

Description of the project: Implement a path-sensitive checker that warns if virtual calls are made from constructors and destructors, which is not valid in case of pure virtual calls and could be a sign of user error in non-pure calls.
The current virtual calls checker, implemented in VirtualCallChecker.cpp, needs to be re-implemented in a path-sensitive way. The lack of path-sensitive reasoning may result in false positives in the inter-procedural mode, which is disabled now for that reason. The false positives could happen when a called function uses a member variable flag to track whether initialization is complete and relies on the flag to ensure that the virtual member function is not called during initialization. Further, the path diagnostic should be used to highlight both the virtual call and the path from the constructor. Last, we will need to evaluate if the warning should be issued for both calls to pure virtual functions (which is always an error) and non-pure virtual functions (which is more of a code smell and may be a false positive).

Confirmed Mentor:Anna Zaks

Description of the project: Enhance the clang static analyzer by adding models of C++11 and C11 atomic operations, such as std::atomic_compare_exchange_*. Currently, these operations are being treated opaquely, which results in loss of precision when analyzing the code that uses these instructions. To address the problem, one would need to programmatically construct AST that simulates these APIs to the BodyFarm of the analyzer. BodyFarm is the API used for modeling system APIs. Finally, the work would also include writing tests for the various APIs and checking that the analyzer correctly models atomics.

Confirmed Mentor:Anna Zaks

Description of the project: The static analyzer and the compiler that operate on top of clang AST could use the results of a fast pointer analysis such as "Points-to analysis in almost linear time" by Bjarne Steensgaard.

Confirmed Mentor:Devin Coughlin

Description of the project: Many of the projects in compiler-rt are only supported on Linux.
Here are some examples: CFI, DFSan, XSan, LSan, XRay. Porting any of them to other platforms, for example, Mac OS, would be great!

Confirmed Mentor:Kuba Mracek, Anna Zaks

Description of the project: lldb-mi implements a machine-readable interface that is supported by many IDEs. The current support is very incomplete and isn't using the right abstraction layer. Instead of scraping LLDB's textual output, it should be using the data structures provided by the public API.

Confirmed Mentor:Adrian Prantl, Greg Clayton

Desirable skills: Intermediate knowledge of C++.

This document is meant to be a sort of "big TODO list" for LLVM. Each project in this document is something that would be useful for LLVM to have, and would also be a great way to get familiar with the system. Some of these projects are small and self-contained, which may be implemented in a couple of days, others are larger. Several of these projects may lead to interesting research projects in their own right. In any case, we welcome all contributions.

If you are thinking about tackling one of these projects, please send a mail to the LLVM Developer's mailing list, so that we know the project is being worked on. Additionally this is a good way to get more information about a specific project or to suggest other projects to add to this page.

The projects in this page are open-ended. More specific projects are filed as unassigned enhancements in the LLVM bug tracker. See the list of currently outstanding issues if you wish to help improve LLVM.

In addition to hacking on the main LLVM project, LLVM has several subprojects, including Clang and others. If you are interested in working on these, please see their "Open projects" page:

Improvements to the current infrastructure are always very welcome and tend to be fairly straight-forward to implement. Here are some of the key areas that can use improvement...

Currently, both Clang and LLVM have a separate target description infrastructure, with some features duplicated, others "shared" (in the sense that Clang has to create a full LLVM target description to query specific information).

This separation has grown in parallel, since in the beginning they were quite different and served disparate purposes. But as the compiler evolved, more and more features had to be shared between the two so that the compiler would behave properly. An example is when targets have default features on speficic configurations that don't have flags for. If the back-end has a different "default" behaviour than the front-end and the latter has no way of enforcing behaviour, it simply won't work.

Of course, an alternative would be to create flags for all little quirks, but first, Clang is not the only front-end or tool that uses LLVM's middle/back ends, and second, that's what "default behaviour" is there for, so we'd be missing the point.

Several ideas have been floating around to fix the Clang driver WRT recognizing architectures, features and so on (table-gen it, user-specific configuration files, etc) but none of them touch the critical issue: sharing that information with the back-end.

Recently, the idea to factor out the target description infrastructure from both Clang and LLVM into its own library that both use, has been floating around. This would make sure that all defaults, flags and behaviour are shared, but would also reduce the complexity (and thus the cost of maintenance) a lot. That would also allow all tools (lli, llc, lld, lldb, etc) to have the same behaviour across the board.

The main challenges are:

  • To make sure the transition doesn't destroy the delicate balance on any target, as some defaults are implicit and, some times, unknown.
  • To be able to migrate one target at a time, one tool at a time and still keep the old infrastructure intact.
  • To make it easy for detecting target's features for both front-end and back-end features, and to merge both into a coherent set of properties.
  • To provide a bridge to the new system for tools that haven't migrated, especially the off-the-tree ones, that will need some time (one release, at least) to migrate..

The LLVM bug tracker occasionally has "code-cleanup" bugs filed in it. Taking one of these and fixing it is a good way to get your feet wet in the LLVM code and discover how some of its components work. Some of these include some major IR redesign work, which is high-impact because it can simplify a lot of things in the optimizer.

Some specific ones that would be great to have:

Additionally, there are performance improvements in LLVM that need to get fixed. These are marked with the slow-compile keyword. Use this Bugzilla query to find them.

The llvm-test testsuite is a large collection of programs we use for nightly testing of generated code performance, compile times, correctness, etc. Having a large testsuite gives us a lot of coverage of programs and enables us to spot and improve any problem areas in the compiler.

One extremely useful task, which does not require in-depth knowledge of compilers, would be to extend our testsuite to include new programs and benchmarks. In particular, we are interested in cpu-intensive programs that have few library dependencies, produce some output that can be used for correctness testing, and that are redistributable in source form. Many different programs are suitable, for example, see this list for some potential candidates.

We are always looking for new testcases and benchmarks for use with LLVM. In particular, it is useful to try compiling your favorite C source code with LLVM. If it doesn't compile, try to figure out why or report it to the llvm-bugs list. If you get the program to compile, it would be extremely useful to convert the build system to be compatible with the LLVM Programs testsuite so that we can check it into SVN and the automated tester can use it to track progress of the compiler.

When testing a code, try running it with a variety of optimizations, and with all the back-ends: CBE, llc, and lli.

Find benchmarks either using our test results or on your own, where LLVM code generators do not produce optimal code or simply where another compiler produces better code. Try to minimize the test case that demonstrates the issue. Then, either submit a bug with your testcase and the code that LLVM produces vs. the code that it should produce, or even better, see if you can improve the code generator and submit a patch. The basic idea is that it's generally quite easy for us to fix performance problems if we know about them, but we generally don't have the resources to go finding out why performance is bad.

The LNT perf database has some nice features like detect moving average, standard deviations, variations, etc. But the report page give too much emphasis on the individual variation (where noise can be higher than signal), eg. this case.

The first part of the project would be to create an analysis tool that would track moving averages and report:

  • If the current result is higher/lower than the previous moving average by more than (configurable) S standard deviations
  • If the current moving average is more than S standard deviations of the Base run
  • If the last A moving averages are in constant increase/decrease of more than P percent

The second part would be to create a web page which would show all related benchmarks (possibly configurable, like a dashboard) and show the basic statistics with red/yellow/green colour codes to show status and links to more detailed analysis of each benchmark.

A possible third part would be to be able to automatically cross reference different builds, so that if you group them by architecture/compiler/number of CPUs, this automated tool would understand that the changes are more common to one particular group.

The LLVM Coverage Report has a nice interface to show what source lines are covered by the tests, but it doesn't mentions which tests, which revision and what architecture is covered.

A project to renovate LCOV would involve:

  • Making it run on a buildbot, so that we know what commits / architectures are covered
  • Update the web page to show that information
  • Develop a system that would report every buildbot build into the web page in a searchable database, like LNT

Another idea is to enable the test suite to run all built backends, not just the host architecture, so that coverage report can be built in a fast machine and have one report per commit without needing to update the buildbots.

  1. Completely rewrite bugpoint. In addition to being a mess, bugpoint suffers from a number of problems where it will "lose" a bug when reducing. It should be rewritten from scratch to solve these and other problems.
  2. Add support for transactions to the PassManager for improved bugpoint.
  3. Improve bugpoint to support running tests in parallel on MP machines.
  4. Add MC assembler/disassembler and JIT support to the SPARC port.
  5. Move more optimizations out of the -instcombine pass and into InstructionSimplify. The optimizations that should be moved are those that do not create new instructions, for example turning sub i32 %x, 0 into %x. Many passes use InstructionSimplify to clean up code as they go, so making it smarter can result in improvements all over the place.

Sometimes creating new things is more fun than improving existing things. These projects tend to be more involved and perhaps require more work, but can also be very rewarding.

Many proposed extensions and improvements to LLVM core are awaiting design and implementation.

  1. Improvements for Debug Information Generation
  2. EH support for non-call exceptions
  3. Many ideas for feature requests are stored in LLVM bugzilla. Just search for bugs with a "new-feature" keyword.

We have a strong base for development of both pointer analysis based optimizations as well as pointer analyses themselves. It seems natural to want to take advantage of this:

  1. The globals mod/ref pass basically does really simple and cheap bottom-up context sensitive alias analysis. It being simple and cheap are really important, but there are simple things that we could do to better capture the effects of functions that access pointer arguments. This can be really important for C++ methods, which spend lots of time accessing pointers off 'this'.
  2. The alias analysis API supports the getModRefBehavior method, which allows the implementation to give details analysis of the functions. For example, we could implement full knowledge of printf/scanf side effects, which would be useful. This feature is in place but not being used for anything right now.
  3. We need some way to reason about errno. Consider a loop like this:
        for ()
          x += sqrt(loopinvariant);

    We'd like to transform this into:

        t = sqrt(loopinvariant);
        for ()
          x += t;

    This transformation is safe, because the value of errno isn't otherwise changed in the loop and the exit value of errno from the loop is the same. We currently can't do this, because sqrt clobbers errno, so it isn't "readonly" or "readnone" and we don't have a good way to model this.

    The hard part of this project is figuring out how to describe errno in the optimizer: each libc #defines errno to something different it seems. Maybe the solution is to have a __builtin_errno_addr() or something and change sys headers to use it.

  4. There are lots of ways to optimize out and improve handling of memcpy/memset.

We now have a unified infrastructure for writing profile-guided transformations, which will work either at offline-compile-time or in the JIT, but we don't have many transformations. We would welcome new profile-guided transformations as well as improvements to the current profiling system.

Ideas for profile-guided transformations:

  1. Superblock formation (with many optimizations)
  2. Loop unrolling/peeling
  3. Profile directed inlining
  4. Code layout
  5. ...

Improvements to the existing support:

  1. The current block and edge profiling code that gets inserted is very simple and inefficient. Through the use of control-dependence information, many fewer counters could be inserted into the code. Also, if the execution count of a loop is known to be a compile-time or runtime constant, all of the counters in the loop could be avoided.
  2. You could implement one of the "static profiling" algorithms which analyze a piece of code an make educated guesses about the relative execution frequencies of various parts of the code.
  3. You could add path profiling support, or adapt the existing LLVM path profiling code to work with the generic profiling interfaces.

LLVM aggressively optimizes for performance, but does not yet optimize for code size. With a new ARM backend, there is increasing interest in using LLVM for embedded systems where code size is more of an issue.

Someone interested in working on implementing code compaction in LLVM might want to read this article, describing using link-time optimizations for code size optimization.

  1. Implement a Loop Dependence Analysis Infrastructure
    - Design some way to represent and query dep analysis
  2. Value range propagation pass
  3. More fun with loops: Predictive Commoning
  4. Type inference (aka. devirtualization)
  5. Value assertions (also PR810).
  1. Generalize target-specific backend passes that could be target-independent, by adding necessary target hooks and making sure all IR/MI features (such as register masks and predicated instructions) are properly handled. Enable these for other targets where doing so is demonstrably beneficial. For example:
    1. lib/Target/Hexagon/RDF*
    2. lib/Target/AArch64/AArch64AddressTypePromotion.cpp
  2. Merge the delay slot filling logic that is duplicated into (at least) the Sparc and Mips backends into a single target independent pass. Likewise, the branch shortening logic in several targets should be merged together into one pass.
  3. Implement 'stack slot coloring' to allocate two frame indexes to the same stack offset if their live ranges don't overlap. This can reuse a bunch of analysis machinery from LiveIntervals. Making the stack smaller is good for cache use and very important on targets where loads have limited displacement like ppc, thumb, mips, sparc, etc. This should be done as a pass before prolog epilog insertion. This is now done for register allocator temporaries, but not for allocas.
  4. Implement 'shrink wrapping', which is the intelligent placement of callee saved register save/restores. Right now PrologEpilogInsertion always saves every (modified) callee save reg in the prolog and restores it in the epilog. However, some paths through a function (e.g. an early exit) may not use all regs. Sinking the save down the CFG avoids useless work on these paths. Work has started on this, please inquire on llvm-dev.
  5. Implement interprocedural register allocation. The CallGraphSCCPass can be used to implement a bottom-up analysis that will determine the *actual* registers clobbered by a function. Use the pass to fine tune register usage in callers based on *actual* registers used by the callee.
  6. Add support for 16-bit x86 assembly and real mode to the assembler and disassembler, for use by BIOS code. This includes both 16-bit instruction encodings as well as privileged instructions (lgdt, lldt, ltr, lmsw, clts, invd, invlpg, wbinvd, hlt, rdmsr, wrmsr, rdpmc, rdtsc) and the control and debug registers.
  1. Port the Bigloo Scheme compiler, from Manuel Serrano at INRIA Sophia-Antipolis, to output LLVM bytecode. It seems that it can already output .NET bytecode, JVM bytecode, and C, so LLVM would ostensibly be another good candidate.
  2. Write a new frontend for some other language (Java? OCaml? Forth?)
  3. Random test vector generator: Use a C grammar to generate random C code, e.g., quest; run it through llvm-gcc, then run a random set of passes on it using opt. Try to crash opt. When opt crashes, use bugpoint to reduce the test case and post it to a website or mailing list. Repeat ad infinitum.
  4. Add sandbox features to the Interpreter: catch invalid memory accesses, potentially unsafe operations (access via arbitrary memory pointer) etc.
  5. Port Valgrind to use LLVM code generation and optimization passes instead of its own.
  6. Write LLVM IR level debugger (extend Interpreter?)
  7. Write an LLVM Superoptimizer. It would be interesting to take ideas from this superoptimizer for x86: paper #1 and paper #2 and adapt them to run on LLVM code.

    It would seem that operating on LLVM code would save a lot of time because its semantics are much simpler than x86. The cost of operating on LLVM is that target-specific tricks would be missed.

    The outcome would be a new LLVM pass that subsumes at least the instruction combiner, and probably a few other passes as well. Benefits would include not missing cases missed by the current combiner and also more easily adapting to changes in the LLVM IR.

    All previous superoptimizers have worked on linear sequences of code. It would seem much better to operate on small subgraphs of the program dependency graph.

In addition to projects that enhance the existing LLVM infrastructure, there are projects that improve software that uses, but is not included with, the LLVM compiler infrastructure. These projects include open-source software projects and research projects that use LLVM. Like projects that enhance the core LLVM infrastructure, these projects are often challenging and rewarding.

Several projects reuse LLVM's code generator infrastructure to analyze and optimize the machine code that the compiler generates. However, LLVM's code generator infrastructure does not provide a pass which allows inter-procedural analysis and optimization of the native code that the LLVM compiler generates. What is needed is an equivalent of the LLVM IR ModulePass (i.e., a MachineModulePass) that permits inter-procedural analysis and optimiztion of machine instructions (MachineIntr's).

At least one project (and probably more) needs to use analysis information (such as call graph analysis) from within a MachineFunctionPass. However, most analysis passes operate at the LLVM IR level. In some cases, a value (e.g., a function pointer) cannot be mapped from the MachineInstr level back to the LLVM IR level reliably, making the use of existing LLVM analysis passes from within a MachineFunctionPass impossible (or at least brittle).

This project is to encode analysis information from the LLVM IR level into the MachineInstr IR when it is generated so that it is available to a MachineFunctionPass. The exemplar is call graph analysis (useful for control-flow integrity instrumentation, analysis of code reuse defenses, and gadget compilers); however, other LLVM analyses may be useful.

Implement an on-demand function relocator in the LLVM JIT. This can help improve code locality using runtime profiling information. The idea is to use a relocation table for every function. The relocation entries need to be updated upon every function relocation (take a look at this article). A (per-function) basic block reordering would be a useful extension.

The goal of this project is to implement better data layout optimizations using the model of reference affinity. This paper provides some background information.

Slimmer is a prototype tool, built using LLVM, that uses dynamic analysis to find potential performance bugs in programs. Development on Slimmer started during Google Summer of Code in 2015 and resulted in an initial prototype, but evaluation of the prototype and improvements to make it portable and robust are still needed. This project would have a student pick up and finish the Slimmer work. The source code of Slimmer and its current documentation can be found at its Github web page.

Valid CSS! Valid HTML 4.01! LLVM Compiler Infrastructure