// cSpell:ignoreRegExp @[a-zA-Z0-9_-]+ // cSpell:ignoreRegExp #\w+\( // cSpell:ignore cetz booktbl bibliographyx citegroup instcombine // cSpell:ignore Reissmann Soffa Zaidi Aigner Barany Mössenböck Duboscq Steinlechner Ginsbach Kildall Rastilav Joisha Rajiv Schreiber Banerjee Boehm Chakrabarti Löwe Edvinsson Peyton-Jones Bahmann Själander // cSpell:disable #import "clean-acmart.typ": acmart #import "@preview/cetz:0.3.4" #import "@preview/lilaq:0.3.0" as lq #import "@preview/cetz-plot:0.1.1": chart as cetz_chart #import "@preview/tblr:0.3.1": tblr, rows as tblr_rows, hline as tblr_hline #import "@preview/codly:1.3.0": codly-init, codly #import "@preview/alexandria:0.2.0": * #show: codly-init.with() #codly(zebra-fill: none, display-icon: false, display-name: false, stroke: none, radius: 0mm, inset: 0.2em) #let booktbl = tblr.with( stroke: none, column-gutter: 0.6em, // booktabs style rules tblr_rows(within: "header", auto, inset: (y: 0.5em)), tblr_rows(within: "header", auto, align: center), tblr_hline(within: "header", y: 0, stroke: 0.08em), tblr_hline(within: "header", y: end, position: bottom, stroke: 0.05em), tblr_rows(within: "body", 0, inset: (top: 0.5em)), tblr_hline(y: end, position: bottom, stroke: 0.08em), tblr_rows(end, inset: (bottom: 0.5em)), ) #let title = [Dataflow Analysis for Compiler Optimization] #let authors = ( ( name: "Matthias Veigel", email: "matthias.veigel@uni-ulm.de", department: [Institute of Software Engineering and Programming Languages], institute: [University Ulm] ), ) #show: acmart.with( title: title, authors: authors, copyright: none // Set review to submission ID for the review process or to "none" for the final version. // review: [\#001], ) #set heading(supplement: "Sect.") #set figure(supplement: [Fig.]) #show figure.caption: it => { set text(size: 8pt) v(1.5mm); strong(it.supplement + " " + context { it.counter.display(it.numbering)}); h(0.3em); it.body } #show figure.where(kind: "raw"): set figure(supplement: [Listing]) #show figure.where(kind: "raw"): it => align(left, { v(8pt, weak: true); it.body; v(4pt, weak: true); it.caption; v(8pt, weak: true) }) #show figure: it => { v(1.25em, weak: true); it; v(1.25em, weak: true) } #show ref: it => { let el = it.element if el != none { if el.func() == figure and el.kind == "slr" { return link(el.location(), text(el.body, weight: "bold")) } if it.supplement == auto { let counter = if el.func() == figure { el.counter } else { counter(heading) } let numb = numbering(el.numbering, ..counter.at(el.location())) return link(el.location(), box(el.supplement + " " + numb)) } } return it } #show heading.where(level: 1): it => { v(4mm, weak: true); it } #show heading.where(level: 2): it => { v(2mm, weak: true); it } #show heading.where(level: 3): it => { v(1.5mm, weak: true); it } #show: alexandria(prefix: "x-", read: p => read(p)) #show: alexandria(prefix: "y-", read: p => read(p)) #let cgx = it => citegroup(it, prefix: "x-") #let cgy = it => citegroup(it, prefix: "y-") #set heading(numbering: "1.1.1") // cSpell:enable // TODO Small info what is LLVM + footnote link = Abstract Dataflow analysis is an important part of compiler optimization since it allows to eliminate or rewrite parts of the code with various techniques such as: constant propagation, dead code elimination, branch elimination. This work aims to look at the advantages and disadvantages of using dataflow analysis, how it is already used in current compilers, on which programming languages or immediate representations it operates and what limitations still exist. \ For this purpose we conducted a systematic literature in which we analyze 15 publications selected from 571 entries. Finally, following conclusions were drawn: dataflow analysis is used in many of todays popular compilers and the field is actively being researched. The advantages of dataflow analysis are huge for performance gain, but its implementations are complex and you need to be careful that the implementation does not change the program in an unwanted way. = Introduction Program performance remains a large concern in modern computing and programming, since it has a direct impact on user and developer experience. As software is becoming more complex, manual optimization is increasingly complex and harder for developers to implement. Another problem with this increasing complexity is that large codebases are spread out over more files, which also makes it harder for developers to keep an overview and to implement optimizations. Because of these reasons automatic optimization is needed in compilers. \ The technique dataflow analysis is used to gather information about the state of variables throughout the flow of the program. It plays an important role in many compilers, since by analyzing how, where and what variables are assigned and how these variables are used, many complex optimizations, which require context from the surrounding code, can be implemented. \ Dataflow analysis is a well-established field where regularly new techniques are created and older techniques improved. Different compilers and analysis framework implement different methods and optimizations with dataflow analysis. This work aims to summarize the current state and past achievements of this technology. \ While this paper talks about dataflow analysis in the context of compiler optimization, these techniques can also be used to create either more detailed or previously not possible compilation warnings and errors. For example: with dataflow analysis writes to an invalid memory location or usage of a not initialized object or variable can be detected at compile time, which leads to a better coding experience and less crashes at runtime. Examples for this are the Clang Static Analyzer #footnote[https://clang.llvm.org/docs/ClangStaticAnalyzer.html] and the static analysis options of GCC #footnote[https://gcc.gnu.org/onlinedocs/gcc/Static-Analyzer-Options.html]. \ This work is divided into the following sections: in @background_c the background required to understand this work is given, in @methodology_c the methodology used to create this work is described, in @findings_c the contents of the papers are analyzed and evaluated, in @conclusion_c the results from this work are summarized. = Background == LLVM #footnote[https://llvm.org/] LLVM is a compiler framework mainly consisting of the LLVM Core, which serves as backend or optimizing LLVM IR and compiling it to machine specific code, and Clang, a frontend for compiling C, C++ and Objective-C to LLVM IR. There are more components like a debugger and a C and C++ standard library. There also exists a lot of external projects for compiling other languages and targeting other machines. The biggest advantage of LLVM is its flexibility and easy extensibility, which makes it a useful framework for most kinds of research about compilers. == Control flow graph #figure( // ssa_form_example caption: [C code and respective SSA in control flow graph form, adapted from Fig. 1 in the work of Reissmann, Meyer and Soffa @y-reissmann_rvsdg_2020], kind: "raw", grid( columns: (1fr, 1.25fr), ```C int x = 2 * 2 + 4; x = x - 2; if (x < 4) x = 10; else x = 12; int y = x * 2; ```, image("ssa-example.svg", height: 16em) ) ) A control flow graph is a directed graph consisting of blocks of code as nodes and edges between these blocks for the program execution flow. The right part of @ssa_form_example show a small example of a control flow graph. The nodes always contain sequential code with a statement at the end that changes the control flow. In the example this is either the `branch`, which branches the flow based on a condition, or the hidden `goto` at the end of both blocks for `x₃` and `x₄`, which just unconditionally jump to another block of code. Edges after a branching statement also have a label to indicate for what condition this branch is used. == Static Single Assignment form (SSA / SSA form) Many modern compilers and analysis tools operate on a Static Single-Assignment (SSA) form @x-cooper_keith_d_engineering_2011 @x-cytron_efficiently_1991. The SSA form works by assigning each variable only once. This is done by creating multiple sub-variables $x_1, x_2, ...$ for each variable $x$. After a branch in the program a #{sym.Phi}-Node is used to select the new value of the variable based on branch executed. An example of the SSA form can be seen in @ssa_form_example. On the left is a simple C code in a function body and on right is the respective SSA form of the C code. The immediate representation of LLVM is closely modeled after the SSA form. == Dataflow analysis (DFA) A compiler can perform dataflow analysis either on the original source code or an intermediate representation. When performing the analysis on the source code, the original structure and flow constructs of the program is available, while performing the analysis on an immediate representation has the advantage of being usable for many different languages but in the translation step from source code to immediate representation a lot of information about control flow and similar could have already been lost. LLVM for example already has a lot of generic optimization steps for its own immediate representation, which allows language developers to focus designing their language and a compiler from their language to the LLVM IR with language specific optimizations instead of having to implement a full compiler and optimizations themselves. A big problem with DFA is the long runtime and because it is a sequential algorithm it is complicated to parallelize it. This makes it harder to use DFA in a Just-In-Time (JIT) compiler, since it has a limited timeframe for compilation. === Forward and backward analysis Dataflow analysis can be performed in two directions: forward and backward. Forward analysis is done by walking from the entry of the program to the exit. This allows to look what conditions apply before a statement is executed and which statements can be reached by the program. Backward analysis goes from the exit of the program to the entry, thereby enabling to calculated what variables are still required at a specific point in the program. === Must and may analysis #figure( // must_may_example caption: [Must/may analysis], kind: "raw", ```C int x, y, z; if (...) { x = y = 2; z = 1; } else { x = z = 2; } printf("%d %d %d", x, y, z); ``` ) The facts which the algorithm knows about variable either must be true or may be true. When they must be true, every path leading to the current point must ensure that the fact is true. The facts in @must_may_example on line 4 are: `x` and `z` must be initialized since that is done in both branches of the if, while `y` only may be initialized. `x` also must be `2`, since it gets assigned `2` in both branches, `z` may be `2` or may be `1`. The must constraints are mostly used for optimization, while the may constraints are mostly used for showing warnings. === Points-to analysis @x-rayside_pointsto_2005 Points-to analysis is for handling DFA with pointers and references. Specifically, it show if one variable can point to another variable during the execution of the program. Points-to analysis has multiple levels of precision. \ One of the most important aspects for precision is context-sensitivity. If you have a function `void* id(void* p)` which just returns the pointer `p` you give it, in context-insensitive points-to analysis you would get that every pointer which is supplied as argument, could be return as results, while with context-sensitive analysis you would only get the pointer which you supplied. As example: with the code `void* a, b; id(a); id(b);` you would get `id(a), id(b) ∈ {a, b}` because the analysis could not differentiate between those calls, while with context-sensitive analysis it would be `id(a) == a` and `id(b) == b`. \ There many design choices, which impact the performance and the precision, that can be made when implementing points-to analysis: Subset-based analysis where each pointer has a set of variables to which it can point. When pointer `a` is assigned to pointer `b` (`b = a;`) the variables which `a` points to must be a subset of `b` (`b ⊇ a`). Later these sets can be merged for faster analysis, but this leads to information loss. \ A more precise variation is equivalence-based points-to analysis. This works by having a separate set for each pointer and copying these sets when assigning pointers to other pointers. Because the analysis needs to keep a set for every pointer, it is much slower and requires more memory during analysis. \ A even more precise method and most relevant to this paper is flow-sensitive analysis. By analyzing the control flow it is possible to precisely define to which variable a pointer points to at a certain time in the code and to make optimizations based on that. Then drawback of this is the bad performance of the analysis and the complicated implementation. \ Field-sensitivity treats every field of an object as a separate object instead of just storing the entire object which gets pointed at. This allows for more detailed analysis on which fields actually get accessed and modified, but it comes with a big performance overhead. A similar options is Array-sensitivity, which models each entry of an array as a separate object instead of just using the whole array when something in it is referenced. \ While subset-based and equivalence-based analysis is enough for simple optimizations and simple compile time checks, for safety critical applications and complex optimizations it is necessary to use context and flow sensitive algorithms. It is also necessary to make this choice based on the size of the analyzed codebase and how long the compile time should be. \ === Constant folding and propagation @x-optimizing_compiler_wikipedia An example based on @ssa_form_example would be the compiler calculating $x_1$ to be $8$. This is called constant folding and done by replacing all calculations which are possible at compile time with their result. Constant propagation then replaces the $x_1$ in the calculation of $x_2$ with its value. When constant folding is the applied again $x_2$ would be $6$. === Conditional branch elimination Continuing from the last optimization it would be possible to conclude that the branch $x_2 < 4$ would always evaluate to $0$. This would result in elimination of the $1$ branch and result in $x_5$ always being $12$ and $y_1 = 24$. These two optimizations would already allow to replace the whole code from @ssa_form_example with `int x = 12; int y = 24;`. In this simple simple example this seems obvious, but if $x$ is a function parameter there still could be instances where this branch could be eliminated because of the function argument supplied elsewhere. === Common subexpression elimination @x-optimizing_compiler_wikipedia Common subexpression elimination is done by finding cases where a calculation or a function call without side-effect is done multiple times with the same variables and values. An example for this would be the expressions `d = (a * b) - c * (a * b)`, which can be rewritten to `tmp = a * b; d = tmp - c * tmp` as long as `a` and `b` remain the same between the two calculations of `a * b`. === Copy propagation @x-copy_propagation_wikipedia Copy propagation is done by replacing variables which were directly assigned with another variable with their value. Example: `y = x; z = 3 + y;` would be replaced with `z = 3 + x;` since `y` got directly assigned with `x`. An important thing to watch out for is that `x` does not change between both lines. === Dead code elimination @x-optimizing_compiler_wikipedia Dead code elimination removes code which would not be executed under any circumstance or code which modifies variables that will not be used again. Code which would not be executed would be after statements which return the control flow, for example: `break`, `continue`, `return`, `goto`. === Dead branch elimination Dead branch elimination is a simpler version of conditional branch elimination where the condition of the branch is a constant false, i.e. `if (0) {...}`. This could be because of parts of the code which were commented out this way for debugging purpose. = Methodology This work is created following the process described in @process_fig. The protocol for the review is divided up into the following chapters: for the objective of the research see @research_questions_s, for the search strategy see @sas_s, for the selection criteria see @selection_criteria_s and for the data extraction strategy see @data_extraction_s. #place( // process_fig bottom + center, scope: "parent", float: true, [ #figure( caption: [Overview of the review process. Adapted from @x-federico_ciccozzi_execution_2019 and @x-gotz_claimed_2021.], image("review_process.png") ) ] ) == Objective and research questions The goal of this research paper is to find claims about the advantages and disadvantages of using dataflow analysis for compiler optimization and where DFA is already implemented in Compilers. This goal has been defined in two research questions: - RQ1 --- What are the advantages and disadvantages of using dataflow analysis for compiler optimization? \ This questions aims to identify which advantages DFA has over other optimization techniques and which disadvantages it has when used. - RQ2 --- How is dataflow analysis used in current compilers? \ This questions aims to identify how DFA is already used in current compilers, what optimizations are done with it and if it is used during normal compilation or if it has to be explicitly enabled. == Search and selection strategy #[ // sas_fig #set text(size: 8pt) #figure( caption: [Search and selection process], cetz.canvas({ import cetz.draw: * let bs = (2.5, 1.3) let bm = (0.6, 0.5) let bx(px, py, name, inner) = { content(((bs.at(0)+bm.at(0))*px, -(bs.at(1)+bm.at(1))*py), (rel: bs), name: name, box( align(center + horizon, par(justify: false, leading: 0.425em, inner)), stroke: 0.5pt, width: 100%, height: 100%, inset: 0.3em )) } set-style(stroke: (thickness: 0.5pt)) bx(0, 0, "acm")[ACM \ Digital Library \ n = 3594] bx(0, 1, "ieee")[IEEE Xplore \ n = 1720] bx(0, 2, "springer")[Springer Link \ n = 786] bx(1, 0, "dup")[Duplicate removal \ and preliminary \ filtering \ n = 471] bx(2, 0, "sel")[Application of \ selection criteria \ n = 10] bx(2, 1, "snow")[Snowballing \ n = 110] bx(2, 2, "reap")[Reapplication \ of selection \ criteria \ n = 15] bx(1, 2, "inc")[Publications \ included \ n = 15] line("acm.east", (rel: (0.25, 0)), name: "dlu") line("ieee.east", (rel: (0.25, 0))) line("springer.east", (rel: (0.25, 0)), name: "dld") line("dlu.end", "dld.end", name: "dl") set-style(mark: (end: "triangle", fill: black)) line("dl.start", "dup.west") line("dup.east", "sel.west") line("sel.south", "snow.north") line("snow.south", "reap.north") line("reap.west", "inc.east") }) ) ] #figure( // sas_search_string caption: [Search string used in electronic databases], kind: "raw", align(left)[ // ("dataflow analysis" OR "data flow analysis") AND (compiler OR compilers OR compilation) AND (optimization OR optimizations) AND (advantages OR disadvantages OR strengths OR limitations OR trade-offs) AND (implementation OR usage OR used OR applied) // ("Full Text .AND. Metadata":"dataflow analysis" OR "Full Text .AND. Metadata":"data flow analysis") AND ("Full Text .AND. Metadata":compiler OR "Full Text .AND. Metadata":compilers OR "Full Text .AND. Metadata":compilation) AND ("Full Text .AND. Metadata":optimization OR "Full Text .AND. Metadata":optimizations) AND ("Full Text .AND. Metadata":advantages OR "Full Text .AND. Metadata":disadvantages OR "Full Text .AND. Metadata":strengths OR "Full Text .AND. Metadata":limitations OR "Full Text .AND. Metadata":trade-offs) AND ("Full Text .AND. Metadata":implementation OR "Full Text .AND. Metadata":usage OR "Full Text .AND. Metadata":used OR "Full Text .AND. Metadata":applied) #set raw(syntaxes: "search-string.sublime-syntax", theme: "search-string.tmTheme") ```SearchString ("dataflow analysis" OR "data flow analysis") AND (compiler OR compilers OR compilation) AND (optimization OR optimizations) AND (advantages OR disadvantages OR strengths OR limitations OR trade-offs) AND (implementation OR usage OR used OR applied) ``` ] ) Our search strategy consists of 5 steps as seen in @sas_fig. \ The papers from the first steps are collected from the electronic databases ACM Digital Library, IEEE Xplore, Springer Link with the search string seen in @sas_search_string. The search string in @sas_search_string was created using the research questions in @research_questions_s and was always applied to the full text of the papers. The search string is divided into the keywords for dataflow analysis, the keywords for compiler optimization, keywords for advantages or disadvantages to help finding papers relevant for answering RQ1, and keyword signaling an implementation in the paper to help answer RQ2. The different keywords were made from the abstracts, titles and keywords of my source papers. In the second step all duplicates which where returned from multiple databases where removed from the results and the amount was limited to fit the scope of this paper. In the third step the selection was filtered by applying all selection criteria from @selection_criteria_s. In the fourth step we snowballed the previously acquired results. This was to find relevant papers which where not included because of either the search string or the search criteria. Afterwards all papers found via the snowballing where filtered again by applying the selection criteria in @selection_criteria_s. In the end all papers from the third step and the papers of the snowballing where evaluated based on the data extraction items mentioned in @data_extraction_s. == Selection criteria For a publication to be relevant it has to satisfy at least one inclusion criteria and not any exclusion criteria. The criteria were chosen to include as any publications as possible but still filter out irrelevant ones. #[ #v(4pt) #set enum(numbering: (.., i) => "IC" + str(i)) + Publications discussing advantages and disadvantages of DFA compared to other optimization techniques. + Publications focusing on one or more compilers (e.g., LLVM, Java JIT, C\# JIT). + Publications providing an implementation for a DFA optimization. #v(4pt) ] We chose _IC1_ to help answer _RQ1_. _IC2_ is to include publications which talk about a compiler and how DFA is implemented in it. _IC3_ is to further include publications which directly provide an implementation or talk about creating an implementation. This is to allow analyzing how DFA is used in compilers. #[ #v(4pt) #set enum(numbering: (.., i) => "EC" + str(i)) + Publications which discuss DFA in a non-compiler context. + Publications written in a language other than english. + Secondary and tertiary publications (e.g., systematic literature reviews, surveys). + Publications in the form of tutorial papers, short papers, poster papers, editorials. + Publications for which the full text is not available. #v(4pt) ] _EC1_ is to exclude publications which talk about DFA in other contexts which are not relevant to compiler optimization. _EC2--EC5_ are to exclude publications which do not provide enough information to include them in this publication. == Data extraction #[ // data_extraction_table #set par(leading: 0.4em) #set text(size: 9pt) #figure( caption: [Data items], supplement: "Table", booktbl( columns: (1fr, 8fr, 3.9fr), align: left, inset: (x: 6pt, y: 2pt), [ID], [Data], [Purpose], ..( ([Author(s)], [Documentation]), ([Publication year], [Documentation]), ([Title], [Documentation]), ([Named advantage(s) of DFA for CO], [RQ1]), ([Named disadvantage(s) of DFA for CO], [RQ1]), ([Analyzed compiler(s)], [RQ2]), ([Targeted language(s) of the optimization], [RQ2]), ([What optimizations are implemented with DFA], [RQ2]), ([Limitations of the analysis], [RQ2]) ).enumerate(start: 1).map(((i, arr)) => ([D#i], ..arr)).flatten() ) ) ] Based on the research questions, we collected 9 data items to extract from all included publications. @data_extraction_table lists all data items. \ Data items _D1--D3_ are to document the source of the publication. \ _D4_ and _D5_ are to explicitly list the advantages and disadvantages for answering _RQ1_. \ _D6_ and _D7_ show in which compiler DFA was implemented and if it is running directly on a programming language like C++ or if it runs on a intermediate language like LLVM IR. \ _D8_ lists which optimizations where performed based on the results of DFA and _D9_ lists the limitations of the executed DFA. (e.g., only run on function scope). \ All data items were extracted from the full text of all included publications. = Findings In this chapter we list our findings from the conducted systematic literature analysis. == Demographic The work by Zaidi and Greaves @y-zaidi_value_2015 shows that dataflow analysis is not only used to optimize software for normal computers, but also to optimize hardware description languages like Verilog or VHDL, which are then turned into hardware via a Field Programmable Gate Array (FPGA). === Publication year #figure( // demographic_pub_year caption: "Publication years of the publications", { let data = ( (1973, 1), (1997, 1), (2010, 2), (2011, 2), (2012, 1), (2013, 2), (2015, 1), (2018, 1), (2019, 1), (2020, 2), (2024, 1) ) // cSpell:disable lq.diagram( width: 8.5cm, xlim: (1972, 2026), ylim: (0, 2.5), yaxis: (subticks: none, ticks: range(0, 3)), xaxis: (ticks: range(1975, 2026, step: 5)), lq.bar( data.map(v => v.at(0)), data.map(v => v.at(1)) ) ) // cSpell:enable } ) As seen in @demographic_pub_year most of the analyzed publication are from the last 15 years, which indicates that this field is still actively being researched and explore, but research has already start back in 1983. Since research started over 50 years ago it indicates that this field is by now well-established. There are definitely more publications which are not listed here and not represented in this figure, but that is because the scope of this papers is very limited. \ === Target languages #figure( // demographic_target_lang caption: "Target languages of the publications", { let data = ( ("None", 1), ("Custom", 1), ("C", 3), ("LLVM IR", 5), ("Java Bytecode", 2), ("Graal IR", 1), ("SSA of Java", 2) ) cetz.canvas({ //let colors = (red, eastern, green, blue, navy, purple, maroon, orange) let colors = gradient.linear(..color.map.rainbow.map(v => v.darken(20%).saturate(20%))) // cspell:disable-next-line cetz_chart.piechart( data, value-key: 1, label-key: 0, radius: 3, slice-style: colors, inner-radius: 0, inner-label: (content: (value, _) => [#text(white, str(value))], radius: 150%), outer-label: (content: (value, _) => [], radius: 0%), legend: ( position: "east", anchor: "south", orientation: ttb, offset: (1.7cm, -2.5cm) ) ) }) } ) @demographic_target_lang shows a 33% trend towards implementing DFA optimizations either with LLVM directly or by operating on the LLVM IR, while Java is either directly used as bytecode or as SSA representation of Java. This shows that LLVM is a good platform for implementing optimizations and that it has a lower barrier of entry for developing optimizations. \ The publication which does not target any language is by Kildall @y-kildall_unified_1973 and only talks about the theoretical ways to implement DFA. The publication by Ramsey, Dias and Peyton-Jones @y-ramsey_hoopl_2010 implements a custom IR to run the optimizations on in Haskell with a custom type for the different possible instructions. // TODO mention which pubs are in each category === Research focus #figure( // demographic_research_focus caption: "Research focus of the publications", { let data = ( ("Algorithms and Techniques", 5), // 1, 2, 5, 7, 12 ("Implementation and Reusability", 2), // 3, 8 ("Analysis speed improvement", 4), // 4, 6, 14, 15 ("Custom IR for analysis", 3), // 9, 10, 13 ("Tools for implementation of DFA", 1), // 11 ) cetz.canvas({ //let colors = (red, eastern, green, blue, navy, purple, maroon, orange) let colors = gradient.linear(..color.map.rainbow.map(v => v.darken(20%).saturate(20%))) // cspell:disable-next-line cetz_chart.piechart( data, value-key: 1, label-key: 0, radius: 3, slice-style: colors, inner-radius: 0, inner-label: (content: (value, _) => [#text(white, str(value))], radius: 150%), outer-label: (content: (value, _) => [], radius: 0%), legend: ( position: "south", anchor: "north", orientation: ttb ) ) }) } ) The focus of the different papers can be seen in @demographic_research_focus. Most of the papers #cgy[@y-kildall_unified_1973 @y-rastislav_bodik_interprocedural_1997 @y-joisha_technique_2011 @y-tang_summary-based_2012 @y-pathade_path_2019] focus on creating and implementing new algorithms and techniques. Another big focus of the included papers is speeding up the analysis, which also makes it more viable for using in JIT compilers. While #cgy[@y-edvinsson_multi-threaded_2010 @y-edvinsson_parallel_2011] try to do this by simply running parts of the analysis on different threads, the work by Shi and Zhang @y-shi_pipelining_2020 tries to pipeline the analysis of functions and the work by Aigner, Barany and Mössenböck @y-aigner_lazy_2024 tries to skip parts by only lazily iterating over nodes of the IR. In the works of Duboscq et al., Zaidi et al., Reissmann et al. #cgy[@y-duboscq_graal_2013 @y-zaidi_value_2015 @y-reissmann_rvsdg_2020] a custom IR is implemented to make it easier to run parts of the DFA or to have a better structure then the previous code or IR. The focus of the work by Ramsey et al. @y-ramsey_hoopl_2010 is to provide a generic library for implementing DFA and using it and the work of Urban and Steinlechner @y-urban_implementing_2013 is to provide an example implementation of the library to show how it works. The work by Ginsbach, Crawford and O'Boyle @y-ginsbach_candl_2018 defines and implements a compiler for a Domain-Specific Language (DSL) defined in the paper to implement DFA algorithms in the LLVM framework to make it easier for researchers to try out new ideas and implement them. == RQ1: Advantages and disadvantages of using Dataflow analysis for compiler optimization DFA makes many big compiler optimizations possible but it also brings many trade-offs and not just for performance. These optimizations eliminate unused code and simplify expressions, which reduces execution time and memory footprint during runtime. The work by Kildall @y-kildall_unified_1973 is one of the first to talk about DFA and how it allows to use previously existing optimizations, which could only be applied on code sections without branches, with branching by checking how data flows through the branches. Later publications by Rastilav et al. and Joisha et al. #cgy[@y-rastislav_bodik_interprocedural_1997 @y-joisha_technique_2011] describe ways to apply these optimization interprocedurally and across thread synchronization boundaries. The work by Rastilav, Rajiv and Soffa @y-rastislav_bodik_interprocedural_1997 does this be inlining the called procedure and then performing dataflow analysis. This makes every procedure call optimized for every call location, but brings the disadvantage of very rapidly increasing the size of the optimized program. An important requirement that the work by Joisha, Schreiber, Banerjee, Boehm and Chakrabarti @y-joisha_technique_2011 describes, is that programs must be well synchronized, otherwise DFA can not be used because of possible data races. === Analysis performance While performance is not the biggest concern for DFA, since it runs at compile-time and accuracy is more important as described in the work by Edvinsson and Löwe @y-edvinsson_multi-threaded_2010. Many publications #cgy[@y-edvinsson_multi-threaded_2010 @y-edvinsson_parallel_2011 @y-shi_pipelining_2020 @y-aigner_lazy_2024] have investigated how to improve the performance of DFA. This is done with several techniques described next. \ In both publications by Edvinsson, Löwe and Lundberg #cgy[@y-edvinsson_multi-threaded_2010 @y-edvinsson_parallel_2011] different function calls are run on different threads, but it has the problem of creating and queue a task for each function, which can lead to a big overhead. The later work by Edvinsson, Löwe and Lundberg @y-edvinsson_parallel_2011 expands upon the first work by Edvinsson and Löwe @y-edvinsson_multi-threaded_2010 and both detail how just running every SSA-node on a new thread leads to almost no speed-up, since must SSA-nodes are sequentially dependant on each other and it also leads to thread-switch and synchronization overhead. To solve this problem they describe a way to cluster SSA-nodes, which are independent from other clusters. SSA-nodes are independent from one another if they are not reachable in a SSA-graph, they are in different function and they are analyzed in a different context. They also want to avoid dependencies between clusters, because then it is necessary to perform synchronization between the threads which operate on those clusters. Their algorithm allows that nodes may be assigned to multiple clusters, but then the analysis result leads to dependencies between the clusters because the result must be shared. The algorithm also allows that independent branches (i.e. the _true_ and the _false_ branch of an if) are processed independently but after calculating both branches the results must be merged together in one thread, which leads to synchronization. They also process all potential call targets of a method in parallel. Both of those can lead to redundancy because the already queued nodes can still be reached through other nodes which are called by these nodes. To avoid the overhead for analyzing small methods separately a threshold is added to indicate when it is worth to perform the task on a separate thread. This threshold is calculated by comparing the number of methods reachable from the to-be-analyzed method to a fixed number. This fixed number was determined experimentally. They also try to reduce redundancy by checking the set of methods which would be called by the new task against the sets of methods called by the currently running and waiting tasks. When the overlap between those sets is smaller than a fixed threshold, which was also determined experimentally, the new task is queued. The results show a maximum speed-up of up to 2.43 with and an average of 1.71. A problem with this approach is, that the fixed numbers for thresholds differ between program, so they must be newly determined every time. In some cases there is no speed-up for the analysis of a program. \ // TODO explain detailed how function are selected Another approach described in the work by Shi and Zhang @y-shi_pipelining_2020 is to pipeline the function calls. This is done by analyzing all variables, which do not depend on any function calls. When the function calls have finished being analyzed, the variables, which depend on that function call are analyzed. Thereby more parallel work is possible. === Implementation complexity Another problem with DFA is the difficulty to implement optimizations with it, which is explained in the works by Ramsey et al. and Ginsbach et al. #cgy[@y-ramsey_hoopl_2010 @y-ginsbach_candl_2018]. DFA is often also deeply entangled with the compiler internals, which makes it difficult to reuse existing optimizations with other compilers or implement new optimizations quickly and it is complicated to implemented, as seen in LLVM: "simple peephole optimizations in the LLVM instcombine pass contain approximately 30000 lines of complex C++ code, despite the transformations being simple" (@y-ginsbach_candl_2018, p. 151) \ One solutions to this problem is described in the work by Ramsey, Dias and Peyton-Jones @y-ramsey_hoopl_2010 by implementing a library in Haskell which performs the dataflow analysis and provides an interface, which "is made possible by sophisticated aspects of Haskell's type system, such as higher-rank polymorphism, GADTs, and type functions" @y-ramsey_hoopl_2010, to implement various optimizations, which also then can be reused for other compilers. The biggest drawback of this library is it's limited to compilers implemented in Haskell. \ Another approach is described by the work of Ginsbach, Crawford and O'Boyle @y-ginsbach_candl_2018 by creating a domain specific language to implement LLVM optimization passes. This is done by a having a simple language for directly implementing the logic of the optimization, while a custom transpiler then converts it into a LLVM pass written in C++. Since the LLVM pass is implemented in a more generic way to fit this purpose, it leads to a moderate compile time increase. There is no formal verification done on the implemented optimization pass. Because of these disadvantages it is a great tool to quickly implement, test and iterate optimizations, but for a more permanent passes, hand-written C++ code should be used. === Limitations DFA is hard to parallelize because variables are often dependant on other variables or function arguments. While it is possible to analyze multiple functions at the surface level, they still depend on the context of other functions calling it. As already mentioned in the work by Shi and Zhang @y-shi_pipelining_2020, it is still possible to run parallel analysis while still waiting for the results of other threads. \ Global variables also make analysis more complicated since the can be accessed and modified by all functions and either need to be treated as an unknown value every time or all functions which work with this variable are analytically dependant on each other and should be locked at when checking the value of the variable. A similar problem exists for variables shared across threads, because the analysis has to look at all functions which could modify the variable. As described in the work by Joisha, Schreiber, Banerjee, Boehm and Chakrabarti @y-joisha_technique_2011 an alternative would be that the variable are well synchronized so that only one thread can write it or multiple threads can read it, but not both options at the same time. \ Another thing that complicates DFA in languages like C is the usage of pointers because they allow the program to modify all variables in unpredictable ways which thereby invalidates all facts and assumptions which were made up to that point about all variables. \ Since inlining is required to perform rewrites, it can lead to bloating the executable and make it overly huge. == RQ2: Usage of dataflow analysis in current compilers The Glasgow Haskell Compiler (GHC), LLVM, and GCC are good examples for compilers which already extensively use DFA to implement optimizations. The optimizations implemented by the analyzed papers are described in the following sections. These optimizations include common sub-expression elimination #cgy[@y-kildall_unified_1973 @y-tang_summary-based_2012 @y-reissmann_rvsdg_2020], copy propagation #cgy[@y-joisha_technique_2011 @y-tang_summary-based_2012], constant propagation @y-kildall_unified_1973, conditional branch elimination @y-rastislav_bodik_interprocedural_1997 and dead code elimination @y-reissmann_rvsdg_2020. === Summary-based analysis The work by Tang and Järvi @y-tang_summary-based_2012 describes how to implement summary-based analysis and how to use it for user defined types and objects. Summary-based analysis can be used to keep the sensitivity and most of the information from the analysis but still save time while analyzing the code. It is described in the work by Tang and Järvi @y-tang_summary-based_2012 to commonly consist of two steps. \ The first step is to traverse the call graph from the bottom-up (start with the procedures which depend on no other procedures, then go through procedures where only already analyzed procedures are called) and compute the side effects and points-to relations of each procedure. This first steps computes everything without the calling context, so it results only in the side effects and relations which can happen independent of where the functions are called. \ The second step then performs a top-down analysis (start with the normal entry points of the program and then go through the procedures as they would be called in the program) where the actual call arguments are passed to the procedures. The exact side effects and points-to relations are calculated in this step. \ The results of this analysis are then stored as a tuple. The first entry of the tuple are the points-to relations dependant an the objects accessible in the procedure and the pointers accessible in the procedure. The second entry are the side-effect, specifically reads and modifications of the objects in the procedure. \ While the work by Tang and Järvi @y-tang_summary-based_2012 does not directly implement any optimizations based on their analysis, they show that the approach leads in almost all cases to a more concrete and smaller result for the points-to analysis and the side effects analysis. Because of this other optimization are able to run quicker with almost the same accuracy. // TODO explain 3.1 composite objects === Copy propagation Copy propagation is implemented in the work of Joisha, Schreiber, Banerjee, Boehm and Chakrabarti @y-joisha_technique_2011 with focus on making it possible to apply in multi-threaded environments. It is implemented based on a procedural concurrency graph which is build from the source code. The nodes are all procedures which could run in the program. The edges between the nodes represent a MHP (may-happen-in-parallel) relation (@y-joisha_technique_2011, p. 627), which is a possible overlap of execution of both nodes. The function $I((p_1, p_2))$ lists the variables which the procedures $p_1$ and $p_2$ interfere. Interference in this context is a read and a write in overlapping (parallel) regions of the procedures. As long as there is no interfere between two function on a variable or the corresponding lock for a variable is held, it is possible to do copy propagation for the variable. #figure( // copy_prop_rq2_example caption: [Example for demonstrating copy propagation in multi-threaded programs, adapted from Fig. 1 and Fig. 2 in the work of Joisha, Schreiber, Banerjee, Boehm and Chakrabarti @y-joisha_technique_2011], kind: "raw", grid( columns: (1fr, 1fr), grid.cell(colspan: 2, align(center)[Global: `int X, Y; X = Y = 0; mutex my;` #v(1mm)]), [ #align(center, [Thread $t_1$]) ```C int a = X; LOCK(my); int b = Y; print(a, b); Y = b + 1; UNLOCK(my); ``` ], [ #align(center, [Thread $t_2$]) ```C X = 1; int a = Y; X = a + 1; LOCK(my); Y = a + 1; UNLOCK(my); print(X, Y); ``` ] ) ) This technique can be explained based on @copy_prop_rq2_example. In thread $t_1$ there are two opportunities for applying copy propagation. The first is the variable `a` on line 1 can be propagated to the `print` in line 4, since no writes happen in this thread for the global variable `X`. The second is the variable `b` since access to the global variable `Y` is locked behind the mutex `my`. In thread $t_2$ copy propagation can not be performed, since the variable `a` reads from the global variable `Y` and it is not protected by locking the mutex `my`. This could result in `Y` being a different value on line 3 and line 5, because it is also written in $t_1$. === Regionalized Value State Dependence Graph (RVSDG) #place( // rvsdg_example bottom + center, scope: "parent", float: true, [ #figure( caption: [Example how RVSDG, taken from Fig. 1 of the work by Reissmann, Meyer, Bahmann and Själander @y-reissmann_rvsdg_2020], grid( columns: (1fr,)*4, column-gutter: 0.5em, image("rvsdg_1_code.svg"), image("rvsdg_2_cfg.svg"), image("rvsdg_3_uir.svg"), image("rvsdg_4_oir.svg") ) ) ] ) In the work by Reissmann, Meyer, Bahmann and Själander @y-reissmann_rvsdg_2020 they describe a new IR for optimizing compilers to make optimizations easier and simpler to implement. Based on this IR they implemented their own compiler, jlm #footnote[https://github.com/phate/jlm], which consumed and outputted LLVM IR and was able to reach almost the same speed-up as LLVMs own optimizer. \ RVSDG is implemented as a tree structure consisting of nested regions. A region only represents a collection of nodes and edges, with input connections and output connections. The edges in a region are either between nodes themselves and between nodes and either input or output of the region. \ The nodes can either be primitive operations like arithmetics, memory operations and function calls. The other type are structural nodes where each represent a different construct: Gamma (#sym.gamma) model branching statements like `if` or `switch`. Theta (#sym.theta) models tail-controlled loops, for head-controlled loops a combination of a gamma node and a theta node is used. Lambda (#sym.lambda) models a function with multiple inputs connections representing arguments and a single output connection for the function result. Delta (#sym.delta) models global variable with its inputs representing dependencies on external variables and its output represents its actual value. Phi (#sym.Phi) is used for recursive functions and contain a single #sym.lambda node representing the actual function, its inner input connection is connected to the lambda node, which then can use it for calling itself again, thereby allowing RVSDG to represent recursive calls without cycles. Omega (#sym.omega) represents a translation unit and is the top-level node of an RVSDG and has not input or output connections. // TODO maybe explain nodes in/out better // TODO explain edges // TODO explain optimizations // TODO explain example = Conclusion Our findings show that DFA is already extensively used in current compilers and brings big advantages for runtime speed. The cost of this is a higher compilation duration, which makes it unsuitable for JIT compilation. Furthermore, DFA allows complex optimizations across branches and function boundaries which would not be possible with traditional straight-line optimizations. \ The high implementation complexity and the deep entangled with the compiler internals also poses a big problem for advancing this field further. The recent release of new publications on this topic indicates that researchers are continuously searching for better and faster ways to implement DFA and to make better use of the analysis results. \ The adaptability of LLVM and the associated immediate representation makes it an invaluable platform to do testing and research with DFA. #pagebreak(weak: true) #set heading(numbering: none) #bibliographyx("refs.bib", prefix: "x-", title: "References", style: "association-for-computing-machinery") #let slr_bib_style = read("association-for-computing-machinery.csl", encoding: none) #bibliographyx("slr.bib", full: true, prefix: "y-", title: "Analyzed papers", style: slr_bib_style) #pagebreak(weak: true) #set heading(numbering: "A.a.a") #counter(heading).update(0) #{ // slr results table set page(flipped: true, columns: 1, margin: (x: 1.75em, y: 3em)) [= SLR Results] v(1em) counter(heading).update(0) set table(stroke: (x, _) => if x in (1, 4, 6) { (x: 2pt, y: 1pt) } else { 1pt }) show heading: set text(weight: "regular") context { let slr_data = csv("pubs.csv") let header = slr_data.at(0) let slr_data = slr_data.slice(1) let slr_bib = get-bibliography("y-") let key_map = (:) for r in slr_bib.references { let k = "y-" + r.key let id = r.prefix.elem.children.at(0).text.text.find(regex("\d\d?")) let idx = slr_data.position(v => v.at(0) == id) slr_data.at(idx).at(0) = ref(label(k)) key_map.insert(id, k) } table( columns: (auto, auto, auto, auto, auto, auto, 6em, 4.05em, auto, auto), inset: (x: 5pt, y: 3pt), ..header, ..slr_data.flatten() ) } }