diff --git a/.vscode/settings.json b/.vscode/settings.json index 02fcc82..387ca7a 100644 --- a/.vscode/settings.json +++ b/.vscode/settings.json @@ -1,6 +1,7 @@ { "editor.wordWrap": "on", "cSpell.words": [ + "bitcode", "interprocedurally", "Regionalized", "RVSDG", diff --git a/main.typ b/main.typ index f8d7daf..4c4ee95 100644 --- a/main.typ +++ b/main.typ @@ -1,6 +1,6 @@ // cSpell:ignoreRegExp @[a-zA-Z0-9_-]+ // cSpell:ignoreRegExp #\w+\( -// cSpell:ignore cetz booktbl bibliographyx citegroup instcombine +// cSpell:ignore cetz booktbl bibliographyx citegroup instcombine acopy // cSpell:ignore Reissmann Soffa Zaidi Aigner Barany Mössenböck Duboscq Steinlechner Ginsbach Kildall Rastilav Joisha Rajiv Schreiber Banerjee Boehm Chakrabarti Löwe Edvinsson Peyton-Jones Bahmann Själander // cSpell:disable #import "clean-acmart.typ": acmart @@ -12,7 +12,7 @@ #import "@preview/alexandria:0.2.0": * #show: codly-init.with() -#codly(zebra-fill: none, display-icon: false, display-name: false, stroke: none, radius: 0mm, inset: 0.2em) +#codly(zebra-fill: none, display-icon: false, display-name: false, stroke: none, radius: 0mm, inset: 0.125em) #let booktbl = tblr.with( stroke: none, @@ -80,7 +80,6 @@ #set heading(numbering: "1.1.1") // cSpell:enable -// TODO Small info what is LLVM + footnote link = Abstract Dataflow analysis is an important part of compiler optimization since it allows to eliminate or rewrite parts of the code with various techniques such as: constant propagation, dead code elimination, branch elimination. This work aims to look at the advantages and disadvantages of using dataflow analysis, how it is already used in current compilers, on which programming languages or immediate representations it operates and what limitations still exist. \ For this purpose we conducted a systematic literature in which we analyze 15 publications selected from 571 entries. Finally, following conclusions were drawn: dataflow analysis is used in many of todays popular compilers and the field is actively being researched. The advantages of dataflow analysis are huge for performance gain, but its implementations are complex and you need to be careful that the implementation does not change the program in an unwanted way. @@ -94,11 +93,15 @@ While this paper talks about dataflow analysis in the context of compiler optimi This work is divided into the following sections: in @background_c the background required to understand this work is given, in @methodology_c the methodology used to create this work is described, in @findings_c the contents of the papers are analyzed and evaluated, in @conclusion_c the results from this work are summarized. = Background +== Static Single Assignment form (SSA / SSA form) +Many modern compilers and analysis tools operate on a Static Single-Assignment (SSA) form @x-cooper_keith_d_engineering_2011 @x-cytron_efficiently_1991. The SSA form works by assigning each variable only once. This is done by creating multiple sub-variables $x_1, x_2, ...$ for each variable $x$. After a branch in the program a #{sym.Phi}-Node is used to select the new value of the variable based on branch executed. +An example of the SSA form can be seen in @ssa_form_example. On the left is a simple C code in a function body and on right is the respective SSA form of the C code. The immediate representation of LLVM is closely modeled after the SSA form. == LLVM #footnote[https://llvm.org/] -LLVM is a compiler framework mainly consisting of the LLVM Core, which serves as backend or optimizing LLVM IR and compiling it to machine specific code, and Clang, a frontend for compiling C, C++ and Objective-C to LLVM IR. There are more components like a debugger and a C and C++ standard library. There also exists a lot of external projects for compiling other languages and targeting other machines. The biggest advantage of LLVM is its flexibility and easy extensibility, which makes it a useful framework for most kinds of research about compilers. +LLVM is a compiler framework mainly consisting of the LLVM Core, which serves as backend or optimizing LLVM IR and compiling it to machine specific code, and Clang, a frontend for compiling C, C++ and Objective-C to LLVM IR. There are more components like a debugger and a C and C++ standard library. There also exists a lot of external projects for compiling other languages and targeting other machines. The biggest advantage of LLVM is its flexibility and easy extensibility, which makes it a useful framework for most kinds of research about compilers. \ +The LLVM IR #footnote[https://llvm.org/docs/LangRef.html] in immediate representation based on the SSA form and its instructions are modeled after virtual cpu. Because can be represented as either in-memory IR, on-disk bitcode or human-readable assembly, it can be efficiently used by compilers and still be debugged by humans without decoding bytes. == Control flow graph #figure( // ssa_form_example - caption: [C code and respective SSA in control flow graph form, adapted from Fig. 1 in the work of Reissmann, Meyer and Soffa @y-reissmann_rvsdg_2020], + caption: [C code and respective control flow graph in SSA form, adapted from Fig. 1 in the work of Reissmann, Meyer and Soffa @y-reissmann_rvsdg_2020], kind: "raw", grid( columns: (1fr, 1.25fr), @@ -115,9 +118,6 @@ LLVM is a compiler framework mainly consisting of the LLVM Core, which serves as ) ) A control flow graph is a directed graph consisting of blocks of code as nodes and edges between these blocks for the program execution flow. The right part of @ssa_form_example show a small example of a control flow graph. The nodes always contain sequential code with a statement at the end that changes the control flow. In the example this is either the `branch`, which branches the flow based on a condition, or the hidden `goto` at the end of both blocks for `x₃` and `x₄`, which just unconditionally jump to another block of code. Edges after a branching statement also have a label to indicate for what condition this branch is used. -== Static Single Assignment form (SSA / SSA form) -Many modern compilers and analysis tools operate on a Static Single-Assignment (SSA) form @x-cooper_keith_d_engineering_2011 @x-cytron_efficiently_1991. The SSA form works by assigning each variable only once. This is done by creating multiple sub-variables $x_1, x_2, ...$ for each variable $x$. After a branch in the program a #{sym.Phi}-Node is used to select the new value of the variable based on branch executed. -An example of the SSA form can be seen in @ssa_form_example. On the left is a simple C code in a function body and on right is the respective SSA form of the C code. The immediate representation of LLVM is closely modeled after the SSA form. == Dataflow analysis (DFA) A compiler can perform dataflow analysis either on the original source code or an intermediate representation. When performing the analysis on the source code, the original structure and flow constructs of the program is available, while performing the analysis on an immediate representation has the advantage of being usable for many different languages but in the translation step from source code to immediate representation a lot of information about control flow and similar could have already been lost. LLVM for example already has a lot of generic optimization steps for its own immediate representation, which allows language developers to focus designing their language and a compiler from their language to the LLVM IR with language specific optimizations instead of having to implement a full compiler and optimizations themselves. A big problem with DFA is the long runtime and because it is a sequential algorithm it is complicated to parallelize it. This makes it harder to use DFA in a Just-In-Time (JIT) compiler, since it has a limited timeframe for compilation. === Forward and backward analysis @@ -130,12 +130,16 @@ Backward analysis goes from the exit of the program to the entry, thereby enabli kind: "raw", ```C int x, y, z; - if (...) { x = y = 2; z = 1; } - else { x = z = 2; } + if (...) { + x = y = 2; + z = 1; + } else { + x = z = 2; + } printf("%d %d %d", x, y, z); ``` ) -The facts which the algorithm knows about variable either must be true or may be true. When they must be true, every path leading to the current point must ensure that the fact is true. The facts in @must_may_example on line 4 are: `x` and `z` must be initialized since that is done in both branches of the if, while `y` only may be initialized. `x` also must be `2`, since it gets assigned `2` in both branches, `z` may be `2` or may be `1`. The must constraints are mostly used for optimization, while the may constraints are mostly used for showing warnings. +The facts which the algorithm knows about variables either must be true or may be true. When they must be true, every path leading to the current point must ensure that the fact is true. The facts in @must_may_example on line 4 are: `x` and `z` must be initialized since that is done in both branches of the if, while `y` only may be initialized. `x` also must be `2`, since it gets assigned `2` in both branches, `z` may be `2` or may be `1`. The must constraints are mostly used for optimization, while the may constraints are mostly used for showing warnings. === Points-to analysis @x-rayside_pointsto_2005 Points-to analysis is for handling DFA with pointers and references. Specifically, it show if one variable can point to another variable during the execution of the program. Points-to analysis has multiple levels of precision. \ One of the most important aspects for precision is context-sensitivity. If you have a function `void* id(void* p)` which just returns the pointer `p` you give it, in context-insensitive points-to analysis you would get that every pointer which is supplied as argument, could be return as results, while with context-sensitive analysis you would only get the pointer which you supplied. As example: with the code `void* a, b; id(a); id(b);` you would get `id(a), id(b) ∈ {a, b}` because the analysis could not differentiate between those calls, while with context-sensitive analysis it would be `id(a) == a` and `id(b) == b`. \ @@ -351,7 +355,7 @@ The work by Zaidi and Greaves @y-zaidi_value_2015 shows that dataflow analysis i // cSpell:enable } ) -As seen in @demographic_pub_year most of the analyzed publication are from the last 15 years, which indicates that this field is still actively being researched and explore, but research has already start back in 1983. Since research started over 50 years ago it indicates that this field is by now well-established. There are definitely more publications which are not listed here and not represented in this figure, but that is because the scope of this papers is very limited. \ +As seen in @demographic_pub_year most of the analyzed publication are from the last 15 years, which indicates that this field is still actively being researched and explore, but research has already start back in 1973. Since research started over 50 years ago it indicates that this field is by now well-established. There are definitely more publications which are not listed here and not represented in this figure, but that is because the scope of this papers is very limited. The oldest publication included is by Kildall @y-kildall_unified_1973 from 1973, while the newest included publication is by Aigner, Barany and Mössenböck @y-aigner_lazy_2024 from 2024. === Target languages #figure( // demographic_target_lang caption: "Target languages of the publications", @@ -390,9 +394,8 @@ As seen in @demographic_pub_year most of the analyzed publication are from the l }) } ) -@demographic_target_lang shows a 33% trend towards implementing DFA optimizations either with LLVM directly or by operating on the LLVM IR, while Java is either directly used as bytecode or as SSA representation of Java. This shows that LLVM is a good platform for implementing optimizations and that it has a lower barrier of entry for developing optimizations. \ +@demographic_target_lang shows a 33% trend towards implementing DFA optimizations either with LLVM directly or by operating on the LLVM IR, while Java is either directly used as bytecode or as SSA representation of Java. This shows that LLVM is a good platform for implementing optimizations and that it has a lower barrier of entry for developing optimizations or using their immediate language as input or output language. \ The publication which does not target any language is by Kildall @y-kildall_unified_1973 and only talks about the theoretical ways to implement DFA. The publication by Ramsey, Dias and Peyton-Jones @y-ramsey_hoopl_2010 implements a custom IR to run the optimizations on in Haskell with a custom type for the different possible instructions. -// TODO mention which pubs are in each category === Research focus #figure( // demographic_research_focus caption: "Research focus of the publications", @@ -439,7 +442,6 @@ Later publications by Rastilav et al. and Joisha et al. #cgy[@y-rastislav_bodik_ While performance is not the biggest concern for DFA, since it runs at compile-time and accuracy is more important as described in the work by Edvinsson and Löwe @y-edvinsson_multi-threaded_2010. Many publications #cgy[@y-edvinsson_multi-threaded_2010 @y-edvinsson_parallel_2011 @y-shi_pipelining_2020 @y-aigner_lazy_2024] have investigated how to improve the performance of DFA. This is done with several techniques described next. \ In both publications by Edvinsson, Löwe and Lundberg #cgy[@y-edvinsson_multi-threaded_2010 @y-edvinsson_parallel_2011] different function calls are run on different threads, but it has the problem of creating and queue a task for each function, which can lead to a big overhead. The later work by Edvinsson, Löwe and Lundberg @y-edvinsson_parallel_2011 expands upon the first work by Edvinsson and Löwe @y-edvinsson_multi-threaded_2010 and both detail how just running every SSA-node on a new thread leads to almost no speed-up, since must SSA-nodes are sequentially dependant on each other and it also leads to thread-switch and synchronization overhead. To solve this problem they describe a way to cluster SSA-nodes, which are independent from other clusters. SSA-nodes are independent from one another if they are not reachable in a SSA-graph, they are in different function and they are analyzed in a different context. They also want to avoid dependencies between clusters, because then it is necessary to perform synchronization between the threads which operate on those clusters. Their algorithm allows that nodes may be assigned to multiple clusters, but then the analysis result leads to dependencies between the clusters because the result must be shared. The algorithm also allows that independent branches (i.e. the _true_ and the _false_ branch of an if) are processed independently but after calculating both branches the results must be merged together in one thread, which leads to synchronization. They also process all potential call targets of a method in parallel. Both of those can lead to redundancy because the already queued nodes can still be reached through other nodes which are called by these nodes. To avoid the overhead for analyzing small methods separately a threshold is added to indicate when it is worth to perform the task on a separate thread. This threshold is calculated by comparing the number of methods reachable from the to-be-analyzed method to a fixed number. This fixed number was determined experimentally. They also try to reduce redundancy by checking the set of methods which would be called by the new task against the sets of methods called by the currently running and waiting tasks. When the overlap between those sets is smaller than a fixed threshold, which was also determined experimentally, the new task is queued. The results show a maximum speed-up of up to 2.43 with and an average of 1.71. A problem with this approach is, that the fixed numbers for thresholds differ between program, so they must be newly determined every time. In some cases there is no speed-up for the analysis of a program. \ -// TODO explain detailed how function are selected Another approach described in the work by Shi and Zhang @y-shi_pipelining_2020 is to pipeline the function calls. This is done by analyzing all variables, which do not depend on any function calls. When the function calls have finished being analyzed, the variables, which depend on that function call are analyzed. Thereby more parallel work is possible. === Implementation complexity Another problem with DFA is the difficulty to implement optimizations with it, which is explained in the works by Ramsey et al. and Ginsbach et al. #cgy[@y-ramsey_hoopl_2010 @y-ginsbach_candl_2018]. DFA is often also deeply entangled with the compiler internals, which makes it difficult to reuse existing optimizations with other compilers or implement new optimizations quickly and it is complicated to implemented, as seen in LLVM: "simple peephole optimizations in the LLVM instcombine pass contain approximately 30000 lines of complex C++ code, despite the transformations being simple" (@y-ginsbach_candl_2018, p. 151) \ @@ -460,7 +462,6 @@ The first step is to traverse the call graph from the bottom-up (start with the The second step then performs a top-down analysis (start with the normal entry points of the program and then go through the procedures as they would be called in the program) where the actual call arguments are passed to the procedures. The exact side effects and points-to relations are calculated in this step. \ The results of this analysis are then stored as a tuple. The first entry of the tuple are the points-to relations dependant an the objects accessible in the procedure and the pointers accessible in the procedure. The second entry are the side-effect, specifically reads and modifications of the objects in the procedure. \ While the work by Tang and Järvi @y-tang_summary-based_2012 does not directly implement any optimizations based on their analysis, they show that the approach leads in almost all cases to a more concrete and smaller result for the points-to analysis and the side effects analysis. Because of this other optimization are able to run quicker with almost the same accuracy. -// TODO explain 3.1 composite objects === Copy propagation Copy propagation is implemented in the work of Joisha, Schreiber, Banerjee, Boehm and Chakrabarti @y-joisha_technique_2011 with focus on making it possible to apply in multi-threaded environments. It is implemented based on a procedural concurrency graph which is build from the source code. The nodes are all procedures which could run in the program. The edges between the nodes represent a MHP (may-happen-in-parallel) relation (@y-joisha_technique_2011, p. 627), which is a possible overlap of execution of both nodes. The function $I((p_1, p_2))$ lists the variables which the procedures $p_1$ and $p_2$ interfere. Interference in this context is a read and a write in overlapping (parallel) regions of the procedures. As long as there is no interfere between two function on a variable or the corresponding lock for a variable is held, it is possible to do copy propagation for the variable. @@ -503,31 +504,61 @@ This technique can be explained based on @copy_prop_rq2_example. In thread $t_1$ float: true, [ #figure( - caption: [Example how RVSDG, taken from Fig. 1 of the work by Reissmann, Meyer, Bahmann and Själander @y-reissmann_rvsdg_2020], + caption: [Example of how RVSDG looks, taken from Fig. 1 of the work by Reissmann, Meyer, Bahmann and Själander @y-reissmann_rvsdg_2020], grid( - columns: (1fr,)*4, + columns: (1.0864fr, 1fr, 1fr, 1fr), column-gutter: 0.5em, - image("rvsdg_1_code.svg"), + [ + ```C + int + f(int a, int b, int c, int d) + { + int li1, li2; + int cse, epr; + do { + li1 = b+c; + li2 = d-b; + a = a*li1; + int down = a%c; + int dead = a+d; + if (a > d) { + int acopy = a; + a = 3+down; + cse = acopy< cse); + return li2+epr; + } + ``` + #align(center, text(size: 8.25pt, "(a) Code")) + ], image("rvsdg_2_cfg.svg"), image("rvsdg_3_uir.svg"), - image("rvsdg_4_oir.svg") + image("rvsdg_4_oir.svg"), + ) ) ] ) In the work by Reissmann, Meyer, Bahmann and Själander @y-reissmann_rvsdg_2020 they describe a new IR for optimizing compilers to make optimizations easier and simpler to implement. Based on this IR they implemented their own compiler, jlm #footnote[https://github.com/phate/jlm], which consumed and outputted LLVM IR and was able to reach almost the same speed-up as LLVMs own optimizer. \ -RVSDG is implemented as a tree structure consisting of nested regions. A region only represents a collection of nodes and edges, with input connections and output connections. The edges in a region are either between nodes themselves and between nodes and either input or output of the region. \ -The nodes can either be primitive operations like arithmetics, memory operations and function calls. The other type are structural nodes where each represent a different construct: Gamma (#sym.gamma) model branching statements like `if` or `switch`. Theta (#sym.theta) models tail-controlled loops, for head-controlled loops a combination of a gamma node and a theta node is used. Lambda (#sym.lambda) models a function with multiple inputs connections representing arguments and a single output connection for the function result. Delta (#sym.delta) models global variable with its inputs representing dependencies on external variables and its output represents its actual value. Phi (#sym.Phi) is used for recursive functions and contain a single #sym.lambda node representing the actual function, its inner input connection is connected to the lambda node, which then can use it for calling itself again, thereby allowing RVSDG to represent recursive calls without cycles. Omega (#sym.omega) represents a translation unit and is the top-level node of an RVSDG and has not input or output connections. -// TODO maybe explain nodes in/out better -// TODO explain edges -// TODO explain optimizations -// TODO explain example +RVSDG is implemented as a tree structure consisting of nested regions. An example of how RVSDG looks can be seen in @rvsdg_example. A region only represents a collection of nodes and edges, with input connections and output connections. The connections of the region can be seen on the inside of the colored regions in (c) and (d) in @rvsdg_example. The nodes can either be primitive operations like arithmetics, memory operations and function calls (applications) or structural nodes representing control flow, functions, translation units or global variables. The inputs and outputs of primitive nodes are just the arguments for the operating. Function are also passed through as a connection and called through the `apply` primitive nodes with the function arguments. The connections on the outside of the structural nodes are things like arguments, function results and branching condition. The edges in a region are either between nodes themselves or between nodes and either input or output of the region. \ +There are a few types of structural nodes where each represents a different construct: Gamma (#sym.gamma) model branching statements like `if` or `switch`. Theta (#sym.theta) models tail-controlled loops, for head-controlled loops a combination of a gamma node and a theta node is used. Lambda (#sym.lambda) models a function with multiple inputs connections representing arguments and a single output connection for the function result. Delta (#sym.delta) models global variable with its inputs representing dependencies on external variables and its output represents its actual value. Phi (#sym.Phi) is used for recursive functions and contain a single #sym.lambda node representing the actual function, its inner input connection is connected to the lambda node, which then can use it for calling itself again, thereby allowing RVSDG to represent recursive calls without cycles. Omega (#sym.omega) represents a translation unit and is the top-level node of an RVSDG and has not input or output connections. \ +The solid colored edges represent a data flow from a result to another operating, while dashed edges (not seen in @rvsdg_example) represent a state flow for ordering operations. An input must be connected to exactly one output, but outputs can be connected to any number of inputs. These edges thereby also model dependencies on other operations. The input connections of regions commonly model variables. In #sym.theta nodes these are variables that are set after every loop iteration and read again at the beginning. In #sym.gamma nodes every branch region gets the same inputs from the inputs of the node and all outputs from the branches get merged together, based on which branch was executed, to the output connections of the node. \ +The two optimizations presented in the work by Reissmann, Meyer, Bahmann and Själander @y-reissmann_rvsdg_2020 are common node elimination and dead node elimination. It is also mentioned that many other optimizations can be expressed as rewriting subgraphs, move nodes between regions and diverting edges. \ +Common node elimination is implemented by first finding nodes which perform the same operation with the same inputs and marking them. The second step of the algorithm is taking all output edges of one group of nodes and diverting them to a single node in that group. The result of this is that all nodes except one have their output disconnected and are thereby declared dead nodes. \ +Dead node elimination removes nodes whose result is either not used or only used by other dead nodes. This is very easy with RVSDG, since you only need to check if the output of the node is connected anywhere. \ +The example in @rvsdg_example shows a simple function that calculates a value with a loop and an if. The listing in @rvsdg_example (a) contains C code, which has various optimization opportunities. The variable `cse` gets assigned in both branches of the on line 12 to the same value, since `acopy` equals `a`. The variable `dead` on line 11 never gets used. The variable `down` is only needed on the first branch on line 14, thereby it does not need to be calculated when the other branch would run. Since the variables `b`, `c` and `d` never get modified in the loop, the calculations for `li1` and `li2` on lines 7 and 8 can be moved out of the loop. \ +These optimizations can be found and performed relatively simple in the RVSDG representation. The unoptimized and optimized version can be seen in (c) and (d) of @rvsdg_example. The calculation of `cse` is inside the #sym.gamma region and since in both branches the node is connected to the same inputs and the same outputs, the calculation can be moved out of the #sym.gamma region. The calculation of the variable `dead` is the `+` node directly above the #sym.gamma node, because its output is not connected to anything the node gets eliminated. The calculation of `down` is the marked `%` node above the #sym.gamma node, since its result is only used in the gamma node and there it is only used in the `1` branch (because its input in the `0` branch is not connected anywhere), it is moved into the `1` branch of the #sym.gamma region. Since the `b`, `c` and `d` input connections of the #sym.theta region are directly connected to their corresponding outputs, they are never modified in the loop. Because of this fact all nodes, which are only connected to these inputs can be moved out of the region, here these are the calculations of `li1` and `li2`. = Conclusion Our findings show that DFA is already extensively used in current compilers and brings big advantages for runtime speed. The cost of this is a higher compilation duration, which makes it unsuitable for JIT compilation. Furthermore, DFA allows complex optimizations across branches and function boundaries which would not be possible with traditional straight-line optimizations. \ The high implementation complexity and the deep entangled with the compiler internals also poses a big problem for advancing this field further. The recent release of new publications on this topic indicates that researchers are continuously searching for better and faster ways to implement DFA and to make better use of the analysis results. \ -The adaptability of LLVM and the associated immediate representation makes it an invaluable platform to do testing and research with DFA. +The adaptability of LLVM and the associated immediate representation makes it an invaluable platform to do testing and research with DFA. \ +There are a lot of publications to make implementing algorithms simpler and quicker. This is done by either making a DSL just for implementing DFA algorithms or by making a IR on which it is much easier to run algorithms for DFA. #pagebreak(weak: true) #set heading(numbering: none)