| 1 |
711 |
jeremybenn |
@c Copyright (c) 2010 Free Software Foundation, Inc.
|
| 2 |
|
|
@c Free Software Foundation, Inc.
|
| 3 |
|
|
@c This is part of the GCC manual.
|
| 4 |
|
|
@c For copying conditions, see the file gcc.texi.
|
| 5 |
|
|
@c Contributed by Jan Hubicka <jh@suse.cz> and
|
| 6 |
|
|
@c Diego Novillo <dnovillo@google.com>
|
| 7 |
|
|
|
| 8 |
|
|
@node LTO
|
| 9 |
|
|
@chapter Link Time Optimization
|
| 10 |
|
|
@cindex lto
|
| 11 |
|
|
@cindex whopr
|
| 12 |
|
|
@cindex wpa
|
| 13 |
|
|
@cindex ltrans
|
| 14 |
|
|
|
| 15 |
|
|
@section Design Overview
|
| 16 |
|
|
|
| 17 |
|
|
Link time optimization is implemented as a GCC front end for a
|
| 18 |
|
|
bytecode representation of GIMPLE that is emitted in special sections
|
| 19 |
|
|
of @code{.o} files. Currently, LTO support is enabled in most
|
| 20 |
|
|
ELF-based systems, as well as darwin, cygwin and mingw systems.
|
| 21 |
|
|
|
| 22 |
|
|
Since GIMPLE bytecode is saved alongside final object code, object
|
| 23 |
|
|
files generated with LTO support are larger than regular object files.
|
| 24 |
|
|
This ``fat'' object format makes it easy to integrate LTO into
|
| 25 |
|
|
existing build systems, as one can, for instance, produce archives of
|
| 26 |
|
|
the files. Additionally, one might be able to ship one set of fat
|
| 27 |
|
|
objects which could be used both for development and the production of
|
| 28 |
|
|
optimized builds. A, perhaps surprising, side effect of this feature
|
| 29 |
|
|
is that any mistake in the toolchain that leads to LTO information not
|
| 30 |
|
|
being used (e.g.@: an older @code{libtool} calling @code{ld} directly).
|
| 31 |
|
|
This is both an advantage, as the system is more robust, and a
|
| 32 |
|
|
disadvantage, as the user is not informed that the optimization has
|
| 33 |
|
|
been disabled.
|
| 34 |
|
|
|
| 35 |
|
|
The current implementation only produces ``fat'' objects, effectively
|
| 36 |
|
|
doubling compilation time and increasing file sizes up to 5x the
|
| 37 |
|
|
original size. This hides the problem that some tools, such as
|
| 38 |
|
|
@code{ar} and @code{nm}, need to understand symbol tables of LTO
|
| 39 |
|
|
sections. These tools were extended to use the plugin infrastructure,
|
| 40 |
|
|
and with these problems solved, GCC will also support ``slim'' objects
|
| 41 |
|
|
consisting of the intermediate code alone.
|
| 42 |
|
|
|
| 43 |
|
|
At the highest level, LTO splits the compiler in two. The first half
|
| 44 |
|
|
(the ``writer'') produces a streaming representation of all the
|
| 45 |
|
|
internal data structures needed to optimize and generate code. This
|
| 46 |
|
|
includes declarations, types, the callgraph and the GIMPLE representation
|
| 47 |
|
|
of function bodies.
|
| 48 |
|
|
|
| 49 |
|
|
When @option{-flto} is given during compilation of a source file, the
|
| 50 |
|
|
pass manager executes all the passes in @code{all_lto_gen_passes}.
|
| 51 |
|
|
Currently, this phase is composed of two IPA passes:
|
| 52 |
|
|
|
| 53 |
|
|
@itemize @bullet
|
| 54 |
|
|
@item @code{pass_ipa_lto_gimple_out}
|
| 55 |
|
|
This pass executes the function @code{lto_output} in
|
| 56 |
|
|
@file{lto-streamer-out.c}, which traverses the call graph encoding
|
| 57 |
|
|
every reachable declaration, type and function. This generates a
|
| 58 |
|
|
memory representation of all the file sections described below.
|
| 59 |
|
|
|
| 60 |
|
|
@item @code{pass_ipa_lto_finish_out}
|
| 61 |
|
|
This pass executes the function @code{produce_asm_for_decls} in
|
| 62 |
|
|
@file{lto-streamer-out.c}, which takes the memory image built in the
|
| 63 |
|
|
previous pass and encodes it in the corresponding ELF file sections.
|
| 64 |
|
|
@end itemize
|
| 65 |
|
|
|
| 66 |
|
|
The second half of LTO support is the ``reader''. This is implemented
|
| 67 |
|
|
as the GCC front end @file{lto1} in @file{lto/lto.c}. When
|
| 68 |
|
|
@file{collect2} detects a link set of @code{.o}/@code{.a} files with
|
| 69 |
|
|
LTO information and the @option{-flto} is enabled, it invokes
|
| 70 |
|
|
@file{lto1} which reads the set of files and aggregates them into a
|
| 71 |
|
|
single translation unit for optimization. The main entry point for
|
| 72 |
|
|
the reader is @file{lto/lto.c}:@code{lto_main}.
|
| 73 |
|
|
|
| 74 |
|
|
@subsection LTO modes of operation
|
| 75 |
|
|
|
| 76 |
|
|
One of the main goals of the GCC link-time infrastructure was to allow
|
| 77 |
|
|
effective compilation of large programs. For this reason GCC implements two
|
| 78 |
|
|
link-time compilation modes.
|
| 79 |
|
|
|
| 80 |
|
|
@enumerate
|
| 81 |
|
|
@item @emph{LTO mode}, in which the whole program is read into the
|
| 82 |
|
|
compiler at link-time and optimized in a similar way as if it
|
| 83 |
|
|
were a single source-level compilation unit.
|
| 84 |
|
|
|
| 85 |
|
|
@item @emph{WHOPR or partitioned mode}, designed to utilize multiple
|
| 86 |
|
|
CPUs and/or a distributed compilation environment to quickly link
|
| 87 |
|
|
large applications. WHOPR stands for WHOle Program optimizeR (not to
|
| 88 |
|
|
be confused with the semantics of @option{-fwhole-program}). It
|
| 89 |
|
|
partitions the aggregated callgraph from many different @code{.o}
|
| 90 |
|
|
files and distributes the compilation of the sub-graphs to different
|
| 91 |
|
|
CPUs.
|
| 92 |
|
|
|
| 93 |
|
|
Note that distributed compilation is not implemented yet, but since
|
| 94 |
|
|
the parallelism is facilitated via generating a @code{Makefile}, it
|
| 95 |
|
|
would be easy to implement.
|
| 96 |
|
|
@end enumerate
|
| 97 |
|
|
|
| 98 |
|
|
WHOPR splits LTO into three main stages:
|
| 99 |
|
|
@enumerate
|
| 100 |
|
|
@item Local generation (LGEN)
|
| 101 |
|
|
This stage executes in parallel. Every file in the program is compiled
|
| 102 |
|
|
into the intermediate language and packaged together with the local
|
| 103 |
|
|
call-graph and summary information. This stage is the same for both
|
| 104 |
|
|
the LTO and WHOPR compilation mode.
|
| 105 |
|
|
|
| 106 |
|
|
@item Whole Program Analysis (WPA)
|
| 107 |
|
|
WPA is performed sequentially. The global call-graph is generated, and
|
| 108 |
|
|
a global analysis procedure makes transformation decisions. The global
|
| 109 |
|
|
call-graph is partitioned to facilitate parallel optimization during
|
| 110 |
|
|
phase 3. The results of the WPA stage are stored into new object files
|
| 111 |
|
|
which contain the partitions of program expressed in the intermediate
|
| 112 |
|
|
language and the optimization decisions.
|
| 113 |
|
|
|
| 114 |
|
|
@item Local transformations (LTRANS)
|
| 115 |
|
|
This stage executes in parallel. All the decisions made during phase 2
|
| 116 |
|
|
are implemented locally in each partitioned object file, and the final
|
| 117 |
|
|
object code is generated. Optimizations which cannot be decided
|
| 118 |
|
|
efficiently during the phase 2 may be performed on the local
|
| 119 |
|
|
call-graph partitions.
|
| 120 |
|
|
@end enumerate
|
| 121 |
|
|
|
| 122 |
|
|
WHOPR can be seen as an extension of the usual LTO mode of
|
| 123 |
|
|
compilation. In LTO, WPA and LTRANS are executed within a single
|
| 124 |
|
|
execution of the compiler, after the whole program has been read into
|
| 125 |
|
|
memory.
|
| 126 |
|
|
|
| 127 |
|
|
When compiling in WHOPR mode, the callgraph is partitioned during
|
| 128 |
|
|
the WPA stage. The whole program is split into a given number of
|
| 129 |
|
|
partitions of roughly the same size. The compiler tries to
|
| 130 |
|
|
minimize the number of references which cross partition boundaries.
|
| 131 |
|
|
The main advantage of WHOPR is to allow the parallel execution of
|
| 132 |
|
|
LTRANS stages, which are the most time-consuming part of the
|
| 133 |
|
|
compilation process. Additionally, it avoids the need to load the
|
| 134 |
|
|
whole program into memory.
|
| 135 |
|
|
|
| 136 |
|
|
|
| 137 |
|
|
@section LTO file sections
|
| 138 |
|
|
|
| 139 |
|
|
LTO information is stored in several ELF sections inside object files.
|
| 140 |
|
|
Data structures and enum codes for sections are defined in
|
| 141 |
|
|
@file{lto-streamer.h}.
|
| 142 |
|
|
|
| 143 |
|
|
These sections are emitted from @file{lto-streamer-out.c} and mapped
|
| 144 |
|
|
in all at once from @file{lto/lto.c}:@code{lto_file_read}. The
|
| 145 |
|
|
individual functions dealing with the reading/writing of each section
|
| 146 |
|
|
are described below.
|
| 147 |
|
|
|
| 148 |
|
|
@itemize @bullet
|
| 149 |
|
|
@item Command line options (@code{.gnu.lto_.opts})
|
| 150 |
|
|
|
| 151 |
|
|
This section contains the command line options used to generate the
|
| 152 |
|
|
object files. This is used at link time to determine the optimization
|
| 153 |
|
|
level and other settings when they are not explicitly specified at the
|
| 154 |
|
|
linker command line.
|
| 155 |
|
|
|
| 156 |
|
|
Currently, GCC does not support combining LTO object files compiled
|
| 157 |
|
|
with different set of the command line options into a single binary.
|
| 158 |
|
|
At link time, the options given on the command line and the options
|
| 159 |
|
|
saved on all the files in a link-time set are applied globally. No
|
| 160 |
|
|
attempt is made at validating the combination of flags (other than the
|
| 161 |
|
|
usual validation done by option processing). This is implemented in
|
| 162 |
|
|
@file{lto/lto.c}:@code{lto_read_all_file_options}.
|
| 163 |
|
|
|
| 164 |
|
|
|
| 165 |
|
|
@item Symbol table (@code{.gnu.lto_.symtab})
|
| 166 |
|
|
|
| 167 |
|
|
This table replaces the ELF symbol table for functions and variables
|
| 168 |
|
|
represented in the LTO IL. Symbols used and exported by the optimized
|
| 169 |
|
|
assembly code of ``fat'' objects might not match the ones used and
|
| 170 |
|
|
exported by the intermediate code. This table is necessary because
|
| 171 |
|
|
the intermediate code is less optimized and thus requires a separate
|
| 172 |
|
|
symbol table.
|
| 173 |
|
|
|
| 174 |
|
|
Additionally, the binary code in the ``fat'' object will lack a call
|
| 175 |
|
|
to a function, since the call was optimized out at compilation time
|
| 176 |
|
|
after the intermediate language was streamed out. In some special
|
| 177 |
|
|
cases, the same optimization may not happen during link-time
|
| 178 |
|
|
optimization. This would lead to an undefined symbol if only one
|
| 179 |
|
|
symbol table was used.
|
| 180 |
|
|
|
| 181 |
|
|
The symbol table is emitted in
|
| 182 |
|
|
@file{lto-streamer-out.c}:@code{produce_symtab}.
|
| 183 |
|
|
|
| 184 |
|
|
|
| 185 |
|
|
@item Global declarations and types (@code{.gnu.lto_.decls})
|
| 186 |
|
|
|
| 187 |
|
|
This section contains an intermediate language dump of all
|
| 188 |
|
|
declarations and types required to represent the callgraph, static
|
| 189 |
|
|
variables and top-level debug info.
|
| 190 |
|
|
|
| 191 |
|
|
The contents of this section are emitted in
|
| 192 |
|
|
@file{lto-streamer-out.c}:@code{produce_asm_for_decls}. Types and
|
| 193 |
|
|
symbols are emitted in a topological order that preserves the sharing
|
| 194 |
|
|
of pointers when the file is read back in
|
| 195 |
|
|
(@file{lto.c}:@code{read_cgraph_and_symbols}).
|
| 196 |
|
|
|
| 197 |
|
|
|
| 198 |
|
|
@item The callgraph (@code{.gnu.lto_.cgraph})
|
| 199 |
|
|
|
| 200 |
|
|
This section contains the basic data structure used by the GCC
|
| 201 |
|
|
inter-procedural optimization infrastructure. This section stores an
|
| 202 |
|
|
annotated multi-graph which represents the functions and call sites as
|
| 203 |
|
|
well as the variables, aliases and top-level @code{asm} statements.
|
| 204 |
|
|
|
| 205 |
|
|
This section is emitted in
|
| 206 |
|
|
@file{lto-streamer-out.c}:@code{output_cgraph} and read in
|
| 207 |
|
|
@file{lto-cgraph.c}:@code{input_cgraph}.
|
| 208 |
|
|
|
| 209 |
|
|
|
| 210 |
|
|
@item IPA references (@code{.gnu.lto_.refs})
|
| 211 |
|
|
|
| 212 |
|
|
This section contains references between function and static
|
| 213 |
|
|
variables. It is emitted by @file{lto-cgraph.c}:@code{output_refs}
|
| 214 |
|
|
and read by @file{lto-cgraph.c}:@code{input_refs}.
|
| 215 |
|
|
|
| 216 |
|
|
|
| 217 |
|
|
@item Function bodies (@code{.gnu.lto_.function_body.<name>})
|
| 218 |
|
|
|
| 219 |
|
|
This section contains function bodies in the intermediate language
|
| 220 |
|
|
representation. Every function body is in a separate section to allow
|
| 221 |
|
|
copying of the section independently to different object files or
|
| 222 |
|
|
reading the function on demand.
|
| 223 |
|
|
|
| 224 |
|
|
Functions are emitted in
|
| 225 |
|
|
@file{lto-streamer-out.c}:@code{output_function} and read in
|
| 226 |
|
|
@file{lto-streamer-in.c}:@code{input_function}.
|
| 227 |
|
|
|
| 228 |
|
|
|
| 229 |
|
|
@item Static variable initializers (@code{.gnu.lto_.vars})
|
| 230 |
|
|
|
| 231 |
|
|
This section contains all the symbols in the global variable pool. It
|
| 232 |
|
|
is emitted by @file{lto-cgraph.c}:@code{output_varpool} and read in
|
| 233 |
|
|
@file{lto-cgraph.c}:@code{input_cgraph}.
|
| 234 |
|
|
|
| 235 |
|
|
@item Summaries and optimization summaries used by IPA passes
|
| 236 |
|
|
(@code{.gnu.lto_.<xxx>}, where @code{<xxx>} is one of @code{jmpfuncs},
|
| 237 |
|
|
@code{pureconst} or @code{reference})
|
| 238 |
|
|
|
| 239 |
|
|
These sections are used by IPA passes that need to emit summary
|
| 240 |
|
|
information during LTO generation to be read and aggregated at
|
| 241 |
|
|
link time. Each pass is responsible for implementing two pass manager
|
| 242 |
|
|
hooks: one for writing the summary and another for reading it in. The
|
| 243 |
|
|
format of these sections is entirely up to each individual pass. The
|
| 244 |
|
|
only requirement is that the writer and reader hooks agree on the
|
| 245 |
|
|
format.
|
| 246 |
|
|
@end itemize
|
| 247 |
|
|
|
| 248 |
|
|
|
| 249 |
|
|
@section Using summary information in IPA passes
|
| 250 |
|
|
|
| 251 |
|
|
Programs are represented internally as a @emph{callgraph} (a
|
| 252 |
|
|
multi-graph where nodes are functions and edges are call sites)
|
| 253 |
|
|
and a @emph{varpool} (a list of static and external variables in
|
| 254 |
|
|
the program).
|
| 255 |
|
|
|
| 256 |
|
|
The inter-procedural optimization is organized as a sequence of
|
| 257 |
|
|
individual passes, which operate on the callgraph and the
|
| 258 |
|
|
varpool. To make the implementation of WHOPR possible, every
|
| 259 |
|
|
inter-procedural optimization pass is split into several stages
|
| 260 |
|
|
that are executed at different times during WHOPR compilation:
|
| 261 |
|
|
|
| 262 |
|
|
@itemize @bullet
|
| 263 |
|
|
@item LGEN time
|
| 264 |
|
|
@enumerate
|
| 265 |
|
|
@item @emph{Generate summary} (@code{generate_summary} in
|
| 266 |
|
|
@code{struct ipa_opt_pass_d}). This stage analyzes every function
|
| 267 |
|
|
body and variable initializer is examined and stores relevant
|
| 268 |
|
|
information into a pass-specific data structure.
|
| 269 |
|
|
|
| 270 |
|
|
@item @emph{Write summary} (@code{write_summary} in
|
| 271 |
|
|
@code{struct ipa_opt_pass_d}). This stage writes all the
|
| 272 |
|
|
pass-specific information generated by @code{generate_summary}.
|
| 273 |
|
|
Summaries go into their own @code{LTO_section_*} sections that
|
| 274 |
|
|
have to be declared in @file{lto-streamer.h}:@code{enum
|
| 275 |
|
|
lto_section_type}. A new section is created by calling
|
| 276 |
|
|
@code{create_output_block} and data can be written using the
|
| 277 |
|
|
@code{lto_output_*} routines.
|
| 278 |
|
|
@end enumerate
|
| 279 |
|
|
|
| 280 |
|
|
@item WPA time
|
| 281 |
|
|
@enumerate
|
| 282 |
|
|
@item @emph{Read summary} (@code{read_summary} in
|
| 283 |
|
|
@code{struct ipa_opt_pass_d}). This stage reads all the
|
| 284 |
|
|
pass-specific information in exactly the same order that it was
|
| 285 |
|
|
written by @code{write_summary}.
|
| 286 |
|
|
|
| 287 |
|
|
@item @emph{Execute} (@code{execute} in @code{struct
|
| 288 |
|
|
opt_pass}). This performs inter-procedural propagation. This
|
| 289 |
|
|
must be done without actual access to the individual function
|
| 290 |
|
|
bodies or variable initializers. Typically, this results in a
|
| 291 |
|
|
transitive closure operation over the summary information of all
|
| 292 |
|
|
the nodes in the callgraph.
|
| 293 |
|
|
|
| 294 |
|
|
@item @emph{Write optimization summary}
|
| 295 |
|
|
(@code{write_optimization_summary} in @code{struct
|
| 296 |
|
|
ipa_opt_pass_d}). This writes the result of the inter-procedural
|
| 297 |
|
|
propagation into the object file. This can use the same data
|
| 298 |
|
|
structures and helper routines used in @code{write_summary}.
|
| 299 |
|
|
@end enumerate
|
| 300 |
|
|
|
| 301 |
|
|
@item LTRANS time
|
| 302 |
|
|
@enumerate
|
| 303 |
|
|
@item @emph{Read optimization summary}
|
| 304 |
|
|
(@code{read_optimization_summary} in @code{struct
|
| 305 |
|
|
ipa_opt_pass_d}). The counterpart to
|
| 306 |
|
|
@code{write_optimization_summary}. This reads the interprocedural
|
| 307 |
|
|
optimization decisions in exactly the same format emitted by
|
| 308 |
|
|
@code{write_optimization_summary}.
|
| 309 |
|
|
|
| 310 |
|
|
@item @emph{Transform} (@code{function_transform} and
|
| 311 |
|
|
@code{variable_transform} in @code{struct ipa_opt_pass_d}).
|
| 312 |
|
|
The actual function bodies and variable initializers are updated
|
| 313 |
|
|
based on the information passed down from the @emph{Execute} stage.
|
| 314 |
|
|
@end enumerate
|
| 315 |
|
|
@end itemize
|
| 316 |
|
|
|
| 317 |
|
|
The implementation of the inter-procedural passes are shared
|
| 318 |
|
|
between LTO, WHOPR and classic non-LTO compilation.
|
| 319 |
|
|
|
| 320 |
|
|
@itemize
|
| 321 |
|
|
@item During the traditional file-by-file mode every pass executes its
|
| 322 |
|
|
own @emph{Generate summary}, @emph{Execute}, and @emph{Transform}
|
| 323 |
|
|
stages within the single execution context of the compiler.
|
| 324 |
|
|
|
| 325 |
|
|
@item In LTO compilation mode, every pass uses @emph{Generate
|
| 326 |
|
|
summary} and @emph{Write summary} stages at compilation time,
|
| 327 |
|
|
while the @emph{Read summary}, @emph{Execute}, and
|
| 328 |
|
|
@emph{Transform} stages are executed at link time.
|
| 329 |
|
|
|
| 330 |
|
|
@item In WHOPR mode all stages are used.
|
| 331 |
|
|
@end itemize
|
| 332 |
|
|
|
| 333 |
|
|
To simplify development, the GCC pass manager differentiates
|
| 334 |
|
|
between normal inter-procedural passes and small inter-procedural
|
| 335 |
|
|
passes. A @emph{small inter-procedural pass}
|
| 336 |
|
|
(@code{SIMPLE_IPA_PASS}) is a pass that does
|
| 337 |
|
|
everything at once and thus it can not be executed during WPA in
|
| 338 |
|
|
WHOPR mode. It defines only the @emph{Execute} stage and during
|
| 339 |
|
|
this stage it accesses and modifies the function bodies. Such
|
| 340 |
|
|
passes are useful for optimization at LGEN or LTRANS time and are
|
| 341 |
|
|
used, for example, to implement early optimization before writing
|
| 342 |
|
|
object files. The simple inter-procedural passes can also be used
|
| 343 |
|
|
for easier prototyping and development of a new inter-procedural
|
| 344 |
|
|
pass.
|
| 345 |
|
|
|
| 346 |
|
|
|
| 347 |
|
|
@subsection Virtual clones
|
| 348 |
|
|
|
| 349 |
|
|
One of the main challenges of introducing the WHOPR compilation
|
| 350 |
|
|
mode was addressing the interactions between optimization passes.
|
| 351 |
|
|
In LTO compilation mode, the passes are executed in a sequence,
|
| 352 |
|
|
each of which consists of analysis (or @emph{Generate summary}),
|
| 353 |
|
|
propagation (or @emph{Execute}) and @emph{Transform} stages.
|
| 354 |
|
|
Once the work of one pass is finished, the next pass sees the
|
| 355 |
|
|
updated program representation and can execute. This makes the
|
| 356 |
|
|
individual passes dependent on each other.
|
| 357 |
|
|
|
| 358 |
|
|
In WHOPR mode all passes first execute their @emph{Generate
|
| 359 |
|
|
summary} stage. Then summary writing marks the end of the LGEN
|
| 360 |
|
|
stage. At WPA time,
|
| 361 |
|
|
the summaries are read back into memory and all passes run the
|
| 362 |
|
|
@emph{Execute} stage. Optimization summaries are streamed and
|
| 363 |
|
|
sent to LTRANS, where all the passes execute the @emph{Transform}
|
| 364 |
|
|
stage.
|
| 365 |
|
|
|
| 366 |
|
|
Most optimization passes split naturally into analysis,
|
| 367 |
|
|
propagation and transformation stages. But some do not. The
|
| 368 |
|
|
main problem arises when one pass performs changes and the
|
| 369 |
|
|
following pass gets confused by seeing different callgraphs
|
| 370 |
|
|
between the @emph{Transform} stage and the @emph{Generate summary}
|
| 371 |
|
|
or @emph{Execute} stage. This means that the passes are required
|
| 372 |
|
|
to communicate their decisions with each other.
|
| 373 |
|
|
|
| 374 |
|
|
To facilitate this communication, the GCC callgraph
|
| 375 |
|
|
infrastructure implements @emph{virtual clones}, a method of
|
| 376 |
|
|
representing the changes performed by the optimization passes in
|
| 377 |
|
|
the callgraph without needing to update function bodies.
|
| 378 |
|
|
|
| 379 |
|
|
A @emph{virtual clone} in the callgraph is a function that has no
|
| 380 |
|
|
associated body, just a description of how to create its body based
|
| 381 |
|
|
on a different function (which itself may be a virtual clone).
|
| 382 |
|
|
|
| 383 |
|
|
The description of function modifications includes adjustments to
|
| 384 |
|
|
the function's signature (which allows, for example, removing or
|
| 385 |
|
|
adding function arguments), substitutions to perform on the
|
| 386 |
|
|
function body, and, for inlined functions, a pointer to the
|
| 387 |
|
|
function that it will be inlined into.
|
| 388 |
|
|
|
| 389 |
|
|
It is also possible to redirect any edge of the callgraph from a
|
| 390 |
|
|
function to its virtual clone. This implies updating of the call
|
| 391 |
|
|
site to adjust for the new function signature.
|
| 392 |
|
|
|
| 393 |
|
|
Most of the transformations performed by inter-procedural
|
| 394 |
|
|
optimizations can be represented via virtual clones. For
|
| 395 |
|
|
instance, a constant propagation pass can produce a virtual clone
|
| 396 |
|
|
of the function which replaces one of its arguments by a
|
| 397 |
|
|
constant. The inliner can represent its decisions by producing a
|
| 398 |
|
|
clone of a function whose body will be later integrated into
|
| 399 |
|
|
a given function.
|
| 400 |
|
|
|
| 401 |
|
|
Using @emph{virtual clones}, the program can be easily updated
|
| 402 |
|
|
during the @emph{Execute} stage, solving most of pass interactions
|
| 403 |
|
|
problems that would otherwise occur during @emph{Transform}.
|
| 404 |
|
|
|
| 405 |
|
|
Virtual clones are later materialized in the LTRANS stage and
|
| 406 |
|
|
turned into real functions. Passes executed after the virtual
|
| 407 |
|
|
clone were introduced also perform their @emph{Transform} stage
|
| 408 |
|
|
on new functions, so for a pass there is no significant
|
| 409 |
|
|
difference between operating on a real function or a virtual
|
| 410 |
|
|
clone introduced before its @emph{Execute} stage.
|
| 411 |
|
|
|
| 412 |
|
|
Optimization passes then work on virtual clones introduced before
|
| 413 |
|
|
their @emph{Execute} stage as if they were real functions. The
|
| 414 |
|
|
only difference is that clones are not visible during the
|
| 415 |
|
|
@emph{Generate Summary} stage.
|
| 416 |
|
|
|
| 417 |
|
|
To keep function summaries updated, the callgraph interface
|
| 418 |
|
|
allows an optimizer to register a callback that is called every
|
| 419 |
|
|
time a new clone is introduced as well as when the actual
|
| 420 |
|
|
function or variable is generated or when a function or variable
|
| 421 |
|
|
is removed. These hooks are registered in the @emph{Generate
|
| 422 |
|
|
summary} stage and allow the pass to keep its information intact
|
| 423 |
|
|
until the @emph{Execute} stage. The same hooks can also be
|
| 424 |
|
|
registered during the @emph{Execute} stage to keep the
|
| 425 |
|
|
optimization summaries updated for the @emph{Transform} stage.
|
| 426 |
|
|
|
| 427 |
|
|
@subsection IPA references
|
| 428 |
|
|
|
| 429 |
|
|
GCC represents IPA references in the callgraph. For a function
|
| 430 |
|
|
or variable @code{A}, the @emph{IPA reference} is a list of all
|
| 431 |
|
|
locations where the address of @code{A} is taken and, when
|
| 432 |
|
|
@code{A} is a variable, a list of all direct stores and reads
|
| 433 |
|
|
to/from @code{A}. References represent an oriented multi-graph on
|
| 434 |
|
|
the union of nodes of the callgraph and the varpool. See
|
| 435 |
|
|
@file{ipa-reference.c}:@code{ipa_reference_write_optimization_summary}
|
| 436 |
|
|
and
|
| 437 |
|
|
@file{ipa-reference.c}:@code{ipa_reference_read_optimization_summary}
|
| 438 |
|
|
for details.
|
| 439 |
|
|
|
| 440 |
|
|
@subsection Jump functions
|
| 441 |
|
|
Suppose that an optimization pass sees a function @code{A} and it
|
| 442 |
|
|
knows the values of (some of) its arguments. The @emph{jump
|
| 443 |
|
|
function} describes the value of a parameter of a given function
|
| 444 |
|
|
call in function @code{A} based on this knowledge.
|
| 445 |
|
|
|
| 446 |
|
|
Jump functions are used by several optimizations, such as the
|
| 447 |
|
|
inter-procedural constant propagation pass and the
|
| 448 |
|
|
devirtualization pass. The inliner also uses jump functions to
|
| 449 |
|
|
perform inlining of callbacks.
|
| 450 |
|
|
|
| 451 |
|
|
@section Whole program assumptions, linker plugin and symbol visibilities
|
| 452 |
|
|
|
| 453 |
|
|
Link-time optimization gives relatively minor benefits when used
|
| 454 |
|
|
alone. The problem is that propagation of inter-procedural
|
| 455 |
|
|
information does not work well across functions and variables
|
| 456 |
|
|
that are called or referenced by other compilation units (such as
|
| 457 |
|
|
from a dynamically linked library). We say that such functions
|
| 458 |
|
|
are variables are @emph{externally visible}.
|
| 459 |
|
|
|
| 460 |
|
|
To make the situation even more difficult, many applications
|
| 461 |
|
|
organize themselves as a set of shared libraries, and the default
|
| 462 |
|
|
ELF visibility rules allow one to overwrite any externally
|
| 463 |
|
|
visible symbol with a different symbol at runtime. This
|
| 464 |
|
|
basically disables any optimizations across such functions and
|
| 465 |
|
|
variables, because the compiler cannot be sure that the function
|
| 466 |
|
|
body it is seeing is the same function body that will be used at
|
| 467 |
|
|
runtime. Any function or variable not declared @code{static} in
|
| 468 |
|
|
the sources degrades the quality of inter-procedural
|
| 469 |
|
|
optimization.
|
| 470 |
|
|
|
| 471 |
|
|
To avoid this problem the compiler must assume that it sees the
|
| 472 |
|
|
whole program when doing link-time optimization. Strictly
|
| 473 |
|
|
speaking, the whole program is rarely visible even at link-time.
|
| 474 |
|
|
Standard system libraries are usually linked dynamically or not
|
| 475 |
|
|
provided with the link-time information. In GCC, the whole
|
| 476 |
|
|
program option (@option{-fwhole-program}) asserts that every
|
| 477 |
|
|
function and variable defined in the current compilation
|
| 478 |
|
|
unit is static, except for function @code{main} (note: at
|
| 479 |
|
|
link time, the current unit is the union of all objects compiled
|
| 480 |
|
|
with LTO). Since some functions and variables need to
|
| 481 |
|
|
be referenced externally, for example by another DSO or from an
|
| 482 |
|
|
assembler file, GCC also provides the function and variable
|
| 483 |
|
|
attribute @code{externally_visible} which can be used to disable
|
| 484 |
|
|
the effect of @option{-fwhole-program} on a specific symbol.
|
| 485 |
|
|
|
| 486 |
|
|
The whole program mode assumptions are slightly more complex in
|
| 487 |
|
|
C++, where inline functions in headers are put into @emph{COMDAT}
|
| 488 |
|
|
sections. COMDAT function and variables can be defined by
|
| 489 |
|
|
multiple object files and their bodies are unified at link-time
|
| 490 |
|
|
and dynamic link-time. COMDAT functions are changed to local only
|
| 491 |
|
|
when their address is not taken and thus un-sharing them with a
|
| 492 |
|
|
library is not harmful. COMDAT variables always remain externally
|
| 493 |
|
|
visible, however for readonly variables it is assumed that their
|
| 494 |
|
|
initializers cannot be overwritten by a different value.
|
| 495 |
|
|
|
| 496 |
|
|
GCC provides the function and variable attribute
|
| 497 |
|
|
@code{visibility} that can be used to specify the visibility of
|
| 498 |
|
|
externally visible symbols (or alternatively an
|
| 499 |
|
|
@option{-fdefault-visibility} command line option). ELF defines
|
| 500 |
|
|
the @code{default}, @code{protected}, @code{hidden} and
|
| 501 |
|
|
@code{internal} visibilities.
|
| 502 |
|
|
|
| 503 |
|
|
The most commonly used is visibility is @code{hidden}. It
|
| 504 |
|
|
specifies that the symbol cannot be referenced from outside of
|
| 505 |
|
|
the current shared library. Unfortunately, this information
|
| 506 |
|
|
cannot be used directly by the link-time optimization in the
|
| 507 |
|
|
compiler since the whole shared library also might contain
|
| 508 |
|
|
non-LTO objects and those are not visible to the compiler.
|
| 509 |
|
|
|
| 510 |
|
|
GCC solves this problem using linker plugins. A @emph{linker
|
| 511 |
|
|
plugin} is an interface to the linker that allows an external
|
| 512 |
|
|
program to claim the ownership of a given object file. The linker
|
| 513 |
|
|
then performs the linking procedure by querying the plugin about
|
| 514 |
|
|
the symbol table of the claimed objects and once the linking
|
| 515 |
|
|
decisions are complete, the plugin is allowed to provide the
|
| 516 |
|
|
final object file before the actual linking is made. The linker
|
| 517 |
|
|
plugin obtains the symbol resolution information which specifies
|
| 518 |
|
|
which symbols provided by the claimed objects are bound from the
|
| 519 |
|
|
rest of a binary being linked.
|
| 520 |
|
|
|
| 521 |
|
|
Currently, the linker plugin works only in combination
|
| 522 |
|
|
with the Gold linker, but a GNU ld implementation is under
|
| 523 |
|
|
development.
|
| 524 |
|
|
|
| 525 |
|
|
GCC is designed to be independent of the rest of the toolchain
|
| 526 |
|
|
and aims to support linkers without plugin support. For this
|
| 527 |
|
|
reason it does not use the linker plugin by default. Instead,
|
| 528 |
|
|
the object files are examined by @command{collect2} before being
|
| 529 |
|
|
passed to the linker and objects found to have LTO sections are
|
| 530 |
|
|
passed to @command{lto1} first. This mode does not work for
|
| 531 |
|
|
library archives. The decision on what object files from the
|
| 532 |
|
|
archive are needed depends on the actual linking and thus GCC
|
| 533 |
|
|
would have to implement the linker itself. The resolution
|
| 534 |
|
|
information is missing too and thus GCC needs to make an educated
|
| 535 |
|
|
guess based on @option{-fwhole-program}. Without the linker
|
| 536 |
|
|
plugin GCC also assumes that symbols are declared @code{hidden}
|
| 537 |
|
|
and not referred by non-LTO code by default.
|
| 538 |
|
|
|
| 539 |
|
|
@section Internal flags controlling @code{lto1}
|
| 540 |
|
|
|
| 541 |
|
|
The following flags are passed into @command{lto1} and are not
|
| 542 |
|
|
meant to be used directly from the command line.
|
| 543 |
|
|
|
| 544 |
|
|
@itemize
|
| 545 |
|
|
@item -fwpa
|
| 546 |
|
|
@opindex fwpa
|
| 547 |
|
|
This option runs the serial part of the link-time optimizer
|
| 548 |
|
|
performing the inter-procedural propagation (WPA mode). The
|
| 549 |
|
|
compiler reads in summary information from all inputs and
|
| 550 |
|
|
performs an analysis based on summary information only. It
|
| 551 |
|
|
generates object files for subsequent runs of the link-time
|
| 552 |
|
|
optimizer where individual object files are optimized using both
|
| 553 |
|
|
summary information from the WPA mode and the actual function
|
| 554 |
|
|
bodies. It then drives the LTRANS phase.
|
| 555 |
|
|
|
| 556 |
|
|
@item -fltrans
|
| 557 |
|
|
@opindex fltrans
|
| 558 |
|
|
This option runs the link-time optimizer in the
|
| 559 |
|
|
local-transformation (LTRANS) mode, which reads in output from a
|
| 560 |
|
|
previous run of the LTO in WPA mode. In the LTRANS mode, LTO
|
| 561 |
|
|
optimizes an object and produces the final assembly.
|
| 562 |
|
|
|
| 563 |
|
|
@item -fltrans-output-list=@var{file}
|
| 564 |
|
|
@opindex fltrans-output-list
|
| 565 |
|
|
This option specifies a file to which the names of LTRANS output
|
| 566 |
|
|
files are written. This option is only meaningful in conjunction
|
| 567 |
|
|
with @option{-fwpa}.
|
| 568 |
|
|
@end itemize
|