OpenCores
URL https://opencores.org/ocsvn/openrisc/openrisc/trunk

Subversion Repositories openrisc

[/] [openrisc/] [trunk/] [gnu-dev/] [or1k-gcc/] [gcc/] [doc/] [lto.texi] - Blame information for rev 729

Go to most recent revision | Details | Compare with Previous | View Log

Line No. Rev Author Line
1 711 jeremybenn
@c Copyright (c) 2010 Free Software Foundation, Inc.
2
@c Free Software Foundation, Inc.
3
@c This is part of the GCC manual.
4
@c For copying conditions, see the file gcc.texi.
5
@c Contributed by Jan Hubicka <jh@suse.cz> and
6
@c Diego Novillo <dnovillo@google.com>
7
 
8
@node LTO
9
@chapter Link Time Optimization
10
@cindex lto
11
@cindex whopr
12
@cindex wpa
13
@cindex ltrans
14
 
15
@section Design Overview
16
 
17
Link time optimization is implemented as a GCC front end for a
18
bytecode representation of GIMPLE that is emitted in special sections
19
of @code{.o} files.  Currently, LTO support is enabled in most
20
ELF-based systems, as well as darwin, cygwin and mingw systems.
21
 
22
Since GIMPLE bytecode is saved alongside final object code, object
23
files generated with LTO support are larger than regular object files.
24
This ``fat'' object format makes it easy to integrate LTO into
25
existing build systems, as one can, for instance, produce archives of
26
the files.  Additionally, one might be able to ship one set of fat
27
objects which could be used both for development and the production of
28
optimized builds.  A, perhaps surprising, side effect of this feature
29
is that any mistake in the toolchain that leads to LTO information not
30
being used (e.g.@: an older @code{libtool} calling @code{ld} directly).
31
This is both an advantage, as the system is more robust, and a
32
disadvantage, as the user is not informed that the optimization has
33
been disabled.
34
 
35
The current implementation only produces ``fat'' objects, effectively
36
doubling compilation time and increasing file sizes up to 5x the
37
original size.  This hides the problem that some tools, such as
38
@code{ar} and @code{nm}, need to understand symbol tables of LTO
39
sections.  These tools were extended to use the plugin infrastructure,
40
and with these problems solved, GCC will also support ``slim'' objects
41
consisting of the intermediate code alone.
42
 
43
At the highest level, LTO splits the compiler in two.  The first half
44
(the ``writer'') produces a streaming representation of all the
45
internal data structures needed to optimize and generate code.  This
46
includes declarations, types, the callgraph and the GIMPLE representation
47
of function bodies.
48
 
49
When @option{-flto} is given during compilation of a source file, the
50
pass manager executes all the passes in @code{all_lto_gen_passes}.
51
Currently, this phase is composed of two IPA passes:
52
 
53
@itemize @bullet
54
@item @code{pass_ipa_lto_gimple_out}
55
This pass executes the function @code{lto_output} in
56
@file{lto-streamer-out.c}, which traverses the call graph encoding
57
every reachable declaration, type and function.  This generates a
58
memory representation of all the file sections described below.
59
 
60
@item @code{pass_ipa_lto_finish_out}
61
This pass executes the function @code{produce_asm_for_decls} in
62
@file{lto-streamer-out.c}, which takes the memory image built in the
63
previous pass and encodes it in the corresponding ELF file sections.
64
@end itemize
65
 
66
The second half of LTO support is the ``reader''.  This is implemented
67
as the GCC front end @file{lto1} in @file{lto/lto.c}.  When
68
@file{collect2} detects a link set of @code{.o}/@code{.a} files with
69
LTO information and the @option{-flto} is enabled, it invokes
70
@file{lto1} which reads the set of files and aggregates them into a
71
single translation unit for optimization.  The main entry point for
72
the reader is @file{lto/lto.c}:@code{lto_main}.
73
 
74
@subsection LTO modes of operation
75
 
76
One of the main goals of the GCC link-time infrastructure was to allow
77
effective compilation of large programs.  For this reason GCC implements two
78
link-time compilation modes.
79
 
80
@enumerate
81
@item   @emph{LTO mode}, in which the whole program is read into the
82
compiler at link-time and optimized in a similar way as if it
83
were a single source-level compilation unit.
84
 
85
@item   @emph{WHOPR or partitioned mode}, designed to utilize multiple
86
CPUs and/or a distributed compilation environment to quickly link
87
large applications.  WHOPR stands for WHOle Program optimizeR (not to
88
be confused with the semantics of @option{-fwhole-program}).  It
89
partitions the aggregated callgraph from many different @code{.o}
90
files and distributes the compilation of the sub-graphs to different
91
CPUs.
92
 
93
Note that distributed compilation is not implemented yet, but since
94
the parallelism is facilitated via generating a @code{Makefile}, it
95
would be easy to implement.
96
@end enumerate
97
 
98
WHOPR splits LTO into three main stages:
99
@enumerate
100
@item Local generation (LGEN)
101
This stage executes in parallel.  Every file in the program is compiled
102
into the intermediate language and packaged together with the local
103
call-graph and summary information.  This stage is the same for both
104
the LTO and WHOPR compilation mode.
105
 
106
@item Whole Program Analysis (WPA)
107
WPA is performed sequentially.  The global call-graph is generated, and
108
a global analysis procedure makes transformation decisions.  The global
109
call-graph is partitioned to facilitate parallel optimization during
110
phase 3.  The results of the WPA stage are stored into new object files
111
which contain the partitions of program expressed in the intermediate
112
language and the optimization decisions.
113
 
114
@item Local transformations (LTRANS)
115
This stage executes in parallel.  All the decisions made during phase 2
116
are implemented locally in each partitioned object file, and the final
117
object code is generated.  Optimizations which cannot be decided
118
efficiently during the phase 2 may be performed on the local
119
call-graph partitions.
120
@end enumerate
121
 
122
WHOPR can be seen as an extension of the usual LTO mode of
123
compilation.  In LTO, WPA and LTRANS are executed within a single
124
execution of the compiler, after the whole program has been read into
125
memory.
126
 
127
When compiling in WHOPR mode, the callgraph is partitioned during
128
the WPA stage.  The whole program is split into a given number of
129
partitions of roughly the same size.  The compiler tries to
130
minimize the number of references which cross partition boundaries.
131
The main advantage of WHOPR is to allow the parallel execution of
132
LTRANS stages, which are the most time-consuming part of the
133
compilation process.  Additionally, it avoids the need to load the
134
whole program into memory.
135
 
136
 
137
@section LTO file sections
138
 
139
LTO information is stored in several ELF sections inside object files.
140
Data structures and enum codes for sections are defined in
141
@file{lto-streamer.h}.
142
 
143
These sections are emitted from @file{lto-streamer-out.c} and mapped
144
in all at once from @file{lto/lto.c}:@code{lto_file_read}.  The
145
individual functions dealing with the reading/writing of each section
146
are described below.
147
 
148
@itemize @bullet
149
@item Command line options (@code{.gnu.lto_.opts})
150
 
151
This section contains the command line options used to generate the
152
object files.  This is used at link time to determine the optimization
153
level and other settings when they are not explicitly specified at the
154
linker command line.
155
 
156
Currently, GCC does not support combining LTO object files compiled
157
with different set of the command line options into a single binary.
158
At link time, the options given on the command line and the options
159
saved on all the files in a link-time set are applied globally.  No
160
attempt is made at validating the combination of flags (other than the
161
usual validation done by option processing).  This is implemented in
162
@file{lto/lto.c}:@code{lto_read_all_file_options}.
163
 
164
 
165
@item Symbol table (@code{.gnu.lto_.symtab})
166
 
167
This table replaces the ELF symbol table for functions and variables
168
represented in the LTO IL.  Symbols used and exported by the optimized
169
assembly code of ``fat'' objects might not match the ones used and
170
exported by the intermediate code.  This table is necessary because
171
the intermediate code is less optimized and thus requires a separate
172
symbol table.
173
 
174
Additionally, the binary code in the ``fat'' object will lack a call
175
to a function, since the call was optimized out at compilation time
176
after the intermediate language was streamed out.  In some special
177
cases, the same optimization may not happen during link-time
178
optimization.  This would lead to an undefined symbol if only one
179
symbol table was used.
180
 
181
The symbol table is emitted in
182
@file{lto-streamer-out.c}:@code{produce_symtab}.
183
 
184
 
185
@item Global declarations and types (@code{.gnu.lto_.decls})
186
 
187
This section contains an intermediate language dump of all
188
declarations and types required to represent the callgraph, static
189
variables and top-level debug info.
190
 
191
The contents of this section are emitted in
192
@file{lto-streamer-out.c}:@code{produce_asm_for_decls}.  Types and
193
symbols are emitted in a topological order that preserves the sharing
194
of pointers when the file is read back in
195
(@file{lto.c}:@code{read_cgraph_and_symbols}).
196
 
197
 
198
@item The callgraph (@code{.gnu.lto_.cgraph})
199
 
200
This section contains the basic data structure used by the GCC
201
inter-procedural optimization infrastructure.  This section stores an
202
annotated multi-graph which represents the functions and call sites as
203
well as the variables, aliases and top-level @code{asm} statements.
204
 
205
This section is emitted in
206
@file{lto-streamer-out.c}:@code{output_cgraph} and read in
207
@file{lto-cgraph.c}:@code{input_cgraph}.
208
 
209
 
210
@item IPA references (@code{.gnu.lto_.refs})
211
 
212
This section contains references between function and static
213
variables.  It is emitted by @file{lto-cgraph.c}:@code{output_refs}
214
and read by @file{lto-cgraph.c}:@code{input_refs}.
215
 
216
 
217
@item Function bodies (@code{.gnu.lto_.function_body.<name>})
218
 
219
This section contains function bodies in the intermediate language
220
representation.  Every function body is in a separate section to allow
221
copying of the section independently to different object files or
222
reading the function on demand.
223
 
224
Functions are emitted in
225
@file{lto-streamer-out.c}:@code{output_function} and read in
226
@file{lto-streamer-in.c}:@code{input_function}.
227
 
228
 
229
@item Static variable initializers (@code{.gnu.lto_.vars})
230
 
231
This section contains all the symbols in the global variable pool.  It
232
is emitted by @file{lto-cgraph.c}:@code{output_varpool} and read in
233
@file{lto-cgraph.c}:@code{input_cgraph}.
234
 
235
@item Summaries and optimization summaries used by IPA passes
236
(@code{.gnu.lto_.<xxx>}, where @code{<xxx>} is one of @code{jmpfuncs},
237
@code{pureconst} or @code{reference})
238
 
239
These sections are used by IPA passes that need to emit summary
240
information during LTO generation to be read and aggregated at
241
link time.  Each pass is responsible for implementing two pass manager
242
hooks: one for writing the summary and another for reading it in.  The
243
format of these sections is entirely up to each individual pass.  The
244
only requirement is that the writer and reader hooks agree on the
245
format.
246
@end itemize
247
 
248
 
249
@section Using summary information in IPA passes
250
 
251
Programs are represented internally as a @emph{callgraph} (a
252
multi-graph where nodes are functions and edges are call sites)
253
and a @emph{varpool} (a list of static and external variables in
254
the program).
255
 
256
The inter-procedural optimization is organized as a sequence of
257
individual passes, which operate on the callgraph and the
258
varpool.  To make the implementation of WHOPR possible, every
259
inter-procedural optimization pass is split into several stages
260
that are executed at different times during WHOPR compilation:
261
 
262
@itemize @bullet
263
@item LGEN time
264
@enumerate
265
@item @emph{Generate summary} (@code{generate_summary} in
266
@code{struct ipa_opt_pass_d}).  This stage analyzes every function
267
body and variable initializer is examined and stores relevant
268
information into a pass-specific data structure.
269
 
270
@item @emph{Write summary} (@code{write_summary} in
271
@code{struct ipa_opt_pass_d}).  This stage writes all the
272
pass-specific information generated by @code{generate_summary}.
273
Summaries go into their own @code{LTO_section_*} sections that
274
have to be declared in @file{lto-streamer.h}:@code{enum
275
lto_section_type}.  A new section is created by calling
276
@code{create_output_block} and data can be written using the
277
@code{lto_output_*} routines.
278
@end enumerate
279
 
280
@item WPA time
281
@enumerate
282
@item @emph{Read summary} (@code{read_summary} in
283
@code{struct ipa_opt_pass_d}).  This stage reads all the
284
pass-specific information in exactly the same order that it was
285
written by @code{write_summary}.
286
 
287
@item @emph{Execute} (@code{execute} in @code{struct
288
opt_pass}).  This performs inter-procedural propagation.  This
289
must be done without actual access to the individual function
290
bodies or variable initializers.  Typically, this results in a
291
transitive closure operation over the summary information of all
292
the nodes in the callgraph.
293
 
294
@item @emph{Write optimization summary}
295
(@code{write_optimization_summary} in @code{struct
296
ipa_opt_pass_d}).  This writes the result of the inter-procedural
297
propagation into the object file.  This can use the same data
298
structures and helper routines used in @code{write_summary}.
299
@end enumerate
300
 
301
@item LTRANS time
302
@enumerate
303
@item @emph{Read optimization summary}
304
(@code{read_optimization_summary} in @code{struct
305
ipa_opt_pass_d}).  The counterpart to
306
@code{write_optimization_summary}.  This reads the interprocedural
307
optimization decisions in exactly the same format emitted by
308
@code{write_optimization_summary}.
309
 
310
@item @emph{Transform} (@code{function_transform} and
311
@code{variable_transform} in @code{struct ipa_opt_pass_d}).
312
The actual function bodies and variable initializers are updated
313
based on the information passed down from the @emph{Execute} stage.
314
@end enumerate
315
@end itemize
316
 
317
The implementation of the inter-procedural passes are shared
318
between LTO, WHOPR and classic non-LTO compilation.
319
 
320
@itemize
321
@item During the traditional file-by-file mode every pass executes its
322
own @emph{Generate summary}, @emph{Execute}, and @emph{Transform}
323
stages within the single execution context of the compiler.
324
 
325
@item In LTO compilation mode, every pass uses @emph{Generate
326
summary} and @emph{Write summary} stages at compilation time,
327
while the @emph{Read summary}, @emph{Execute}, and
328
@emph{Transform} stages are executed at link time.
329
 
330
@item In WHOPR mode all stages are used.
331
@end itemize
332
 
333
To simplify development, the GCC pass manager differentiates
334
between normal inter-procedural passes and small inter-procedural
335
passes.  A @emph{small inter-procedural pass}
336
(@code{SIMPLE_IPA_PASS}) is a pass that does
337
everything at once and thus it can not be executed during WPA in
338
WHOPR mode.  It defines only the @emph{Execute} stage and during
339
this stage it accesses and modifies the function bodies.  Such
340
passes are useful for optimization at LGEN or LTRANS time and are
341
used, for example, to implement early optimization before writing
342
object files.  The simple inter-procedural passes can also be used
343
for easier prototyping and development of a new inter-procedural
344
pass.
345
 
346
 
347
@subsection Virtual clones
348
 
349
One of the main challenges of introducing the WHOPR compilation
350
mode was addressing the interactions between optimization passes.
351
In LTO compilation mode, the passes are executed in a sequence,
352
each of which consists of analysis (or @emph{Generate summary}),
353
propagation (or @emph{Execute}) and @emph{Transform} stages.
354
Once the work of one pass is finished, the next pass sees the
355
updated program representation and can execute.  This makes the
356
individual passes dependent on each other.
357
 
358
In WHOPR mode all passes first execute their @emph{Generate
359
summary} stage.  Then summary writing marks the end of the LGEN
360
stage.  At WPA time,
361
the summaries are read back into memory and all passes run the
362
@emph{Execute} stage.  Optimization summaries are streamed and
363
sent to LTRANS, where all the passes execute the @emph{Transform}
364
stage.
365
 
366
Most optimization passes split naturally into analysis,
367
propagation and transformation stages.  But some do not.  The
368
main problem arises when one pass performs changes and the
369
following pass gets confused by seeing different callgraphs
370
between the @emph{Transform} stage and the @emph{Generate summary}
371
or @emph{Execute} stage.  This means that the passes are required
372
to communicate their decisions with each other.
373
 
374
To facilitate this communication, the GCC callgraph
375
infrastructure implements @emph{virtual clones}, a method of
376
representing the changes performed by the optimization passes in
377
the callgraph without needing to update function bodies.
378
 
379
A @emph{virtual clone} in the callgraph is a function that has no
380
associated body, just a description of how to create its body based
381
on a different function (which itself may be a virtual clone).
382
 
383
The description of function modifications includes adjustments to
384
the function's signature (which allows, for example, removing or
385
adding function arguments), substitutions to perform on the
386
function body, and, for inlined functions, a pointer to the
387
function that it will be inlined into.
388
 
389
It is also possible to redirect any edge of the callgraph from a
390
function to its virtual clone.  This implies updating of the call
391
site to adjust for the new function signature.
392
 
393
Most of the transformations performed by inter-procedural
394
optimizations can be represented via virtual clones.  For
395
instance, a constant propagation pass can produce a virtual clone
396
of the function which replaces one of its arguments by a
397
constant.  The inliner can represent its decisions by producing a
398
clone of a function whose body will be later integrated into
399
a given function.
400
 
401
Using @emph{virtual clones}, the program can be easily updated
402
during the @emph{Execute} stage, solving most of pass interactions
403
problems that would otherwise occur during @emph{Transform}.
404
 
405
Virtual clones are later materialized in the LTRANS stage and
406
turned into real functions.  Passes executed after the virtual
407
clone were introduced also perform their @emph{Transform} stage
408
on new functions, so for a pass there is no significant
409
difference between operating on a real function or a virtual
410
clone introduced before its @emph{Execute} stage.
411
 
412
Optimization passes then work on virtual clones introduced before
413
their @emph{Execute} stage as if they were real functions.  The
414
only difference is that clones are not visible during the
415
@emph{Generate Summary} stage.
416
 
417
To keep function summaries updated, the callgraph interface
418
allows an optimizer to register a callback that is called every
419
time a new clone is introduced as well as when the actual
420
function or variable is generated or when a function or variable
421
is removed.  These hooks are registered in the @emph{Generate
422
summary} stage and allow the pass to keep its information intact
423
until the @emph{Execute} stage.  The same hooks can also be
424
registered during the @emph{Execute} stage to keep the
425
optimization summaries updated for the @emph{Transform} stage.
426
 
427
@subsection IPA references
428
 
429
GCC represents IPA references in the callgraph.  For a function
430
or variable @code{A}, the @emph{IPA reference} is a list of all
431
locations where the address of @code{A} is taken and, when
432
@code{A} is a variable, a list of all direct stores and reads
433
to/from @code{A}.  References represent an oriented multi-graph on
434
the union of nodes of the callgraph and the varpool.  See
435
@file{ipa-reference.c}:@code{ipa_reference_write_optimization_summary}
436
and
437
@file{ipa-reference.c}:@code{ipa_reference_read_optimization_summary}
438
for details.
439
 
440
@subsection Jump functions
441
Suppose that an optimization pass sees a function @code{A} and it
442
knows the values of (some of) its arguments.  The @emph{jump
443
function} describes the value of a parameter of a given function
444
call in function @code{A} based on this knowledge.
445
 
446
Jump functions are used by several optimizations, such as the
447
inter-procedural constant propagation pass and the
448
devirtualization pass.  The inliner also uses jump functions to
449
perform inlining of callbacks.
450
 
451
@section Whole program assumptions, linker plugin and symbol visibilities
452
 
453
Link-time optimization gives relatively minor benefits when used
454
alone.  The problem is that propagation of inter-procedural
455
information does not work well across functions and variables
456
that are called or referenced by other compilation units (such as
457
from a dynamically linked library).  We say that such functions
458
are variables are @emph{externally visible}.
459
 
460
To make the situation even more difficult, many applications
461
organize themselves as a set of shared libraries, and the default
462
ELF visibility rules allow one to overwrite any externally
463
visible symbol with a different symbol at runtime.  This
464
basically disables any optimizations across such functions and
465
variables, because the compiler cannot be sure that the function
466
body it is seeing is the same function body that will be used at
467
runtime.  Any function or variable not declared @code{static} in
468
the sources degrades the quality of inter-procedural
469
optimization.
470
 
471
To avoid this problem the compiler must assume that it sees the
472
whole program when doing link-time optimization.  Strictly
473
speaking, the whole program is rarely visible even at link-time.
474
Standard system libraries are usually linked dynamically or not
475
provided with the link-time information.  In GCC, the whole
476
program option (@option{-fwhole-program}) asserts that every
477
function and variable defined in the current compilation
478
unit is static, except for function @code{main} (note: at
479
link time, the current unit is the union of all objects compiled
480
with LTO).  Since some functions and variables need to
481
be referenced externally, for example by another DSO or from an
482
assembler file, GCC also provides the function and variable
483
attribute @code{externally_visible} which can be used to disable
484
the effect of @option{-fwhole-program} on a specific symbol.
485
 
486
The whole program mode assumptions are slightly more complex in
487
C++, where inline functions in headers are put into @emph{COMDAT}
488
sections.  COMDAT function and variables can be defined by
489
multiple object files and their bodies are unified at link-time
490
and dynamic link-time.  COMDAT functions are changed to local only
491
when their address is not taken and thus un-sharing them with a
492
library is not harmful.  COMDAT variables always remain externally
493
visible, however for readonly variables it is assumed that their
494
initializers cannot be overwritten by a different value.
495
 
496
GCC provides the function and variable attribute
497
@code{visibility} that can be used to specify the visibility of
498
externally visible symbols (or alternatively an
499
@option{-fdefault-visibility} command line option).  ELF defines
500
the @code{default}, @code{protected}, @code{hidden} and
501
@code{internal} visibilities.
502
 
503
The most commonly used is visibility is @code{hidden}.  It
504
specifies that the symbol cannot be referenced from outside of
505
the current shared library.  Unfortunately, this information
506
cannot be used directly by the link-time optimization in the
507
compiler since the whole shared library also might contain
508
non-LTO objects and those are not visible to the compiler.
509
 
510
GCC solves this problem using linker plugins.  A @emph{linker
511
plugin} is an interface to the linker that allows an external
512
program to claim the ownership of a given object file.  The linker
513
then performs the linking procedure by querying the plugin about
514
the symbol table of the claimed objects and once the linking
515
decisions are complete, the plugin is allowed to provide the
516
final object file before the actual linking is made.  The linker
517
plugin obtains the symbol resolution information which specifies
518
which symbols provided by the claimed objects are bound from the
519
rest of a binary being linked.
520
 
521
Currently, the linker plugin  works only in combination
522
with the Gold linker, but a GNU ld implementation is under
523
development.
524
 
525
GCC is designed to be independent of the rest of the toolchain
526
and aims to support linkers without plugin support.  For this
527
reason it does not use the linker plugin by default.  Instead,
528
the object files are examined by @command{collect2} before being
529
passed to the linker and objects found to have LTO sections are
530
passed to @command{lto1} first.  This mode does not work for
531
library archives.  The decision on what object files from the
532
archive are needed depends on the actual linking and thus GCC
533
would have to implement the linker itself.  The resolution
534
information is missing too and thus GCC needs to make an educated
535
guess based on @option{-fwhole-program}.  Without the linker
536
plugin GCC also assumes that symbols are declared @code{hidden}
537
and not referred by non-LTO code by default.
538
 
539
@section Internal flags controlling @code{lto1}
540
 
541
The following flags are passed into @command{lto1} and are not
542
meant to be used directly from the command line.
543
 
544
@itemize
545
@item -fwpa
546
@opindex fwpa
547
This option runs the serial part of the link-time optimizer
548
performing the inter-procedural propagation (WPA mode).  The
549
compiler reads in summary information from all inputs and
550
performs an analysis based on summary information only.  It
551
generates object files for subsequent runs of the link-time
552
optimizer where individual object files are optimized using both
553
summary information from the WPA mode and the actual function
554
bodies.  It then drives the LTRANS phase.
555
 
556
@item -fltrans
557
@opindex fltrans
558
This option runs the link-time optimizer in the
559
local-transformation (LTRANS) mode, which reads in output from a
560
previous run of the LTO in WPA mode.  In the LTRANS mode, LTO
561
optimizes an object and produces the final assembly.
562
 
563
@item -fltrans-output-list=@var{file}
564
@opindex fltrans-output-list
565
This option specifies a file to which the names of LTRANS output
566
files are written.  This option is only meaningful in conjunction
567
with @option{-fwpa}.
568
@end itemize

powered by: WebSVN 2.1.0

© copyright 1999-2024 OpenCores.org, equivalent to Oliscience, all rights reserved. OpenCores®, registered trademark.