OpenCores
URL https://opencores.org/ocsvn/openrisc/openrisc/trunk

Subversion Repositories openrisc

[/] [openrisc/] [trunk/] [gnu-dev/] [or1k-gcc/] [libstdc++-v3/] [doc/] [xml/] [manual/] [profile_mode.xml] - Blame information for rev 833

Go to most recent revision | Details | Compare with Previous | View Log

Line No. Rev Author Line
1 742 jeremybenn
2
         xml:id="manual.ext.profile_mode" xreflabel="Profile Mode">
3
4
 
5
Profile Mode
6
  
7
    
8
      C++
9
    
10
    
11
      library
12
    
13
    
14
      profile
15
    
16
  
17
18
 
19
 
20
 
21
 
22
Intro
23
 
24
  
25
  Goal: Give performance improvement advice based on
26
  recognition of suboptimal usage patterns of the standard library.
27
  
28
 
29
  
30
  Method: Wrap the standard library code.  Insert
31
  calls to an instrumentation library to record the internal state of
32
  various components at interesting entry/exit points to/from the standard
33
  library.  Process trace, recognize suboptimal patterns, give advice.
34
  For details, see
35
  paper presented at
36
   CGO 2009.
37
  
38
  
39
  Strengths: 
40
41
  
42
  Unintrusive solution.  The application code does not require any
43
  modification.
44
  
45
   The advice is call context sensitive, thus capable of
46
  identifying precisely interesting dynamic performance behavior.
47
  
48
  
49
  The overhead model is pay-per-view.  When you turn off a diagnostic class
50
  at compile time, its overhead disappears.
51
  
52
53
  
54
  
55
  Drawbacks: 
56
57
  
58
  You must recompile the application code with custom options.
59
  
60
  You must run the application on representative input.
61
  The advice is input dependent.
62
  
63
  
64
  The execution time will increase, in some cases by factors.
65
  
66
67
  
68
 
69
 
70
Using the Profile Mode
71
 
72
 
73
  
74
  This is the anticipated common workflow for program foo.cc:
75
76
$ cat foo.cc
77
#include <vector>
78
int main() {
79
  vector<int> v;
80
  for (int k = 0; k < 1024; ++k) v.insert(v.begin(), k);
81
}
82
 
83
$ g++ -D_GLIBCXX_PROFILE foo.cc
84
$ ./a.out
85
$ cat libstdcxx-profile.txt
86
vector-to-list: improvement = 5: call stack = 0x804842c ...
87
    : advice = change std::vector to std::list
88
vector-size: improvement = 3: call stack = 0x804842c ...
89
    : advice = change initial container size from 0 to 1024
90
91
  
92
 
93
  
94
  Anatomy of a warning:
95
  
96
  
97
  
98
  Warning id.  This is a short descriptive string for the class
99
  that this warning belongs to.  E.g., "vector-to-list".
100
  
101
  
102
  
103
  
104
  Estimated improvement.  This is an approximation of the benefit expected
105
  from implementing the change suggested by the warning.  It is given on
106
  a log10 scale.  Negative values mean that the alternative would actually
107
  do worse than the current choice.
108
  In the example above, 5 comes from the fact that the overhead of
109
  inserting at the beginning of a vector vs. a list is around 1024 * 1024 / 2,
110
  which is around 10e5.  The improvement from setting the initial size to
111
  1024 is in the range of 10e3, since the overhead of dynamic resizing is
112
  linear in this case.
113
  
114
  
115
  
116
  
117
  Call stack.  Currently, the addresses are printed without
118
  symbol name or code location attribution.
119
  Users are expected to postprocess the output using, for instance, addr2line.
120
  
121
  
122
  
123
  
124
  The warning message.  For some warnings, this is static text, e.g.,
125
  "change vector to list".  For other warnings, such as the one above,
126
  the message contains numeric advice, e.g., the suggested initial size
127
  of the vector.
128
  
129
  
130
  
131
  
132
 
133
  Three files are generated.  libstdcxx-profile.txt
134
   contains human readable advice.  libstdcxx-profile.raw
135
   contains implementation specific data about each diagnostic.
136
   Their format is not documented.  They are sufficient to generate
137
   all the advice given in libstdcxx-profile.txt.  The advantage
138
   of keeping this raw format is that traces from multiple executions can
139
   be aggregated simply by concatenating the raw traces.  We intend to
140
   offer an external utility program that can issue advice from a trace.
141
   libstdcxx-profile.conf.out lists the actual diagnostic
142
   parameters used.  To alter parameters, edit this file and rename it to
143
   libstdcxx-profile.conf.
144
  
145
 
146
  Advice is given regardless whether the transformation is valid.
147
  For instance, we advise changing a map to an unordered_map even if the
148
  application semantics require that data be ordered.
149
  We believe such warnings can help users understand the performance
150
  behavior of their application better, which can lead to changes
151
  at a higher abstraction level.
152
  
153
 
154
155
 
156
Tuning the Profile Mode
157
 
158
 
159
  Compile time switches and environment variables (see also file
160
   profiler.h).  Unless specified otherwise, they can be set at compile time
161
   using -D_<name> or by setting variable <name>
162
   in the environment where the program is run, before starting execution.
163
  
164
  
165
   _GLIBCXX_PROFILE_NO_<diagnostic>:
166
   disable specific diagnostics.
167
   See section Diagnostics for possible values.
168
   (Environment variables not supported.)
169
   
170
  
171
   _GLIBCXX_PROFILE_TRACE_PATH_ROOT: set an alternative root
172
   path for the output files.
173
   
174
  _GLIBCXX_PROFILE_MAX_WARN_COUNT: set it to the maximum
175
   number of warnings desired.  The default value is 10.
176
  
177
   _GLIBCXX_PROFILE_MAX_STACK_DEPTH: if set to 0,
178
   the advice will
179
   be collected and reported for the program as a whole, and not for each
180
   call context.
181
   This could also be used in continuous regression tests, where you
182
   just need to know whether there is a regression or not.
183
   The default value is 32.
184
   
185
  
186
   _GLIBCXX_PROFILE_MEM_PER_DIAGNOSTIC:
187
   set a limit on how much memory to use for the accounting tables for each
188
   diagnostic type.  When this limit is reached, new events are ignored
189
   until the memory usage decreases under the limit.  Generally, this means
190
   that newly created containers will not be instrumented until some
191
   live containers are deleted.  The default is 128 MB.
192
   
193
  
194
   _GLIBCXX_PROFILE_NO_THREADS:
195
   Make the library not use threads.  If thread local storage (TLS) is not
196
   available, you will get a preprocessor error asking you to set
197
   -D_GLIBCXX_PROFILE_NO_THREADS if your program is single-threaded.
198
   Multithreaded execution without TLS is not supported.
199
   (Environment variable not supported.)
200
   
201
  
202
   _GLIBCXX_HAVE_EXECINFO_H:
203
   This name should be defined automatically at library configuration time.
204
   If your library was configured without execinfo.h, but
205
   you have it in your include path, you can define it explicitly.  Without
206
   it, advice is collected for the program as a whole, and not for each
207
   call context.
208
   (Environment variable not supported.)
209
   
210
  
211
  
212
 
213
214
 
215
216
 
217
 
218
Design
219
 
220
 
221
222
223
224
Profile Code Location
225
 
226
227
228
229
 
230
231
  
232
    Code Location
233
    Use
234
  
235
236
237
  
238
    libstdc++-v3/include/std/*
239
    Preprocessor code to redirect to profile extension headers.
240
  
241
  
242
    libstdc++-v3/include/profile/*
243
    Profile extension public headers (map, vector, ...).
244
  
245
  
246
    libstdc++-v3/include/profile/impl/*
247
    Profile extension internals.  Implementation files are
248
     only included from impl/profiler.h, which is the only
249
     file included from the public headers.
250
  
251
252
253
254
 
255
256
257
 
258
Wrapper Model
259
 
260
  
261
  In order to get our instrumented library version included instead of the
262
  release one,
263
  we use the same wrapper model as the debug mode.
264
  We subclass entities from the release version.  Wherever
265
  _GLIBCXX_PROFILE is defined, the release namespace is
266
  std::__norm, whereas the profile namespace is
267
  std::__profile.  Using plain std translates
268
  into std::__profile.
269
  
270
  
271
  Whenever possible, we try to wrap at the public interface level, e.g.,
272
  in unordered_set rather than in hashtable,
273
  in order not to depend on implementation.
274
  
275
  
276
  Mixing object files built with and without the profile mode must
277
  not affect the program execution.  However, there are no guarantees to
278
  the accuracy of diagnostics when using even a single object not built with
279
  -D_GLIBCXX_PROFILE.
280
  Currently, mixing the profile mode with debug and parallel extensions is
281
  not allowed.  Mixing them at compile time will result in preprocessor errors.
282
  Mixing them at link time is undefined.
283
  
284
285
 
286
 
287
Instrumentation
288
 
289
  
290
  Instead of instrumenting every public entry and exit point,
291
  we chose to add instrumentation on demand, as needed
292
  by individual diagnostics.
293
  The main reason is that some diagnostics require us to extract bits of
294
  internal state that are particular only to that diagnostic.
295
  We plan to formalize this later, after we learn more about the requirements
296
  of several diagnostics.
297
  
298
  
299
  All the instrumentation points can be switched on and off using
300
  -D[_NO]_GLIBCXX_PROFILE_<diagnostic> options.
301
  With all the instrumentation calls off, there should be negligible
302
  overhead over the release version.  This property is needed to support
303
  diagnostics based on timing of internal operations.  For such diagnostics,
304
  we anticipate turning most of the instrumentation off in order to prevent
305
  profiling overhead from polluting time measurements, and thus diagnostics.
306
  
307
  
308
  All the instrumentation on/off compile time switches live in
309
  include/profile/profiler.h.
310
  
311
312
 
313
 
314
Run Time Behavior
315
 
316
  
317
  For practical reasons, the instrumentation library processes the trace
318
  partially
319
  rather than dumping it to disk in raw form.  Each event is processed when
320
  it occurs.  It is usually attached a cost and it is aggregated into
321
  the database of a specific diagnostic class.  The cost model
322
  is based largely on the standard performance guarantees, but in some
323
  cases we use knowledge about GCC's standard library implementation.
324
  
325
  
326
  Information is indexed by (1) call stack and (2) instance id or address
327
  to be able to understand and summarize precise creation-use-destruction
328
  dynamic chains.  Although the analysis is sensitive to dynamic instances,
329
  the reports are only sensitive to call context.  Whenever a dynamic instance
330
  is destroyed, we accumulate its effect to the corresponding entry for the
331
  call stack of its constructor location.
332
  
333
 
334
  
335
  For details, see
336
   paper presented at
337
   CGO 2009.
338
  
339
340
 
341
 
342
Analysis and Diagnostics
343
 
344
  
345
  Final analysis takes place offline, and it is based entirely on the
346
  generated trace and debugging info in the application binary.
347
  See section Diagnostics for a list of analysis types that we plan to support.
348
  
349
  
350
  The input to the analysis is a table indexed by profile type and call stack.
351
  The data type for each entry depends on the profile type.
352
  
353
354
 
355
 
356
Cost Model
357
 
358
  
359
  While it is likely that cost models become complex as we get into
360
  more sophisticated analysis, we will try to follow a simple set of rules
361
  at the beginning.
362
  
363
364
  Relative benefit estimation:
365
  The idea is to estimate or measure the cost of all operations
366
  in the original scenario versus the scenario we advise to switch to.
367
  For instance, when advising to change a vector to a list, an occurrence
368
  of the insert method will generally count as a benefit.
369
  Its magnitude depends on (1) the number of elements that get shifted
370
  and (2) whether it triggers a reallocation.
371
  
372
  Synthetic measurements:
373
  We will measure the relative difference between similar operations on
374
  different containers.  We plan to write a battery of small tests that
375
  compare the times of the executions of similar methods on different
376
  containers.  The idea is to run these tests on the target machine.
377
  If this training phase is very quick, we may decide to perform it at
378
  library initialization time.  The results can be cached on disk and reused
379
  across runs.
380
  
381
  Timers:
382
  We plan to use timers for operations of larger granularity, such as sort.
383
  For instance, we can switch between different sort methods on the fly
384
  and report the one that performs best for each call context.
385
  
386
  Show stoppers:
387
  We may decide that the presence of an operation nullifies the advice.
388
  For instance, when considering switching from set to
389
  unordered_set, if we detect use of operator ++,
390
  we will simply not issue the advice, since this could signal that the use
391
  care require a sorted container.
392
393
 
394
395
 
396
 
397
Reports
398
 
399
  
400
There are two types of reports.  First, if we recognize a pattern for which
401
we have a substitute that is likely to give better performance, we print
402
the advice and estimated performance gain.  The advice is usually associated
403
to a code position and possibly a call stack.
404
  
405
  
406
Second, we report performance characteristics for which we do not have
407
a clear solution for improvement.  For instance, we can point to the user
408
the top 10 multimap locations
409
which have the worst data locality in actual traversals.
410
Although this does not offer a solution,
411
it helps the user focus on the key problems and ignore the uninteresting ones.
412
  
413
414
 
415
 
416
Testing
417
 
418
  
419
  First, we want to make sure we preserve the behavior of the release mode.
420
  You can just type "make check-profile", which
421
  builds and runs the whole test suite in profile mode.
422
  
423
  
424
  Second, we want to test the correctness of each diagnostic.
425
  We created a profile directory in the test suite.
426
  Each diagnostic must come with at least two tests, one for false positives
427
  and one for false negatives.
428
  
429
430
 
431
432
 
433
Extensions for Custom Containers
434
 
435
 
436
  
437
  Many large projects use their own data structures instead of the ones in the
438
  standard library.  If these data structures are similar in functionality
439
  to the standard library, they can be instrumented with the same hooks
440
  that are used to instrument the standard library.
441
  The instrumentation API is exposed in file
442
  profiler.h (look for "Instrumentation hooks").
443
  
444
 
445
446
 
447
 
448
Empirical Cost Model
449
 
450
 
451
  
452
  Currently, the cost model uses formulas with predefined relative weights
453
  for alternative containers or container implementations.  For instance,
454
  iterating through a vector is X times faster than iterating through a list.
455
  
456
  
457
  (Under development.)
458
  We are working on customizing this to a particular machine by providing
459
  an automated way to compute the actual relative weights for operations
460
  on the given machine.
461
  
462
  
463
  (Under development.)
464
  We plan to provide a performance parameter database format that can be
465
  filled in either by hand or by an automated training mechanism.
466
  The analysis module will then use this database instead of the built in.
467
  generic parameters.
468
  
469
 
470
471
 
472
 
473
Implementation Issues
474
 
475
 
476
 
477
Stack Traces
478
 
479
  
480
  Accurate stack traces are needed during profiling since we group events by
481
  call context and dynamic instance.  Without accurate traces, diagnostics
482
  may be hard to interpret.  For instance, when giving advice to the user
483
  it is imperative to reference application code, not library code.
484
  
485
  
486
  Currently we are using the libc backtrace routine to get
487
  stack traces.
488
  _GLIBCXX_PROFILE_STACK_DEPTH can be set
489
  to 0 if you are willing to give up call context information, or to a small
490
  positive value to reduce run time overhead.
491
  
492
493
 
494
 
495
Symbolization of Instruction Addresses
496
 
497
  
498
  The profiling and analysis phases use only instruction addresses.
499
  An external utility such as addr2line is needed to postprocess the result.
500
  We do not plan to add symbolization support in the profile extension.
501
  This would require access to symbol tables, debug information tables,
502
  external programs or libraries and other system dependent information.
503
  
504
505
 
506
 
507
Concurrency
508
 
509
  
510
  Our current model is simplistic, but precise.
511
  We cannot afford to approximate because some of our diagnostics require
512
  precise matching of operations to container instance and call context.
513
  During profiling, we keep a single information table per diagnostic.
514
  There is a single lock per information table.
515
  
516
517
 
518
 
519
Using the Standard Library in the Instrumentation Implementation
520
 
521
  
522
  As much as we would like to avoid uses of libstdc++ within our
523
  instrumentation library, containers such as unordered_map are very
524
  appealing.  We plan to use them as long as they are named properly
525
  to avoid ambiguity.
526
  
527
528
 
529
 
530
Malloc Hooks
531
 
532
  
533
  User applications/libraries can provide malloc hooks.
534
  When the implementation of the malloc hooks uses stdlibc++, there can
535
  be an infinite cycle between the profile mode instrumentation and the
536
  malloc hook code.
537
  
538
  
539
  We protect against reentrance to the profile mode instrumentation code,
540
  which should avoid this problem in most cases.
541
  The protection mechanism is thread safe and exception safe.
542
  This mechanism does not prevent reentrance to the malloc hook itself,
543
  which could still result in deadlock, if, for instance, the malloc hook
544
  uses non-recursive locks.
545
  XXX: A definitive solution to this problem would be for the profile extension
546
  to use a custom allocator internally, and perhaps not to use libstdc++.
547
  
548
549
 
550
 
551
Construction and Destruction of Global Objects
552
 
553
  
554
  The profiling library state is initialized at the first call to a profiling
555
  method.  This allows us to record the construction of all global objects.
556
  However, we cannot do the same at destruction time.  The trace is written
557
  by a function registered by atexit, thus invoked by
558
  exit.
559
  
560
561
 
562
563
 
564
 
565
Developer Information
566
 
567
 
568
Big Picture
569
 
570
 
571
  The profile mode headers are included with
572
   -D_GLIBCXX_PROFILE through preprocessor directives in
573
   include/std/*.
574
  
575
 
576
  Instrumented implementations are provided in
577
   include/profile/*.  All instrumentation hooks are macros
578
   defined in include/profile/profiler.h.
579
  
580
 
581
  All the implementation of the instrumentation hooks is in
582
   include/profile/impl/*.  Although all the code gets included,
583
   thus is publicly visible, only a small number of functions are called from
584
   outside this directory.  All calls to hook implementations must be
585
   done through macros defined in profiler.h.  The macro
586
   must ensure (1) that the call is guarded against reentrance and
587
   (2) that the call can be turned off at compile time using a
588
   -D_GLIBCXX_PROFILE_... compiler option.
589
  
590
 
591
592
 
593
How To Add A Diagnostic
594
 
595
 
596
  Let's say the diagnostic name is "magic".
597
  
598
 
599
  If you need to instrument a header not already under
600
   include/profile/*, first edit the corresponding header
601
   under include/std/ and add a preprocessor directive such
602
   as the one in include/std/vector:
603
604
#ifdef _GLIBCXX_PROFILE
605
# include <profile/vector>
606
#endif
607
608
  
609
 
610
  If the file you need to instrument is not yet under
611
   include/profile/, make a copy of the one in
612
   include/debug, or the main implementation.
613
   You'll need to include the main implementation and inherit the classes
614
   you want to instrument.  Then define the methods you want to instrument,
615
   define the instrumentation hooks and add calls to them.
616
   Look at include/profile/vector for an example.
617
  
618
 
619
  Add macros for the instrumentation hooks in
620
   include/profile/impl/profiler.h.
621
   Hook names must start with __profcxx_.
622
   Make sure they transform
623
   in no code with -D_NO_GLBICXX_PROFILE_MAGIC.
624
   Make sure all calls to any method in namespace __gnu_profile
625
   is protected against reentrance using macro
626
   _GLIBCXX_PROFILE_REENTRANCE_GUARD.
627
   All names of methods in namespace __gnu_profile called from
628
   profiler.h must start with __trace_magic_.
629
  
630
 
631
  Add the implementation of the diagnostic.
632
   
633
     
634
      Create new file include/profile/impl/profiler_magic.h.
635
     
636
     
637
      Define class __magic_info: public __object_info_base.
638
      This is the representation of a line in the object table.
639
      The __merge method is used to aggregate information
640
      across all dynamic instances created at the same call context.
641
      The __magnitude must return the estimation of the benefit
642
      as a number of small operations, e.g., number of words copied.
643
      The __write method is used to produce the raw trace.
644
      The __advice method is used to produce the advice string.
645
     
646
     
647
      Define class __magic_stack_info: public __magic_info.
648
      This defines the content of a line in the stack table.
649
     
650
     
651
      Define class __trace_magic: public __trace_base<__magic_info,
652
      __magic_stack_info>.
653
      It defines the content of the trace associated with this diagnostic.
654
     
655
    
656
  
657
 
658
  Add initialization and reporting calls in
659
   include/profile/impl/profiler_trace.h.  Use
660
   __trace_vector_to_list as an example.
661
  
662
 
663
  Add documentation in file doc/xml/manual/profile_mode.xml.
664
  
665
666
667
 
668
Diagnostics
669
 
670
 
671
  
672
  The table below presents all the diagnostics we intend to implement.
673
  Each diagnostic has a corresponding compile time switch
674
  -D_GLIBCXX_PROFILE_<diagnostic>.
675
  Groups of related diagnostics can be turned on with a single switch.
676
  For instance, -D_GLIBCXX_PROFILE_LOCALITY is equivalent to
677
  -D_GLIBCXX_PROFILE_SOFTWARE_PREFETCH
678
  -D_GLIBCXX_PROFILE_RBTREE_LOCALITY.
679
  
680
 
681
  
682
  The benefit, cost, expected frequency and accuracy of each diagnostic
683
  was given a grade from 1 to 10, where 10 is highest.
684
  A high benefit means that, if the diagnostic is accurate, the expected
685
  performance improvement is high.
686
  A high cost means that turning this diagnostic on leads to high slowdown.
687
  A high frequency means that we expect this to occur relatively often.
688
  A high accuracy means that the diagnostic is unlikely to be wrong.
689
  These grades are not perfect.  They are just meant to guide users with
690
  specific needs or time budgets.
691
  
692
 
693
694
Profile Diagnostics
695
 
696
697
698
699
700
701
702
703
704
 
705
706
  
707
    Group
708
    Flag
709
    Benefit
710
    Cost
711
    Freq.
712
    Implemented
713
  
714
715
716
  
717
    
718
    CONTAINERS
719
    
720
    HASHTABLE_TOO_SMALL
721
    10
722
    1
723
    
724
    10
725
    yes
726
  
727
  
728
    
729
    
730
    HASHTABLE_TOO_LARGE
731
    5
732
    1
733
    
734
    10
735
    yes
736
  
737
  
738
    
739
    
740
    INEFFICIENT_HASH
741
    7
742
    3
743
    
744
    10
745
    yes
746
  
747
  
748
    
749
    
750
    VECTOR_TOO_SMALL
751
    8
752
    1
753
    
754
    10
755
    yes
756
  
757
  
758
    
759
    
760
    VECTOR_TOO_LARGE
761
    5
762
    1
763
    
764
    10
765
    yes
766
  
767
  
768
    
769
    
770
    VECTOR_TO_HASHTABLE
771
    7
772
    7
773
    
774
    10
775
    no
776
  
777
  
778
    
779
    
780
    HASHTABLE_TO_VECTOR
781
    7
782
    7
783
    
784
    10
785
    no
786
  
787
  
788
    
789
    
790
    VECTOR_TO_LIST
791
    8
792
    5
793
    
794
    10
795
    yes
796
  
797
  
798
    
799
    
800
    LIST_TO_VECTOR
801
    10
802
    5
803
    
804
    10
805
    no
806
  
807
  
808
    
809
    
810
    ORDERED_TO_UNORDERED
811
    10
812
    5
813
    
814
    10
815
    only map/unordered_map
816
  
817
  
818
    
819
    ALGORITHMS
820
    
821
    SORT
822
    7
823
    8
824
    
825
    7
826
    no
827
  
828
  
829
    
830
    LOCALITY
831
    
832
    SOFTWARE_PREFETCH
833
    8
834
    8
835
    
836
    5
837
    no
838
  
839
  
840
    
841
    
842
    RBTREE_LOCALITY
843
    4
844
    8
845
    
846
    5
847
    no
848
  
849
  
850
    
851
    
852
    FALSE_SHARING
853
    8
854
    10
855
    
856
    10
857
    no
858
  
859
860
861
862
 
863
Diagnostic Template
864
 
865
866
  Switch:
867
  _GLIBCXX_PROFILE_<diagnostic>.
868
  
869
  Goal:  What problem will it diagnose?
870
  
871
  Fundamentals:.
872
  What is the fundamental reason why this is a problem
873
  Sample runtime reduction:
874
  Percentage reduction in execution time.  When reduction is more than
875
  a constant factor, describe the reduction rate formula.
876
  
877
  Recommendation:
878
  What would the advise look like?
879
  To instrument:
880
  What stdlibc++ components need to be instrumented?
881
  Analysis:
882
  How do we decide when to issue the advice?
883
  Cost model:
884
  How do we measure benefits?  Math goes here.
885
  Example:
886
887
program code
888
...
889
advice sample
890
891
892
893
894
 
895
 
896
Containers
897
 
898
 
899
900
Switch:
901
  _GLIBCXX_PROFILE_CONTAINERS.
902
903
 
904
Hashtable Too Small
905
 
906
907
  Switch:
908
  _GLIBCXX_PROFILE_HASHTABLE_TOO_SMALL.
909
  
910
  Goal: Detect hashtables with many
911
  rehash operations, small construction size and large destruction size.
912
  
913
  Fundamentals: Rehash is very expensive.
914
  Read content, follow chains within bucket, evaluate hash function, place at
915
  new location in different order.
916
  Sample runtime reduction: 36%.
917
  Code similar to example below.
918
  
919
  Recommendation:
920
  Set initial size to N at construction site S.
921
  
922
  To instrument:
923
  unordered_set, unordered_map constructor, destructor, rehash.
924
  
925
  Analysis:
926
  For each dynamic instance of unordered_[multi]set|map,
927
  record initial size and call context of the constructor.
928
  Record size increase, if any, after each relevant operation such as insert.
929
  Record the estimated rehash cost.
930
  Cost model:
931
  Number of individual rehash operations * cost per rehash.
932
  Example:
933
934
1 unordered_set<int> us;
935
2 for (int k = 0; k < 1000000; ++k) {
936
3   us.insert(k);
937
4 }
938
 
939
foo.cc:1: advice: Changing initial unordered_set size from 10 to 1000000 saves 1025530 rehash operations.
940
941
942
943
944
 
945
 
946
Hashtable Too Large
947
 
948
949
  Switch:
950
  _GLIBCXX_PROFILE_HASHTABLE_TOO_LARGE.
951
  
952
  Goal: Detect hashtables which are
953
  never filled up because fewer elements than reserved are ever
954
  inserted.
955
  
956
  Fundamentals: Save memory, which
957
  is good in itself and may also improve memory reference performance through
958
  fewer cache and TLB misses.
959
  Sample runtime reduction: unknown.
960
  
961
  Recommendation:
962
  Set initial size to N at construction site S.
963
  
964
  To instrument:
965
  unordered_set, unordered_map constructor, destructor, rehash.
966
  
967
  Analysis:
968
  For each dynamic instance of unordered_[multi]set|map,
969
  record initial size and call context of the constructor, and correlate it
970
  with its size at destruction time.
971
  
972
  Cost model:
973
  Number of iteration operations + memory saved.
974
  Example:
975
976
1 vector<unordered_set<int>> v(100000, unordered_set<int>(100)) ;
977
2 for (int k = 0; k < 100000; ++k) {
978
3   for (int j = 0; j < 10; ++j) {
979
4     v[k].insert(k + j);
980
5  }
981
6 }
982
 
983
foo.cc:1: advice: Changing initial unordered_set size from 100 to 10 saves N
984
bytes of memory and M iteration steps.
985
986
987
988
989
 
990
Inefficient Hash
991
 
992
993
  Switch:
994
  _GLIBCXX_PROFILE_INEFFICIENT_HASH.
995
  
996
  Goal: Detect hashtables with polarized
997
  distribution.
998
  
999
  Fundamentals: A non-uniform
1000
  distribution may lead to long chains, thus possibly increasing complexity
1001
  by a factor up to the number of elements.
1002
  
1003
  Sample runtime reduction: factor up
1004
   to container size.
1005
  
1006
  Recommendation: Change hash function
1007
  for container built at site S.  Distribution score = N.  Access score = S.
1008
  Longest chain = C, in bucket B.
1009
  
1010
  To instrument:
1011
  unordered_set, unordered_map constructor, destructor, [],
1012
  insert, iterator.
1013
  
1014
  Analysis:
1015
  Count the exact number of link traversals.
1016
  
1017
  Cost model:
1018
  Total number of links traversed.
1019
  Example:
1020
1021
class dumb_hash {
1022
 public:
1023
  size_t operator() (int i) const { return 0; }
1024
};
1025
...
1026
  unordered_set<int, dumb_hash> hs;
1027
  ...
1028
  for (int i = 0; i < COUNT; ++i) {
1029
    hs.find(i);
1030
  }
1031
1032
1033
1034
1035
 
1036
Vector Too Small
1037
 
1038
1039
  Switch:
1040
  _GLIBCXX_PROFILE_VECTOR_TOO_SMALL.
1041
  
1042
  Goal:Detect vectors with many
1043
  resize operations, small construction size and large destruction size..
1044
  
1045
  Fundamentals:Resizing can be expensive.
1046
  Copying large amounts of data takes time.  Resizing many small vectors may
1047
  have allocation overhead and affect locality.
1048
  Sample runtime reduction:%.
1049
  
1050
  Recommendation:
1051
  Set initial size to N at construction site S.
1052
  To instrument:vector.
1053
  
1054
  Analysis:
1055
  For each dynamic instance of vector,
1056
  record initial size and call context of the constructor.
1057
  Record size increase, if any, after each relevant operation such as
1058
  push_back.  Record the estimated resize cost.
1059
  
1060
  Cost model:
1061
  Total number of words copied * time to copy a word.
1062
  Example:
1063
1064
1 vector<int> v;
1065
2 for (int k = 0; k < 1000000; ++k) {
1066
3   v.push_back(k);
1067
4 }
1068
 
1069
foo.cc:1: advice: Changing initial vector size from 10 to 1000000 saves
1070
copying 4000000 bytes and 20 memory allocations and deallocations.
1071
1072
1073
1074
1075
 
1076
Vector Too Large
1077
 
1078
1079
  Switch:
1080
  _GLIBCXX_PROFILE_VECTOR_TOO_LARGE
1081
  
1082
  Goal:Detect vectors which are
1083
  never filled up because fewer elements than reserved are ever
1084
  inserted.
1085
  
1086
  Fundamentals:Save memory, which
1087
  is good in itself and may also improve memory reference performance through
1088
  fewer cache and TLB misses.
1089
  Sample runtime reduction:%.
1090
  
1091
  Recommendation:
1092
  Set initial size to N at construction site S.
1093
  To instrument:vector.
1094
  
1095
  Analysis:
1096
  For each dynamic instance of vector,
1097
  record initial size and call context of the constructor, and correlate it
1098
  with its size at destruction time.
1099
  Cost model:
1100
  Total amount of memory saved.
1101
  Example:
1102
1103
1 vector<vector<int>> v(100000, vector<int>(100)) ;
1104
2 for (int k = 0; k < 100000; ++k) {
1105
3   for (int j = 0; j < 10; ++j) {
1106
4     v[k].insert(k + j);
1107
5  }
1108
6 }
1109
 
1110
foo.cc:1: advice: Changing initial vector size from 100 to 10 saves N
1111
bytes of memory and may reduce the number of cache and TLB misses.
1112
1113
1114
1115
1116
 
1117
Vector to Hashtable
1118
 
1119
1120
  Switch:
1121
  _GLIBCXX_PROFILE_VECTOR_TO_HASHTABLE.
1122
  
1123
  Goal: Detect uses of
1124
  vector that can be substituted with unordered_set
1125
  to reduce execution time.
1126
  
1127
  Fundamentals:
1128
  Linear search in a vector is very expensive, whereas searching in a hashtable
1129
  is very quick.
1130
  Sample runtime reduction:factor up
1131
   to container size.
1132
  
1133
  Recommendation:Replace
1134
  vector with unordered_set at site S.
1135
  
1136
  To instrument:vector
1137
  operations and access methods.
1138
  Analysis:
1139
  For each dynamic instance of vector,
1140
  record call context of the constructor.  Issue the advice only if the
1141
  only methods called on this vector are push_back,
1142
  insert and find.
1143
  
1144
  Cost model:
1145
  Cost(vector::push_back) + cost(vector::insert) + cost(find, vector) -
1146
  cost(unordered_set::insert) + cost(unordered_set::find).
1147
  
1148
  Example:
1149
1150
1  vector<int> v;
1151
...
1152
2  for (int i = 0; i < 1000; ++i) {
1153
3    find(v.begin(), v.end(), i);
1154
4  }
1155
 
1156
foo.cc:1: advice: Changing "vector" to "unordered_set" will save about 500,000
1157
comparisons.
1158
1159
1160
1161
1162
 
1163
Hashtable to Vector
1164
 
1165
1166
  Switch:
1167
  _GLIBCXX_PROFILE_HASHTABLE_TO_VECTOR.
1168
  
1169
  Goal: Detect uses of
1170
  unordered_set that can be substituted with vector
1171
  to reduce execution time.
1172
  
1173
  Fundamentals:
1174
  Hashtable iterator is slower than vector iterator.
1175
  Sample runtime reduction:95%.
1176
  
1177
  Recommendation:Replace
1178
  unordered_set with vector at site S.
1179
  
1180
  To instrument:unordered_set
1181
  operations and access methods.
1182
  Analysis:
1183
  For each dynamic instance of unordered_set,
1184
  record call context of the constructor.  Issue the advice only if the
1185
  number of find, insert and []
1186
  operations on this unordered_set are small relative to the
1187
  number of elements, and methods begin or end
1188
  are invoked (suggesting iteration).
1189
  Cost model:
1190
  Number of .
1191
  Example:
1192
1193
1  unordered_set<int> us;
1194
...
1195
2  int s = 0;
1196
3  for (unordered_set<int>::iterator it = us.begin(); it != us.end(); ++it) {
1197
4    s += *it;
1198
5  }
1199
 
1200
foo.cc:1: advice: Changing "unordered_set" to "vector" will save about N
1201
indirections and may achieve better data locality.
1202
1203
1204
1205
1206
 
1207
Vector to List
1208
 
1209
1210
  Switch:
1211
  _GLIBCXX_PROFILE_VECTOR_TO_LIST.
1212
  
1213
  Goal: Detect cases where
1214
  vector could be substituted with list for
1215
  better performance.
1216
  
1217
  Fundamentals:
1218
  Inserting in the middle of a vector is expensive compared to inserting in a
1219
  list.
1220
  
1221
  Sample runtime reduction:factor up to
1222
   container size.
1223
  
1224
  Recommendation:Replace vector with list
1225
  at site S.
1226
  To instrument:vector
1227
  operations and access methods.
1228
  Analysis:
1229
  For each dynamic instance of vector,
1230
  record the call context of the constructor.  Record the overhead of each
1231
  insert operation based on current size and insert position.
1232
  Report instance with high insertion overhead.
1233
  
1234
  Cost model:
1235
  (Sum(cost(vector::method)) - Sum(cost(list::method)), for
1236
  method in [push_back, insert, erase])
1237
  + (Cost(iterate vector) - Cost(iterate list))
1238
  Example:
1239
1240
1  vector<int> v;
1241
2  for (int i = 0; i < 10000; ++i) {
1242
3    v.insert(v.begin(), i);
1243
4  }
1244
 
1245
foo.cc:1: advice: Changing "vector" to "list" will save about 5,000,000
1246
operations.
1247
1248
1249
1250
1251
 
1252
List to Vector
1253
 
1254
1255
  Switch:
1256
  _GLIBCXX_PROFILE_LIST_TO_VECTOR.
1257
  
1258
  Goal: Detect cases where
1259
  list could be substituted with vector for
1260
  better performance.
1261
  
1262
  Fundamentals:
1263
  Iterating through a vector is faster than through a list.
1264
  
1265
  Sample runtime reduction:64%.
1266
  
1267
  Recommendation:Replace list with vector
1268
  at site S.
1269
  To instrument:vector
1270
  operations and access methods.
1271
  Analysis:
1272
  Issue the advice if there are no insert operations.
1273
  
1274
  Cost model:
1275
    (Sum(cost(vector::method)) - Sum(cost(list::method)), for
1276
  method in [push_back, insert, erase])
1277
  + (Cost(iterate vector) - Cost(iterate list))
1278
  Example:
1279
1280
1  list<int> l;
1281
...
1282
2  int sum = 0;
1283
3  for (list<int>::iterator it = l.begin(); it != l.end(); ++it) {
1284
4    sum += *it;
1285
5  }
1286
 
1287
foo.cc:1: advice: Changing "list" to "vector" will save about 1000000 indirect
1288
memory references.
1289
1290
1291
1292
1293
 
1294
List to Forward List (Slist)
1295
 
1296
1297
  Switch:
1298
  _GLIBCXX_PROFILE_LIST_TO_SLIST.
1299
  
1300
  Goal: Detect cases where
1301
  list could be substituted with forward_list for
1302
  better performance.
1303
  
1304
  Fundamentals:
1305
  The memory footprint of a forward_list is smaller than that of a list.
1306
  This has beneficial effects on memory subsystem, e.g., fewer cache misses.
1307
  
1308
  Sample runtime reduction:40%.
1309
  Note that the reduction is only noticeable if the size of the forward_list
1310
  node is in fact larger than that of the list node.  For memory allocators
1311
  with size classes, you will only notice an effect when the two node sizes
1312
  belong to different allocator size classes.
1313
  
1314
  Recommendation:Replace list with
1315
  forward_list at site S.
1316
  To instrument:list
1317
  operations and iteration methods.
1318
  Analysis:
1319
  Issue the advice if there are no backwards traversals
1320
  or insertion before a given node.
1321
  
1322
  Cost model:
1323
  Always true.
1324
  Example:
1325
1326
1  list<int> l;
1327
...
1328
2  int sum = 0;
1329
3  for (list<int>::iterator it = l.begin(); it != l.end(); ++it) {
1330
4    sum += *it;
1331
5  }
1332
 
1333
foo.cc:1: advice: Change "list" to "forward_list".
1334
1335
1336
1337
1338
 
1339
Ordered to Unordered Associative Container
1340
 
1341
1342
  Switch:
1343
  _GLIBCXX_PROFILE_ORDERED_TO_UNORDERED.
1344
  
1345
  Goal:  Detect cases where ordered
1346
  associative containers can be replaced with unordered ones.
1347
  
1348
  Fundamentals:
1349
  Insert and search are quicker in a hashtable than in
1350
  a red-black tree.
1351
  Sample runtime reduction:52%.
1352
  
1353
  Recommendation:
1354
  Replace set with unordered_set at site S.
1355
  To instrument:
1356
  set, multiset, map,
1357
  multimap methods.
1358
  Analysis:
1359
  Issue the advice only if we are not using operator ++ on any
1360
  iterator on a particular [multi]set|map.
1361
  
1362
  Cost model:
1363
  (Sum(cost(hashtable::method)) - Sum(cost(rbtree::method)), for
1364
  method in [insert, erase, find])
1365
  + (Cost(iterate hashtable) - Cost(iterate rbtree))
1366
  Example:
1367
1368
1  set<int> s;
1369
2  for (int i = 0; i < 100000; ++i) {
1370
3    s.insert(i);
1371
4  }
1372
5  int sum = 0;
1373
6  for (int i = 0; i < 100000; ++i) {
1374
7    sum += *s.find(i);
1375
8  }
1376
1377
1378
1379
1380
 
1381
1382
 
1383
 
1384
 
1385
Algorithms
1386
 
1387
 
1388
  Switch:
1389
  _GLIBCXX_PROFILE_ALGORITHMS.
1390
  
1391
 
1392
Sort Algorithm Performance
1393
 
1394
1395
  Switch:
1396
  _GLIBCXX_PROFILE_SORT.
1397
  
1398
  Goal: Give measure of sort algorithm
1399
  performance based on actual input.  For instance, advise Radix Sort over
1400
  Quick Sort for a particular call context.
1401
  
1402
  Fundamentals:
1403
  See papers:
1404
  
1405
  A framework for adaptive algorithm selection in STAPL and
1406
  
1407
  Optimizing Sorting with Machine Learning Algorithms.
1408
  
1409
  Sample runtime reduction:60%.
1410
  
1411
  Recommendation: Change sort algorithm
1412
  at site S from X Sort to Y Sort.
1413
  To instrument: sort
1414
  algorithm.
1415
  Analysis:
1416
  Issue the advice if the cost model tells us that another sort algorithm
1417
  would do better on this input.  Requires us to know what algorithm we
1418
  are using in our sort implementation in release mode.
1419
  Cost model:
1420
  Runtime(algo) for algo in [radix, quick, merge, ...]
1421
  Example:
1422
1423
1424
1425
1426
1427
 
1428
1429
 
1430
 
1431
Data Locality
1432
 
1433
 
1434
  Switch:
1435
  _GLIBCXX_PROFILE_LOCALITY.
1436
  
1437
 
1438
Need Software Prefetch
1439
 
1440
1441
  Switch:
1442
  _GLIBCXX_PROFILE_SOFTWARE_PREFETCH.
1443
  
1444
  Goal: Discover sequences of indirect
1445
  memory accesses that are not regular, thus cannot be predicted by
1446
  hardware prefetchers.
1447
  
1448
  Fundamentals:
1449
  Indirect references are hard to predict and are very expensive when they
1450
  miss in caches.
1451
  Sample runtime reduction:25%.
1452
  
1453
  Recommendation: Insert prefetch
1454
  instruction.
1455
  To instrument: Vector iterator and
1456
  access operator [].
1457
  
1458
  Analysis:
1459
  First, get cache line size and page size from system.
1460
  Then record iterator dereference sequences for which the value is a pointer.
1461
  For each sequence within a container, issue a warning if successive pointer
1462
  addresses are not within cache lines and do not form a linear pattern
1463
  (otherwise they may be prefetched by hardware).
1464
  If they also step across page boundaries, make the warning stronger.
1465
  
1466
  The same analysis applies to containers other than vector.
1467
  However, we cannot give the same advice for linked structures, such as list,
1468
  as there is no random access to the n-th element.  The user may still be
1469
  able to benefit from this information, for instance by employing frays (user
1470
  level light weight threads) to hide the latency of chasing pointers.
1471
  
1472
  
1473
  This analysis is a little oversimplified.  A better cost model could be
1474
  created by understanding the capability of the hardware prefetcher.
1475
  This model could be trained automatically by running a set of synthetic
1476
  cases.
1477
  
1478
  
1479
  Cost model:
1480
  Total distance between pointer values of successive elements in vectors
1481
  of pointers.
1482
  Example:
1483
1484
1 int zero = 0;
1485
2 vector<int*> v(10000000, &zero);
1486
3 for (int k = 0; k < 10000000; ++k) {
1487
4   v[random() % 10000000] = new int(k);
1488
5 }
1489
6 for (int j = 0; j < 10000000; ++j) {
1490
7   count += (*v[j] == 0 ? 0 : 1);
1491
8 }
1492
 
1493
foo.cc:7: advice: Insert prefetch instruction.
1494
1495
1496
1497
1498
 
1499
Linked Structure Locality
1500
 
1501
1502
  Switch:
1503
  _GLIBCXX_PROFILE_RBTREE_LOCALITY.
1504
  
1505
  Goal: Give measure of locality of
1506
  objects stored in linked structures (lists, red-black trees and hashtables)
1507
  with respect to their actual traversal patterns.
1508
  
1509
  Fundamentals:Allocation can be tuned
1510
  to a specific traversal pattern, to result in better data locality.
1511
  See paper:
1512
  
1513
  Custom Memory Allocation for Free.
1514
  
1515
  Sample runtime reduction:30%.
1516
  
1517
  Recommendation:
1518
  High scatter score N for container built at site S.
1519
  Consider changing allocation sequence or choosing a structure conscious
1520
  allocator.
1521
  To instrument: Methods of all
1522
  containers using linked structures.
1523
  Analysis:
1524
  First, get cache line size and page size from system.
1525
  Then record the number of successive elements that are on different line
1526
  or page, for each traversal method such as find.  Give advice
1527
  only if the ratio between this number and the number of total node hops
1528
  is above a threshold.
1529
  Cost model:
1530
  Sum(same_cache_line(this,previous))
1531
  Example:
1532
1533
 1  set<int> s;
1534
 2  for (int i = 0; i < 10000000; ++i) {
1535
 3    s.insert(i);
1536
 4  }
1537
 5  set<int> s1, s2;
1538
 6  for (int i = 0; i < 10000000; ++i) {
1539
 7    s1.insert(i);
1540
 8    s2.insert(i);
1541
 9  }
1542
...
1543
      // Fast, better locality.
1544
10    for (set<int>::iterator it = s.begin(); it != s.end(); ++it) {
1545
11      sum += *it;
1546
12    }
1547
      // Slow, elements are further apart.
1548
13    for (set<int>::iterator it = s1.begin(); it != s1.end(); ++it) {
1549
14      sum += *it;
1550
15    }
1551
 
1552
foo.cc:5: advice: High scatter score NNN for set built here.  Consider changing
1553
the allocation sequence or switching to a structure conscious allocator.
1554
1555
1556
1557
1558
 
1559
1560
 
1561
 
1562
Multithreaded Data Access
1563
 
1564
 
1565
  
1566
  The diagnostics in this group are not meant to be implemented short term.
1567
  They require compiler support to know when container elements are written
1568
  to.  Instrumentation can only tell us when elements are referenced.
1569
  
1570
 
1571
  Switch:
1572
  _GLIBCXX_PROFILE_MULTITHREADED.
1573
  
1574
 
1575
Data Dependence Violations at Container Level
1576
 
1577
1578
  Switch:
1579
  _GLIBCXX_PROFILE_DDTEST.
1580
  
1581
  Goal: Detect container elements
1582
  that are referenced from multiple threads in the parallel region or
1583
  across parallel regions.
1584
  
1585
  Fundamentals:
1586
  Sharing data between threads requires communication and perhaps locking,
1587
  which may be expensive.
1588
  
1589
  Sample runtime reduction:?%.
1590
  
1591
  Recommendation: Change data
1592
  distribution or parallel algorithm.
1593
  To instrument: Container access methods
1594
  and iterators.
1595
  
1596
  Analysis:
1597
  Keep a shadow for each container.  Record iterator dereferences and
1598
  container member accesses.  Issue advice for elements referenced by
1599
  multiple threads.
1600
  See paper: 
1601
  The LRPD test: speculative run-time parallelization of loops with
1602
  privatization and reduction parallelization.
1603
  
1604
  Cost model:
1605
  Number of accesses to elements referenced from multiple threads
1606
  
1607
  Example:
1608
1609
1610
1611
1612
1613
 
1614
False Sharing
1615
 
1616
1617
  Switch:
1618
  _GLIBCXX_PROFILE_FALSE_SHARING.
1619
  
1620
  Goal: Detect elements in the
1621
  same container which share a cache line, are written by at least one
1622
  thread, and accessed by different threads.
1623
  
1624
  Fundamentals: Under these assumptions,
1625
  cache protocols require
1626
  communication to invalidate lines, which may be expensive.
1627
  
1628
  Sample runtime reduction:68%.
1629
  
1630
  Recommendation: Reorganize container
1631
  or use padding to avoid false sharing.
1632
  To instrument: Container access methods
1633
  and iterators.
1634
  
1635
  Analysis:
1636
  First, get the cache line size.
1637
  For each shared container, record all the associated iterator dereferences
1638
  and member access methods with the thread id.  Compare the address lists
1639
  across threads to detect references in two different threads to the same
1640
  cache line.  Issue a warning only if the ratio to total references is
1641
  significant.  Do the same for iterator dereference values if they are
1642
  pointers.
1643
  Cost model:
1644
  Number of accesses to same cache line from different threads.
1645
  
1646
  Example:
1647
1648
1     vector<int> v(2, 0);
1649
2 #pragma omp parallel for shared(v, SIZE) schedule(static, 1)
1650
3     for (i = 0; i < SIZE; ++i) {
1651
4       v[i % 2] += i;
1652
5     }
1653
 
1654
OMP_NUM_THREADS=2 ./a.out
1655
foo.cc:1: advice: Change container structure or padding to avoid false
1656
sharing in multithreaded access at foo.cc:4.  Detected N shared cache lines.
1657
1658
1659
1660
1661
 
1662
1663
 
1664
 
1665
Statistics
1666
 
1667
 
1668
1669
Switch:
1670
  _GLIBCXX_PROFILE_STATISTICS.
1671
1672
 
1673
1674
  In some cases the cost model may not tell us anything because the costs
1675
  appear to offset the benefits.  Consider the choice between a vector and
1676
  a list.  When there are both inserts and iteration, an automatic advice
1677
  may not be issued.  However, the programmer may still be able to make use
1678
  of this information in a different way.
1679
1680
1681
  This diagnostic will not issue any advice, but it will print statistics for
1682
  each container construction site.  The statistics will contain the cost
1683
  of each operation actually performed on the container.
1684
1685
 
1686
1687
 
1688
 
1689
1690
 
1691
 
1692
Bibliography
1693
 
1694
 
1695
  
1696
    
1697
      Perflint: A Context Sensitive Performance Advisor for C++ Programs
1698
    
1699
 
1700
    LixiaLiu
1701
    SilviusRus
1702
 
1703
    
1704
      2009
1705
      
1706
    
1707
 
1708
    
1709
      
1710
        Proceedings of the 2009 International Symposium on Code Generation
1711
        and Optimization
1712
      
1713
    
1714
  
1715
1716
 
1717
 
1718

powered by: WebSVN 2.1.0

© copyright 1999-2024 OpenCores.org, equivalent to Oliscience, all rights reserved. OpenCores®, registered trademark.