OpenCores

 "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd"

 "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd"

[ ]>

[ ]>

C++

C++

      library

      library

      profile

      profile

Profile Mode

Profile Mode

  Intro

  Intro

  Goal: Give performance improvement advice based on

  Goal: Give performance improvement advice based on

  recognition of suboptimal usage patterns of the standard library.

  recognition of suboptimal usage patterns of the standard library.

  Method: Wrap the standard library code.  Insert

  Method: Wrap the standard library code.  Insert

  calls to an instrumentation library to record the internal state of

  calls to an instrumentation library to record the internal state of

  various components at interesting entry/exit points to/from the standard

  various components at interesting entry/exit points to/from the standard

  library.  Process trace, recognize suboptimal patterns, give advice.

  library.  Process trace, recognize suboptimal patterns, give advice.

  For details, see

  For details, see

  paper presented at

  paper presented at

   CGO 2009.

   CGO 2009.

  Strengths:

  Strengths:

  Unintrusive solution.  The application code does not require any

  Unintrusive solution.  The application code does not require any

  modification.

  modification.

   The advice is call context sensitive, thus capable of

   The advice is call context sensitive, thus capable of

  identifying precisely interesting dynamic performance behavior.

  identifying precisely interesting dynamic performance behavior.

  The overhead model is pay-per-view.  When you turn off a diagnostic class

  The overhead model is pay-per-view.  When you turn off a diagnostic class

  at compile time, its overhead disappears.

  at compile time, its overhead disappears.

  Drawbacks:

  Drawbacks:

  You must recompile the application code with custom options.

  You must recompile the application code with custom options.

  You must run the application on representative input.

  You must run the application on representative input.

  The advice is input dependent.

  The advice is input dependent.

  The execution time will increase, in some cases by factors.

  The execution time will increase, in some cases by factors.

  Using the Profile Mode

  Using the Profile Mode

  This is the anticipated common workflow for program foo.cc:

  This is the anticipated common workflow for program foo.cc:

$ cat foo.cc

$ cat foo.cc

#include <vector>

#include <vector>

int main() {

int main() {

  vector<int> v;

  vector<int> v;

  for (int k = 0; k < 1024; ++k) v.insert(v.begin(), k);

  for (int k = 0; k < 1024; ++k) v.insert(v.begin(), k);

$ g++ -D_GLIBCXX_PROFILE foo.cc

$ g++ -D_GLIBCXX_PROFILE foo.cc

$ ./a.out

$ ./a.out

$ cat libstdcxx-profile.txt

$ cat libstdcxx-profile.txt

vector-to-list: improvement = 5: call stack = 0x804842c ...

vector-to-list: improvement = 5: call stack = 0x804842c ...

    : advice = change std::vector to std::list

    : advice = change std::vector to std::list

vector-size: improvement = 3: call stack = 0x804842c ...

vector-size: improvement = 3: call stack = 0x804842c ...

    : advice = change initial container size from 0 to 1024

    : advice = change initial container size from 0 to 1024

  Anatomy of a warning:

  Anatomy of a warning:

  Warning id.  This is a short descriptive string for the class

  Warning id.  This is a short descriptive string for the class

  that this warning belongs to.  E.g., "vector-to-list".

  that this warning belongs to.  E.g., "vector-to-list".

  Estimated improvement.  This is an approximation of the benefit expected

  Estimated improvement.  This is an approximation of the benefit expected

  from implementing the change suggested by the warning.  It is given on

  from implementing the change suggested by the warning.  It is given on

  a log10 scale.  Negative values mean that the alternative would actually

  a log10 scale.  Negative values mean that the alternative would actually

  do worse than the current choice.

  do worse than the current choice.

  In the example above, 5 comes from the fact that the overhead of

  In the example above, 5 comes from the fact that the overhead of

  inserting at the beginning of a vector vs. a list is around 1024 * 1024 / 2,

  inserting at the beginning of a vector vs. a list is around 1024 * 1024 / 2,

  which is around 10e5.  The improvement from setting the initial size to

  which is around 10e5.  The improvement from setting the initial size to

  1024 is in the range of 10e3, since the overhead of dynamic resizing is

  1024 is in the range of 10e3, since the overhead of dynamic resizing is

  linear in this case.

  linear in this case.

  Call stack.  Currently, the addresses are printed without

  Call stack.  Currently, the addresses are printed without

  symbol name or code location attribution.

  symbol name or code location attribution.

  Users are expected to postprocess the output using, for instance, addr2line.

  Users are expected to postprocess the output using, for instance, addr2line.

  The warning message.  For some warnings, this is static text, e.g.,

  The warning message.  For some warnings, this is static text, e.g.,

  "change vector to list".  For other warnings, such as the one above,

  "change vector to list".  For other warnings, such as the one above,

  the message contains numeric advice, e.g., the suggested initial size

  the message contains numeric advice, e.g., the suggested initial size

  of the vector.

  of the vector.

  Three files are generated.  libstdcxx-profile.txt

  Three files are generated.  libstdcxx-profile.txt

   contains human readable advice.  libstdcxx-profile.raw

   contains human readable advice.  libstdcxx-profile.raw

   contains implementation specific data about each diagnostic.

   contains implementation specific data about each diagnostic.

   Their format is not documented.  They are sufficient to generate

   Their format is not documented.  They are sufficient to generate

   all the advice given in libstdcxx-profile.txt.  The advantage

   all the advice given in libstdcxx-profile.txt.  The advantage

   of keeping this raw format is that traces from multiple executions can

   of keeping this raw format is that traces from multiple executions can

   be aggregated simply by concatenating the raw traces.  We intend to

   be aggregated simply by concatenating the raw traces.  We intend to

   offer an external utility program that can issue advice from a trace.

   offer an external utility program that can issue advice from a trace.

   libstdcxx-profile.conf.out lists the actual diagnostic

   libstdcxx-profile.conf.out lists the actual diagnostic

   parameters used.  To alter parameters, edit this file and rename it to

   parameters used.  To alter parameters, edit this file and rename it to

   libstdcxx-profile.conf.

   libstdcxx-profile.conf.

  Advice is given regardless whether the transformation is valid.

  Advice is given regardless whether the transformation is valid.

  For instance, we advise changing a map to an unordered_map even if the

  For instance, we advise changing a map to an unordered_map even if the

  application semantics require that data be ordered.

  application semantics require that data be ordered.

  We believe such warnings can help users understand the performance

  We believe such warnings can help users understand the performance

  behavior of their application better, which can lead to changes

  behavior of their application better, which can lead to changes

  at a higher abstraction level.

  at a higher abstraction level.

  Tuning the Profile Mode

  Tuning the Profile Mode

  Compile time switches and environment variables (see also file

  Compile time switches and environment variables (see also file

   profiler.h).  Unless specified otherwise, they can be set at compile time

   profiler.h).  Unless specified otherwise, they can be set at compile time

   using -D_<name> or by setting variable <name>

   using -D_<name> or by setting variable <name>

   in the environment where the program is run, before starting execution.

   in the environment where the program is run, before starting execution.

   _GLIBCXX_PROFILE_NO_<diagnostic>:

   _GLIBCXX_PROFILE_NO_<diagnostic>:

   disable specific diagnostics.

   disable specific diagnostics.

   See section Diagnostics for possible values.

   See section Diagnostics for possible values.

   (Environment variables not supported.)

   (Environment variables not supported.)

   _GLIBCXX_PROFILE_TRACE_PATH_ROOT: set an alternative root

   _GLIBCXX_PROFILE_TRACE_PATH_ROOT: set an alternative root

   path for the output files.

   path for the output files.

  _GLIBCXX_PROFILE_MAX_WARN_COUNT: set it to the maximum

  _GLIBCXX_PROFILE_MAX_WARN_COUNT: set it to the maximum

   number of warnings desired.  The default value is 10.

   number of warnings desired.  The default value is 10.

   _GLIBCXX_PROFILE_MAX_STACK_DEPTH: if set to 0,

   _GLIBCXX_PROFILE_MAX_STACK_DEPTH: if set to 0,

   the advice will

   the advice will

   be collected and reported for the program as a whole, and not for each

   be collected and reported for the program as a whole, and not for each

   call context.

   call context.

   This could also be used in continuous regression tests, where you

   This could also be used in continuous regression tests, where you

   just need to know whether there is a regression or not.

   just need to know whether there is a regression or not.

   The default value is 32.

   The default value is 32.

   _GLIBCXX_PROFILE_MEM_PER_DIAGNOSTIC:

   _GLIBCXX_PROFILE_MEM_PER_DIAGNOSTIC:

   set a limit on how much memory to use for the accounting tables for each

   set a limit on how much memory to use for the accounting tables for each

   diagnostic type.  When this limit is reached, new events are ignored

   diagnostic type.  When this limit is reached, new events are ignored

   until the memory usage decreases under the limit.  Generally, this means

   until the memory usage decreases under the limit.  Generally, this means

   that newly created containers will not be instrumented until some

   that newly created containers will not be instrumented until some

   live containers are deleted.  The default is 128 MB.

   live containers are deleted.  The default is 128 MB.

   _GLIBCXX_PROFILE_NO_THREADS:

   _GLIBCXX_PROFILE_NO_THREADS:

   Make the library not use threads.  If thread local storage (TLS) is not

   Make the library not use threads.  If thread local storage (TLS) is not

   available, you will get a preprocessor error asking you to set

   available, you will get a preprocessor error asking you to set

   -D_GLIBCXX_PROFILE_NO_THREADS if your program is single-threaded.

   -D_GLIBCXX_PROFILE_NO_THREADS if your program is single-threaded.

   Multithreaded execution without TLS is not supported.

   Multithreaded execution without TLS is not supported.

   (Environment variable not supported.)

   (Environment variable not supported.)

   _GLIBCXX_HAVE_EXECINFO_H:

   _GLIBCXX_HAVE_EXECINFO_H:

   This name should be defined automatically at library configuration time.

   This name should be defined automatically at library configuration time.

   If your library was configured without execinfo.h, but

   If your library was configured without execinfo.h, but

   you have it in your include path, you can define it explicitly.  Without

   you have it in your include path, you can define it explicitly.  Without

   it, advice is collected for the program as a whole, and not for each

   it, advice is collected for the program as a whole, and not for each

   call context.

   call context.

   (Environment variable not supported.)

   (Environment variable not supported.)

  Design

  Design


        
      
      
        Profile Code Location
        Profile Code Location
      
      
        
        
      
      
        
        
      
      
        
        
      
      
        
        
      
      
        

        

      
      
          
          
      
      
            Code Location
            Code Location
      
      
            Use
            Use
      
      
          
          
      
      
        
        
      
      
        

        

      
      
          
          
      
      
            libstdc++-v3/include/std/*
            libstdc++-v3/include/std/*
      
      
            Preprocessor code to redirect to profile extension headers.
            Preprocessor code to redirect to profile extension headers.
      
      
          
          
      
      
          
          
      
      
            libstdc++-v3/include/profile/*
            libstdc++-v3/include/profile/*
      
      
            Profile extension public headers (map, vector, ...).
            Profile extension public headers (map, vector, ...).
      
      
          
          
      
      
          
          
      
      
            libstdc++-v3/include/profile/impl/*
            libstdc++-v3/include/profile/impl/*
      
      
            Profile extension internals.  Implementation files are
            Profile extension internals.  Implementation files are
      
      
             only included from impl/profiler.h, which is the only
             only included from impl/profiler.h, which is the only
      
      
             file included from the public headers.
             file included from the public headers.

 xreflabel="Wrapper">

 xreflabel="Wrapper">

Wrapper Model

Wrapper Model

  In order to get our instrumented library version included instead of the

  In order to get our instrumented library version included instead of the

  release one,

  release one,

  we use the same wrapper model as the debug mode.

  we use the same wrapper model as the debug mode.

  We subclass entities from the release version.  Wherever

  We subclass entities from the release version.  Wherever

  _GLIBCXX_PROFILE is defined, the release namespace is

  _GLIBCXX_PROFILE is defined, the release namespace is

  std::__norm, whereas the profile namespace is

  std::__norm, whereas the profile namespace is

  std::__profile.  Using plain std translates

  std::__profile.  Using plain std translates

  into std::__profile.

  into std::__profile.

  Whenever possible, we try to wrap at the public interface level, e.g.,

  Whenever possible, we try to wrap at the public interface level, e.g.,

  in unordered_set rather than in hashtable,

  in unordered_set rather than in hashtable,

  in order not to depend on implementation.

  in order not to depend on implementation.

  Mixing object files built with and without the profile mode must

  Mixing object files built with and without the profile mode must

  not affect the program execution.  However, there are no guarantees to

  not affect the program execution.  However, there are no guarantees to

  the accuracy of diagnostics when using even a single object not built with

  the accuracy of diagnostics when using even a single object not built with

  -D_GLIBCXX_PROFILE.

  -D_GLIBCXX_PROFILE.

  Currently, mixing the profile mode with debug and parallel extensions is

  Currently, mixing the profile mode with debug and parallel extensions is

  not allowed.  Mixing them at compile time will result in preprocessor errors.

  not allowed.  Mixing them at compile time will result in preprocessor errors.

  Mixing them at link time is undefined.

  Mixing them at link time is undefined.

 xreflabel="Instrumentation">

 xreflabel="Instrumentation">

Instrumentation

Instrumentation

  Instead of instrumenting every public entry and exit point,

  Instead of instrumenting every public entry and exit point,

  we chose to add instrumentation on demand, as needed

  we chose to add instrumentation on demand, as needed

  by individual diagnostics.

  by individual diagnostics.

  The main reason is that some diagnostics require us to extract bits of

  The main reason is that some diagnostics require us to extract bits of

  internal state that are particular only to that diagnostic.

  internal state that are particular only to that diagnostic.

  We plan to formalize this later, after we learn more about the requirements

  We plan to formalize this later, after we learn more about the requirements

  of several diagnostics.

  of several diagnostics.

  All the instrumentation points can be switched on and off using

  All the instrumentation points can be switched on and off using

  -D[_NO]_GLIBCXX_PROFILE_<diagnostic> options.

  -D[_NO]_GLIBCXX_PROFILE_<diagnostic> options.

  With all the instrumentation calls off, there should be negligible

  With all the instrumentation calls off, there should be negligible

  overhead over the release version.  This property is needed to support

  overhead over the release version.  This property is needed to support

  diagnostics based on timing of internal operations.  For such diagnostics,

  diagnostics based on timing of internal operations.  For such diagnostics,

  we anticipate turning most of the instrumentation off in order to prevent

  we anticipate turning most of the instrumentation off in order to prevent

  profiling overhead from polluting time measurements, and thus diagnostics.

  profiling overhead from polluting time measurements, and thus diagnostics.

  All the instrumentation on/off compile time switches live in

  All the instrumentation on/off compile time switches live in

  include/profile/profiler.h.

  include/profile/profiler.h.

 xreflabel="Run Time Behavior">

 xreflabel="Run Time Behavior">

Run Time Behavior

Run Time Behavior

  For practical reasons, the instrumentation library processes the trace

  For practical reasons, the instrumentation library processes the trace

  partially

  partially

  rather than dumping it to disk in raw form.  Each event is processed when

  rather than dumping it to disk in raw form.  Each event is processed when

  it occurs.  It is usually attached a cost and it is aggregated into

  it occurs.  It is usually attached a cost and it is aggregated into

  the database of a specific diagnostic class.  The cost model

  the database of a specific diagnostic class.  The cost model

  is based largely on the standard performance guarantees, but in some

  is based largely on the standard performance guarantees, but in some

  cases we use knowledge about GCC's standard library implementation.

  cases we use knowledge about GCC's standard library implementation.

  Information is indexed by (1) call stack and (2) instance id or address

  Information is indexed by (1) call stack and (2) instance id or address

  to be able to understand and summarize precise creation-use-destruction

  to be able to understand and summarize precise creation-use-destruction

  dynamic chains.  Although the analysis is sensitive to dynamic instances,

  dynamic chains.  Although the analysis is sensitive to dynamic instances,

  the reports are only sensitive to call context.  Whenever a dynamic instance

  the reports are only sensitive to call context.  Whenever a dynamic instance

  is destroyed, we accumulate its effect to the corresponding entry for the

  is destroyed, we accumulate its effect to the corresponding entry for the

  call stack of its constructor location.

  call stack of its constructor location.

  For details, see

  For details, see

   paper presented at

   paper presented at

   CGO 2009.

   CGO 2009.

 xreflabel="Analysis and Diagnostics">

 xreflabel="Analysis and Diagnostics">

Analysis and Diagnostics

Analysis and Diagnostics

  Final analysis takes place offline, and it is based entirely on the

  Final analysis takes place offline, and it is based entirely on the

  generated trace and debugging info in the application binary.

  generated trace and debugging info in the application binary.

  See section Diagnostics for a list of analysis types that we plan to support.

  See section Diagnostics for a list of analysis types that we plan to support.

  The input to the analysis is a table indexed by profile type and call stack.

  The input to the analysis is a table indexed by profile type and call stack.

  The data type for each entry depends on the profile type.

  The data type for each entry depends on the profile type.

 xreflabel="Cost Model">

 xreflabel="Cost Model">

Cost Model

Cost Model

  While it is likely that cost models become complex as we get into

  While it is likely that cost models become complex as we get into

  more sophisticated analysis, we will try to follow a simple set of rules

  more sophisticated analysis, we will try to follow a simple set of rules

  at the beginning.

  at the beginning.

  Relative benefit estimation:

  Relative benefit estimation:

  The idea is to estimate or measure the cost of all operations

  The idea is to estimate or measure the cost of all operations

  in the original scenario versus the scenario we advise to switch to.

  in the original scenario versus the scenario we advise to switch to.

  For instance, when advising to change a vector to a list, an occurrence

  For instance, when advising to change a vector to a list, an occurrence

  of the insert method will generally count as a benefit.

  of the insert method will generally count as a benefit.

  Its magnitude depends on (1) the number of elements that get shifted

  Its magnitude depends on (1) the number of elements that get shifted

  and (2) whether it triggers a reallocation.

  and (2) whether it triggers a reallocation.

  Synthetic measurements:

  Synthetic measurements:

  We will measure the relative difference between similar operations on

  We will measure the relative difference between similar operations on

  different containers.  We plan to write a battery of small tests that

  different containers.  We plan to write a battery of small tests that

  compare the times of the executions of similar methods on different

  compare the times of the executions of similar methods on different

  containers.  The idea is to run these tests on the target machine.

  containers.  The idea is to run these tests on the target machine.

  If this training phase is very quick, we may decide to perform it at

  If this training phase is very quick, we may decide to perform it at

  library initialization time.  The results can be cached on disk and reused

  library initialization time.  The results can be cached on disk and reused

  across runs.

  across runs.

  Timers:

  Timers:

  We plan to use timers for operations of larger granularity, such as sort.

  We plan to use timers for operations of larger granularity, such as sort.

  For instance, we can switch between different sort methods on the fly

  For instance, we can switch between different sort methods on the fly

  and report the one that performs best for each call context.

  and report the one that performs best for each call context.

  Show stoppers:

  Show stoppers:

  We may decide that the presence of an operation nullifies the advice.

  We may decide that the presence of an operation nullifies the advice.

  For instance, when considering switching from set to

  For instance, when considering switching from set to

  unordered_set, if we detect use of operator ++,

  unordered_set, if we detect use of operator ++,

  we will simply not issue the advice, since this could signal that the use

  we will simply not issue the advice, since this could signal that the use

  care require a sorted container.

  care require a sorted container.

 xreflabel="Reports">

 xreflabel="Reports">

Reports

Reports

There are two types of reports.  First, if we recognize a pattern for which

There are two types of reports.  First, if we recognize a pattern for which

we have a substitute that is likely to give better performance, we print

we have a substitute that is likely to give better performance, we print

the advice and estimated performance gain.  The advice is usually associated

the advice and estimated performance gain.  The advice is usually associated

to a code position and possibly a call stack.

to a code position and possibly a call stack.

Second, we report performance characteristics for which we do not have

Second, we report performance characteristics for which we do not have

a clear solution for improvement.  For instance, we can point to the user

a clear solution for improvement.  For instance, we can point to the user

the top 10 multimap locations

the top 10 multimap locations

which have the worst data locality in actual traversals.

which have the worst data locality in actual traversals.

Although this does not offer a solution,

Although this does not offer a solution,

it helps the user focus on the key problems and ignore the uninteresting ones.

it helps the user focus on the key problems and ignore the uninteresting ones.

 xreflabel="Testing">

 xreflabel="Testing">

Testing

Testing

  First, we want to make sure we preserve the behavior of the release mode.

  First, we want to make sure we preserve the behavior of the release mode.

  You can just type "make check-profile", which

  You can just type "make check-profile", which

  builds and runs the whole test suite in profile mode.

  builds and runs the whole test suite in profile mode.

  Second, we want to test the correctness of each diagnostic.

  Second, we want to test the correctness of each diagnostic.

  We created a profile directory in the test suite.

  We created a profile directory in the test suite.

  Each diagnostic must come with at least two tests, one for false positives

  Each diagnostic must come with at least two tests, one for false positives

  and one for false negatives.

  and one for false negatives.

 xreflabel="API">

 xreflabel="API">

Extensions for Custom Containers

Extensions for Custom Containers

  Many large projects use their own data structures instead of the ones in the

  Many large projects use their own data structures instead of the ones in the

  standard library.  If these data structures are similar in functionality

  standard library.  If these data structures are similar in functionality

  to the standard library, they can be instrumented with the same hooks

  to the standard library, they can be instrumented with the same hooks

  that are used to instrument the standard library.

  that are used to instrument the standard library.

  The instrumentation API is exposed in file

  The instrumentation API is exposed in file

  profiler.h (look for "Instrumentation hooks").

  profiler.h (look for "Instrumentation hooks").

 xreflabel="Cost Model">

 xreflabel="Cost Model">

Empirical Cost Model

Empirical Cost Model

  Currently, the cost model uses formulas with predefined relative weights

  Currently, the cost model uses formulas with predefined relative weights

  for alternative containers or container implementations.  For instance,

  for alternative containers or container implementations.  For instance,

  iterating through a vector is X times faster than iterating through a list.

  iterating through a vector is X times faster than iterating through a list.

  (Under development.)

  (Under development.)

  We are working on customizing this to a particular machine by providing

  We are working on customizing this to a particular machine by providing

  an automated way to compute the actual relative weights for operations

  an automated way to compute the actual relative weights for operations

  on the given machine.

  on the given machine.

  (Under development.)

  (Under development.)

  We plan to provide a performance parameter database format that can be

  We plan to provide a performance parameter database format that can be

  filled in either by hand or by an automated training mechanism.

  filled in either by hand or by an automated training mechanism.

  The analysis module will then use this database instead of the built in.

  The analysis module will then use this database instead of the built in.

  generic parameters.

  generic parameters.

 xreflabel="Implementation">

 xreflabel="Implementation">

Implementation Issues

Implementation Issues

 xreflabel="Stack Traces">

 xreflabel="Stack Traces">

Stack Traces

Stack Traces

  Accurate stack traces are needed during profiling since we group events by

  Accurate stack traces are needed during profiling since we group events by

  call context and dynamic instance.  Without accurate traces, diagnostics

  call context and dynamic instance.  Without accurate traces, diagnostics

  may be hard to interpret.  For instance, when giving advice to the user

  may be hard to interpret.  For instance, when giving advice to the user

  it is imperative to reference application code, not library code.

  it is imperative to reference application code, not library code.

  Currently we are using the libc backtrace routine to get

  Currently we are using the libc backtrace routine to get

  stack traces.

  stack traces.

  _GLIBCXX_PROFILE_STACK_DEPTH can be set

  _GLIBCXX_PROFILE_STACK_DEPTH can be set

  to 0 if you are willing to give up call context information, or to a small

  to 0 if you are willing to give up call context information, or to a small

  positive value to reduce run time overhead.

  positive value to reduce run time overhead.

 xreflabel="Symbolization">

 xreflabel="Symbolization">

Symbolization of Instruction Addresses

Symbolization of Instruction Addresses

  The profiling and analysis phases use only instruction addresses.

  The profiling and analysis phases use only instruction addresses.

  An external utility such as addr2line is needed to postprocess the result.

  An external utility such as addr2line is needed to postprocess the result.

  We do not plan to add symbolization support in the profile extension.

  We do not plan to add symbolization support in the profile extension.

  This would require access to symbol tables, debug information tables,

  This would require access to symbol tables, debug information tables,

  external programs or libraries and other system dependent information.

  external programs or libraries and other system dependent information.

 xreflabel="Concurrency">

 xreflabel="Concurrency">

Concurrency

Concurrency

  Our current model is simplistic, but precise.

  Our current model is simplistic, but precise.

  We cannot afford to approximate because some of our diagnostics require

  We cannot afford to approximate because some of our diagnostics require

  precise matching of operations to container instance and call context.

  precise matching of operations to container instance and call context.

  During profiling, we keep a single information table per diagnostic.

  During profiling, we keep a single information table per diagnostic.

  There is a single lock per information table.

  There is a single lock per information table.

 xreflabel="Using the Standard Library in the Runtime Library">

 xreflabel="Using the Standard Library in the Runtime Library">

Using the Standard Library in the Instrumentation Implementation

Using the Standard Library in the Instrumentation Implementation

  As much as we would like to avoid uses of libstdc++ within our

  As much as we would like to avoid uses of libstdc++ within our

  instrumentation library, containers such as unordered_map are very

  instrumentation library, containers such as unordered_map are very

  appealing.  We plan to use them as long as they are named properly

  appealing.  We plan to use them as long as they are named properly

  to avoid ambiguity.

  to avoid ambiguity.

 xreflabel="Malloc Hooks">

 xreflabel="Malloc Hooks">

Malloc Hooks

Malloc Hooks

  User applications/libraries can provide malloc hooks.

  User applications/libraries can provide malloc hooks.

  When the implementation of the malloc hooks uses stdlibc++, there can

  When the implementation of the malloc hooks uses stdlibc++, there can

  be an infinite cycle between the profile mode instrumentation and the

  be an infinite cycle between the profile mode instrumentation and the

  malloc hook code.

  malloc hook code.

  We protect against reentrance to the profile mode instrumentation code,

  We protect against reentrance to the profile mode instrumentation code,

  which should avoid this problem in most cases.

  which should avoid this problem in most cases.

  The protection mechanism is thread safe and exception safe.

  The protection mechanism is thread safe and exception safe.

  This mechanism does not prevent reentrance to the malloc hook itself,

  This mechanism does not prevent reentrance to the malloc hook itself,

  which could still result in deadlock, if, for instance, the malloc hook

  which could still result in deadlock, if, for instance, the malloc hook

  uses non-recursive locks.

  uses non-recursive locks.

  XXX: A definitive solution to this problem would be for the profile extension

  XXX: A definitive solution to this problem would be for the profile extension

  to use a custom allocator internally, and perhaps not to use libstdc++.

  to use a custom allocator internally, and perhaps not to use libstdc++.

 xreflabel="Construction and Destruction of Global Objects">

 xreflabel="Construction and Destruction of Global Objects">

Construction and Destruction of Global Objects

Construction and Destruction of Global Objects

  The profiling library state is initialized at the first call to a profiling

  The profiling library state is initialized at the first call to a profiling

  method.  This allows us to record the construction of all global objects.

  method.  This allows us to record the construction of all global objects.

  However, we cannot do the same at destruction time.  The trace is written

  However, we cannot do the same at destruction time.  The trace is written

  by a function registered by atexit, thus invoked by

  by a function registered by atexit, thus invoked by

  exit.

  exit.

 xreflabel="Developer Information">

 xreflabel="Developer Information">

Developer Information

Developer Information

 xreflabel="Big Picture">

 xreflabel="Big Picture">

Big Picture

Big Picture

  The profile mode headers are included with

  The profile mode headers are included with

   -D_GLIBCXX_PROFILE through preprocessor directives in

   -D_GLIBCXX_PROFILE through preprocessor directives in

   include/std/*.

   include/std/*.

  Instrumented implementations are provided in

  Instrumented implementations are provided in

   include/profile/*.  All instrumentation hooks are macros

   include/profile/*.  All instrumentation hooks are macros

   defined in include/profile/profiler.h.

   defined in include/profile/profiler.h.

  All the implementation of the instrumentation hooks is in

  All the implementation of the instrumentation hooks is in

   include/profile/impl/*.  Although all the code gets included,

   include/profile/impl/*.  Although all the code gets included,

   thus is publicly visible, only a small number of functions are called from

   thus is publicly visible, only a small number of functions are called from

   outside this directory.  All calls to hook implementations must be

   outside this directory.  All calls to hook implementations must be

   done through macros defined in profiler.h.  The macro

   done through macros defined in profiler.h.  The macro

   must ensure (1) that the call is guarded against reentrance and

   must ensure (1) that the call is guarded against reentrance and

   (2) that the call can be turned off at compile time using a

   (2) that the call can be turned off at compile time using a

   -D_GLIBCXX_PROFILE_... compiler option.

   -D_GLIBCXX_PROFILE_... compiler option.

 xreflabel="How To Add A Diagnostic">

 xreflabel="How To Add A Diagnostic">

How To Add A Diagnostic

How To Add A Diagnostic

  Let's say the diagnostic name is "magic".

  Let's say the diagnostic name is "magic".

  If you need to instrument a header not already under

  If you need to instrument a header not already under

   include/profile/*, first edit the corresponding header

   include/profile/*, first edit the corresponding header

   under include/std/ and add a preprocessor directive such

   under include/std/ and add a preprocessor directive such

   as the one in include/std/vector:

   as the one in include/std/vector:

#ifdef _GLIBCXX_PROFILE

#ifdef _GLIBCXX_PROFILE

# include <profile/vector>

# include <profile/vector>

#endif

#endif

  If the file you need to instrument is not yet under

  If the file you need to instrument is not yet under

   include/profile/, make a copy of the one in

   include/profile/, make a copy of the one in

   include/debug, or the main implementation.

   include/debug, or the main implementation.

   You'll need to include the main implementation and inherit the classes

   You'll need to include the main implementation and inherit the classes

   you want to instrument.  Then define the methods you want to instrument,

   you want to instrument.  Then define the methods you want to instrument,

   define the instrumentation hooks and add calls to them.

   define the instrumentation hooks and add calls to them.

   Look at include/profile/vector for an example.

   Look at include/profile/vector for an example.

  Add macros for the instrumentation hooks in

  Add macros for the instrumentation hooks in

   include/profile/impl/profiler.h.

   include/profile/impl/profiler.h.

   Hook names must start with __profcxx_.

   Hook names must start with __profcxx_.

   Make sure they transform

   Make sure they transform

   in no code with -D_NO_GLBICXX_PROFILE_MAGIC.

   in no code with -D_NO_GLBICXX_PROFILE_MAGIC.

   Make sure all calls to any method in namespace __gnu_profile

   Make sure all calls to any method in namespace __gnu_profile

   is protected against reentrance using macro

   is protected against reentrance using macro

   _GLIBCXX_PROFILE_REENTRANCE_GUARD.

   _GLIBCXX_PROFILE_REENTRANCE_GUARD.

   All names of methods in namespace __gnu_profile called from

   All names of methods in namespace __gnu_profile called from

   profiler.h must start with __trace_magic_.

   profiler.h must start with __trace_magic_.

  Add the implementation of the diagnostic.

  Add the implementation of the diagnostic.

      Create new file include/profile/impl/profiler_magic.h.

      Create new file include/profile/impl/profiler_magic.h.

      Define class __magic_info: public __object_info_base.

      Define class __magic_info: public __object_info_base.

      This is the representation of a line in the object table.

      This is the representation of a line in the object table.

      The __merge method is used to aggregate information

      The __merge method is used to aggregate information

      across all dynamic instances created at the same call context.

      across all dynamic instances created at the same call context.

      The __magnitude must return the estimation of the benefit

      The __magnitude must return the estimation of the benefit

      as a number of small operations, e.g., number of words copied.

      as a number of small operations, e.g., number of words copied.

      The __write method is used to produce the raw trace.

      The __write method is used to produce the raw trace.

      The __advice method is used to produce the advice string.

      The __advice method is used to produce the advice string.

      Define class __magic_stack_info: public __magic_info.

      Define class __magic_stack_info: public __magic_info.

      This defines the content of a line in the stack table.

      This defines the content of a line in the stack table.

      Define class __trace_magic: public __trace_base<__magic_info,

      Define class __trace_magic: public __trace_base<__magic_info,

      __magic_stack_info>.

      __magic_stack_info>.

      It defines the content of the trace associated with this diagnostic.

      It defines the content of the trace associated with this diagnostic.

  Add initialization and reporting calls in

  Add initialization and reporting calls in

   include/profile/impl/profiler_trace.h.  Use

   include/profile/impl/profiler_trace.h.  Use

   __trace_vector_to_list as an example.

   __trace_vector_to_list as an example.

  Add documentation in file doc/xml/manual/profile_mode.xml.

  Add documentation in file doc/xml/manual/profile_mode.xml.

Diagnostics

Diagnostics

  The table below presents all the diagnostics we intend to implement.

  The table below presents all the diagnostics we intend to implement.

  Each diagnostic has a corresponding compile time switch

  Each diagnostic has a corresponding compile time switch

  -D_GLIBCXX_PROFILE_<diagnostic>.

  -D_GLIBCXX_PROFILE_<diagnostic>.

  Groups of related diagnostics can be turned on with a single switch.

  Groups of related diagnostics can be turned on with a single switch.

  For instance, -D_GLIBCXX_PROFILE_LOCALITY is equivalent to

  For instance, -D_GLIBCXX_PROFILE_LOCALITY is equivalent to

  -D_GLIBCXX_PROFILE_SOFTWARE_PREFETCH

  -D_GLIBCXX_PROFILE_SOFTWARE_PREFETCH

  -D_GLIBCXX_PROFILE_RBTREE_LOCALITY.

  -D_GLIBCXX_PROFILE_RBTREE_LOCALITY.

  The benefit, cost, expected frequency and accuracy of each diagnostic

  The benefit, cost, expected frequency and accuracy of each diagnostic

  was given a grade from 1 to 10, where 10 is highest.

  was given a grade from 1 to 10, where 10 is highest.

  A high benefit means that, if the diagnostic is accurate, the expected

  A high benefit means that, if the diagnostic is accurate, the expected

  performance improvement is high.

  performance improvement is high.

  A high cost means that turning this diagnostic on leads to high slowdown.

  A high cost means that turning this diagnostic on leads to high slowdown.

  A high frequency means that we expect this to occur relatively often.

  A high frequency means that we expect this to occur relatively often.

  A high accuracy means that the diagnostic is unlikely to be wrong.

  A high accuracy means that the diagnostic is unlikely to be wrong.

  These grades are not perfect.  They are just meant to guide users with

  These grades are not perfect.  They are just meant to guide users with

  specific needs or time budgets.

  specific needs or time budgets.


        
      
      
        Profile Diagnostics
        Profile Diagnostics
      
      
        
        
      
      
        
        
      
      
        
        
      
      
        
        
      
      
        
        
      
      
        
        
      
      
        
        
      
      
        
        
      
      
        
        
      
      
        

        

      
      
          
          
      
      
            Group
            Group
      
      
            Flag
            Flag
      
      
            Benefit
            Benefit
      
      
            Cost
            Cost
      
      
            Freq.
            Freq.
      
      
            Implemented
            Implemented
      
      
          
          
      
      
        
        
      
      
        

        

      
      
          
          
      
      
            
            
      
      
            CONTAINERS
            CONTAINERS
      
      
            
            
      
      
            HASHTABLE_TOO_SMALL
            HASHTABLE_TOO_SMALL
      
      
            10
            10
      
      
            1
            1
      
      
            
            
      
      
            10
            10
      
      
            yes
            yes
      
      
          
          
      
      
          
          
      
      
            
            
      
      
            
            
      
      
            HASHTABLE_TOO_LARGE
            HASHTABLE_TOO_LARGE
      
      
            5
            5
      
      
            1
            1
      
      
            
            
      
      
            10
            10
      
      
            yes
            yes
      
      
          
          
      
      
          
          
      
      
            
            
      
      
            
            
      
      
            INEFFICIENT_HASH
            INEFFICIENT_HASH
      
      
            7
            7
      
      
            3
            3
      
      
            
            
      
      
            10
            10
      
      
            yes
            yes
      
      
          
          
      
      
          
          
      
      
            
            
      
      
            
            
      
      
            VECTOR_TOO_SMALL
            VECTOR_TOO_SMALL
      
      
            8
            8
      
      
            1
            1
      
      
            
            
      
      
            10
            10
      
      
            yes
            yes
      
      
          
          
      
      
          
          
      
      
            
            
      
      
            
            
      
      
            VECTOR_TOO_LARGE
            VECTOR_TOO_LARGE
      
      
            5
            5
      
      
            1
            1
      
      
            
            
      
      
            10
            10
      
      
            yes
            yes
      
      
          
          
      
      
          
          
      
      
            
            
      
      
            
            
      
      
            VECTOR_TO_HASHTABLE
            VECTOR_TO_HASHTABLE
      
      
            7
            7
      
      
            7
            7
      
      
            
            
      
      
            10
            10
      
      
            no
            no
      
      
          
          
      
      
          
          
      
      
            
            
      
      
            
            
      
      
            HASHTABLE_TO_VECTOR
            HASHTABLE_TO_VECTOR
      
      
            7
            7
      
      
            7
            7
      
      
            
            
      
      
            10
            10
      
      
            no
            no
      
      
          
          
      
      
          
          
      
      
            
            
      
      
            
            
      
      
            VECTOR_TO_LIST
            VECTOR_TO_LIST
      
      
            8
            8
      
      
            5
            5
      
      
            
            
      
      
            10
            10
      
      
            yes
            yes
      
      
          
          
      
      
          
          
      
      
            
            
      
      
            
            
      
      
            LIST_TO_VECTOR
            LIST_TO_VECTOR
      
      
            10
            10
      
      
            5
            5
      
      
            
            
      
      
            10
            10
      
      
            no
            no
      
      
          
          
      
      
          
          
      
      
            
            
      
      
            
            
      
      
            ORDERED_TO_UNORDERED
            ORDERED_TO_UNORDERED
      
      
            10
            10
      
      
            5
            5
      
      
            
            
      
      
            10
            10
      
      
            only map/unordered_map
            only map/unordered_map
      
      
          
          
      
      
          
          
      
      
            
            
      
      
            ALGORITHMS
            ALGORITHMS
      
      
            
            
      
      
            SORT
            SORT
      
      
            7
            7
      
      
            8
            8
      
      
            
            
      
      
            7
            7
      
      
            no
            no
      
      
          
          
      
      
          
          
      
      
            
            
      
      
            LOCALITY
            LOCALITY
      
      
            
            
      
      
            SOFTWARE_PREFETCH
            SOFTWARE_PREFETCH
      
      
            8
            8
      
      
            8
            8
      
      
            
            
      
      
            5
            5
      
      
            no
            no
      
      
          
          
      
      
          
          
      
      
            
            
      
      
            
            
      
      
            RBTREE_LOCALITY
            RBTREE_LOCALITY
      
      
            4
            4
      
      
            8
            8
      
      
            
            
      
      
            5
            5
      
      
            no
            no
      
      
          
          
      
      
          
          
      
      
            
            
      
      
            
            
      
      
            FALSE_SHARING
            FALSE_SHARING
      
      
            8
            8
      
      
            10
            10
      
      
            
            
      
      
            10
            10
      
      
            no
            no

 xreflabel="Template">

 xreflabel="Template">

Diagnostic Template

Diagnostic Template

  Switch:

  Switch:

  _GLIBCXX_PROFILE_<diagnostic>.

  _GLIBCXX_PROFILE_<diagnostic>.

  Goal:  What problem will it diagnose?

  Goal:  What problem will it diagnose?

  Fundamentals:.

  Fundamentals:.

  What is the fundamental reason why this is a problem

  What is the fundamental reason why this is a problem

  Sample runtime reduction:

  Sample runtime reduction:

  Percentage reduction in execution time.  When reduction is more than

  Percentage reduction in execution time.  When reduction is more than

  a constant factor, describe the reduction rate formula.

  a constant factor, describe the reduction rate formula.

  Recommendation:

  Recommendation:

  What would the advise look like?

  What would the advise look like?

  To instrument:

  To instrument:

  What stdlibc++ components need to be instrumented?

  What stdlibc++ components need to be instrumented?

  Analysis:

  Analysis:

  How do we decide when to issue the advice?

  How do we decide when to issue the advice?

  Cost model:

  Cost model:

  How do we measure benefits?  Math goes here.

  How do we measure benefits?  Math goes here.

  Example:

  Example:

program code

program code

...

...

advice sample

advice sample

 xreflabel="Containers">

 xreflabel="Containers">

Containers

Containers

Switch:

Switch:

  _GLIBCXX_PROFILE_CONTAINERS.

  _GLIBCXX_PROFILE_CONTAINERS.

 xreflabel="Hashtable Too Small">

 xreflabel="Hashtable Too Small">

Hashtable Too Small

Hashtable Too Small

  Switch:

  Switch:

  _GLIBCXX_PROFILE_HASHTABLE_TOO_SMALL.

  _GLIBCXX_PROFILE_HASHTABLE_TOO_SMALL.

  Goal: Detect hashtables with many

  Goal: Detect hashtables with many

  rehash operations, small construction size and large destruction size.

  rehash operations, small construction size and large destruction size.

  Fundamentals: Rehash is very expensive.

  Fundamentals: Rehash is very expensive.

  Read content, follow chains within bucket, evaluate hash function, place at

  Read content, follow chains within bucket, evaluate hash function, place at

  new location in different order.

  new location in different order.

  Sample runtime reduction: 36%.

  Sample runtime reduction: 36%.

  Code similar to example below.

  Code similar to example below.

  Recommendation:

  Recommendation:

  Set initial size to N at construction site S.

  Set initial size to N at construction site S.

  To instrument:

  To instrument:

  unordered_set, unordered_map constructor, destructor, rehash.

  unordered_set, unordered_map constructor, destructor, rehash.

  Analysis:

  Analysis:

  For each dynamic instance of unordered_[multi]set|map,

  For each dynamic instance of unordered_[multi]set|map,

  record initial size and call context of the constructor.

  record initial size and call context of the constructor.

  Record size increase, if any, after each relevant operation such as insert.

  Record size increase, if any, after each relevant operation such as insert.

  Record the estimated rehash cost.

  Record the estimated rehash cost.

  Cost model:

  Cost model:

  Number of individual rehash operations * cost per rehash.

  Number of individual rehash operations * cost per rehash.

  Example:

  Example:

1 unordered_set<int> us;

1 unordered_set<int> us;

2 for (int k = 0; k < 1000000; ++k) {

2 for (int k = 0; k < 1000000; ++k) {

3   us.insert(k);

3   us.insert(k);

4 }

4 }

foo.cc:1: advice: Changing initial unordered_set size from 10 to 1000000 saves 1025530 rehash operations.

foo.cc:1: advice: Changing initial unordered_set size from 10 to 1000000 saves 1025530 rehash operations.

 xreflabel="Hashtable Too Large">

 xreflabel="Hashtable Too Large">

Hashtable Too Large

Hashtable Too Large

  Switch:

  Switch:

  _GLIBCXX_PROFILE_HASHTABLE_TOO_LARGE.

  _GLIBCXX_PROFILE_HASHTABLE_TOO_LARGE.

  Goal: Detect hashtables which are

  Goal: Detect hashtables which are

  never filled up because fewer elements than reserved are ever

  never filled up because fewer elements than reserved are ever

  inserted.

  inserted.

  Fundamentals: Save memory, which

  Fundamentals: Save memory, which

  is good in itself and may also improve memory reference performance through

  is good in itself and may also improve memory reference performance through

  fewer cache and TLB misses.

  fewer cache and TLB misses.

  Sample runtime reduction: unknown.

  Sample runtime reduction: unknown.

  Recommendation:

  Recommendation:

  Set initial size to N at construction site S.

  Set initial size to N at construction site S.

  To instrument:

  To instrument:

  unordered_set, unordered_map constructor, destructor, rehash.

  unordered_set, unordered_map constructor, destructor, rehash.

  Analysis:

  Analysis:

  For each dynamic instance of unordered_[multi]set|map,

  For each dynamic instance of unordered_[multi]set|map,

  record initial size and call context of the constructor, and correlate it

  record initial size and call context of the constructor, and correlate it

  with its size at destruction time.

  with its size at destruction time.

  Cost model:

  Cost model:

  Number of iteration operations + memory saved.

  Number of iteration operations + memory saved.

  Example:

  Example:

1 vector<unordered_set<int>> v(100000, unordered_set<int>(100)) ;

1 vector<unordered_set<int>> v(100000, unordered_set<int>(100)) ;

2 for (int k = 0; k < 100000; ++k) {

2 for (int k = 0; k < 100000; ++k) {

3   for (int j = 0; j < 10; ++j) {

3   for (int j = 0; j < 10; ++j) {

4     v[k].insert(k + j);

4     v[k].insert(k + j);

5  }

5  }

6 }

6 }

foo.cc:1: advice: Changing initial unordered_set size from 100 to 10 saves N

foo.cc:1: advice: Changing initial unordered_set size from 100 to 10 saves N

bytes of memory and M iteration steps.

bytes of memory and M iteration steps.

 xreflabel="Inefficient Hash">

 xreflabel="Inefficient Hash">

Inefficient Hash

Inefficient Hash

  Switch:

  Switch:

  _GLIBCXX_PROFILE_INEFFICIENT_HASH.

  _GLIBCXX_PROFILE_INEFFICIENT_HASH.

  Goal: Detect hashtables with polarized

  Goal: Detect hashtables with polarized

  distribution.

  distribution.

  Fundamentals: A non-uniform

  Fundamentals: A non-uniform

  distribution may lead to long chains, thus possibly increasing complexity

  distribution may lead to long chains, thus possibly increasing complexity

  by a factor up to the number of elements.

  by a factor up to the number of elements.

  Sample runtime reduction: factor up

  Sample runtime reduction: factor up

   to container size.

   to container size.

  Recommendation: Change hash function

  Recommendation: Change hash function

  for container built at site S.  Distribution score = N.  Access score = S.

  for container built at site S.  Distribution score = N.  Access score = S.

  Longest chain = C, in bucket B.

  Longest chain = C, in bucket B.

  To instrument:

  To instrument:

  unordered_set, unordered_map constructor, destructor, [],

  unordered_set, unordered_map constructor, destructor, [],

  insert, iterator.

  insert, iterator.

  Analysis:

  Analysis:

  Count the exact number of link traversals.

  Count the exact number of link traversals.

  Cost model:

  Cost model:

  Total number of links traversed.

  Total number of links traversed.

  Example:

  Example:

class dumb_hash {

class dumb_hash {

 public:

 public:

  size_t operator() (int i) const { return 0; }

  size_t operator() (int i) const { return 0; }

};

};

...

...

  unordered_set<int, dumb_hash> hs;

  unordered_set<int, dumb_hash> hs;

...

...

  for (int i = 0; i < COUNT; ++i) {

  for (int i = 0; i < COUNT; ++i) {

    hs.find(i);

    hs.find(i);

 xreflabel="Vector Too Small">

 xreflabel="Vector Too Small">

Vector Too Small

Vector Too Small

  Switch:

  Switch:

  _GLIBCXX_PROFILE_VECTOR_TOO_SMALL.

  _GLIBCXX_PROFILE_VECTOR_TOO_SMALL.

  Goal:Detect vectors with many

  Goal:Detect vectors with many

  resize operations, small construction size and large destruction size..

  resize operations, small construction size and large destruction size..

  Fundamentals:Resizing can be expensive.

  Fundamentals:Resizing can be expensive.

  Copying large amounts of data takes time.  Resizing many small vectors may

  Copying large amounts of data takes time.  Resizing many small vectors may

  have allocation overhead and affect locality.

  have allocation overhead and affect locality.

  Sample runtime reduction:%.

  Sample runtime reduction:%.

  Recommendation:

  Recommendation:

  Set initial size to N at construction site S.

  Set initial size to N at construction site S.

  To instrument:vector.

  To instrument:vector.

  Analysis:

  Analysis:

  For each dynamic instance of vector,

  For each dynamic instance of vector,

  record initial size and call context of the constructor.

  record initial size and call context of the constructor.

  Record size increase, if any, after each relevant operation such as

  Record size increase, if any, after each relevant operation such as

  push_back.  Record the estimated resize cost.

  push_back.  Record the estimated resize cost.

  Cost model:

  Cost model:

  Total number of words copied * time to copy a word.

  Total number of words copied * time to copy a word.

  Example:

  Example:

1 vector<int> v;

1 vector<int> v;

2 for (int k = 0; k < 1000000; ++k) {

2 for (int k = 0; k < 1000000; ++k) {

3   v.push_back(k);

3   v.push_back(k);

4 }

4 }

foo.cc:1: advice: Changing initial vector size from 10 to 1000000 saves

foo.cc:1: advice: Changing initial vector size from 10 to 1000000 saves

copying 4000000 bytes and 20 memory allocations and deallocations.

copying 4000000 bytes and 20 memory allocations and deallocations.

 xreflabel="Vector Too Large">

 xreflabel="Vector Too Large">

Vector Too Large

Vector Too Large

  Switch:

  Switch:

  _GLIBCXX_PROFILE_VECTOR_TOO_LARGE

  _GLIBCXX_PROFILE_VECTOR_TOO_LARGE

  Goal:Detect vectors which are

  Goal:Detect vectors which are

  never filled up because fewer elements than reserved are ever

  never filled up because fewer elements than reserved are ever

  inserted.

  inserted.

  Fundamentals:Save memory, which

  Fundamentals:Save memory, which

  is good in itself and may also improve memory reference performance through

  is good in itself and may also improve memory reference performance through

  fewer cache and TLB misses.

  fewer cache and TLB misses.

  Sample runtime reduction:%.

  Sample runtime reduction:%.

  Recommendation:

  Recommendation:

  Set initial size to N at construction site S.

  Set initial size to N at construction site S.

  To instrument:vector.

  To instrument:vector.

  Analysis:

  Analysis:

  For each dynamic instance of vector,

  For each dynamic instance of vector,

  record initial size and call context of the constructor, and correlate it

  record initial size and call context of the constructor, and correlate it

  with its size at destruction time.

  with its size at destruction time.

  Cost model:

  Cost model:

  Total amount of memory saved.

  Total amount of memory saved.

  Example:

  Example:

1 vector<vector<int>> v(100000, vector<int>(100)) ;

1 vector<vector<int>> v(100000, vector<int>(100)) ;

2 for (int k = 0; k < 100000; ++k) {

2 for (int k = 0; k < 100000; ++k) {

3   for (int j = 0; j < 10; ++j) {

3   for (int j = 0; j < 10; ++j) {

4     v[k].insert(k + j);

4     v[k].insert(k + j);

5  }

5  }

6 }

6 }

foo.cc:1: advice: Changing initial vector size from 100 to 10 saves N

foo.cc:1: advice: Changing initial vector size from 100 to 10 saves N

bytes of memory and may reduce the number of cache and TLB misses.

bytes of memory and may reduce the number of cache and TLB misses.

 xreflabel="Vector to Hashtable">

 xreflabel="Vector to Hashtable">

Vector to Hashtable

Vector to Hashtable

  Switch:

  Switch:

  _GLIBCXX_PROFILE_VECTOR_TO_HASHTABLE.

  _GLIBCXX_PROFILE_VECTOR_TO_HASHTABLE.

  Goal: Detect uses of

  Goal: Detect uses of

  vector that can be substituted with unordered_set

  vector that can be substituted with unordered_set

  to reduce execution time.

  to reduce execution time.

  Fundamentals:

  Fundamentals:

  Linear search in a vector is very expensive, whereas searching in a hashtable

  Linear search in a vector is very expensive, whereas searching in a hashtable

  is very quick.

  is very quick.

  Sample runtime reduction:factor up

  Sample runtime reduction:factor up

   to container size.

   to container size.

  Recommendation:Replace

  Recommendation:Replace

  vector with unordered_set at site S.

  vector with unordered_set at site S.

  To instrument:vector

  To instrument:vector

  operations and access methods.

  operations and access methods.

  Analysis:

  Analysis:

  For each dynamic instance of vector,

  For each dynamic instance of vector,

  record call context of the constructor.  Issue the advice only if the

  record call context of the constructor.  Issue the advice only if the

  only methods called on this vector are push_back,

  only methods called on this vector are push_back,

  insert and find.

  insert and find.

  Cost model:

  Cost model:

  Cost(vector::push_back) + cost(vector::insert) + cost(find, vector) -

  Cost(vector::push_back) + cost(vector::insert) + cost(find, vector) -

  cost(unordered_set::insert) + cost(unordered_set::find).

  cost(unordered_set::insert) + cost(unordered_set::find).

  Example:

  Example:

1  vector<int> v;

1  vector<int> v;

...

...

2  for (int i = 0; i < 1000; ++i) {

2  for (int i = 0; i < 1000; ++i) {

3    find(v.begin(), v.end(), i);

3    find(v.begin(), v.end(), i);

4  }

4  }

foo.cc:1: advice: Changing "vector" to "unordered_set" will save about 500,000

foo.cc:1: advice: Changing "vector" to "unordered_set" will save about 500,000

comparisons.

comparisons.

 xreflabel="Hashtable to Vector">

 xreflabel="Hashtable to Vector">

Hashtable to Vector

Hashtable to Vector

  Switch:

  Switch:

  _GLIBCXX_PROFILE_HASHTABLE_TO_VECTOR.

  _GLIBCXX_PROFILE_HASHTABLE_TO_VECTOR.

  Goal: Detect uses of

  Goal: Detect uses of

  unordered_set that can be substituted with vector

  unordered_set that can be substituted with vector

  to reduce execution time.

  to reduce execution time.

  Fundamentals:

  Fundamentals:

  Hashtable iterator is slower than vector iterator.

  Hashtable iterator is slower than vector iterator.

  Sample runtime reduction:95%.

  Sample runtime reduction:95%.

  Recommendation:Replace

  Recommendation:Replace

  unordered_set with vector at site S.

  unordered_set with vector at site S.

  To instrument:unordered_set

  To instrument:unordered_set

  operations and access methods.

  operations and access methods.

  Analysis:

  Analysis:

  For each dynamic instance of unordered_set,

  For each dynamic instance of unordered_set,

  record call context of the constructor.  Issue the advice only if the

  record call context of the constructor.  Issue the advice only if the

  number of find, insert and []

  number of find, insert and []

  operations on this unordered_set are small relative to the

  operations on this unordered_set are small relative to the

  number of elements, and methods begin or end

  number of elements, and methods begin or end

  are invoked (suggesting iteration).

  are invoked (suggesting iteration).

  Cost model:

  Cost model:

  Number of .

  Number of .

  Example:

  Example:

1  unordered_set<int> us;

1  unordered_set<int> us;

...

...

2  int s = 0;

2  int s = 0;

3  for (unordered_set<int>::iterator it = us.begin(); it != us.end(); ++it) {

3  for (unordered_set<int>::iterator it = us.begin(); it != us.end(); ++it) {

4    s += *it;

4    s += *it;

5  }

5  }

foo.cc:1: advice: Changing "unordered_set" to "vector" will save about N

foo.cc:1: advice: Changing "unordered_set" to "vector" will save about N

indirections and may achieve better data locality.

indirections and may achieve better data locality.

 xreflabel="Vector to List">

 xreflabel="Vector to List">

Vector to List

Vector to List

  Switch:

  Switch:

  _GLIBCXX_PROFILE_VECTOR_TO_LIST.

  _GLIBCXX_PROFILE_VECTOR_TO_LIST.

  Goal: Detect cases where

  Goal: Detect cases where

  vector could be substituted with list for

  vector could be substituted with list for

  better performance.

  better performance.

  Fundamentals:

  Fundamentals:

  Inserting in the middle of a vector is expensive compared to inserting in a

  Inserting in the middle of a vector is expensive compared to inserting in a

  list.

  list.

  Sample runtime reduction:factor up to

  Sample runtime reduction:factor up to

   container size.

   container size.

  Recommendation:Replace vector with list

  Recommendation:Replace vector with list

  at site S.

  at site S.

  To instrument:vector

  To instrument:vector

  operations and access methods.

  operations and access methods.

  Analysis:

  Analysis:

  For each dynamic instance of vector,

  For each dynamic instance of vector,

  record the call context of the constructor.  Record the overhead of each

  record the call context of the constructor.  Record the overhead of each

  insert operation based on current size and insert position.

  insert operation based on current size and insert position.

  Report instance with high insertion overhead.

  Report instance with high insertion overhead.

  Cost model:

  Cost model:

  (Sum(cost(vector::method)) - Sum(cost(list::method)), for

  (Sum(cost(vector::method)) - Sum(cost(list::method)), for

  method in [push_back, insert, erase])

  method in [push_back, insert, erase])

  + (Cost(iterate vector) - Cost(iterate list))

  + (Cost(iterate vector) - Cost(iterate list))

  Example:

  Example:

1  vector<int> v;

1  vector<int> v;

2  for (int i = 0; i < 10000; ++i) {

2  for (int i = 0; i < 10000; ++i) {

3    v.insert(v.begin(), i);

3    v.insert(v.begin(), i);

4  }

4  }

foo.cc:1: advice: Changing "vector" to "list" will save about 5,000,000

foo.cc:1: advice: Changing "vector" to "list" will save about 5,000,000

operations.

operations.

 xreflabel="List to Vector">

 xreflabel="List to Vector">

List to Vector

List to Vector

  Switch:

  Switch:

  _GLIBCXX_PROFILE_LIST_TO_VECTOR.

  _GLIBCXX_PROFILE_LIST_TO_VECTOR.

  Goal: Detect cases where

  Goal: Detect cases where

  list could be substituted with vector for

  list could be substituted with vector for

  better performance.

  better performance.

  Fundamentals:

  Fundamentals:

  Iterating through a vector is faster than through a list.

  Iterating through a vector is faster than through a list.

  Sample runtime reduction:64%.

  Sample runtime reduction:64%.

  Recommendation:Replace list with vector

  Recommendation:Replace list with vector

  at site S.

  at site S.

  To instrument:vector

  To instrument:vector

  operations and access methods.

  operations and access methods.

  Analysis:

  Analysis:

  Issue the advice if there are no insert operations.

  Issue the advice if there are no insert operations.

  Cost model:

  Cost model:

    (Sum(cost(vector::method)) - Sum(cost(list::method)), for

    (Sum(cost(vector::method)) - Sum(cost(list::method)), for

  method in [push_back, insert, erase])

  method in [push_back, insert, erase])

  + (Cost(iterate vector) - Cost(iterate list))

  + (Cost(iterate vector) - Cost(iterate list))

  Example:

  Example:

1  list<int> l;

1  list<int> l;

...

...

2  int sum = 0;

2  int sum = 0;

3  for (list<int>::iterator it = l.begin(); it != l.end(); ++it) {

3  for (list<int>::iterator it = l.begin(); it != l.end(); ++it) {

4    sum += *it;

4    sum += *it;

5  }

5  }

foo.cc:1: advice: Changing "list" to "vector" will save about 1000000 indirect

foo.cc:1: advice: Changing "list" to "vector" will save about 1000000 indirect

memory references.

memory references.

 xreflabel="List to Forward List">

 xreflabel="List to Forward List">

List to Forward List (Slist)

List to Forward List (Slist)

  Switch:

  Switch:

  _GLIBCXX_PROFILE_LIST_TO_SLIST.

  _GLIBCXX_PROFILE_LIST_TO_SLIST.

  Goal: Detect cases where

  Goal: Detect cases where

  list could be substituted with forward_list for

  list could be substituted with forward_list for

  better performance.

  better performance.

  Fundamentals:

  Fundamentals:

  The memory footprint of a forward_list is smaller than that of a list.

  The memory footprint of a forward_list is smaller than that of a list.

  This has beneficial effects on memory subsystem, e.g., fewer cache misses.

  This has beneficial effects on memory subsystem, e.g., fewer cache misses.

  Sample runtime reduction:40%.

  Sample runtime reduction:40%.

  Note that the reduction is only noticeable if the size of the forward_list

  Note that the reduction is only noticeable if the size of the forward_list

  node is in fact larger than that of the list node.  For memory allocators

  node is in fact larger than that of the list node.  For memory allocators

  with size classes, you will only notice an effect when the two node sizes

  with size classes, you will only notice an effect when the two node sizes

  belong to different allocator size classes.

  belong to different allocator size classes.

  Recommendation:Replace list with

  Recommendation:Replace list with

  forward_list at site S.

  forward_list at site S.

  To instrument:list

  To instrument:list

  operations and iteration methods.

  operations and iteration methods.

  Analysis:

  Analysis:

  Issue the advice if there are no backwards traversals

  Issue the advice if there are no backwards traversals

  or insertion before a given node.

  or insertion before a given node.

  Cost model:

  Cost model:

  Always true.

  Always true.

  Example:

  Example:

1  list<int> l;

1  list<int> l;

...

...

2  int sum = 0;

2  int sum = 0;

3  for (list<int>::iterator it = l.begin(); it != l.end(); ++it) {

3  for (list<int>::iterator it = l.begin(); it != l.end(); ++it) {

4    sum += *it;

4    sum += *it;

5  }

5  }

foo.cc:1: advice: Change "list" to "forward_list".

foo.cc:1: advice: Change "list" to "forward_list".

 xreflabel="Ordered to Unordered Associative Container">

 xreflabel="Ordered to Unordered Associative Container">

Ordered to Unordered Associative Container

Ordered to Unordered Associative Container

  Switch:

  Switch:

  _GLIBCXX_PROFILE_ORDERED_TO_UNORDERED.

  _GLIBCXX_PROFILE_ORDERED_TO_UNORDERED.

  Goal:  Detect cases where ordered

  Goal:  Detect cases where ordered

  associative containers can be replaced with unordered ones.

  associative containers can be replaced with unordered ones.

  Fundamentals:

  Fundamentals:

  Insert and search are quicker in a hashtable than in

  Insert and search are quicker in a hashtable than in

  a red-black tree.

  a red-black tree.

  Sample runtime reduction:52%.

  Sample runtime reduction:52%.

  Recommendation:

  Recommendation:

  Replace set with unordered_set at site S.

  Replace set with unordered_set at site S.

  To instrument:

  To instrument:

  set, multiset, map,

  set, multiset, map,

  multimap methods.

  multimap methods.

  Analysis:

  Analysis:

  Issue the advice only if we are not using operator ++ on any

  Issue the advice only if we are not using operator ++ on any

  iterator on a particular [multi]set|map.

  iterator on a particular [multi]set|map.

  Cost model:

  Cost model:

  (Sum(cost(hashtable::method)) - Sum(cost(rbtree::method)), for

  (Sum(cost(hashtable::method)) - Sum(cost(rbtree::method)), for

  method in [insert, erase, find])

  method in [insert, erase, find])

  + (Cost(iterate hashtable) - Cost(iterate rbtree))

  + (Cost(iterate hashtable) - Cost(iterate rbtree))

  Example:

  Example:

1  set<int> s;

1  set<int> s;

2  for (int i = 0; i < 100000; ++i) {

2  for (int i = 0; i < 100000; ++i) {

3    s.insert(i);

3    s.insert(i);

4  }

4  }

5  int sum = 0;

5  int sum = 0;

6  for (int i = 0; i < 100000; ++i) {

6  for (int i = 0; i < 100000; ++i) {

7    sum += *s.find(i);

7    sum += *s.find(i);

8  }

8  }

 xreflabel="Algorithms">

 xreflabel="Algorithms">

Algorithms

Algorithms

  Switch:

  Switch:

  _GLIBCXX_PROFILE_ALGORITHMS.

  _GLIBCXX_PROFILE_ALGORITHMS.

 xreflabel="Sorting">

 xreflabel="Sorting">

Sort Algorithm Performance

Sort Algorithm Performance

  Switch:

  Switch:

  _GLIBCXX_PROFILE_SORT.

  _GLIBCXX_PROFILE_SORT.

  Goal: Give measure of sort algorithm

  Goal: Give measure of sort algorithm

  performance based on actual input.  For instance, advise Radix Sort over

  performance based on actual input.  For instance, advise Radix Sort over

  Quick Sort for a particular call context.

  Quick Sort for a particular call context.

  Fundamentals:

  Fundamentals:

  See papers:

  See papers:

  A framework for adaptive algorithm selection in STAPL and

  A framework for adaptive algorithm selection in STAPL and

  Optimizing Sorting with Machine Learning Algorithms.

  Optimizing Sorting with Machine Learning Algorithms.

  Sample runtime reduction:60%.

  Sample runtime reduction:60%.

  Recommendation: Change sort algorithm

  Recommendation: Change sort algorithm

  at site S from X Sort to Y Sort.

  at site S from X Sort to Y Sort.

  To instrument: sort

  To instrument: sort

  algorithm.

  algorithm.

  Analysis:

  Analysis:

  Issue the advice if the cost model tells us that another sort algorithm

  Issue the advice if the cost model tells us that another sort algorithm

  would do better on this input.  Requires us to know what algorithm we

  would do better on this input.  Requires us to know what algorithm we

  are using in our sort implementation in release mode.

  are using in our sort implementation in release mode.

  Cost model:

  Cost model:

  Runtime(algo) for algo in [radix, quick, merge, ...]

  Runtime(algo) for algo in [radix, quick, merge, ...]

  Example:

  Example:

 xreflabel="Data Locality">

 xreflabel="Data Locality">

Data Locality

Data Locality

  Switch:

  Switch:

  _GLIBCXX_PROFILE_LOCALITY.

  _GLIBCXX_PROFILE_LOCALITY.

 xreflabel="Need Software Prefetch">

 xreflabel="Need Software Prefetch">

Need Software Prefetch

Need Software Prefetch

  Switch:

  Switch:

  _GLIBCXX_PROFILE_SOFTWARE_PREFETCH.

  _GLIBCXX_PROFILE_SOFTWARE_PREFETCH.

  Goal: Discover sequences of indirect

  Goal: Discover sequences of indirect

  memory accesses that are not regular, thus cannot be predicted by

  memory accesses that are not regular, thus cannot be predicted by

  hardware prefetchers.

  hardware prefetchers.

  Fundamentals:

  Fundamentals:

  Indirect references are hard to predict and are very expensive when they

  Indirect references are hard to predict and are very expensive when they

  miss in caches.

  miss in caches.

  Sample runtime reduction:25%.

  Sample runtime reduction:25%.

  Recommendation: Insert prefetch

  Recommendation: Insert prefetch

  instruction.

  instruction.

  To instrument: Vector iterator and

  To instrument: Vector iterator and

  access operator [].

  access operator [].

  Analysis:

  Analysis:

  First, get cache line size and page size from system.

  First, get cache line size and page size from system.

  Then record iterator dereference sequences for which the value is a pointer.

  Then record iterator dereference sequences for which the value is a pointer.

  For each sequence within a container, issue a warning if successive pointer

  For each sequence within a container, issue a warning if successive pointer

  addresses are not within cache lines and do not form a linear pattern

  addresses are not within cache lines and do not form a linear pattern

  (otherwise they may be prefetched by hardware).

  (otherwise they may be prefetched by hardware).

  If they also step across page boundaries, make the warning stronger.

  If they also step across page boundaries, make the warning stronger.

  The same analysis applies to containers other than vector.

  The same analysis applies to containers other than vector.

  However, we cannot give the same advice for linked structures, such as list,

  However, we cannot give the same advice for linked structures, such as list,

  as there is no random access to the n-th element.  The user may still be

  as there is no random access to the n-th element.  The user may still be

  able to benefit from this information, for instance by employing frays (user

  able to benefit from this information, for instance by employing frays (user

  level light weight threads) to hide the latency of chasing pointers.

  level light weight threads) to hide the latency of chasing pointers.

  This analysis is a little oversimplified.  A better cost model could be

  This analysis is a little oversimplified.  A better cost model could be

  created by understanding the capability of the hardware prefetcher.

  created by understanding the capability of the hardware prefetcher.

  This model could be trained automatically by running a set of synthetic

  This model could be trained automatically by running a set of synthetic

  cases.

  cases.

  Cost model:

  Cost model:

  Total distance between pointer values of successive elements in vectors

  Total distance between pointer values of successive elements in vectors

  of pointers.

  of pointers.

  Example:

  Example:

1 int zero = 0;

1 int zero = 0;

2 vector<int*> v(10000000, &zero);

2 vector<int*> v(10000000, &zero);

3 for (int k = 0; k < 10000000; ++k) {

3 for (int k = 0; k < 10000000; ++k) {

4   v[random() % 10000000] = new int(k);

4   v[random() % 10000000] = new int(k);

5 }

5 }

6 for (int j = 0; j < 10000000; ++j) {

6 for (int j = 0; j < 10000000; ++j) {

7   count += (*v[j] == 0 ? 0 : 1);

7   count += (*v[j] == 0 ? 0 : 1);

8 }

8 }

foo.cc:7: advice: Insert prefetch instruction.

foo.cc:7: advice: Insert prefetch instruction.

 xreflabel="Linked Structure Locality">

 xreflabel="Linked Structure Locality">

Linked Structure Locality

Linked Structure Locality

  Switch:

  Switch:

  _GLIBCXX_PROFILE_RBTREE_LOCALITY.

  _GLIBCXX_PROFILE_RBTREE_LOCALITY.

  Goal: Give measure of locality of

  Goal: Give measure of locality of

  objects stored in linked structures (lists, red-black trees and hashtables)

  objects stored in linked structures (lists, red-black trees and hashtables)

  with respect to their actual traversal patterns.

  with respect to their actual traversal patterns.

  Fundamentals:Allocation can be tuned

  Fundamentals:Allocation can be tuned

  to a specific traversal pattern, to result in better data locality.

  to a specific traversal pattern, to result in better data locality.

  See paper:

  See paper:

  Custom Memory Allocation for Free.

  Custom Memory Allocation for Free.

  Sample runtime reduction:30%.

  Sample runtime reduction:30%.

  Recommendation:

  Recommendation:

  High scatter score N for container built at site S.

  High scatter score N for container built at site S.

  Consider changing allocation sequence or choosing a structure conscious

  Consider changing allocation sequence or choosing a structure conscious

  allocator.

  allocator.

  To instrument: Methods of all

  To instrument: Methods of all

  containers using linked structures.

  containers using linked structures.

  Analysis:

  Analysis:

  First, get cache line size and page size from system.

  First, get cache line size and page size from system.

  Then record the number of successive elements that are on different line

  Then record the number of successive elements that are on different line

  or page, for each traversal method such as find.  Give advice

  or page, for each traversal method such as find.  Give advice

  only if the ratio between this number and the number of total node hops

  only if the ratio between this number and the number of total node hops

  is above a threshold.

  is above a threshold.

  Cost model:

  Cost model:

  Sum(same_cache_line(this,previous))

  Sum(same_cache_line(this,previous))

  Example:

  Example:

 1  set<int> s;

 1  set<int> s;

 2  for (int i = 0; i < 10000000; ++i) {

 2  for (int i = 0; i < 10000000; ++i) {

 3    s.insert(i);

 3    s.insert(i);

 4  }

 4  }

 5  set<int> s1, s2;

 5  set<int> s1, s2;

 6  for (int i = 0; i < 10000000; ++i) {

 6  for (int i = 0; i < 10000000; ++i) {

 7    s1.insert(i);

 7    s1.insert(i);

 8    s2.insert(i);

 8    s2.insert(i);

 9  }

 9  }

...

...

      // Fast, better locality.

      // Fast, better locality.

10    for (set<int>::iterator it = s.begin(); it != s.end(); ++it) {

10    for (set<int>::iterator it = s.begin(); it != s.end(); ++it) {

11      sum += *it;

11      sum += *it;

12    }

12    }

      // Slow, elements are further apart.

      // Slow, elements are further apart.

13    for (set<int>::iterator it = s1.begin(); it != s1.end(); ++it) {

13    for (set<int>::iterator it = s1.begin(); it != s1.end(); ++it) {

14      sum += *it;

14      sum += *it;

15    }

15    }

foo.cc:5: advice: High scatter score NNN for set built here.  Consider changing

foo.cc:5: advice: High scatter score NNN for set built here.  Consider changing

the allocation sequence or switching to a structure conscious allocator.

the allocation sequence or switching to a structure conscious allocator.

 xreflabel="Multithreaded Data Access">

 xreflabel="Multithreaded Data Access">

Multithreaded Data Access

Multithreaded Data Access

  The diagnostics in this group are not meant to be implemented short term.

  The diagnostics in this group are not meant to be implemented short term.

  They require compiler support to know when container elements are written

  They require compiler support to know when container elements are written

  to.  Instrumentation can only tell us when elements are referenced.

  to.  Instrumentation can only tell us when elements are referenced.

  Switch:

  Switch:

  _GLIBCXX_PROFILE_MULTITHREADED.

  _GLIBCXX_PROFILE_MULTITHREADED.

 xreflabel="Dependence Violations at Container Level">

 xreflabel="Dependence Violations at Container Level">

Data Dependence Violations at Container Level

Data Dependence Violations at Container Level

  Switch:

  Switch:

  _GLIBCXX_PROFILE_DDTEST.

  _GLIBCXX_PROFILE_DDTEST.

  Goal: Detect container elements

  Goal: Detect container elements

  that are referenced from multiple threads in the parallel region or

  that are referenced from multiple threads in the parallel region or

  across parallel regions.

  across parallel regions.

  Fundamentals:

  Fundamentals:

  Sharing data between threads requires communication and perhaps locking,

  Sharing data between threads requires communication and perhaps locking,

  which may be expensive.

  which may be expensive.

  Sample runtime reduction:?%.

  Sample runtime reduction:?%.

  Recommendation: Change data

  Recommendation: Change data

  distribution or parallel algorithm.

  distribution or parallel algorithm.

  To instrument: Container access methods

  To instrument: Container access methods

  and iterators.

  and iterators.

  Analysis:

  Analysis:

  Keep a shadow for each container.  Record iterator dereferences and

  Keep a shadow for each container.  Record iterator dereferences and

  container member accesses.  Issue advice for elements referenced by

  container member accesses.  Issue advice for elements referenced by

  multiple threads.

  multiple threads.

  See paper:

  See paper:

  The LRPD test: speculative run-time parallelization of loops with

  The LRPD test: speculative run-time parallelization of loops with

  privatization and reduction parallelization.

  privatization and reduction parallelization.

  Cost model:

  Cost model:

  Number of accesses to elements referenced from multiple threads

  Number of accesses to elements referenced from multiple threads

  Example:

  Example:

 xreflabel="False Sharing">

 xreflabel="False Sharing">

False Sharing

False Sharing

  Switch:

  Switch:

  _GLIBCXX_PROFILE_FALSE_SHARING.

  _GLIBCXX_PROFILE_FALSE_SHARING.

  Goal: Detect elements in the

  Goal: Detect elements in the

  same container which share a cache line, are written by at least one

  same container which share a cache line, are written by at least one

  thread, and accessed by different threads.

  thread, and accessed by different threads.

  Fundamentals: Under these assumptions,

  Fundamentals: Under these assumptions,

  cache protocols require

  cache protocols require

  communication to invalidate lines, which may be expensive.

  communication to invalidate lines, which may be expensive.

  Sample runtime reduction:68%.

  Sample runtime reduction:68%.

  Recommendation: Reorganize container

  Recommendation: Reorganize container

  or use padding to avoid false sharing.

  or use padding to avoid false sharing.

  To instrument: Container access methods

  To instrument: Container access methods

  and iterators.

  and iterators.

  Analysis:

  Analysis:

  First, get the cache line size.

  First, get the cache line size.

  For each shared container, record all the associated iterator dereferences

  For each shared container, record all the associated iterator dereferences

  and member access methods with the thread id.  Compare the address lists

  and member access methods with the thread id.  Compare the address lists

  across threads to detect references in two different threads to the same

  across threads to detect references in two different threads to the same

  cache line.  Issue a warning only if the ratio to total references is

  cache line.  Issue a warning only if the ratio to total references is

  significant.  Do the same for iterator dereference values if they are

  significant.  Do the same for iterator dereference values if they are

  pointers.

  pointers.

  Cost model:

  Cost model:

  Number of accesses to same cache line from different threads.

  Number of accesses to same cache line from different threads.

  Example:

  Example:

1     vector<int> v(2, 0);

1     vector<int> v(2, 0);

2 #pragma omp parallel for shared(v, SIZE) schedule(static, 1)

2 #pragma omp parallel for shared(v, SIZE) schedule(static, 1)

3     for (i = 0; i < SIZE; ++i) {

3     for (i = 0; i < SIZE; ++i) {

4       v[i % 2] += i;

4       v[i % 2] += i;

5     }

5     }

OMP_NUM_THREADS=2 ./a.out

OMP_NUM_THREADS=2 ./a.out

foo.cc:1: advice: Change container structure or padding to avoid false

foo.cc:1: advice: Change container structure or padding to avoid false

sharing in multithreaded access at foo.cc:4.  Detected N shared cache lines.

sharing in multithreaded access at foo.cc:4.  Detected N shared cache lines.

 xreflabel="Statistics">

 xreflabel="Statistics">

Statistics

Statistics

Switch:

Switch:

  _GLIBCXX_PROFILE_STATISTICS.

  _GLIBCXX_PROFILE_STATISTICS.

  In some cases the cost model may not tell us anything because the costs

  In some cases the cost model may not tell us anything because the costs

  appear to offset the benefits.  Consider the choice between a vector and

  appear to offset the benefits.  Consider the choice between a vector and

  a list.  When there are both inserts and iteration, an automatic advice

  a list.  When there are both inserts and iteration, an automatic advice

  may not be issued.  However, the programmer may still be able to make use

  may not be issued.  However, the programmer may still be able to make use

  of this information in a different way.

  of this information in a different way.

  This diagnostic will not issue any advice, but it will print statistics for

  This diagnostic will not issue any advice, but it will print statistics for

  each container construction site.  The statistics will contain the cost

  each container construction site.  The statistics will contain the cost

  of each operation actually performed on the container.

  of each operation actually performed on the container.

Bibliography

Bibliography

    </code></pre></td>
        <td class="diff"><pre><code>    <title></code></pre></td>
      </tr>
      <tr class="diffcode">
        <td class="diff"><pre><code>      Perflint: A Context Sensitive Performance Advisor for C++ Programs</code></pre></td>
        <td class="diff"><pre><code>      Perflint: A Context Sensitive Performance Advisor for C++ Programs</code></pre></td>
      </tr>
      <tr class="diffcode">
        <td class="diff"><pre><code>

      Lixia

      Lixia

Liu

Liu

      Silvius

      Silvius

Rus

Rus

        Proceedings of the 2009 International Symposium on Code Generation

        Proceedings of the 2009 International Symposium on Code Generation

        and Optimization

        and Optimization

Browse

Tools

Subversion Repositories openrisc

[/] [openrisc/] [tags/] [gnu-src/] [gcc-4.5.1/] [gcc-4.5.1-or32-1.0rc4/] [libstdc++-v3/] [doc/] [xml/] [manual/] [profile_mode.xml] - Diff between revs 424 and 519