1 |
12 |
jlechner |
<HTML>
|
2 |
|
|
<HEAD>
|
3 |
|
|
<TITLE>Garbage collector scalability</TITLE>
|
4 |
|
|
</HEAD>
|
5 |
|
|
<BODY>
|
6 |
|
|
<H1>Garbage collector scalability</h1>
|
7 |
|
|
In its default configuration, the Boehm-Demers-Weiser garbage collector
|
8 |
|
|
is not thread-safe. It can be made thread-safe for a number of environments
|
9 |
|
|
by building the collector with the appropriate
|
10 |
|
|
<TT>-D</tt><I>XXX</i><TT>-THREADS</tt> compilation
|
11 |
|
|
flag. This has primarily two effects:
|
12 |
|
|
<OL>
|
13 |
|
|
<LI> It causes the garbage collector to stop all other threads when
|
14 |
|
|
it needs to see a consistent memory state.
|
15 |
|
|
<LI> It causes the collector to acquire a lock around essentially all
|
16 |
|
|
allocation and garbage collection activity.
|
17 |
|
|
</ol>
|
18 |
|
|
Since a single lock is used for all allocation-related activity, only one
|
19 |
|
|
thread can be allocating or collecting at one point. This inherently
|
20 |
|
|
limits performance of multi-threaded applications on multiprocessors.
|
21 |
|
|
<P>
|
22 |
|
|
On most platforms, the allocator/collector lock is implemented as a
|
23 |
|
|
spin lock with exponential back-off. Longer wait times are implemented
|
24 |
|
|
by yielding and/or sleeping. If a collection is in progress, the pure
|
25 |
|
|
spinning stage is skipped. This has the advantage that uncontested and
|
26 |
|
|
thus most uniprocessor lock acquisitions are very cheap. It has the
|
27 |
|
|
disadvantage that the application may sleep for small periods of time
|
28 |
|
|
even when there is work to be done. And threads may be unnecessarily
|
29 |
|
|
woken up for short periods. Nonetheless, this scheme empirically
|
30 |
|
|
outperforms native queue-based mutual exclusion implementations in most
|
31 |
|
|
cases, sometimes drastically so.
|
32 |
|
|
<H2>Options for enhanced scalability</h2>
|
33 |
|
|
Version 6.0 of the collector adds two facilities to enhance collector
|
34 |
|
|
scalability on multiprocessors. As of 6.0alpha1, these are supported
|
35 |
|
|
only under Linux on X86 and IA64 processors, though ports to other
|
36 |
|
|
otherwise supported Pthreads platforms should be straightforward.
|
37 |
|
|
They are intended to be used together.
|
38 |
|
|
<UL>
|
39 |
|
|
<LI>
|
40 |
|
|
Building the collector with <TT>-DPARALLEL_MARK</tt> allows the collector to
|
41 |
|
|
run the mark phase in parallel in multiple threads, and thus on multiple
|
42 |
|
|
processors. The mark phase typically consumes the large majority of the
|
43 |
|
|
collection time. Thus this largely parallelizes the garbage collector
|
44 |
|
|
itself, though not the allocation process. Currently the marking is
|
45 |
|
|
performed by the thread that triggered the collection, together with
|
46 |
|
|
<I>N</i>-1 dedicated
|
47 |
|
|
threads, where <I>N</i> is the number of processors detected by the collector.
|
48 |
|
|
The dedicated threads are created once at initialization time.
|
49 |
|
|
<P>
|
50 |
|
|
A second effect of this flag is to switch to a more concurrent
|
51 |
|
|
implementation of <TT>GC_malloc_many</tt>, so that free lists can be
|
52 |
|
|
built, and memory can be cleared, by more than one thread concurrently.
|
53 |
|
|
<LI>
|
54 |
|
|
Building the collector with -DTHREAD_LOCAL_ALLOC adds support for thread
|
55 |
|
|
local allocation. It does not, by itself, cause thread local allocation
|
56 |
|
|
to be used. It simply allows the use of the interface in
|
57 |
|
|
<TT>gc_local_alloc.h</tt>.
|
58 |
|
|
<P>
|
59 |
|
|
Memory returned from thread-local allocators is completely interchangeable
|
60 |
|
|
with that returned by the standard allocators. It may be used by other
|
61 |
|
|
threads. The only difference is that, if the thread allocates enough
|
62 |
|
|
memory of a certain kind, it will build a thread-local free list for
|
63 |
|
|
objects of that kind, and allocate from that. This greatly reduces
|
64 |
|
|
locking. The thread-local free lists are refilled using
|
65 |
|
|
<TT>GC_malloc_many</tt>.
|
66 |
|
|
<P>
|
67 |
|
|
An important side effect of this flag is to replace the default
|
68 |
|
|
spin-then-sleep lock to be replace by a spin-then-queue based implementation.
|
69 |
|
|
This <I>reduces performance</i> for the standard allocation functions,
|
70 |
|
|
though it usually improves performance when thread-local allocation is
|
71 |
|
|
used heavily, and thus the number of short-duration lock acquisitions
|
72 |
|
|
is greatly reduced.
|
73 |
|
|
</ul>
|
74 |
|
|
<P>
|
75 |
|
|
The easiest way to switch an application to thread-local allocation is to
|
76 |
|
|
<OL>
|
77 |
|
|
<LI> Define the macro <TT>GC_REDIRECT_TO_LOCAL</tt>,
|
78 |
|
|
and then include the <TT>gc.h</tt>
|
79 |
|
|
header in each client source file.
|
80 |
|
|
<LI> Invoke <TT>GC_thr_init()</tt> before any allocation.
|
81 |
|
|
<LI> Allocate using <TT>GC_MALLOC</tt>, <TT>GC_MALLOC_ATOMIC</tt>,
|
82 |
|
|
and/or <TT>GC_GCJ_MALLOC</tt>.
|
83 |
|
|
</ol>
|
84 |
|
|
<H2>The Parallel Marking Algorithm</h2>
|
85 |
|
|
We use an algorithm similar to
|
86 |
|
|
<A HREF="http://www.yl.is.s.u-tokyo.ac.jp/gc/">that developed by
|
87 |
|
|
Endo, Taura, and Yonezawa</a> at the University of Tokyo.
|
88 |
|
|
However, the data structures and implementation are different,
|
89 |
|
|
and represent a smaller change to the original collector source,
|
90 |
|
|
probably at the expense of extreme scalability. Some of
|
91 |
|
|
the refinements they suggest, <I>e.g.</i> splitting large
|
92 |
|
|
objects, were also incorporated into out approach.
|
93 |
|
|
<P>
|
94 |
|
|
The global mark stack is transformed into a global work queue.
|
95 |
|
|
Unlike the usual case, it never shrinks during a mark phase.
|
96 |
|
|
The mark threads remove objects from the queue by copying them to a
|
97 |
|
|
local mark stack and changing the global descriptor to zero, indicating
|
98 |
|
|
that there is no more work to be done for this entry.
|
99 |
|
|
This removal
|
100 |
|
|
is done with no synchronization. Thus it is possible for more than
|
101 |
|
|
one worker to remove the same entry, resulting in some work duplication.
|
102 |
|
|
<P>
|
103 |
|
|
The global work queue grows only if a marker thread decides to
|
104 |
|
|
return some of its local mark stack to the global one. This
|
105 |
|
|
is done if the global queue appears to be running low, or if
|
106 |
|
|
the local stack is in danger of overflowing. It does require
|
107 |
|
|
synchronization, but should be relatively rare.
|
108 |
|
|
<P>
|
109 |
|
|
The sequential marking code is reused to process local mark stacks.
|
110 |
|
|
Hence the amount of additional code required for parallel marking
|
111 |
|
|
is minimal.
|
112 |
|
|
<P>
|
113 |
|
|
It should be possible to use generational collection in the presence of the
|
114 |
|
|
parallel collector, by calling <TT>GC_enable_incremental()</tt>.
|
115 |
|
|
This does not result in fully incremental collection, since parallel mark
|
116 |
|
|
phases cannot currently be interrupted, and doing so may be too
|
117 |
|
|
expensive.
|
118 |
|
|
<P>
|
119 |
|
|
Gcj-style mark descriptors do not currently mix with the combination
|
120 |
|
|
of local allocation and incremental collection. They should work correctly
|
121 |
|
|
with one or the other, but not both.
|
122 |
|
|
<P>
|
123 |
|
|
The number of marker threads is set on startup to the number of
|
124 |
|
|
available processors (or to the value of the <TT>GC_NPROCS</tt>
|
125 |
|
|
environment variable). If only a single processor is detected,
|
126 |
|
|
parallel marking is disabled.
|
127 |
|
|
<P>
|
128 |
|
|
Note that setting GC_NPROCS to 1 also causes some lock acquisitions inside
|
129 |
|
|
the collector to immediately yield the processor instead of busy waiting
|
130 |
|
|
first. In the case of a multiprocessor and a client with multiple
|
131 |
|
|
simultaneously runnable threads, this may have disastrous performance
|
132 |
|
|
consequences (e.g. a factor of 10 slowdown).
|
133 |
|
|
<H2>Performance</h2>
|
134 |
|
|
We conducted some simple experiments with a version of
|
135 |
|
|
<A HREF="gc_bench.html">our GC benchmark</a> that was slightly modified to
|
136 |
|
|
run multiple concurrent client threads in the same address space.
|
137 |
|
|
Each client thread does the same work as the original benchmark, but they share
|
138 |
|
|
a heap.
|
139 |
|
|
This benchmark involves very little work outside of memory allocation.
|
140 |
|
|
This was run with GC 6.0alpha3 on a dual processor Pentium III/500 machine
|
141 |
|
|
under Linux 2.2.12.
|
142 |
|
|
<P>
|
143 |
|
|
Running with a thread-unsafe collector, the benchmark ran in 9
|
144 |
|
|
seconds. With the simple thread-safe collector,
|
145 |
|
|
built with <TT>-DLINUX_THREADS</tt>, the execution time
|
146 |
|
|
increased to 10.3 seconds, or 23.5 elapsed seconds with two clients.
|
147 |
|
|
(The times for the <TT>malloc</tt>/i<TT>free</tt> version
|
148 |
|
|
with glibc <TT>malloc</tt>
|
149 |
|
|
are 10.51 (standard library, pthreads not linked),
|
150 |
|
|
20.90 (one thread, pthreads linked),
|
151 |
|
|
and 24.55 seconds respectively. The benchmark favors a
|
152 |
|
|
garbage collector, since most objects are small.)
|
153 |
|
|
<P>
|
154 |
|
|
The following table gives execution times for the collector built
|
155 |
|
|
with parallel marking and thread-local allocation support
|
156 |
|
|
(<TT>-DGC_LINUX_THREADS -DPARALLEL_MARK -DTHREAD_LOCAL_ALLOC</tt>). We tested
|
157 |
|
|
the client using either one or two marker threads, and running
|
158 |
|
|
one or two client threads. Note that the client uses thread local
|
159 |
|
|
allocation exclusively. With -DTHREAD_LOCAL_ALLOC the collector
|
160 |
|
|
switches to a locking strategy that is better tuned to less frequent
|
161 |
|
|
lock acquisition. The standard allocation primitives thus peform
|
162 |
|
|
slightly worse than without -DTHREAD_LOCAL_ALLOC, and should be
|
163 |
|
|
avoided in time-critical code.
|
164 |
|
|
<P>
|
165 |
|
|
(The results using <TT>pthread_mutex_lock</tt>
|
166 |
|
|
directly for allocation locking would have been worse still, at
|
167 |
|
|
least for older versions of linuxthreads.
|
168 |
|
|
With THREAD_LOCAL_ALLOC, we first repeatedly try to acquire the
|
169 |
|
|
lock with pthread_mutex_try_lock(), busy_waiting between attempts.
|
170 |
|
|
After a fixed number of attempts, we use pthread_mutex_lock().)
|
171 |
|
|
<P>
|
172 |
|
|
These measurements do not use incremental collection, nor was prefetching
|
173 |
|
|
enabled in the marker. We used the C version of the benchmark.
|
174 |
|
|
All measurements are in elapsed seconds on an unloaded machine.
|
175 |
|
|
<P>
|
176 |
|
|
<TABLE BORDER ALIGN="CENTER">
|
177 |
|
|
<TR><TH>Number of threads</th><TH>1 marker thread (secs.)</th>
|
178 |
|
|
<TH>2 marker threads (secs.)</th></tr>
|
179 |
|
|
<TR><TD>1 client</td><TD ALIGN="CENTER">10.45</td><TD ALIGN="CENTER">7.85</td>
|
180 |
|
|
<TR><TD>2 clients</td><TD ALIGN="CENTER">19.95</td><TD ALIGN="CENTER">12.3</td>
|
181 |
|
|
</table>
|
182 |
|
|
<PP>
|
183 |
|
|
The execution time for the single threaded case is slightly worse than with
|
184 |
|
|
simple locking. However, even the single-threaded benchmark runs faster than
|
185 |
|
|
even the thread-unsafe version if a second processor is available.
|
186 |
|
|
The execution time for two clients with thread local allocation time is
|
187 |
|
|
only 1.4 times the sequential execution time for a single thread in a
|
188 |
|
|
thread-unsafe environment, even though it involves twice the client work.
|
189 |
|
|
That represents close to a
|
190 |
|
|
factor of 2 improvement over the 2 client case with the old collector.
|
191 |
|
|
The old collector clearly
|
192 |
|
|
still suffered from some contention overhead, in spite of the fact that the
|
193 |
|
|
locking scheme had been fairly well tuned.
|
194 |
|
|
<P>
|
195 |
|
|
Full linear speedup (i.e. the same execution time for 1 client on one
|
196 |
|
|
processor as 2 clients on 2 processors)
|
197 |
|
|
is probably not achievable on this kind of
|
198 |
|
|
hardware even with such a small number of processors,
|
199 |
|
|
since the memory system is
|
200 |
|
|
a major constraint for the garbage collector,
|
201 |
|
|
the processors usually share a single memory bus, and thus
|
202 |
|
|
the aggregate memory bandwidth does not increase in
|
203 |
|
|
proportion to the number of processors.
|
204 |
|
|
<P>
|
205 |
|
|
These results are likely to be very sensitive to both hardware and OS
|
206 |
|
|
issues. Preliminary experiments with an older Pentium Pro machine running
|
207 |
|
|
an older kernel were far less encouraging.
|
208 |
|
|
|
209 |
|
|
</body>
|
210 |
|
|
</html>
|