1 |
721 |
jeremybenn |
<HTML>
|
2 |
|
|
<HEAD>
|
3 |
|
|
<TITLE>Debugging Garbage Collector Related Problems</title>
|
4 |
|
|
</head>
|
5 |
|
|
<BODY>
|
6 |
|
|
<H1>Debugging Garbage Collector Related Problems</h1>
|
7 |
|
|
This page contains some hints on
|
8 |
|
|
debugging issues specific to
|
9 |
|
|
the Boehm-Demers-Weiser conservative garbage collector.
|
10 |
|
|
It applies both to debugging issues in client code that manifest themselves
|
11 |
|
|
as collector misbehavior, and to debugging the collector itself.
|
12 |
|
|
<P>
|
13 |
|
|
If you suspect a bug in the collector itself, it is strongly recommended
|
14 |
|
|
that you try the latest collector release, even if it is labelled as "alpha",
|
15 |
|
|
before proceeding.
|
16 |
|
|
<H2>Bus Errors and Segmentation Violations</h2>
|
17 |
|
|
<P>
|
18 |
|
|
If the fault occurred in GC_find_limit, or with incremental collection enabled,
|
19 |
|
|
this is probably normal. The collector installs handlers to take care of
|
20 |
|
|
these. You will not see these unless you are using a debugger.
|
21 |
|
|
Your debugger <I>should</i> allow you to continue.
|
22 |
|
|
It's often preferable to tell the debugger to ignore SIGBUS and SIGSEGV
|
23 |
|
|
("<TT>handle SIGSEGV SIGBUS nostop noprint</tt>" in gdb,
|
24 |
|
|
"<TT>ignore SIGSEGV SIGBUS</tt>" in most versions of dbx)
|
25 |
|
|
and set a breakpoint in <TT>abort</tt>.
|
26 |
|
|
The collector will call abort if the signal had another cause,
|
27 |
|
|
and there was not other handler previously installed.
|
28 |
|
|
<P>
|
29 |
|
|
We recommend debugging without incremental collection if possible.
|
30 |
|
|
(This applies directly to UNIX systems.
|
31 |
|
|
Debugging with incremental collection under win32 is worse. See README.win32.)
|
32 |
|
|
<P>
|
33 |
|
|
If the application generates an unhandled SIGSEGV or equivalent, it may
|
34 |
|
|
often be easiest to set the environment variable GC_LOOP_ON_ABORT. On many
|
35 |
|
|
platforms, this will cause the collector to loop in a handler when the
|
36 |
|
|
SIGSEGV is encountered (or when the collector aborts for some other reason),
|
37 |
|
|
and a debugger can then be attached to the looping
|
38 |
|
|
process. This sidesteps common operating system problems related
|
39 |
|
|
to incomplete core files for multithreaded applications, etc.
|
40 |
|
|
<H2>Other Signals</h2>
|
41 |
|
|
On most platforms, the multithreaded version of the collector needs one or
|
42 |
|
|
two other signals for internal use by the collector in stopping threads.
|
43 |
|
|
It is normally wise to tell the debugger to ignore these. On Linux,
|
44 |
|
|
the collector currently uses SIGPWR and SIGXCPU by default.
|
45 |
|
|
<H2>Warning Messages About Needing to Allocate Blacklisted Blocks</h2>
|
46 |
|
|
The garbage collector generates warning messages of the form
|
47 |
|
|
<PRE>
|
48 |
|
|
Needed to allocate blacklisted block at 0x...
|
49 |
|
|
</pre>
|
50 |
|
|
or
|
51 |
|
|
<PRE>
|
52 |
|
|
Repeated allocation of very large block ...
|
53 |
|
|
</pre>
|
54 |
|
|
when it needs to allocate a block at a location that it knows to be
|
55 |
|
|
referenced by a false pointer. These false pointers can be either permanent
|
56 |
|
|
(<I>e.g.</i> a static integer variable that never changes) or temporary.
|
57 |
|
|
In the latter case, the warning is largely spurious, and the block will
|
58 |
|
|
eventually be reclaimed normally.
|
59 |
|
|
In the former case, the program will still run correctly, but the block
|
60 |
|
|
will never be reclaimed. Unless the block is intended to be
|
61 |
|
|
permanent, the warning indicates a memory leak.
|
62 |
|
|
<OL>
|
63 |
|
|
<LI>Ignore these warnings while you are using GC_DEBUG. Some of the routines
|
64 |
|
|
mentioned below don't have debugging equivalents. (Alternatively, write
|
65 |
|
|
the missing routines and send them to me.)
|
66 |
|
|
<LI>Replace allocator calls that request large blocks with calls to
|
67 |
|
|
<TT>GC_malloc_ignore_off_page</tt> or
|
68 |
|
|
<TT>GC_malloc_atomic_ignore_off_page</tt>. You may want to set a
|
69 |
|
|
breakpoint in <TT>GC_default_warn_proc</tt> to help you identify such calls.
|
70 |
|
|
Make sure that a pointer to somewhere near the beginning of the resulting block
|
71 |
|
|
is maintained in a (preferably volatile) variable as long as
|
72 |
|
|
the block is needed.
|
73 |
|
|
<LI>
|
74 |
|
|
If the large blocks are allocated with realloc, we suggest instead allocating
|
75 |
|
|
them with something like the following. Note that the realloc size increment
|
76 |
|
|
should be fairly large (e.g. a factor of 3/2) for this to exhibit reasonable
|
77 |
|
|
performance. But we all know we should do that anyway.
|
78 |
|
|
<PRE>
|
79 |
|
|
void * big_realloc(void *p, size_t new_size)
|
80 |
|
|
{
|
81 |
|
|
size_t old_size = GC_size(p);
|
82 |
|
|
void * result;
|
83 |
|
|
|
84 |
|
|
if (new_size <= 10000) return(GC_realloc(p, new_size));
|
85 |
|
|
if (new_size <= old_size) return(p);
|
86 |
|
|
result = GC_malloc_ignore_off_page(new_size);
|
87 |
|
|
if (result == 0) return(0);
|
88 |
|
|
memcpy(result,p,old_size);
|
89 |
|
|
GC_free(p);
|
90 |
|
|
return(result);
|
91 |
|
|
}
|
92 |
|
|
</pre>
|
93 |
|
|
|
94 |
|
|
<LI> In the unlikely case that even relatively small object
|
95 |
|
|
(<20KB) allocations are triggering these warnings, then your address
|
96 |
|
|
space contains lots of "bogus pointers", i.e. values that appear to
|
97 |
|
|
be pointers but aren't. Usually this can be solved by using GC_malloc_atomic
|
98 |
|
|
or the routines in gc_typed.h to allocate large pointer-free regions of bitmaps, etc. Sometimes the problem can be solved with trivial changes of encoding
|
99 |
|
|
in certain values. It is possible, to identify the source of the bogus
|
100 |
|
|
pointers by building the collector with <TT>-DPRINT_BLACK_LIST</tt>,
|
101 |
|
|
which will cause it to print the "bogus pointers", along with their location.
|
102 |
|
|
|
103 |
|
|
<LI> If you get only a fixed number of these warnings, you are probably only
|
104 |
|
|
introducing a bounded leak by ignoring them. If the data structures being
|
105 |
|
|
allocated are intended to be permanent, then it is also safe to ignore them.
|
106 |
|
|
The warnings can be turned off by calling GC_set_warn_proc with a procedure
|
107 |
|
|
that ignores these warnings (e.g. by doing absolutely nothing).
|
108 |
|
|
</ol>
|
109 |
|
|
|
110 |
|
|
<H2>The Collector References a Bad Address in <TT>GC_malloc</tt></h2>
|
111 |
|
|
|
112 |
|
|
This typically happens while the collector is trying to remove an entry from
|
113 |
|
|
its free list, and the free list pointer is bad because the free list link
|
114 |
|
|
in the last allocated object was bad.
|
115 |
|
|
<P>
|
116 |
|
|
With > 99% probability, you wrote past the end of an allocated object.
|
117 |
|
|
Try setting <TT>GC_DEBUG</tt> before including <TT>gc.h</tt> and
|
118 |
|
|
allocating with <TT>GC_MALLOC</tt>. This will try to detect such
|
119 |
|
|
overwrite errors.
|
120 |
|
|
|
121 |
|
|
<H2>Unexpectedly Large Heap</h2>
|
122 |
|
|
|
123 |
|
|
Unexpected heap growth can be due to one of the following:
|
124 |
|
|
<OL>
|
125 |
|
|
<LI> Data structures that are being unintentionally retained. This
|
126 |
|
|
is commonly caused by data structures that are no longer being used,
|
127 |
|
|
but were not cleared, or by caches growing without bounds.
|
128 |
|
|
<LI> Pointer misidentification. The garbage collector is interpreting
|
129 |
|
|
integers or other data as pointers and retaining the "referenced"
|
130 |
|
|
objects. A common symptom is that GC_dump() shows much of the heap
|
131 |
|
|
as black-listed.
|
132 |
|
|
<LI> Heap fragmentation. This should never result in unbounded growth,
|
133 |
|
|
but it may account for larger heaps. This is most commonly caused
|
134 |
|
|
by allocation of large objects. On some platforms it can be reduced
|
135 |
|
|
by building with -DUSE_MUNMAP, which will cause the collector to unmap
|
136 |
|
|
memory corresponding to pages that have not been recently used.
|
137 |
|
|
<LI> Per object overhead. This is usually a relatively minor effect, but
|
138 |
|
|
it may be worth considering. If the collector recognizes interior
|
139 |
|
|
pointers, object sizes are increased, so that one-past-the-end pointers
|
140 |
|
|
are correctly recognized. The collector can be configured not to do this
|
141 |
|
|
(<TT>-DDONT_ADD_BYTE_AT_END</tt>).
|
142 |
|
|
<P>
|
143 |
|
|
The collector rounds up object sizes so the result fits well into the
|
144 |
|
|
chunk size (<TT>HBLKSIZE</tt>, normally 4K on 32 bit machines, 8K
|
145 |
|
|
on 64 bit machines) used by the collector. Thus it may be worth avoiding
|
146 |
|
|
objects of size 2K + 1 (or 2K if a byte is being added at the end.)
|
147 |
|
|
</ol>
|
148 |
|
|
The last two cases can often be identified by looking at the output
|
149 |
|
|
of a call to <TT>GC_dump()</tt>. Among other things, it will print the
|
150 |
|
|
list of free heap blocks, and a very brief description of all chunks in
|
151 |
|
|
the heap, the object sizes they correspond to, and how many live objects
|
152 |
|
|
were found in the chunk at the last collection.
|
153 |
|
|
<P>
|
154 |
|
|
Growing data structures can usually be identified by
|
155 |
|
|
<OL>
|
156 |
|
|
<LI> Building the collector with <TT>-DKEEP_BACK_PTRS</tt>,
|
157 |
|
|
<LI> Preferably using debugging allocation (defining <TT>GC_DEBUG</tt>
|
158 |
|
|
before including <TT>gc.h</tt> and allocating with <TT>GC_MALLOC</tt>),
|
159 |
|
|
so that objects will be identified by their allocation site,
|
160 |
|
|
<LI> Running the application long enough so
|
161 |
|
|
that most of the heap is composed of "leaked" memory, and
|
162 |
|
|
<LI> Then calling <TT>GC_generate_random_backtrace()</tt> from backptr.h
|
163 |
|
|
a few times to determine why some randomly sampled objects in the heap are
|
164 |
|
|
being retained.
|
165 |
|
|
</ol>
|
166 |
|
|
<P>
|
167 |
|
|
The same technique can often be used to identify problems with false
|
168 |
|
|
pointers, by noting whether the reference chains printed by
|
169 |
|
|
<TT>GC_generate_random_backtrace()</tt> involve any misidentified pointers.
|
170 |
|
|
An alternate technique is to build the collector with
|
171 |
|
|
<TT>-DPRINT_BLACK_LIST</tt> which will cause it to report values that
|
172 |
|
|
are almost, but not quite, look like heap pointers. It is very likely that
|
173 |
|
|
actual false pointers will come from similar sources.
|
174 |
|
|
<P>
|
175 |
|
|
In the unlikely case that false pointers are an issue, it can usually
|
176 |
|
|
be resolved using one or more of the following techniques:
|
177 |
|
|
<OL>
|
178 |
|
|
<LI> Use <TT>GC_malloc_atomic</tt> for objects containing no pointers.
|
179 |
|
|
This is especially important for large arrays containing compressed data,
|
180 |
|
|
pseudo-random numbers, and the like. It is also likely to improve GC
|
181 |
|
|
performance, perhaps drastically so if the application is paging.
|
182 |
|
|
<LI> If you allocate large objects containing only
|
183 |
|
|
one or two pointers at the beginning, either try the typed allocation
|
184 |
|
|
primitives is <TT>gc_typed.h</tt>, or separate out the pointerfree component.
|
185 |
|
|
<LI> Consider using <TT>GC_malloc_ignore_off_page()</tt>
|
186 |
|
|
to allocate large objects. (See <TT>gc.h</tt> and above for details.
|
187 |
|
|
Large means > 100K in most environments.)
|
188 |
|
|
<LI> If your heap size is larger than 100MB or so, build the collector with
|
189 |
|
|
-DLARGE_CONFIG. This allows the collector to keep more precise black-list
|
190 |
|
|
information.
|
191 |
|
|
<LI> If you are using heaps close to, or larger than, a gigabyte on a 32-bit
|
192 |
|
|
machine, you may want to consider moving to a platform with 64-bit pointers.
|
193 |
|
|
This is very likely to resolve any false pointer issues.
|
194 |
|
|
</ol>
|
195 |
|
|
<H2>Prematurely Reclaimed Objects</h2>
|
196 |
|
|
The usual symptom of this is a segmentation fault, or an obviously overwritten
|
197 |
|
|
value in a heap object. This should, of course, be impossible. In practice,
|
198 |
|
|
it may happen for reasons like the following:
|
199 |
|
|
<OL>
|
200 |
|
|
<LI> The collector did not intercept the creation of threads correctly in
|
201 |
|
|
a multithreaded application, <I>e.g.</i> because the client called
|
202 |
|
|
<TT>pthread_create</tt> without including <TT>gc.h</tt>, which redefines it.
|
203 |
|
|
<LI> The last pointer to an object in the garbage collected heap was stored
|
204 |
|
|
somewhere were the collector couldn't see it, <I>e.g.</i> in an
|
205 |
|
|
object allocated with system <TT>malloc</tt>, in certain types of
|
206 |
|
|
<TT>mmap</tt>ed files,
|
207 |
|
|
or in some data structure visible only to the OS. (On some platforms,
|
208 |
|
|
thread-local storage is one of these.)
|
209 |
|
|
<LI> The last pointer to an object was somehow disguised, <I>e.g.</i> by
|
210 |
|
|
XORing it with another pointer.
|
211 |
|
|
<LI> Incorrect use of <TT>GC_malloc_atomic</tt> or typed allocation.
|
212 |
|
|
<LI> An incorrect <TT>GC_free</tt> call.
|
213 |
|
|
<LI> The client program overwrote an internal garbage collector data structure.
|
214 |
|
|
<LI> A garbage collector bug.
|
215 |
|
|
<LI> (Empirically less likely than any of the above.) A compiler optimization
|
216 |
|
|
that disguised the last pointer.
|
217 |
|
|
</ol>
|
218 |
|
|
The following relatively simple techniques should be tried first to narrow
|
219 |
|
|
down the problem:
|
220 |
|
|
<OL>
|
221 |
|
|
<LI> If you are using the incremental collector try turning it off for
|
222 |
|
|
debugging.
|
223 |
|
|
<LI> If you are using shared libraries, try linking statically. If that works,
|
224 |
|
|
ensure that DYNAMIC_LOADING is defined on your platform.
|
225 |
|
|
<LI> Try to reproduce the problem with fully debuggable unoptimized code.
|
226 |
|
|
This will eliminate the last possibility, as well as making debugging easier.
|
227 |
|
|
<LI> Try replacing any suspect typed allocation and <TT>GC_malloc_atomic</tt>
|
228 |
|
|
calls with calls to <TT>GC_malloc</tt>.
|
229 |
|
|
<LI> Try removing any GC_free calls (<I>e.g.</i> with a suitable
|
230 |
|
|
<TT>#define</tt>).
|
231 |
|
|
<LI> Rebuild the collector with <TT>-DGC_ASSERTIONS</tt>.
|
232 |
|
|
<LI> If the following works on your platform (i.e. if gctest still works
|
233 |
|
|
if you do this), try building the collector with
|
234 |
|
|
<TT>-DREDIRECT_MALLOC=GC_malloc_uncollectable</tt>. This will cause
|
235 |
|
|
the collector to scan memory allocated with malloc.
|
236 |
|
|
</ol>
|
237 |
|
|
If all else fails, you will have to attack this with a debugger.
|
238 |
|
|
Suggested steps:
|
239 |
|
|
<OL>
|
240 |
|
|
<LI> Call <TT>GC_dump()</tt> from the debugger around the time of the failure. Verify
|
241 |
|
|
that the collectors idea of the root set (i.e. static data regions which
|
242 |
|
|
it should scan for pointers) looks plausible. If not, i.e. if it doesn't
|
243 |
|
|
include some static variables, report this as
|
244 |
|
|
a collector bug. Be sure to describe your platform precisely, since this sort
|
245 |
|
|
of problem is nearly always very platform dependent.
|
246 |
|
|
<LI> Especially if the failure is not deterministic, try to isolate it to
|
247 |
|
|
a relatively small test case.
|
248 |
|
|
<LI> Set a break point in <TT>GC_finish_collection</tt>. This is a good
|
249 |
|
|
point to examine what has been marked, i.e. found reachable, by the
|
250 |
|
|
collector.
|
251 |
|
|
<LI> If the failure is deterministic, run the process
|
252 |
|
|
up to the last collection before the failure.
|
253 |
|
|
Note that the variable <TT>GC_gc_no</tt> counts collections and can be used
|
254 |
|
|
to set a conditional breakpoint in the right one. It is incremented just
|
255 |
|
|
before the call to GC_finish_collection.
|
256 |
|
|
If object <TT>p</tt> was prematurely recycled, it may be helpful to
|
257 |
|
|
look at <TT>*GC_find_header(p)</tt> at the failure point.
|
258 |
|
|
The <TT>hb_last_reclaimed</tt> field will identify the collection number
|
259 |
|
|
during which its block was last swept.
|
260 |
|
|
<LI> Verify that the offending object still has its correct contents at
|
261 |
|
|
this point.
|
262 |
|
|
Then call <TT>GC_is_marked(p)</tt> from the debugger to verify that the
|
263 |
|
|
object has not been marked, and is about to be reclaimed. Note that
|
264 |
|
|
<TT>GC_is_marked(p)</tt> expects the real address of an object (the
|
265 |
|
|
address of the debug header if there is one), and thus it may
|
266 |
|
|
be more appropriate to call <TT>GC_is_marked(GC_base(p))</tt>
|
267 |
|
|
instead.
|
268 |
|
|
<LI> Determine a path from a root, i.e. static variable, stack, or
|
269 |
|
|
register variable,
|
270 |
|
|
to the reclaimed object. Call <TT>GC_is_marked(q)</tt> for each object
|
271 |
|
|
<TT>q</tt> along the path, trying to locate the first unmarked object, say
|
272 |
|
|
<TT>r</tt>.
|
273 |
|
|
<LI> If <TT>r</tt> is pointed to by a static root,
|
274 |
|
|
verify that the location
|
275 |
|
|
pointing to it is part of the root set printed by <TT>GC_dump()</tt>. If it
|
276 |
|
|
is on the stack in the main (or only) thread, verify that
|
277 |
|
|
<TT>GC_stackbottom</tt> is set correctly to the base of the stack. If it is
|
278 |
|
|
in another thread stack, check the collector's thread data structure
|
279 |
|
|
(<TT>GC_thread[]</tt> on several platforms) to make sure that stack bounds
|
280 |
|
|
are set correctly.
|
281 |
|
|
<LI> If <TT>r</tt> is pointed to by heap object <TT>s</tt>, check that the
|
282 |
|
|
collector's layout description for <TT>s</tt> is such that the pointer field
|
283 |
|
|
will be scanned. Call <TT>*GC_find_header(s)</tt> to look at the descriptor
|
284 |
|
|
for the heap chunk. The <TT>hb_descr</tt> field specifies the layout
|
285 |
|
|
of objects in that chunk. See gc_mark.h for the meaning of the descriptor.
|
286 |
|
|
(If it's low order 2 bits are zero, then it is just the length of the
|
287 |
|
|
object prefix to be scanned. This form is always used for objects allocated
|
288 |
|
|
with <TT>GC_malloc</tt> or <TT>GC_malloc_atomic</tt>.)
|
289 |
|
|
<LI> If the failure is not deterministic, you may still be able to apply some
|
290 |
|
|
of the above technique at the point of failure. But remember that objects
|
291 |
|
|
allocated since the last collection will not have been marked, even if the
|
292 |
|
|
collector is functioning properly. On some platforms, the collector
|
293 |
|
|
can be configured to save call chains in objects for debugging.
|
294 |
|
|
Enabling this feature will also cause it to save the call stack at the
|
295 |
|
|
point of the last GC in GC_arrays._last_stack.
|
296 |
|
|
<LI> When looking at GC internal data structures remember that a number
|
297 |
|
|
of <TT>GC_</tt><I>xxx</i> variables are really macro defined to
|
298 |
|
|
<TT>GC_arrays._</tt><I>xxx</i>, so that
|
299 |
|
|
the collector can avoid scanning them.
|
300 |
|
|
</ol>
|
301 |
|
|
</body>
|
302 |
|
|
</html>
|
303 |
|
|
|
304 |
|
|
|
305 |
|
|
|
306 |
|
|
|