URL
https://opencores.org/ocsvn/test_project/test_project/trunk
Subversion Repositories test_project
[/] [test_project/] [trunk/] [linux_sd_driver/] [Documentation/] [oops-tracing.txt] - Rev 62
Compare with Previous | Blame | View Log
NOTE: ksymoops is useless on 2.6. Please use the Oops in its original format(from dmesg, etc). Ignore any references in this or other docs to "decodingthe Oops" or "running it through ksymoops". If you post an Oops from 2.6 thathas been run through ksymoops, people will just tell you to repost it.Quick Summary-------------Find the Oops and send it to the maintainer of the kernel area that seems to beinvolved with the problem. Don't worry too much about getting the wrong person.If you are unsure send it to the person responsible for the code relevant towhat you were doing. If it occurs repeatably try and describe how to recreateit. That's worth even more than the oops.If you are totally stumped as to whom to send the report, send it tolinux-kernel@vger.kernel.org. Thanks for your help in making Linux asstable as humanly possible.Where is the Oops?----------------------Normally the Oops text is read from the kernel buffers by klogd andhanded to syslogd which writes it to a syslog file, typically/var/log/messages (depends on /etc/syslog.conf). Sometimes klogd dies,in which case you can run dmesg > file to read the data from the kernelbuffers and save it. Or you can cat /proc/kmsg > file, however youhave to break in to stop the transfer, kmsg is a "never ending file".If the machine has crashed so badly that you cannot enter commands orthe disk is not available then you have three options :-(1) Hand copy the text from the screen and type it in after the machinehas restarted. Messy but it is the only option if you have notplanned for a crash. Alternatively, you can take a picture ofthe screen with a digital camera - not nice, but better thannothing. If the messages scroll off the top of the console, youmay find that booting with a higher resolution (eg, vga=791)will allow you to read more of the text. (Caveat: This needs vesafb,so won't help for 'early' oopses)(2) Boot with a serial console (see Documentation/serial-console.txt),run a null modem to a second machine and capture the output thereusing your favourite communication program. Minicom works well.(3) Use Kdump (see Documentation/kdump/kdump.txt),extract the kernel ring buffer from old memory with using dmesggdbmacro in Documentation/kdump/gdbmacros.txt.Full Information----------------NOTE: the message from Linus below applies to 2.4 kernel. I have preserved itfor historical reasons, and because some of the information in it stillapplies. Especially, please ignore any references to ksymoops.From: Linus Torvalds <torvalds@osdl.org>How to track down an Oops.. [originally a mail to linux-kernel]The main trick is having 5 years of experience with those pesky oopsmessages ;-)Actually, there are things you can do that make this easier. I have twoseparate approaches:gdb /usr/src/linux/vmlinuxgdb> disassemble <offending_function>That's the easy way to find the problem, at least if the bug-report iswell made (like this one was - run through ksymoops to get theinformation of which function and the offset in the function that ithappened in).Oh, it helps if the report happens on a kernel that is compiled with thesame compiler and similar setups.The other thing to do is disassemble the "Code:" part of the bug report:ksymoops will do this too with the correct tools, but if you don't havethe tools you can just do a silly program:char str[] = "\xXX\xXX\xXX...";main(){}and compile it with gcc -g and then do "disassemble str" (where the "XX"stuff are the values reported by the Oops - you can just cut-and-pasteand do a replace of spaces to "\x" - that's what I do, as I'm too lazyto write a program to automate this all).Alternatively, you can use the shell script in scripts/decodecode.Its usage is: decodecode < oops.txtThe hex bytes that follow "Code:" may (in some architectures) have a seriesof bytes that precede the current instruction pointer as well as bytes at andfollowing the current instruction pointer. In some cases, one instructionbyte or word is surrounded by <> or (), as in "<86>" or "(f00d)". These<> or () markings indicate the current instruction pointer. Example fromi386, split into multiple lines for readability:Code: f9 0f 8d f9 00 00 00 8d 42 0c e8 dd 26 11 c7 a1 60 ea 2b f9 8b 50 08 a164 ea 2b f9 8d 34 82 8b 1e 85 db 74 6d 8b 15 60 ea 2b f9 <8b> 43 04 39 42 547e 04 40 89 42 54 8b 43 04 3b 05 00 f6 52 c0Finally, if you want to see where the code comes from, you can docd /usr/src/linuxmake fs/buffer.s # or whatever file the bug happened inand then you get a better idea of what happens than with the gdbdisassembly.Now, the trick is just then to combine all the data you have: the Csources (and general knowledge of what it _should_ do), the assemblylisting and the code disassembly (and additionally the register dump youalso get from the "oops" message - that can be useful to see _what_ thecorrupted pointers were, and when you have the assembler listing you canalso match the other registers to whatever C expressions they were usedfor).Essentially, you just look at what doesn't match (in this case it was the"Code" disassembly that didn't match with what the compiler generated).Then you need to find out _why_ they don't match. Often it's simple - yousee that the code uses a NULL pointer and then you look at the code andwonder how the NULL pointer got there, and if it's a valid thing to doyou just check against it..Now, if somebody gets the idea that this is time-consuming and requiressome small amount of concentration, you're right. Which is why I willmostly just ignore any panic reports that don't have the symbol tableinfo etc looked up: it simply gets too hard to look it up (I have someprograms to search for specific patterns in the kernel code segment, andsometimes I have been able to look up those kinds of panics too, butthat really requires pretty good knowledge of the kernel just to be ableto pick out the right sequences etc..)_Sometimes_ it happens that I just see the disassembled code sequencefrom the panic, and I know immediately where it's coming from. That's whenI get worried that I've been doing this for too long ;-)Linus---------------------------------------------------------------------------Notes on Oops tracing with klogd:In order to help Linus and the other kernel developers there has beensubstantial support incorporated into klogd for processing protectionfaults. In order to have full support for address resolution at leastversion 1.3-pl3 of the sysklogd package should be used.When a protection fault occurs the klogd daemon automaticallytranslates important addresses in the kernel log messages to theirsymbolic equivalents. This translated kernel message is thenforwarded through whatever reporting mechanism klogd is using. Theprotection fault message can be simply cut out of the message filesand forwarded to the kernel developers.Two types of address resolution are performed by klogd. The first isstatic translation and the second is dynamic translation. Statictranslation uses the System.map file in much the same manner thatksymoops does. In order to do static translation the klogd daemonmust be able to find a system map file at daemon initialization time.See the klogd man page for information on how klogd searches for mapfiles.Dynamic address translation is important when kernel loadable modulesare being used. Since memory for kernel modules is allocated from thekernel's dynamic memory pools there are no fixed locations for eitherthe start of the module or for functions and symbols in the module.The kernel supports system calls which allow a program to determinewhich modules are loaded and their location in memory. Using thesesystem calls the klogd daemon builds a symbol table which can be usedto debug a protection fault which occurs in a loadable kernel module.At the very minimum klogd will provide the name of the module whichgenerated the protection fault. There may be additional symbolicinformation available if the developer of the loadable module chose toexport symbol information from the module.Since the kernel module environment can be dynamic there must be amechanism for notifying the klogd daemon when a change in moduleenvironment occurs. There are command line options available whichallow klogd to signal the currently executing daemon that symbolinformation should be refreshed. See the klogd manual page for moreinformation.A patch is included with the sysklogd distribution which modifies themodules-2.0.0 package to automatically signal klogd whenever a moduleis loaded or unloaded. Applying this patch provides essentiallyseamless support for debugging protection faults which occur withkernel loadable modules.The following is an example of a protection fault in a loadable moduleprocessed by klogd:---------------------------------------------------------------------------Aug 29 09:51:01 blizard kernel: Unable to handle kernel paging request at virtual address f15e97ccAug 29 09:51:01 blizard kernel: current->tss.cr3 = 0062d000, %cr3 = 0062d000Aug 29 09:51:01 blizard kernel: *pde = 00000000Aug 29 09:51:01 blizard kernel: Oops: 0002Aug 29 09:51:01 blizard kernel: CPU: 0Aug 29 09:51:01 blizard kernel: EIP: 0010:[oops:_oops+16/3868]Aug 29 09:51:01 blizard kernel: EFLAGS: 00010212Aug 29 09:51:01 blizard kernel: eax: 315e97cc ebx: 003a6f80 ecx: 001be77b edx: 00237c0cAug 29 09:51:01 blizard kernel: esi: 00000000 edi: bffffdb3 ebp: 00589f90 esp: 00589f8cAug 29 09:51:01 blizard kernel: ds: 0018 es: 0018 fs: 002b gs: 002b ss: 0018Aug 29 09:51:01 blizard kernel: Process oops_test (pid: 3374, process nr: 21, stackpage=00589000)Aug 29 09:51:01 blizard kernel: Stack: 315e97cc 00589f98 0100b0b4 bffffed4 0012e38e 00240c64 003a6f80 00000001Aug 29 09:51:01 blizard kernel: 00000000 00237810 bfffff00 0010a7fa 00000003 00000001 00000000 bfffff00Aug 29 09:51:01 blizard kernel: bffffdb3 bffffed4 ffffffda 0000002b 0007002b 0000002b 0000002b 00000036Aug 29 09:51:01 blizard kernel: Call Trace: [oops:_oops_ioctl+48/80] [_sys_ioctl+254/272] [_system_call+82/128]Aug 29 09:51:01 blizard kernel: Code: c7 00 05 00 00 00 eb 08 90 90 90 90 90 90 90 90 89 ec 5d c3---------------------------------------------------------------------------Dr. G.W. Wettstein Oncology Research Div. Computing FacilityRoger Maris Cancer Center INTERNET: greg@wind.rmcc.com820 4th St. N.Fargo, ND 58122Phone: 701-234-7556---------------------------------------------------------------------------Tainted kernels:Some oops reports contain the string 'Tainted: ' after the programcounter. This indicates that the kernel has been tainted by somemechanism. The string is followed by a series of position-sensitivecharacters, each representing a particular tainted value.1: 'G' if all modules loaded have a GPL or compatible license, 'P' ifany proprietary module has been loaded. Modules without aMODULE_LICENSE or with a MODULE_LICENSE that is not recognised byinsmod as GPL compatible are assumed to be proprietary.2: 'F' if any module was force loaded by "insmod -f", ' ' if allmodules were loaded normally.3: 'S' if the oops occurred on an SMP kernel running on hardware thathasn't been certified as safe to run multiprocessor.Currently this occurs only on various Athlons that are notSMP capable.4: 'R' if a module was force unloaded by "rmmod -f", ' ' if allmodules were unloaded normally.5: 'M' if any processor has reported a Machine Check Exception,' ' if no Machine Check Exceptions have occurred.6: 'B' if a page-release function has found a bad page reference orsome unexpected page flags.7: 'U' if a user or user application specifically requested that theTainted flag be set, ' ' otherwise.8: 'D' if the kernel has died recently, i.e. there was an OOPS or BUG.The primary reason for the 'Tainted: ' string is to tell kerneldebuggers if this is a clean kernel or if anything unusual hasoccurred. Tainting is permanent: even if an offending module isunloaded, the tainted value remains to indicate that the kernel is nottrustworthy.
