Opened 17 years ago

Closed 16 years ago

#172 closed defect (fixed)

kernel panic during mindi process when using mondo rescue on x86_64 HP Proliant dl360g4 (RHEL4.5)

Reported by: cdmaestas Owned by: Bruno Cornec
Priority: high Milestone: 2.2.5
Component: mondo Version: 2.2.3
Severity: critical Keywords:
Cc: simo.syvajarvi@…, jgatenc@…, cdmaest@…, rdscott@…, christopher.maestas@…, jatencio@…, rdscott@…, christmasboy_81@…, shane_chartrand@…, alexrixhardson@…

Description

When Mondo Rescue iso image we get:

RAX: ffffffff803f0520 RBX: 0000000000000000 RCX: 000000000000003f
RDX: 000001007b8bfef8 RSI: 000001007ba517c8 RDI: 000001007fbf9180
RBP: 000001007b8bfef8 R08: 0000000000000000 R09: 0000000000631e98
R10: 000000000000000c R11: 0000000000000000 R12: 0000007fbfffff47
R13: 0000000000000000 R14: 00000000006376b8 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffffffff804ed700(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000000101000 CR4: 00000000000006e0
Process find-and-mount- (pid: 1034, threadinfo 000001007b8be000, task 000001007bb70030)
Stack: ffffffff801823d7 000001007ba517c8 000001007fbf9180 0000000000000202
       ffffffff8012442d 0000000100000001 ffffffff00000000 0000000000000007
       0000000b0000000e 000001007b8bff18
Call Trace:<ffffffff801823d7>{vfs_stat64+47} <ffffffff8012442d>{do_page_fault+577}
       <ffffffff8013bdff>{do_wait+3350} <ffffffff80182717>{sys_newstat+17}
       <ffffffff801342a4>{default_wake_function+0} <ffffffff80110d91>{error_exit+0}
       <ffffffff8011026a>{system_call+126}

Code:  Bad RIP value.
RIP [<0000000000000000>] RSP <000001007b8bfe50>
CR2: 0000000000000000
 <0>Kernel panic - not syncing: Oops

All the firmware is up2date. We've been able to use Mondo Rescue on a dl360g3 (32 bit platform) without issues using RHEL 4.4)

Attachments (1)

mondoarchive.log.gz (22.4 KB ) - added by jatencio 17 years ago.
mondo archive log file

Download all attachments as: .zip

Change History (40)

comment:1 by Bruno Cornec, 17 years ago

Status: newassigned

Could you provide the /var/log/mondoarchive.log file please ? (Or /var/log/mondo-archive.log depending on version used) on the original system.

Could you try http://trac.mondorescue.org/wiki/FAQ#Q26/Whyismykerneldoingapanicatrestoretimewhenitworksperfectlyatarchivetime

by jatencio, 17 years ago

Attachment: mondoarchive.log.gz added

mondo archive log file

comment:2 by jatencio, 17 years ago

I modified the following line in /usr/sbin/mindi:

ADDITIONAL_BOOT_PARAMS="apm=off devfs=nomount noresume selinux=0 barrier=off"

However, we still get the same kernel panic when attempting to boot the mondorescue image. I also uploaded the mondarchive.log to this ticket as well.

comment:3 by cdmaestas, 17 years ago

Bruno,

I hope you've been able to get the data we updated.

Thanks, -cdm

comment:4 by jatencio, 17 years ago

An update, I was able to install the archive using a 32bit mondorescue.iso image. When it come to the point when it is attempting to install grub, is fails and I cannot chroot into the environment because of the 32bit and 64bitness. However, there is option to skip the grub install so I can install it later with a rescue cd. I am not sure if anything needs to happen after the grub install for the rescue to successfully boot.

comment:5 by jatencio, 17 years ago

I was able to successfully restore a x86_64 image using a i386 mondorescue image. However, the problem with the x86_64 mondorescue image still exists.

comment:6 by jatencio, 17 years ago

Even though I was able successfully restore an image, that image will not boot and now we get the following kernel panic during boot up:

  Booting 'Red Hat Enterprise Linux WS (2.6.9-55.ELsmp)'
kernel direct mapping tables upto 10100000000 @ 8000-d000
root (hd0,0)
 Filesystem type is ext2fs, partition type 0x83
kernel /vmlinuz-2.6.9-55.ELsmp ro root=/dev/VolGroup00/LogVol00 rhgb quiet
   [Linux-bzImage, setup=0x1e00, size=0x19772d]
initrd /initrd-2.6.9-55.ELsmp.img
   [Linux-initrd @ 0x37e45000, 0x1aa725 bytes]

.
Decompressing Linux...done.
Booting the kernel.
Red Hat nash version 4.2.1.10 starting
  Reading all physical volumes.  This may take a while...
  No volume groups found
  Volume group "VolGroup00" not found
ERROR: /bin/lvm exited abnormally! (pid 480)
mount: error 6 mounting ext3
mount: error 2 mounting none
switchroot: mount failed: 22
umount /initrd/dev failed: 2
Kernel panic - not syncing: Attempted to kill init!

comment:7 by Bruno Cornec, 17 years ago

The kernel panic you have at boot time now is due to the fact there is a problem restoring the LVM environment of your system. (No VolGroup found so unable to mount the root FS).

Can you save at the en d of the restore the /tmp/mondorestore.log file so that I can see what happens during restore.

Of course you're right that you can pass the grub stage and restore it with another rescue CD later, but I guess you tried and succeed in that.

For the kernel panic on the first case (x86_64), what would be possible to help diagnose is to change at archiving time the script /usr/lib64/mindi/rootfs/sbin/find-and-mount-cdrom to add a

set -x

command after the shebang at the begining. that way we should be able to see at restore time, which part of the script is causing problem (there are tests on various devices, may one triggers the anomaly.)

Last point, you use 2.2.4 which is NOT officialy released yet. But in your case I don't think it would change anything, I even consider it should be bettre than the current official 2.2.3.

comment:8 by Krisztian_Toth, 17 years ago

I'm able to reproduce the problem on a HP BL460.

My environment:

RHAS4 U5 x86_64
Kernel: 2.6.9-55.ELsmp

Mondorescue packages:

afio-2.4.7-1.x86_64.rpm
buffer-1.19-1.x86_64.rpm
mindi-1.2.4-1.rhel4.i586.rpm
mindi-busybox-1.2.2-3.rhel4.i586.rpm
mondo-2.2.4-1.rhel4.i586.rpm
mondo-doc-2.2.4-1.rhel4.noarch.rpm

My partition table:

# fdisk -l /dev/cciss/c0d0

Disk /dev/cciss/c0d0: 73.3 GB, 73372631040 bytes
255 heads, 63 sectors/track, 8920 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

           Device Boot      Start         End      Blocks   Id  System
/dev/cciss/c0d0p1   *           1          14      112423+  83  Linux
/dev/cciss/c0d0p2              15        8920    71537445   8e  Linux LVM


My filesystems:

/dev/mapper/VolGroup00-LogVol00 on / type ext3 (rw)
/dev/cciss/c0d0p1 on /boot type ext3 (rw)

I created my backups on NFS. Here's the command what I used:

mondoarchive -O9 -l GRUB -F -p Krisztian_Toth -s 700M -d /my_system -S /mondo_tmp -T /mondo_tmp -E "/mondo_tmp /backup" -N -n 192.168.128.77:/exports

I modified /usr/sbin/mindi because I encountered with another problem. So I forced the following modules:

FORCE_MODS="nls_utf8 sr_mod md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc ds yenta_socket pcmcia_core button battery ac bnx2 dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla2400 cciss qla2xxx scsi_transport_fc usb_storage uhci_hcd ohci_hcd ehci_hcd sd_mod scsi_mod"

I used ILO2 Virtual Media and NFS to restore my system. I used ILO2 to boot my first mondorescue CD and NFS to obtain the data itself.

So that was my environment.

After I tried some configuration changes to find what is the root of the problem.

A little summary of my actions:

  • I tried to install afio-2.4.7-1.i586.rpm and buffer-1.19-1.i386.rpm instead of afio-2.4.7-1.x86_64.rpm and buffer-1.19-1.x86_64.rpm. I created a new backup. Crash occurred.
  • I tried to use 2.6.9-55.EL instead of 2.6.9-55.ELsmp. I hope crash won't be occurred in an non smp environment. I created a new backup. Crash occured.
  • I tried to test different version of kernels. So I first installed the previous version of the kernel to check it is works or not. I installed 2.6.9-42.0.10.ELsmp which was the latest Red Hat provided kernel before 2.6.9-55.ELsmp. I put back the afio-2.4.7-1.x86_64.rpm and the afio-2.4.7-1.x86_64.rpm. The backup and the restore was successful.
  • I put "set -x" to the /usr/lib/mindi/rootfs/sbin/find-and-mount-cdrom to see what happened before the crash:
+ ln -sf /dev/scd0 /dev/cdrom
+ [ 0 -ne 0 ]
+ LogIt CD-ROM found at /dev/scd0
+ mount /mnt/cdrom
+ mount: /dev/cdrom is write-protected, mounting read-only
+ [ 0 -ne 0 ]
+ [ ! -d /mnt/cdrom/archive ]
<<<< The crash occurred here >>>>>
Unable to handle kernel NULL pointer deference at 0000000000000000 RIP:
...

Summary:

  • I was able to reproduce the problem.
  • The crash won't be occurred if you use the latest mondorescue packages (see my package list above) and you use the 2.6.9-42.0.10.ELsmp kernel.

Conclusion?

  • More testing is needed, however there is something strange with the 2.6.9-55.ELsmp. Maybe something is changed which in not "tolerated" by mindi or this kernel version introduced a new kernel bug.

Best Regards, Krisztian Toth

comment:9 by Krisztian_Toth, 17 years ago

I also tried the above test on an RHAS 4 U5 i386 environment. I can clarify I encountered with the same results. This is not a x86_64 specific problem. So for my results see my previous post.

comment:10 by jatencio, 17 years ago

Hello Bruno,

I have not been successful in my attempt to save /tmp/mondoarvhive.log after restoring. I tried using a rhel4.5 rescue cd, however, it too was unable to mount the root filesystem after I restored the file system.

Since it appears to be a 2.6.9-55.EL issue, has there been any progress as to what the problem may be?

Thanks,

Jonathan

comment:11 by Bruno Cornec, 17 years ago

The work around is to step back and use the previous kernel version as for now.

i'll try to reproduce it in a pure RH context so that I can open a bug report on RH's bugzilla.

comment:12 by (none), 17 years ago

Milestone: 2.2.4

Milestone 2.2.4 deleted

comment:13 by Bruno Cornec, 17 years ago

Milestone: 2.2.5

comment:14 by Joe Ross, 17 years ago

Cc: christmasboy_81@… added

Anyone had a chance to try this with 2.6.9-55.0.2?

comment:15 by gavro, 17 years ago

Im sad to say that the same goes for 2.6.9-55.0.2... Is there any other option beside rebooting an older kernel to make a working backup with mondoarchive? Rebooting a productionserver is something that is not really appreciated...

comment:16 by gavro, 17 years ago

Has anyone already tried it with kernel-2.6.9-55.0.6?

comment:17 by nico, 17 years ago

I tried mondorestore with kernel-2.6.9-55.0.6.ELsmp (32bit) on a HP Proliant DL380G5 and ran into the kernel-panic too. :-(

My rpms: buffer-1.19-1 mindi-busybox-1.2.2-3.rhel4 mondo-2.2.4-1.rhel4 mindi-1.2.4-1.rhel4 afio-2.4.7-1

comment:18 by Joe Ross, 16 years ago

looks like it still happens with 2.6.9-55.0.9.EL

comment:19 by Bruno Cornec, 16 years ago

Cc: simo.syvajarvi@… added
Component: mindi-kernelmondo

I've just made a full backup restore of a BL 460 G1 (same family used by Kristian) with 2.6.9-55 and mondo-2.2.5 + mindi-1.2.5 (from ftp://ftp.mondorescue.org/rhel/4/) with success.

Could anyone of you having pb with 2.2.4 try again with that version to see if it's also fixed for you ?

My main problem is that I do not have a clear idea of what fixes the issue :-(

comment:20 by nico, 16 years ago

Unfortunately the issuie is not fixed for me. Used the newest mindi-1.2.5-1.rhel4 for creating the bootdvd, made a full-backup on the HP Proliant DL380G5, but got the same kernel-panic with kernel 2.6.9-55.0.9.ELsmp during boot the dvd.

comment:21 by ylihemmo, 16 years ago

Not working here either on ProLiant BL685c (rhel 4.5)

Tried it with:

mondo-2.2.5-1.rhel4 mindi-1.2.5-1.rhel4

and ADDITIONAL_BOOT_PARAMS="apm=off devfs=nomount noresume selinux=0 barrier=off"

comment:22 by nico, 16 years ago

Tried once more on DL380G5 and DL580G3, RHEL 4.5. Not working on both maschines. Tried ylihemmo's addtional_boot_params without success.

comment:23 by Bruno Cornec, 16 years ago

So the bug is clearly in find-and-mount-cdrom when unmounting the CD. When editing the script and adding a call to sh before the umount everything is fine. Typing umount /dev/hda in another shell doesn't trigger it. A sequence of 20 mount/umount doesn't either, but in the shell it's triggered everytime. Even with 2.2.5 as of 2007-10-31. selinux=0 doesn't change anything either.

Still searching.

comment:24 by Bruno Cornec, 16 years ago

Of course (in case it wasn't clear earlier) you may use 2.2.4 with the -k /boot/vmlinuz-2.6.9-34.ELsm e.g so that only mondo uses the old kernel for the time of the restore, without changing the fact that the kernel used at run time and after restore time is -55*

Hope this helps as a workaround

comment:25 by triumvir, 16 years ago

Try to mount the /mnt/cdrom in the "find-and-mount-cdrom" function at line 43 not with "mount /mnt/cdrom" but with "mount $device -t iso9660 -o ro /mnt/cdrom 2> /tmp/mount.log" like the test if it is a cdrom.

comment:26 by Bruno Cornec, 16 years ago

Fixed with latest 2.2.5 + mindi 2.0.0 + mindi-busybox 1.7.3 Please check on your side.

comment:27 by julien, 16 years ago

I'm sorry, but that problem is not fixed with latest 2.2.5 + mindi 2.0.0 + mindi-busybox 1.7.3.

I'm using :

[root@TSM... ~]# rpm -qa | grep mindi mindi-busybox-1.7.3-1.rhel4 mindi-2.0.0-1.rhel4 [root@TSM.. ~]# rpm -qa | grep mondo mondo-2.2.5-1.rhel4 [root@TSM... ~]# arch x86_64

My server: HP PROLIANT BL465c G1...and the same problem prevent me to restore backups from ISO CD ...

Code: Bad RIP value. RIP [<0000000000000000>] RSP <000001012bac9e50> CR2: 0000000000000000

<0>Kernel panic - not syncing: Oop

comment:28 by Joe Ross, 16 years ago

Still having the problem with 2.6.9-55.0.12.EL and the latest 2.2.5 + mindi 2.0.0 + mindi-busybox 1.7.3 from 12/14.

comment:29 by Sean, 16 years ago

I have been getting the same error here at work using mondo 2.2.4 and now 2.2.5 . I am using RHEL 4 update 5 with kernel vmlinuz-2.6.9-55.0.2.ELsmp. it also appears to be the find-and-mount causing the error. Like others on here I am eager to try to get this resolved so I can use Mondo. I have some ProLiant BL460c G1 Blades that are not going into production for about 3 weeks and am willing to do a bit of teting and send logs if you think this will aid in resolving this problem.

Sean

comment:30 by Alan Walker, 16 years ago

I am also seeing a kernel panic when booting from a Mondorescue restore CD using Centos (RHEL)4.5 kernel 2.6.9-55.0.12.ELsmp on a HP DL380G5, but when using Mondoarchive with the -k option and specifying a 2.6.9-42.ELsmp kernel image it seems to work OK.

I am using the following packages: afio-2.4.7-1.i586.rpm buffer-1.19-1.i386.rpm mindi-1.2.0-2.rhel4.i586.rpm mindi-busybox-1.2.2-2.rhel4.i586.rpm mondo-2.2.0-2.rhel4.i586.rpm mondo-doc-2.2.0-2.rhel4.noarch.rpm

(I had tried using more recent versions but then I could not even do a successful backup, as it is I find that I have to eject the tape during a restore while it is searching for the file lists to make it continue, otherwise it just seems to hang there)

I have a screen photo of the panic message, if it helps. Thanks for the -k suggestion above, Alan.

comment:31 by adrianmarsh, 16 years ago

Did this get progressed? I've been hanging onto Centos 2.6.9-42.0.10.EL Just tried it in a VM using 2.6.9-67.0.1.EL.plus.c4 and saw the same kernel panic using RPMs :

mindi-1.2.4-1.rh9 mindi-busybox-1.2.2-3.rh9 mondo-2.2.4-1.rh9

comment:32 by shanec, 16 years ago

Cc: shane_chartrand@… added

comment:33 by shanec, 16 years ago

I am having the same issue with the 2.6.9-67.ELsmp Redhat es4.

I have the following installed:

mindi-1.2.4-1.rhel4.i586.rpm mindi-busybox-1.2.2-3.rhel4.i586.rpm mondo-2.2.4-1.rhel4.i586.rpm

Can anyone point me to instructions on installing and configuring the 2.2.4 linux kernel for use with the -k option?

comment:34 by alexrixhardson, 16 years ago

The same problem also persist in the newest RHEL4 kernel: 2.6.9-67.0.4.ELsmp.

Does anyone know if there is some other fairly new vmlinuz available that I could use in order to avodi this problem? (it should however support latest RAID controllers)

comment:35 by alexrixhardson, 16 years ago

Cc: alexrixhardson@… added

comment:36 by Bruno Cornec, 16 years ago

I've updated the test version of 2.2.5 to fix the call to mount that was creating that issue. Could any of you having that issue test and report back please ?

Cf: ftp://ftp.mondorescue.org/rhel/4/test

comment:37 by amaura, 16 years ago

Hello,

I am running RHEL 4.6 x86_64 (kernel 2.6.9-67.0.4) on a Dell PE 2950 and I had the kernel panic problem with version 2.2.4. And your test version of 2.2.5 solved the problem. It also solved the problem on a virtual machine running RHEL 4.5 32 bits (kernel 2.6.9-55).

However, although everything seems to be ok on the VM, on kernel 2.6.9-67.0.4 x86_64 I've got a problem of mounting ext3. Indeed the live boot CD created doesn't support file support ext3 (ext2 is ok but I do not want to mount ext2) and therefore I've got an error at the mountlist step :(. Any idea ?

comment:38 by Bruno Cornec, 16 years ago

Thanks for your report.

Concerning your other issue, I cn't do much without logs. I guess an additional module may be needed or something like that. However, I'd prefer you open another ticket to follow that issue.

Do any of the other people having that issue want to report their feedback ?

comment:39 by Bruno Cornec, 16 years ago

Resolution: fixed
Status: assignedclosed

Should be fixed with official 2.2.5. Reopen if you still experiment an issue with that.

Note: See TracTickets for help on using tickets.