Posts Tagged ‘solaris’

The stress of disk failure



A few weeks ago my ZFS raid containing 12 1.5tb Western Digital green
drives was degraded due to a drive failure. As I have a hot spare
added to the raid as well, the raid was quickly rebuilt and has been
working fine through the problems and afterwards. I ordered two new
drives, and today I thought I would change the defective drive, and
let the replacement-drive become the new hot spare. Before I started,
I did a detach of the defective drive to make the hot spare
permanently stay in place of the defective drive: ‘zpool detach DATA



Before I started, I thought it would be a good idea to reboot the
server, just to see if the server works as it should before I started
tinkering. I reboot, and the server starts beeping loudly about a
drive missing (this was from the raid controller before the OS had
started). I thought nothing of it, thinking that it was the one that
was defective. So, I started the OS to see if everything worked as it

A quick ‘zpool status’ showed me that I had indeed a new problem: A
new drive had failed, and one of the three raidz1-groups was now
degraded as I did not have any hot spare free to rebuild the raid.
While doing some ‘zpool status’ my heart suddenly skipped a few beats
as I noticed a second drive in the same raidz1-group missing, and the
message was now that the zpool did not have enough replicas to
function at all. Luckily I have been in tough situations when it
comes to drives and hardware before, so I thought this was not a good
time to panick. I descended from the attic, and took a break with
some coca cola.


I thought, why would two “new” drives fail at the same time, before I
had even touched anything? I got back up to the attic, and now one of
the drives missing was suddenly working again and some data had been
resilvered. This made me think that maybe the drives were not
defective after all, maybe they just had some bad cables. I turned
off and opened the server, and checked every cable thouroughly.
Having close to 20 hard drives in this server, there are as you can
understand quite a few power splits. One of them was going to the two
drives that were missing (of which one had luckily come back). I
changed out the powersplit, and also changed the defective drive that
was the original problem. I booted the server and did the changes in
the raid controller, making new volumes for the new drive and the
changed one. Then booted the OS in single user mode.

Luckily drive number two of the two that went missing was still
working, and I did a ‘zpool replace DATA c0t3d0 c0t3d0’ to get the
zpool to rebuild and start using the first drive that had been
missing. Then I added the new hot spare to the zpool:
‘zpool add DATA spare c3t5d0’ After about 1.5 hours the raid had
been rebuilt, and was back in normal mode with no faults.

Hopefully the problem with the two drives that went missing is only
the power splits, and that they don’t give any new problems. And also
that it will be a long, long time before the new hot spare is needed

Thank God for raid and hot spares, I do not even want to think about
loosing the 12tb raid. I do have backups of the most important stuff,
but backing up all of the data is quite impossible at home with such
amounts. Starting from scratch with all of my ripped movies and music
and so on would be a nightmare.




Oracle’s incompetent patch-handling


My last post was about how Oracle decided to move smb.conf and the
private-dir for samba without moving the files for you, or even
informing you that this might be a good idea… Read more here:

Today we are patching all of our servers at work, and we came across
even more stupid stuff:

1) If you have ldap/client running and enabled, it is probably because
you are using it. But in one of the latest patches, Oracle decides
that disabling this service is probably a good thing. Even if you
check after patching, it is enabled, but after a reboot it is
magically disabled. This is extra fun if you login as a normal user
and use sudo to manage your servers… How can this pass QC?

2) If you have changed /usr/lib/sendmail to point to your own local
mailer, f.ex. your own compiled exim, this is also probably because
you want it like that. Often Oracle decides that it’s a good idea to
just change this back to point to sendmail. Which breaks your setup
spectacularly… Again, how can this pass QC?

3) In our setup /var/mail is a symlink to /export/mail which is a
gigantic zfs-disk from the SAN. In one of the latest patches, Oracle
decides that this symlink should just be removed. Not even replaced
with a normal folder or anything, just removed. Which breaks our
setup spectacularly… Again, how can this pass QC?

4) If you don’t use sendmail, but f.ex. exim instead like we do, you
have probably disabled the sendmail-services. In many patches, Oracle
thinks that it’s a very good idea to just enable these services for
you again. Don’t they think we know what we’re doing when we disable
them ourselves? Again, how can this pass QC?

I am so happy that we’re moving away from Oracle/Sun/Solaris, because
this is just getting worse and worse for each time we patch the

#oracle #fail

Samba not working after Solaris-patching


If you use the built-in Samba in Solaris 10, you might have discovered
that after patching Solaris lately, Samba is not working any more.

Well this is an easy fix, but I can’t understand why Oracle hasn’t fixed it yet:

1) cp /etc/sfw/private/* /etc/samba/private/
2) cp /etc/sfw/smb.conf /etc/samba/
3) svcadm clear samba
4) svcadm restart samba

Oracle decided to move smb.conf and the private-files for Samba from
/etc/sfw/ to /etc/samba. Not a bad idea, but why on earth don’t they
copy over the old files if they exist?

This is just a stupid bug…

#oracle #fail

Solaris 10 U9 (09/10) and registration during installation


Not long ago, Oracle released a new version of Solaris, the first
after they took over Sun. In this new version, all traces of Sun seems
to be gone, and the logo is now of Oracle Solaris. Probably took them
so long to release it just to make sure every and all Sun-entries were
changed to Oracle…

One of the new things, is that when you install the OS, Oracle
collects information about how you setup your server and sends it to
Oracle. You get a choice between going this anonymously, or to use an
Oracle account. No third choice to *not* send this information…

This also posed a problem for our Jumpstart-setup, since this was a
new question our sysidcfg-files didn’t have an answer for. Therefore
the installation doesn’t start before you have gone through a few
pages of questions around this issue.

But, Google (and now I) is your friend. The solution is to put this
entry in the sysidcfg-file:

Our version of Solaris/Jet/Jass didn’t support doing this
automatically, but that was fixed through our custom scripts.

The interesting part is that this disables the feature that you can’t
disable through interactive installation.

Guess Oracle has really squeezed out most of what was left of Sun’s
culture, and input their own corporate thing…

Hope this is useful for others.

Mer diskplass


[root@solssd01] ~# zpool status DATA
pool: DATA
state: ONLINE
raidz1 ONLINE 0 0 0
c2t5d0 ONLINE 0 0 0
c2t4d0 ONLINE 0 0 0
c2t3d0 ONLINE 0 0 0
c2t2d0 ONLINE 0 0 0

[root@solssd01] ~# zfs list DATA
DATA 3,83T 174G 3,83T /DATA

[root@solssd01] ~# zpool add DATA raidz1 c0t3d0 c0t4d0 c0t5d0 c0t6d0

[root@solssd01] ~# zfs list DATA
DATA 3,83T 4,17T 3,83T /DATA

[root@solssd01] ~# zpool status DATA
pool: DATA
state: ONLINE
raidz1 ONLINE 0 0 0
c2t5d0 ONLINE 0 0 0
c2t4d0 ONLINE 0 0 0
c2t3d0 ONLINE 0 0 0
c2t2d0 ONLINE 0 0 0
raidz1 ONLINE 0 0 0
c0t3d0 ONLINE 0 0 0
c0t4d0 ONLINE 0 0 0
c0t5d0 ONLINE 0 0 0
c0t6d0 ONLINE 0 0 0

Da skulle det være litt ledig plass igjen på serveren hjemme, mer nøyaktig nesten 4,2tb ledig 🙂

Time drifting in VirtualBox RHEL/CentOS


I am running VirtualBox on a Solaris 10 installation, with a CentOS 5.4 virtual machine. The problem I have had is that the time in the virtual machine has been going nuts. Sometimes it goes so slow it almost seems like time is standing still, sometimes it speeds ahead. It is so bad, that neither ntpd nor guest additions’ time sync could keep it running properly.

After googling this a lot and trying different things, I can now bring you the solution:
1) Be sure you’re running the guest os additions (same as vmware tools)
2) Add the following kernel boot parameters:
nmi_watchdog=0 elevator=deadline noapic nolapic divider=10 nolapic_timer clocksource=acpi_pm
3) Reboot, and maybe do a ntpdate one time to get it right. After that, the guest additions should keep the time up to date. Oh, and you should probably not run ntpd at the same time as the guest additions, they could get in each other’s way…

Oh, and it should be fixed in later linux-kernels, but it seems RHEL 5.x (and therefore CentOS 5.x) is stuck on kernel version 2.6.18, which means we don’t get the fix before a new major release.


Move into folders


Often when you download something, for instance movies, it is a big pack of many movies, and they are all in one folder as .avi files or something like that. You might want all of those in their own separate folder to use with f.ex. XBMC, and it can be a tedious task to make a folder for each movie, and move them into that folder.

Well here is the script to do it, filled with comments to explain what is happening. Hope someone finds it useful.

First: This is the quick and dirty oneliner to do it if you are sure there are only files with extensions in the dir, it’s without any error-checking:
IFS=$’n’;for f in `ls -1`; do mkdir “${f%.*}”;mv “${f}” “${f%.*}/”;done

The whole script is attached, with an abundance of comments describing what’s happening.


Et lite script for å gå ned i hver underkatalog, se etter filer som
slutter på .rar, og pakke de ut.  Så spør den om du vil slette “*r??
*sfv” hvorpå den evt. sletter de.

Kjekt når man laster ned noe som kommer som en haug med underkataloger
med rar’et innhold.

Mounting ext2/3, FAT16/FAT32 and NTFS in Solaris 10


You might be using Solaris like me, and you might have some disk containing ext3 partitions on it that you want to mount. This can not be done out of the box on Solaris, since it doesn’t support ext2/3 and ntfs. But do not give up, the solution is here!

First off, note that it’s only read-only support for NTFS/ext2/ext3. It has full read/write support though for fat16/fat32.

Follow these simple steps:
Download these two packages:

Unzip and install them:
gzcat FSWpart.tar.gz | tar xvf –
gzcat FSWfsmisc.tar.gz | tar xvf –
pkgadd -d . FSWpart
pkgadd -d . FSWfsmisc

Now run the prtpart tool on the disk you want to read partitions. You can see the devices Solaris has recognized through “echo|format”.

/usr/bin/prtpart /dev/rdsk/p0
/usr/bin/prtpart /dev/rdsk/c2t1d0p0 -ldevs

This might result in something like this:

Fdisk information for device /dev/dsk/c2t1d0p0

** NOTE **
/dev/dsk/c2t1d0p0 – Physical device referring to entire physical disk
/dev/dsk/c2t1d0p1 – p4 – Physical devices referring to the 4 primary partitions
/dev/dsk/c2t1d0p5 … – Virtual devices referring to logical partitions

Virtual device names can be used to access EXT2 and NTFS on logical partitions

/dev/dsk/c2t1d0p1 Linux raid autodetect
/dev/dsk/c2t1d0p2 Linux swap
/dev/dsk/c2t1d0p3 Linux raid autodetect

To mount NTFS partition use
mount -F ntfs /dev/dsk/c2t1d0p /mnt/windows

To mount FAT 16 / FAT 32 partition use
mount -F pcfs /dev/dsk/c2t1d0p /mnt/windows

if the above command fails you can try the below option
prtpart /dev/dsk/c2t1d0p0 -fat
the above command should list the available PCFS / FAT partitions in colon notation, then use the same for mounting (eg)
mount -F pcfs /dev/dsk/c2t1d0p0:d /mnt/windows

To mount Ext2 / Ext3 partitions use
mount -F ext2fs /dev/dsk/c2t1d0p /mnt/linux

To unmount a partition use “umount ”
umount /mnt/linux

This also means you can share this folder to a branded zone running RedHat Enterprise Linux or CentOS, but remember that it’s read only…

%d bloggers like this: