High Availability NFS With DRBD + Heartbeat

Ryan Babchishin - http://win2ix.ca

This document describes information collected during research and development of a clustered DRBD NFS solution. This project had two purposes:

HA NFS solution for Media-X Inc.
Develop a standard tool kit and documentation that Win2ix can use for future projects

Operating System

The standard operating for Win2ix is Ubuntu 12.04, therefore all testing was done with this as the preferred target.

Hardware

Because of the upcoming project with Media-X, computer hardware was chosen based on low cost and low power consumption.

A pair of identical systems for the cluster were used:

SuperMicro SuperServer 5015A-EHF-D525 (default settings)
Intel(R) Atom(TM) CPU D525 1.80GHz CPU (dual-core, hyper threaded)
4GB DDR3 RAM
2x750GB 2.5" Scorpio Black hard drives
3ware 9650 RAID-1 controller card with 128MB of RAM, without a battery backup
2x on-board Gigabit Ethernet

Partitioning and disk format

The disks were partitioned according to Win2ix standards.

sda1 - / - ext4 - 20GB
sda2 - /tmp - ext4 - 6GB
sda3 - /var - ext4 - 6GB
sdb5 - swap - swap - 2GB
sda6 - drbd - drbd - 716GB

Networking

eth0 was configured with a (unused) 192.168.0.[1|2] address for communication over a direct link between systems, with no switch

Bonding was tested, see below for more information.

eth0 MTU was set to 9000
eth1 was configured with a regular network IP address for SSH/NFS/etc... access on both systems

Bonding

Although not used due to lack of an extra PCI-E slot, Ethernet bonding was originally tested

Important notes when using bonding:

To get true single connection load balancing in both directions, use bonding mode 0 (round robin) with NO SWITCH or find a switch that supports it. Direct connections between systems works well. Switches generally support trunking, LACP or some other Cisco variant. They will most likely only send traffic for different IP connections over separate links. This won't help with DRBD which uses a single connection
Make sure you are seeing full throughput on your bonding device in both directions by testing with something like iperf

Sample working /etc/network/interfaces configuration segment:

iface bond0 inet static
 	address 192.168.0.1
 	netmask 255.255.255.0
 	bond-mode 0
 	bond-miimon 100
 	bond-slaves eth0 eth1

Kernel Tuning

These sysctl changes seemed to make a small improvement, so I left them intact. This would need to be added to /etc/sysctl.conf.

# drbd tuning
net.ipv4.tcp_no_metrics_save = 1
net.core.rmem_max = 33554432
net.core.wmem_max = 33554432
net.ipv4.tcp_rmem = 4096 87380 33554432
net.ipv4.tcp_wmem = 4096 87380 33554432
vm.dirty_ratio = 10
vm.dirty_background_ratio = 4

DRBD

DRBD is the tricky one. It doesn't always perform well despite what it's developers would like you to think.

DRBD Performance

Our test RAID controller had no battery, but did have 128MB of ram
DRBD flushes all writes to disk or uses barriers if you don't disable it. This is important for consistency but disks were not made to be used this way and it will not perform well in most circumstances.
If you disable all flushing or barriers with no-md-flushes, no-disk-flushes, no-disk-barrier, DRBD performance will be nearly the native disk speed. However this means the DRBD device is more prone to corruption if it crashes, power goes out, etc...
If you disable just meta-data flushing with no-md-flushes, performance is reasonable (about 80% native) and you still get some security. Meta-data flushing will make performance so bad that there is no point in using DRBD.
If you have a battery backed RAID controller, disable all flushing and barriers. You will get near native performance and will not have to worry about data corruption
Variable rate synchronization is excellent because it won't hurt DRBD performance and will use all bandwidth when it is free
use-rle is a performance tweak that is enabled in the latest version be default, so I turned it on
Basic tweaks like buffers, al-extents, etc... should all be enabled. They are well documented.
Protocol C seems to perform almost the same as A and B (based on benchmarks). Protocol C provides the most protection.
Our test systems had hard drives there were capable of write speeds very similar to that of 1Gb Ethernet transfer speeds. If faster disks/arrays were used, a faster link or bonding would be required for DRBD to keep up with writes.

Working Example

Working and well performing DRBD resource configuration

                    resource r0 {
			 net { 
				#on-congestion pull-ahead;
				#congestion-fill 1G;
				#congestion-extents 3000;
				#sndbuf-size 1024k; 
				sndbuf-size 0;
				max-buffers 8000;
				max-epoch-size 8000;	
			 }
			 disk {
				#no-disk-barrier;
				#no-disk-flushes;
				no-md-flushes;
			 }
			 syncer {
                                c-plan-ahead 20;
                                c-fill-target 50k;
                                c-min-rate 10M;
				al-extents 3833;
				rate 35M;
				use-rle;
			 }
			 startup { become-primary-on nfs01 ; }
                         protocol C;
			 device minor 1;
                         meta-disk internal;

			on nfs01 {
                              address 192.168.0.1:7801;
                              disk /dev/sda6;
                         }

			 on nfs02 {
                              address 192.168.0.2:7801;
                              disk /dev/sda6;
                         }

                    }

Relevant section of /etc/fstab used with this configuration:

# DRBD, mounted by heartbeat
/dev/drbd1	/mnt		ext4	noatime,noauto,nobarrier	0	0

'nobarrier' makes a big different in performance (on my test systems) and still maintains filesystem integrity
'noatime' makes a small performance difference by disabling access time updates on every file read
'noauto' stops the init scripts (mount -a) from mounting it - heartbeat will manage this

Benchmarking

Before bothering with NFS or anything else, it is a good idea to make sure DRBD is performing well.

Benchmark tools

atop - watches CPU load, IO load, IO throughput, network throughput, etc... for the whole system in one screen, run on both systems during your benchmarking to see what's going on
iptraf - detailed network information, throughput, etc... if you need to dig further
bonnie++ - performs many IO benchmarks, gives a good idea of actual disk performance. Must use a data set at least 2x larger than physical ram to be accurate
postmark - great for over loading a system, performing many small write/reads/appends, directory creation, etc... good for testing NFS performance when everything is up. Gives you ops/sec performance results per test type
dd - basic, initial benchmarking - see dd section

DD

There are some simple tests you can do to test performance of a storage device using DD. However, other tools should be used later for more accurate results (real world). When I'm benchmarking or trying to identify bottlenecks, I run atop on the same system in a separate terminal while dd is transferring data.

Use direct access to write a sequentially to the filesystem (doesn't work on all filesystems)

dd if=/dev/zero of=testfile bs=100M count=20 oflag=direct

Use regular access to write sequentially to the filesystem, and flush before exit

dd if=/dev/zero of=testfile bs=100M count=20 conv=fsync

Use direct access to read sequentially from the filesystemto read sequentially from the filesystem (doesn't work on all filesystems)

dd if=testfile of=/dev/null bs=100M iflag=direct

Use regular access to read sequentially from the filesystem. Drop system cache before doing this or you might just read the file from cache

dd if=testfile of=/dev/null bs=100M

Drop system cache

sync
echo 3 > /proc/sys/vm/drop_caches

Write directly to block device, bypassing filesystem. This will destroy data on the block device

dd if=/dev/zero of=/dev/sdXX bs=100M count=20 oflag=direct

Read directly from block device, bypassing filesystem

dd if=/dev/sdXX of=/dev/null bs=100M count=20 oflag=direct

NFS Server

The only configuration was to '/etc/exports':

/mnt 192.168.3.0/24(rw,async,no_subtree_check,fsid=0)

'async' I found this particular setup performed much better (50%) with async rather than sync

'fsid=0' is a good thing to use in HA solutions. If all nodes use the same ID# (which is trivial) for the same mount, stale handles will be avoided after a fail over.

'/etc/idmapd.conf' may need to be adjusted to match your domain when using NFSv4 (on client and/or server)

NFS Client

In testing, I chose to use this command to mount NFS:

mount nfs:/mnt /testnfs -o rsize=32768,wsize=32768,hard,timeo=50,bg,actimeo=3,noatime,nodiratime

Explanation:

rsize/wsize - set the read and write maximum block size to 32k, appropriate for the average file size of the customers data
hard - cause processes accessing the NFS share to block for ever when it becomes unavailable unless killed with SIGKILL
noatime - do not update file access times every time a file is read
nodiratime - like noatime but for directories
bg - continue NFS mount in the background, rather than blocking - mostly to prevent boot problems
tcp - not specified because it's a default - if udp is used, data could be lost during a fail over, tcp will keep trying
timeo - retry NFS requests after 5 seconds (specified in 10ths of a second)

Clustering

Heartbeat without Pacemaker was chosen. Pacemaker seemed too complex and difficult to manage for what was needed.

Heartbeat

It this test setup, heartbeat has 2 Ethernet connections to communicate between nodes. The first is the network/subnet/lan connection and the other is the DRBD direct crossover link. Having multiple connection paths is important so that one heartbeat node doesn't lose contact with the other. Once that happens, neither one knows which is master and the cluster becomes 'split-brained'.

STONITH

S.T.O.N.I.T.H = Shoot The Other Node In The Head

STONITH is the facility that Heartbeat uses to reboot a cluster node that is not responding. This is very important because heartbeat needs to know that the other node is not using DRBD (or other corruptible resources). If a node is really not responding at all, the other node will reboot it using STONITH, which uses IPMI (in the examples below) and then take over the resources.

When two nodes believe they are master (own the resources) it is called split-brain. This can lead to problems and sometimes data corruption. STONITH w/IPMI can protect against this.

Working Example

Create below configuration files
Install heartbeat (from repositories if possible)
Start logd
Start heartbeat

The ha.cf file defines the cluster and how its nodes interact.

/etc/ha.d/ha.cf:

# Give cluster 30 seconds to start
initdead 30
# Keep alive packets every 1 second
keepalive 1
# Misc settings
traditional_compression off
deadtime 10
deadping 10
warntime 5
# Nodes in cluster
node nfs01 nfs02
# Use ipmi to check power status and reboot nodes
stonith_host    nfs01 external/ipmi nfs02 192.168.3.33 ADMIN somepwd lan
stonith_host    nfs02 external/ipmi nfs01 192.168.3.34 ADMIN somepwd lan
# Use logd, configure /etc/logd.cf
use_logd on
# Don't move service back to preferred host when it comes up
auto_failback off
# If all systems are down, it's failure
ping_group lan_ping 192.168.3.1 192.168.3.13
# Takover if pings (above) fail
respawn hacluster /usr/lib/heartbeat/ipfail

##### Use unicast instead of default multicast so firewall rules are easier
# nfs01
ucast eth0 192.168.3.32
ucast eth1 192.168.0.1
# nfs02
ucast eth0 192.168.3.31
ucast eth1 192.168.0.2

The haresources file describes resources provided by the cluster. It's format is: [Preferred node] [1st Service] [2nd Service]... services are started in the order they are listed and stopped in the reverse order. They will start on the preferred node when possible.

/etc/ha.d/haresources:

nfs01 drbddisk::r0 Filesystem::/dev/drbd1::/mnt::ext4 IPaddr2::192.168.3.30/24/eth0 nfs-kernel-server

The logd.conf file defines logging for heartbeat.

/etc/logd.conf:

debugfile /var/log/ha-debug
logfile	/var/log/ha-log
syslogprefix linux-ha

Testing Fail-over

There are numerous tests you can perform. Try pinging the floating IP address while pulling cables, initiating heartbeat takover, killing heartbeat with SIGKILL, etc... But my favourite test is of the NFS service, the part that matters the most. /var/log/ha-debug will have lots of details about what heartbeat is doing during your tests.

Testing NFS Fail-over

From another system, mount the NFS share from the cluster
Use rsync --progress -av to start copying a large file (1-2 GB) to the share
When the progress is 20%-30%, pull the network cable from the active node
Rsync will lock up (as intended) due to NFS blocking
After 5-10 seconds, the file should continue transferring until finished with no errors
Do an md5 checksum comparison of the original file and the file on the NFS share
Both files should be identical, if not, there was corruption of some kind
Try the test again by reading from NFS, rather than writing to it