On this page
High Availability NFS With DRBD + Heartbeat
Ryan Babchishin - http://win2ix.ca
This document describes information collected during research and development of a clustered DRBD NFS solution. This project had two purposes:
- HA NFS solution for Media-X Inc.
- Develop a standard tool kit and documentation that Win2ix can use for future projects
Operating System
The standard operating for Win2ix is Ubuntu 12.04, therefore all testing was done with this as the preferred target.
Hardware
Because of the upcoming project with Media-X, computer hardware was chosen based on low cost and low power consumption.
A pair of identical systems for the cluster were used:
- SuperMicro SuperServer 5015A-EHF-D525 (default settings)
- Intel(R) Atom(TM) CPU D525 1.80GHz CPU (dual-core, hyper threaded)
- 4GB DDR3 RAM
- 2x750GB 2.5" Scorpio Black hard drives
- 3ware 9650 RAID-1 controller card with 128MB of RAM, without a battery backup
- 2x on-board Gigabit Ethernet
Partitioning and disk format
The disks were partitioned according to Win2ix standards.
- sda1 - / - ext4 - 20GB
- sda2 - /tmp - ext4 - 6GB
- sda3 - /var - ext4 - 6GB
- sdb5 - swap - swap - 2GB
- sda6 - drbd - drbd - 716GB
Networking
- eth0 was configured with a (unused) 192.168.0.[1|2] address for communication over a direct link between systems, with no switch
Bonding was tested, see below for more information.
- eth0 MTU was set to 9000
- eth1 was configured with a regular network IP address for SSH/NFS/etc... access on both systems
Bonding
Although not used due to lack of an extra PCI-E slot, Ethernet bonding was originally tested
Important notes when using bonding:
- To get true single connection load balancing in both directions, use bonding mode 0 (round robin) with NO SWITCH or find a switch that supports it. Direct connections between systems works well. Switches generally support trunking, LACP or some other Cisco variant. They will most likely only send traffic for different IP connections over separate links. This won't help with DRBD which uses a single connection
- Make sure you are seeing full throughput on your bonding device in both directions by testing with something like iperf
Sample working /etc/network/interfaces configuration segment:
iface bond0 inet static address 192.168.0.1 netmask 255.255.255.0 bond-mode 0 bond-miimon 100 bond-slaves eth0 eth1
Kernel Tuning
These sysctl changes seemed to make a small improvement, so I left them intact. This would need to be added to /etc/sysctl.conf.
# drbd tuning net.ipv4.tcp_no_metrics_save = 1 net.core.rmem_max = 33554432 net.core.wmem_max = 33554432 net.ipv4.tcp_rmem = 4096 87380 33554432 net.ipv4.tcp_wmem = 4096 87380 33554432 vm.dirty_ratio = 10 vm.dirty_background_ratio = 4
DRBD
DRBD is the tricky one. It doesn't always perform well despite what it's developers would like you to think.
DRBD Performance
- Our test RAID controller had no battery, but did have 128MB of ram
- DRBD flushes all writes to disk or uses barriers if you don't disable it. This is important for consistency but disks were not made to be used this way and it will not perform well in most circumstances.
- If you disable all flushing or barriers with no-md-flushes, no-disk-flushes, no-disk-barrier, DRBD performance will be nearly the native disk speed. However this means the DRBD device is more prone to corruption if it crashes, power goes out, etc...
- If you disable just meta-data flushing with no-md-flushes, performance is reasonable (about 80% native) and you still get some security. Meta-data flushing will make performance so bad that there is no point in using DRBD.
- If you have a battery backed RAID controller, disable all flushing and barriers. You will get near native performance and will not have to worry about data corruption
- Variable rate synchronization is excellent because it won't hurt DRBD performance and will use all bandwidth when it is free
- use-rle is a performance tweak that is enabled in the latest version be default, so I turned it on
- Basic tweaks like buffers, al-extents, etc... should all be enabled. They are well documented.
- Protocol C seems to perform almost the same as A and B (based on benchmarks). Protocol C provides the most protection.
- Our test systems had hard drives there were capable of write speeds very similar to that of 1Gb Ethernet transfer speeds. If faster disks/arrays were used, a faster link or bonding would be required for DRBD to keep up with writes.
Working Example
Working and well performing DRBD resource configuration
resource r0 { net { #on-congestion pull-ahead; #congestion-fill 1G; #congestion-extents 3000; #sndbuf-size 1024k; sndbuf-size 0; max-buffers 8000; max-epoch-size 8000; } disk { #no-disk-barrier; #no-disk-flushes; no-md-flushes; } syncer { c-plan-ahead 20; c-fill-target 50k; c-min-rate 10M; al-extents 3833; rate 35M; use-rle; } startup { become-primary-on nfs01 ; } protocol C; device minor 1; meta-disk internal; on nfs01 { address 192.168.0.1:7801; disk /dev/sda6; } on nfs02 { address 192.168.0.2:7801; disk /dev/sda6; } }
Relevant section of /etc/fstab used with this configuration:
# DRBD, mounted by heartbeat /dev/drbd1 /mnt ext4 noatime,noauto,nobarrier 0 0
- 'nobarrier' makes a big different in performance (on my test systems) and still maintains filesystem integrity
- 'noatime' makes a small performance difference by disabling access time updates on every file read
- 'noauto' stops the init scripts (mount -a) from mounting it - heartbeat will manage this
Benchmarking
Before bothering with NFS or anything else, it is a good idea to make sure DRBD is performing well.
Benchmark tools
- atop - watches CPU load, IO load, IO throughput, network throughput, etc... for the whole system in one screen, run on both systems during your benchmarking to see what's going on
- iptraf - detailed network information, throughput, etc... if you need to dig further
- bonnie++ - performs many IO benchmarks, gives a good idea of actual disk performance. Must use a data set at least 2x larger than physical ram to be accurate
- postmark - great for over loading a system, performing many small write/reads/appends, directory creation, etc... good for testing NFS performance when everything is up. Gives you ops/sec performance results per test type
- dd - basic, initial benchmarking - see dd section
DD
There are some simple tests you can do to test performance of a storage device using DD. However, other tools should be used later for more accurate results (real world). When I'm benchmarking or trying to identify bottlenecks, I run atop on the same system in a separate terminal while dd is transferring data.
- Use direct access to write a sequentially to the filesystem (doesn't work on all filesystems)
dd if=/dev/zero of=testfile bs=100M count=20 oflag=direct
- Use regular access to write sequentially to the filesystem, and flush before exit
dd if=/dev/zero of=testfile bs=100M count=20 conv=fsync
- Use direct access to read sequentially from the filesystemto read sequentially from the filesystem (doesn't work on all filesystems)
dd if=testfile of=/dev/null bs=100M iflag=direct
- Use regular access to read sequentially from the filesystem. Drop system cache before doing this or you might just read the file from cache
dd if=testfile of=/dev/null bs=100M
- Drop system cache
sync
echo 3 > /proc/sys/vm/drop_caches
- Write directly to block device, bypassing filesystem. This will destroy data on the block device
dd if=/dev/zero of=/dev/sdXX bs=100M count=20 oflag=direct
- Read directly from block device, bypassing filesystem
dd if=/dev/sdXX of=/dev/null bs=100M count=20 oflag=direct
NFS Server
The only configuration was to '/etc/exports':
/mnt 192.168.3.0/24(rw,async,no_subtree_check,fsid=0)
- 'async' I found this particular setup performed much better (50%) with async rather than sync
- 'fsid=0' is a good thing to use in HA solutions. If all nodes use the same ID# (which is trivial) for the same mount, stale handles will be avoided after a fail over.
'/etc/idmapd.conf' may need to be adjusted to match your domain when using NFSv4 (on client and/or server)
NFS Client
In testing, I chose to use this command to mount NFS:
mount nfs:/mnt /testnfs -o rsize=32768,wsize=32768,hard,timeo=50,bg,actimeo=3,noatime,nodiratime
Explanation:
- rsize/wsize - set the read and write maximum block size to 32k, appropriate for the average file size of the customers data
- hard - cause processes accessing the NFS share to block for ever when it becomes unavailable unless killed with SIGKILL
- noatime - do not update file access times every time a file is read
- nodiratime - like noatime but for directories
- bg - continue NFS mount in the background, rather than blocking - mostly to prevent boot problems
- tcp - not specified because it's a default - if udp is used, data could be lost during a fail over, tcp will keep trying
- timeo - retry NFS requests after 5 seconds (specified in 10ths of a second)
Clustering
Heartbeat without Pacemaker was chosen. Pacemaker seemed too complex and difficult to manage for what was needed.
Heartbeat
It this test setup, heartbeat has 2 Ethernet connections to communicate between nodes. The first is the network/subnet/lan connection and the other is the DRBD direct crossover link. Having multiple connection paths is important so that one heartbeat node doesn't lose contact with the other. Once that happens, neither one knows which is master and the cluster becomes 'split-brained'.
STONITH
S.T.O.N.I.T.H = Shoot The Other Node In The Head
STONITH is the facility that Heartbeat uses to reboot a cluster node that is not responding. This is very important because heartbeat needs to know that the other node is not using DRBD (or other corruptible resources). If a node is really not responding at all, the other node will reboot it using STONITH, which uses IPMI (in the examples below) and then take over the resources.
When two nodes believe they are master (own the resources) it is called split-brain. This can lead to problems and sometimes data corruption. STONITH w/IPMI can protect against this.
Working Example
- Create below configuration files
- Install heartbeat (from repositories if possible)
- Start logd
- Start heartbeat
The ha.cf file defines the cluster and how its nodes interact.
/etc/ha.d/ha.cf:
# Give cluster 30 seconds to start initdead 30 # Keep alive packets every 1 second keepalive 1 # Misc settings traditional_compression off deadtime 10 deadping 10 warntime 5 # Nodes in cluster node nfs01 nfs02 # Use ipmi to check power status and reboot nodes stonith_host nfs01 external/ipmi nfs02 192.168.3.33 ADMIN somepwd lan stonith_host nfs02 external/ipmi nfs01 192.168.3.34 ADMIN somepwd lan # Use logd, configure /etc/logd.cf use_logd on # Don't move service back to preferred host when it comes up auto_failback off # If all systems are down, it's failure ping_group lan_ping 192.168.3.1 192.168.3.13 # Takover if pings (above) fail respawn hacluster /usr/lib/heartbeat/ipfail ##### Use unicast instead of default multicast so firewall rules are easier # nfs01 ucast eth0 192.168.3.32 ucast eth1 192.168.0.1 # nfs02 ucast eth0 192.168.3.31 ucast eth1 192.168.0.2
The haresources file describes resources provided by the cluster. It's format is: [Preferred node] [1st Service] [2nd Service]... services are started in the order they are listed and stopped in the reverse order. They will start on the preferred node when possible.
/etc/ha.d/haresources:
nfs01 drbddisk::r0 Filesystem::/dev/drbd1::/mnt::ext4 IPaddr2::192.168.3.30/24/eth0 nfs-kernel-server
The logd.conf file defines logging for heartbeat.
/etc/logd.conf:
debugfile /var/log/ha-debug logfile /var/log/ha-log syslogprefix linux-ha
Testing Fail-over
There are numerous tests you can perform. Try pinging the floating IP address while pulling cables, initiating heartbeat takover, killing heartbeat with SIGKILL, etc... But my favourite test is of the NFS service, the part that matters the most. /var/log/ha-debug will have lots of details about what heartbeat is doing during your tests.
Testing NFS Fail-over
- From another system, mount the NFS share from the cluster
- Use rsync --progress -av to start copying a large file (1-2 GB) to the share
- When the progress is 20%-30%, pull the network cable from the active node
- Rsync will lock up (as intended) due to NFS blocking
- After 5-10 seconds, the file should continue transferring until finished with no errors
- Do an md5 checksum comparison of the original file and the file on the NFS share
- Both files should be identical, if not, there was corruption of some kind
- Try the test again by reading from NFS, rather than writing to it