In October 2012, I was fortunate enough to be given the task of designing and implementing a 1000 Terabyte (1 Petabyte) GlusterFS volume.
This post will discuss the use case, business requirements, design, and implementation of a large Red Hat Storage / GlusterFS environment on HP Proliant hardware.
A few simple business requirements were presented.
- Lowest possible acquisition cost
- 1 petabtye usable GlusterFS volume
- Red Hat Enterprise Linux clients
- 10Gb/s Ethernet
- commodity hardware
- HA (highly available) not required
- Archival of data from High Performance Compute environment.
- Backups / Geo-replication not required
Why GlusterFS? Quoting Red Hat:
“Red Hat Storage is software only, scale-out storage that provides flexible and affordable unstructured data storage for the enterprise. GlusterFS, a key building block of Red Hat Storage, is based on a stackable user space design and can deliver exceptional performance for diverse workloads. GlusterFS aggregates various storage servers over network interconnects into one large parallel network file system. The POSIX compatible GlusterFS servers, which use XFS file system format to store data on disks, can be accessed using industry standard access protocols including NFS and CIFS. Red Hat Storage can be installed on commodity servers and storage hardware resulting in a powerful, massively scalable, and highly available NAS environment.
Red Hat Storage Server for On-Premise enables enterprises to treat physical storage as a virtualized, scalable, and centrally managed pool of storage by using commodity server and storage hardware. Purchasing Red Hat Storage is similar to Red Hat Enterprise Linux where yearly subscriptions are purchased from Red Hat per node. The cost for a single node is the same no matter how much storage is attached to it.”
Why not Isilon or other scale out NAS products?
This shop already uses Isilon and other enterprise storage solutions. This project did not replace any of those and was not chosen to displace but to compliment the existing storage solutions. I am not taking on the task of comparing Gluster to Isilon in this article. Isilon is a great product and I enjoy administering that also.
Why Red Hat Storage instead of open source, free GlusterFS?
Gluster.org is free and I have tested GlusterFS 3.3 extensively. It is a powerful product and is simple to use. I have tested the Red Hat, Debian, and CentOS versions. It seems pretty darn stable but unless you pay a 3rd party service provider, there is no software support other than the community forums and mail groups. If your shop is familiar and comfortable with an open source support model. Go for it w/ no hesitations.
Red Hat Storage is a commercial “pre-packaged” Red Hat Enterprise Linux operating system build with custom, supported Red Hat versions of GlusterFS 3.3. It remains a bit of a mystery to me exactly which features or nuances are different between the open source GlusterFS and the GlusterFS supplied on the Red Hat Storage installation media. Suffice it to say they are very similar with no obvious functionality or scalability differences. The main reasons for selecting the commercial release is to provide a tested, stable platform that comes with installation, maintenance, and break-fix support via email, web, or phone.
A Red Hat Network Subscription for Red Hat Storage covers software support for not only the Red Hat Enterprise Linux operating system but also GlusterFS. It all comes down to your shop’s culture and budget. Is it worth it to you to fork over several thousand dollars per node to help keep your GlusterFS volume up? Do you want vendor provided software support, log file and crash analysis? Is your cluster HA? Do you have RHEL support already in your shop?
The corporate customer site for this project has hundreds of RHEL subscriptions as well as other several other vendors’ products with maintenance agreements. It was decided that it was worth having Red Hat support for GlusterFS as well as the RHEL operating system.
I feel that corporate GlusterFS customers would be better suited if Red Hat would support a simple RPM based install of GlusterFS on RHEL 6. It seems to me like Red Hat made it easy on themselves by delivering glusterfs rpms on a pre-built RHEL version that can’t be overly modified. The OS build is 6.2.z and gluster is compiled on this. It is not possible to run Red Hat Enterprise Linux 6.3 and Red Hat supplied/supported GlusterFS. This was a frustrating factor for this customer site because RHEL deployments are automated and highly customized. We really would have been better off if we could have just installed a supported version of Gluster onto existing RHEL servers.
Backblaze hardware was the original consideration for rock bottom cost per TB. It was ditched mostly due to the unfamiliar hardware which would require new processes for builds, monitoring, and support. I found this article a worthy read on the topic.
Supermicro hardware was considered for server and storage hardware also. There are plenty of 3rd party providers who could create service arrangements for this gear if you did not want to handle that in-house.
HP Proliant servers are used extensively at this site for both virtualized and bare metal Linux and Windows environments. Support procedures and agreements are already well established.
HP Proliant servers w/ D2600 disk enclosures were chosen. The initial acquisition cost per TB was higher than the other 2 options but the familiarity and maturity of the shop’s Proliant support outweighed the cost differences of developing and supporting a new hardware platform. D2600 enclosures using 3TB SATA drives were chosen over SAS drives to keep costs down. SATA performance is expected to meet customer requirements.
HP DL380e Gen8
2 Quad-Core Processors: Intel(R) Xeon(R) CPU E5-2407 0 @ 2.20GHz
32GB RAM DDR3-RDIMM
HP Smart Array P822 Controller, 2GB onboard cache
60 x 3TB SATA drives on a single P822
Qlogic 1/10GbE dual-port Network Controller
Dual/redundant power supplies
Hard Disk Configuration Detail for a single GlusterFS Server
2 x 3TB mirrored disk for Red Hat Storage Operating System
58 x 3TB disks configured into five RAID 6 raid groups
131 TB Usable Disk space per Server for GlusterFS volumes
Note that Red Hat supports only 36 physical drives per server in a replicated volume. Since our lower availability requirements allowed the bulk of our volume to be Distributed, we were able to get an OK from Red Hat support for 58 data drives per volume.
Disk Enclosure Cabling Configuraton for a single GlusterFS Server
Cabling configuration is a single-domain, maximum capacity configuration. One cable path is created between the host, the primary disk enclosure, and the additional cascaded disk enclosures.
Rack Layout for 8 Servers ( 1048TB usable storage )
10U per server including 3 x D2600′s.
Network Layout for each GlusterFS server.
2 port, 10gb ethernet NIC, twinax cables. Redundant switching. Clients are attached to same switches. Jumbo frame MTU.
Although a single volume comprising of all the storage was considered, we ended up with a large distributed volume to provide the bulk of the storage and a smaller distributed/replicated volume to store data with highly available service requirements. 15% of the LVM volume groups were left unallocated for the potential use of GlusterFS checkpoints which is said to be coming in a future release.
The main point to make on using Distributed volumes is that each server becomes a single point of failure. While each server is filled with redundant parts to remain online, in the event of server outage, the volume becomes unstable. Effects are sort of unexpected as well. For example, with 1 node down, the gluster volume simply looks smaller to a client and some files would be unavailable producing i/o errors on r/w requests.
Performance, Monitoring, Management
The performance and i/o characteristics of the Red Hat Storage volume are excellent. I will plan to document these topics in a later post.
Summary of Volume Layout
4 x 24.9TB bricks
1 x 19.9TB brick
6TB operating system disk
15% unallocated free space = 19.6TB
- reserved for checkpoints when supported
8 servers, 1 brick on each.
19.9TB x 8 = 159.2 Total disk
Divide by 2 for replication
8 servers, 4 bricks on each.
24.9TB x 32 = 796.8TB
876.4TB total usable space
1033.9TB Total disk space