Housing is still using VMware as our virtualization platform and we needed to refresh our aging storage system. Several other units on campus have been using Dell Compellent, we got an excellent price on an SC4020, so we bought one too.
The config includes 24 1.2TB 10k SAS disks, with 1 hot spare, that works out to 25TB of available space. We bought the Fast Track license, so we have 1 tier of storage with 4 virtual RAID levels: RAID 10-DM Fast, Standard, RAID 6-10 Fast and Standard.
We had a couple of issues with our install, one fiber cable from one of the fibre channel switches to a node was throwing enough errors that it could not hold that path up but, annoyingly, wouldn’t drop the link completely either. Troubleshooting that was fairly obvious, the HBA port on the ESX node was flashing the link light regularly. We disconnected the bad cable and let the magic of redundant links take IO from that node over the other HBA port.
We went ahead and created a volume and presented it to the 3 nodes. And then we Storage vMotioned a small, not important test VM over to the SC4020. It took over 30 minutes to move a 16GB .vmdk, changing it from thin provisioning to thick provision lazy zero (TPLZ) in the process. That’s the bump on the left hand side of the graphic, capped at less than 3 MB/sec. This seemed slow to me, but I didn’t really have anything to compare it to. And then I noticed some errors in the vCenter logs:
Lost access to volume (CML_DISK1) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly. Successfully restored access to volume (CML_DISK1) following connectivity issues.
Do we have another bad cable? There was nothing in the logs on the individual nodes or the SC4020, so it was time to dig into the fibre channel switch. When was the last time you looked at the switch port stats on your SAN? We use Brocade switches, so there are two options, I chose to look in the web interface first, not knowing that the CLI was a much better choice. Poking through the port details I was able to find a port that was logging hundreds of errors per second, 300 million plus before I swapped out the cable. Verifying the error rate in the CLI was as simple as ‘porterrshow’
SW6505:admin> porterrshow 4 frames enc crc crc too too bad enc disc link loss loss frjt fbsy c3timeout pcs tx rx in err g_eof shrt long eof out c3 fail sync sig tx rx err 4: 12.5m 2.9m 130 130 125 0 0 5 308.9m 44 1 0 1 0 0 38 0 0
So, time for another Storage vMotion, this time a similar 18GB VM switching from thin to TPLZ that’s on the right hand side of that graph, well over 20 MB/sec. Glad we had a spare cable. And now to order some more.
After we get some additional VMs on the SC4020 I’ll write a post about the Data Progression, Fast Track and maybe some replication.