For a large part of 2014, I was involved in a Proof-of-Concept (POC) at work with EMC and a great Adobe storage engineer, Jason (@jason_farns), working on using Isilon as the HDFS layer for virtualized Hadoop clusters. After many hours, long weekends and serious amounts of trial-and-error, a whitepaper has been published on the work we did. This is the first paper I have seen published where I was involved, and it is really exciting to see it out there for everyone to read.
There is always more work to be done, but this was a great start.
During the VMworld EMEA presentation (Tuesday October 14, 2014) , the question around performance was asked again with regards to using Isilon as the data warehouse layer and what positives and negatives are associated with leveraging Isilon as that HDFS layer. As with any benchmark or performance testing, results will vary based on the data set you have, the hardware you are leveraging and how you have the clusters configured. However, there are some things that I’ve learned over the last year and a half that are applicable on a broad scale that can show the advantages to leveraging Isilon as the HDFS layer, especially when you have very large data sets (10+ Petabytes).
There are two benchmarking tests I want to focus on for this post. The tests themselves demonstrate the necessity for understanding the workload (Hadoop job), the size of the data set, and the individual configuration settings (YARN, MapReduce, and Java) for the compute worker nodes.
Continue reading “Virtualized Hadoop + Isilon HDFS Benchmark Testing”