It’s been very quiet around here since VMworld US ended a little over one month ago. I have had my head down studying for my VCP5-DCV exam — which I am taking on Wednesday. The rest of my spare time has been consumed getting ready for VMworld EMEA in Barcelona, Spain. I took the feedback received after VMworld US and will be showing a demo of Hadoop being deployed virtually through the vCAC orchestration workflows that interact with Big Data Extensions.
I am looking forward to my trip to Spain. I am planning on having several days to wander about and see some of what Europe has to offer — especially a FC Barcelona game on Saturday the 18th.
I do have some good things planned for the site, including posts on Isilon performance metrics with Hadoop, expanding BDE functionality to include Flume nodes, blueprints for deploying HDFS-only virtual clusters to be used for a unified data warehouse layer.
It is going to be a very busy winter here in Utah this year with all of the Hadoop work, next-generation Openstack (VMware Integrated OpenStack) and preparing for the VCAP5-DCA test in January.
The conference was completed just over two weeks ago, and since then I’ve had the opportunity to go through my notes, think about the sessions I attended and summarize what insight I gained while there.
The biggest takeaway I had for VMworld 2014 compared to last year revolved around lessons learned in 2013 were applied in 2014. The key insight in 2013 was that many other partners and customers of VMware were facing the same challenges around standardization, automation and self-service. It was helpful to learn that the things we were trying to accomplish within our department at Adobe were not unique to us.
This year, 2014, I learned that we have solved many of the challenges from the last year and now have great insight to offer out to the community. As we work towards building on the standardization, automation and self-service phases of offering both comprehensive IaaS and PaaS offerings, we are doing what we can to share that information with the broader community.
All of that is wonderful, but what are the next steps for our team, the market and others in the virtualization space? We heard a lot at the conference about OpenStack, Docker, VSAN and other emerging technologies. The focus I personally have for the next year is going to revolve around further implementation of the Hadoop ecosystem, using VMware technologies, and building out larger, comprehensive PaaS offerings.
There are many questions to be answered around how OpenStack and Docker plays in the space. I am looking forward to the challenges coming to us as we work with our engineering teams.
Should be an exciting year!
Yesterday was another great day in San Francisco and VMworld 2014. The big takeaway I had revolved around Docker and VMware integration. There is a great article over on the Office of the CTO blog regarding this exact topic. Two key takeaways the CEO of Docker said during his portion (paraphrased):
- Use VMs for security and consistency and use Docker for speed of deployment.
- Docker + VMware gets you the best of both worlds when utilized together
There are some exciting things, like Project Fargo, going on in the space right now that should enable Operations teams to incorporate Docker into their existing environments to give their applications the flexibility next-generation apps and engineering teams are starting to require.
Beyond the sessions, the CTO party last night was really amazing! Lots of networking and conversations were taking place and I was able to gain some good insight into how Mesos could be used to replace YARN. I am excited to follow-up on several of the conversations last night.
Today kicked off the US VMworld 2014 conference in San Francisco and it was pretty exciting. The first of two VAPP1428 sessions took place this afternoon where I had an opportunity to talk about the exciting work I have been doing the last year at Adobe to build out a Hadoop-as-a-Service offering. It was a great talk and Andrew Nelson was an awesome co-presenter who offered great insight into what VMware is doing in the virtual Hadoop space.
A few hours later, I was fortunate to have an opportunity to sit on a panel with several other distinguished guests to talk about best practices around Hadoop and Big Data in a virtualized environment. In that session we were able to share our insights into the decisions that each of made for our organizations and the successes we have seen in the space utilizing not only VMware but also understanding Hadoop for our individual workloads.
It was great to hear Chris from FedEx talk about how they too are utilizing the EMC Isilon HDFS plugin to offer out a unified HDFS layer to the virtual environment for Big Data Extensions to build compute clusters. We have been able to do some great work with EMC the past few months — details to be provided in a future post — around how using the Isilon storage that is already within our data centers will allow us to offer a robust storage layer to our Hadoop clusters.
All in all, it was a great first day of the conference. There were many exciting announcements made during the keynote, not the least of which was the partnership VMware now has with both Docker and OpenStack. There are a lot of exciting things happening in that space and as I look to the future it is pretty awesome.
A big shout out to all the individuals who came and asked questions. It is my hope that everyone who attended walked away with at least one positive takeaway. If you have any additional questions, please reach out to me on Twitter (@chrismutchler) or through email.
There is a quote in the book “Hadoop Operations by Eric Sammer (O’Reilly)” where it states:
“The complexity of sizing a cluster comes from knowing — or more commonly, not knowing — the specifics of such a workload: its CPU, memory, storage, disk I/O, or frequency of execution requirements. Worse, it’s common to see a single cluster support many diverse types of jobs with conflicting resource requirements.”
In my experience that is a factual statement. It does not however, preclude one from determining that very information so that an intelligent decision can be made. In fact, VMware vCenter Operations Manager becomes an invaluable tool in the toolbox when developing the ability to maintain the entire SDLC of a Hadoop cluster.
Initial sizing of the Hadoop cluster in the Engineering|Pre-Production|Chaos environment of your business will include some amount of guessing. You can stick with the tried and true methodology of answering the following two questions — “How much data do I have for HDFS initially?” and “How much data do I need to ingest into HDFS daily|monthly?” It is at this point that you’ll need to start monitoring the workload(s) placed on the Hadoop cluster and begin making determinations for the cluster size once it moves into the QE, Staging and Production environments.
Continue reading “Workload-based cluster sizing for Hadoop”