Series: The Hidden Challenges of Putting Hadoop and Spark in Production
A recent Gartner survey estimates that only 14% of Hadoop deployments are in production. We’re not surprised. We’ve been in many conversations with companies that have been piloting Hadoop to bolster their analytic capabilities beyond relational databases. Common challenges fall into a few important categories, which we explore in this blog series:
- Infrastructure: Choosing and configuring servers for Hadoop
- Performance Optimization: Scaling and tuning Hadoop for price-performance
- To Cloud or Not: Selecting, configuring and new challenges
- Security: Strategies for hardening and access control
Hadoop in the Cloud: Flexible, but Challenging
Many enterprises are asking tough – but reasonable – questions about Hadoop, Spark and many other new data technologies:
How do I get a functional production environment more quickly? What do I buy vs. build? How can services help?
How do I deal with the shifting open-source software landscape in a sustainable manner? Do I bet on the current promising processing engine – how do I hedge my bets?
As our series introduction stated, one survey estimated that only 14% of Hadoop deployments are in production.That’s because in most corporate environments, production-ready means that the cluster is secure and has the operational processes established for access, governance, compliance and service-level agreements (SLAs.) At the most basic level, companies must have a plan for how they will monitor infrastructure, software components and overall system health.
But they also need to plan for processes like patching and upgrading, which is complex with an ecosystem like Hadoop, Spark and related technologies, evolving at a rapid pace. Ensuring that you keep a record of the hundreds of configurations that have been applied to each layer of the stack is mandatory if you want to recover from a disaster...
This is especially challenging in enterprises, which is why we often hear of initial production Hadoop environments taking nine to 12 months (at best!). It’s the operational processes and maintenance that can turn into a much bigger headache than the software deployment.
Leveraging the cloud for Hadoop: Helpful, not a panacea
Deploying Hadoop and Spark in the cloud means that there is no long-term commitment to either the server type that you choose for your data nodes nor the number of nodes that you start with. There is little need to have a configuration that is configured beyond the project(s) that are currently funded and are in motion. This implies that sizing your cluster can be based on known variables for existing projects rather than trying to look out into the future and extrapolate usage. That’s helpful.
A second advantage is that you have flexibility if you get the initially sizing incorrect. Taking away a node as you have over provisioned is a viable option. Try doing that on premises! The cloud gives you a lot of flexibility when it comes to the infrastructure layer, but you still need to lay down the software in the right manner and keep that overall configuration operational.
Configuring and Managing Hadoop in the Cloud
Configuring Hadoop and Spark in the cloud, and keeping it operational, is time-consuming and where the “as a service” model starts to really show its value. If you’ve not yet run your own production Hadoop environment, ask a friend. There’s a growing understanding of the care and feeding of a cluster, cloud or not.
That’s why having automated processes, which can ensure that configurations can be repeated and verified, are a critical step for a production environment. You must have processes to conform to the various compliance attestations that production environments need to have–and it’s not easy to DIY:
- Can you guarantee that your cluster is Kerberos enabled and configured in the same way each time?
- Can you guarantee the cluster is configured appropriately with gateway nodes to ensure that you have control on all service access?
- What happens when one of the cloud VM’s that supports a data node gets sick or dies all together?
And if all of that sounds scary or techie to you, are you really ready to DIY in the cloud?
All of these operational responsibilities are necessary for Hadoop in the cloud. They are time consuming, and can take away from valuable staff time for analytics. All of this does not even touch on security which is a topic all by itself.
But the good news is, there are services like Cazena’s that can get you to production with Hadoop in the cloud quickly, without worrying about any of the things above. That’s the trend for many of the data science, advanced analytics and business units we’ve been interacting with. It didn't take long for them to realize the value of a fully-managed platform. You get to the analytic tasks faster and dedicate more of your team to analytics, rather than dev-ops and keeping the lights on.