We've been collecting data about the costs of big data DevOps and the costs of running your own big data platform as a service (PaaS) in the cloud. One interesting area of research was the cost of assembling the team required to build and run your own Hadoop/Spark PaaS in the cloud. To do this, we identified the basic skills required for a big data team to operate a PaaS, and then researched available jobs online to get a sense of the job market. The available jobs and salary ranges are significant, representing the diversity of interest and companies in this area.
Why are we so interested? Cazena's Big Data as a Service is fully-managed and automated with DevOps built-in. That means that companies don't need to hire a team to build and operate the platform. Everything, including automation and big data DevOps, is included with Cazena's subscription. That represents a massive savings in salaries and administration time for operations, and allows companies to focus resources on more strategic activities.
The goal of this project was to help teams think through the costs of running their own platform, and whether it's really worth it, given the alternatives. The job search site Indeed.com tracks a lot of data and offers interesting analytics. Indeed tracks salary averages for common job categories, including salary lows and highs. They also track the number of available jobs, which gives a sense of the competitiveness of some of these positions. Using data from Indeed.com's website, we put together the infographic above on the roles and salary ranges required to run a big data platform in the cloud.
To be exact, the range is $629,000 to $1,186,000 for annual team salaries for Big Data DevOps. We also noticed that some of these have a huge number of roles available, indicating high demand. And some roles, such as DevOps, have an average tenure of less than a year, according to Indeed.com's summary. That indicates the high potential risk of turnover. Now, markets change, so we've included links, so you can click through for current data on your region. When you search a job, look at the bottom right to see the salary averages. Click through and check out the vast range of salaries depending on company size, industry and location. We pulled numbers recently, but you can use our links below to click through and get current data.
*Salary range: $122,980 to $230,000
Description: The Hadoop Administrator will provision and tune Hadoop/Spark nodes, with attached data stores and centralized object store required to deliver workload performance. This will be needed for open-source or commercial Hadoop software from Cloudera, Hortonworks or others, or cloud Hadoop solutions, such as EMR. View current salaries & job listings on Indeed.com
Description: The Cloud Security Architect will oversee all aspects of security, from platform to network to data. They administer security controls such as encryption, key-management, identities and role-based access control, as well as establish and ensure compliance controls. This requires a comprehensive knowledge of security protocols, and will require specific knowledge relative to the deployment style (AWS, Azure). View current salaries & job listings on Indeed.com
Description: Responsible for managing and administering data ingestion, data governance and logging as well as managing user access from a variety of data engineering, machine learning and SQL tools. View current salaries & job listings on Indeed.com
Description: This role will cover first and second-line alerting, support, root-cause analysis and upgrade/patching/validation issues. This is also a catch-all capability required for technical issues like sprint tracking, billing, SLA monitoring and management. View current salaries & job listings on Indeed.com
*Salary ranges, averages and max from Indeed.com, May 2018
Obviously, teams will vary depending on the size of the company, cluster and location. Note that the sample above is fairly conservative, assumes a medium size cluster and does not include any of the people needed for actual analysis and data science. This is simply the team required to operate the PaaS -- you may even need a few more for the setup (depending on how fast you want to get started) or more for larger global deployments.
You'll need to plan for both development, deployment and ongoing operations. The basic big data team roles will require some crossover training for 24x7 support, and additional resources would be required for major upgrades. In corporate environments, you may have access to additional resources from other IT or security groups, but beware of splitting platform roles between too many teams or people, as that may get risky for security.
Cazena is the First Fully-Managed Big Data as a Service with DevOps Built-in, so you don't need to hire a team just to run your PaaS. Cazena provisions secure cloud environments in just a few hours, ready for data loading and analytics. That means you can start quickly, scale spending with platform use and save 50%+ over the cost of assembling and managing your own team. To learn more about what's included, read about "Fully-Managed (Big Data DevOps Built-in) vs. Big Data PaaS" or get in touch to talk more.
Teams must evaluate, select and implement a wide range of technologies for DIY cloud big data platforms, whether deploying in a data center or a public cloud. In fact, 50+ is a conservative number for most enterprise DIY platforms. And of course, the team will need to test, benchmark and validate technologies with each other, both for deployment and on an ongoing basis, as each technology changes.
Particularly in the rapidly evolving cloud and big data technology markets, teams must expect and plan for rapid change. Both open source Hadoop technologies and public cloud providers such as AWS and Google change regularly, and have their own ecosystem of related products and services. It is not uncommon for each component in a big data architecture to have different patching and upgrade cycles. And staying up to date is absolutely critical -- weak integrations or missed patches can be high-risk security threats.
Given the importance of each component, vendor management is essential to the plan when running a DIY big data platform. Monitoring vendor spend and analyzing it against returns is important to manage costs. Each vendor typically requires their own contract, billing terms and invoicing cycles. A resource will need to track, monitor and rectify bills across vendors. Data will need to be combined and normalized to prepare summaries of platform usage and project chargebacks.
Sample of vendors and components typically used for DIY Big Data Projects:
To deliver production-ready big data platforms, these core capabilities will be necessary (at a very minimum). This is one reason why big data projects take so long. Companies often underestimate the time it takes to make platforms production-ready. For enterprises especially, integration and security requirements can be complex activities.
Long deployments aside, perhaps the most underestimated task for cloud data platforms is ongoing operations requirements. This can be especially impactful for enterprise IT departments that are already supporting existing environments.
Developing each of these capabilities takes hundreds of tasks, represented by the dots above. These must be orchestrated across many components, with a multi-person team.
Considering all that is required to deliver the capabilities, it's understandable why big data success has been difficult. DevOps requirements dominate 80%+ of big data operations budgets. A solid use case is required to justify the cost of adding 5+ full time staff to build these capabilities and run the platform. Many companies have realized that DevOps requirements for big data in the cloud are causing unbalanced ROI equations, delaying the results and impact. This blocks analytic agility.
Managing any big data and analytics environments is challenging and risky. Whether platforms are in the cloud or on-premises in a data center, the fundamental issues are the same: Datasets must be protected in motion and at rest. Applications and teams must be able to access data. Governance and compliance is mandatory. Security is a real-time threat for any company, regardless of where their data sits. Building a DIY big data architecture increases all of these risks exponentially. Data leaders must carefully evaluate the benefits of a DIY deployment vs. an alternative managed service or other approach.
Key questions to consider before taking on a DIY data platform project:
What if we lose key skilled staff members? Data architectures rely on a variety of diverse skills sets. Hiring, retaining and managing changing teams is a critical requirement for DIY big data. But these resources can be in very high demand, regularly recruited for new and greener pastures (especially if they're good). Employee turnover is a very real problem for enterprises today, especially ones perceived as lagging in new technology investments. Before building a DIY DevOps team, enterprises must look seriously at their market and pipeline for new talent -- as well as what core competencies are most important to develop.
Who will track changes across multiple connected ecosystems? Data and analytics demands an interconnected architecture spanning databases, open-source components, analytics tools, cloud infrastructure, networking, security, and more. Managing that is more than a full-time job. These ecosystems evolve somewhat independently, and require expertise to integrate. For example, determining the impact of upgrades or patching across the entire stack becomes an important exercise. It's also incumbent to track changes outside of your core stack; are you confident you're getting the best price for infrastructure?
How will we manage costs across a complex architecture? How will we budget? Data and analytics project ROIs often suffer due to ballooning, hard to measure costs -- and in some cases, the cloud has made this worse, not better. Services can be misused or misunderstood, leading to unpleasant billing surprises. That's why the majority of project budgets often go to skilled DevOps and labor, either internal or external, for expertise on tuning and configuring. There can be hidden costs for ongoing management, and difficulty getting into production, leading to endless expensive pilots with no chance to show results.
Here's a bigger question enterprises should consider when planning big data and analytics deployments:
For most enterprises, building your own data platform is no longer a competitive advantage; in fact, some could argue it's a liability, with the amount of deployment risks, overhead and potential for error. The advantages from data and analytics come from what you do with it. Why spend the time and resources to DIY a data platform, when the real upside will come from everything after that? (Especially when there are alternatives that get you into production in hours).
It's worth it to understand the time, cost and deployment advantages of the alternatives, such as Fully-Managed Big Data as a Service.
Cazena's mission is to radically simplify cloud data platforms for enterprises, with the First Fully-Managed Big Data as a Service. Now data can be securely analyzed in the cloud with a few clicks, and without specialized DevOps skills. Cazena's flagship solution is the Data Lake as a Service with Cloudera embedded, (+SaaS UIs), fully-managed and production-ready for all analytics including ML/AI, data engineering, and BI.
About Cazena: Cazena is purpose-built for enterprises by the team behind the original Netezza data appliance, and backed leading investors and technology partners.
Customers: Read about the variety of Global 5000 customers that rely on Cazena for Big Data as a Service.
Fill out the form below and we'll be in touch soon.