To help enterprises compare the costs of SaaS Data Lakes with alternatives, we've collected data here about data lake requirements and costs. "DIY" or "do it yourself" data lakes may seem cheaper, but not when you consider the labor costs and special skills required.
The job search site Indeed.com tracks salary averages for common job categories, including salary lows and highs. They also track the number of available jobs, which gives a sense of the competitiveness of some of these positions. Using data from Indeed.com's website, we put together the infographic above on the roles and salary ranges required to run a cloud data lake.
To be exact, the range is $629,000 to $1,186,000 for annual team salaries for Data Lake operations. And some roles, such as DevOps, have an average tenure of less than a year, according to Indeed.com's summary. That indicates the high potential risk of turnover. We've included links, so you can click through for current data on your region.
*Salary range: $122,980 to $230,000
Description: Data Lakes are often based on open source data stacks, such as Hadoop, Spark or commercial versions of these (AWS EMR, Cloudera, etc). The Data Platform Administrator will provision and tune the data platform/service for the cloud, with attached data stores and centralized object store required to deliver workload performance. You may need different skills or certifications depending on the data stack that you choose, and availability may depend on your region. View current salaries & job listings on Indeed.com
Description: The Cloud Security Architect will oversee all aspects of security, from platform to network to data. They administer security controls such as encryption, key-management, identities and role-based access control, as well as establish and ensure compliance controls. This requires a comprehensive knowledge of security protocols, and will require specific knowledge relative to the deployment style (AWS, Azure). View current salaries & job listings on Indeed.com
Description: Responsible for integrating and securing processes for data ingestion, data governance and logging as well as integrating secure access from a variety of data engineering, machine learning and SQL tools. View current salaries & job listings on Indeed.com
Description: This role will cover first and second-line alerting, support, root-cause analysis and upgrade/patching/validation issues. This is also a catch-all capability required for technical issues like sprint tracking, billing, SLA monitoring and management. View current salaries & job listings on Indeed.com
*Salary ranges, averages and max from Indeed.com, May 2018
Obviously, teams will vary depending on the size of the company, cluster and location. Note that the sample above is fairly conservative, assumes a medium size cluster and does not include any of the people needed for actual analysis and data science. This is simply the team required to operate the data lake -- you may even need a few more for the setup (depending on how fast you want to get started) or more for larger global deployments.
You'll need to plan for both development, deployment and ongoing operations. The basic data lake team roles will require some crossover training for 24x7 support, and additional resources would be required for major upgrades. In corporate environments, you may have access to additional resources from other IT or security groups, but beware of splitting platform roles between too many teams or people, as that may get risky for security.
Cazena is the Only Data Lake as a Service with a SaaS Experience. Get started instantly, with Zero Ops and cut costs in half with automation. Explore the SaaS Data Lake Solution overview, SaaS Data Lake case studies or get in touch to learn more!
Teams must evaluate, select and implement a wide range of technologies for DIY cloud data lakes, whether deploying in a data center or a public cloud. In fact, 50+ services or components is a conservative number for most enterprise data lakes. After selecting the best components for all of their workloads (!!), a data lake team will need to test, benchmark and integrate technologies with each other.
Teams must expect and plan for rapid change. Data technologies and public clouds (AWS, Azure) change often and have their own ecosystem of related products and services. It is not uncommon for each component in a data lake architecture to have different patching and upgrade cycles. Staying up to date is absolutely critical. Weak integrations or missed patches can be high-risk security threats. In fact, a recent survey found that a majority of data breaches in 2019 were caused by cloud mis-configurations.
Monitoring overall spend and analyzing it against returns is important. But, it can be challenging to manage costs. Each vendor typically requires their own contract, billing terms and invoicing cycles. Someone will need to track, monitor and rectify bills across vendors. Data will need to be combined and normalized to prepare summaries of service usage and charge-backs.
Sample of vendors and components typically used for DIY Data Lakes:
Contact us for a complete list or to learn how to skip all of this, and get instant access to your own Data Lake as Service, the only Data Lake with a SaaS experience.
To deliver a production-ready data lake, these core capabilities will be necessary (at a very minimum) for most enterprises. This is why data lakes often take so long. Teams often underestimate the time it takes to make data lakes production-ready. For enterprises, making a cloud data lake production-ready is complex and time-consuming. Long deployments aside, the most underestimated task for cloud data lakes is ongoing operations requirements. This can be especially impactful for enterprise IT departments.
Each capability required hundreds of tasks, represented by the dots above. These must be orchestrated across many components, with a multi-person team.
Sample cloud Data Lake capability requirements:
Considering all that is required to deliver the capabilities, it's understandable why data lake success can be difficult. DevOps requirements dominate 80%+ of data lake budgets. A solid use case is required to justify the cost of adding 5+ full time staff to build and run a data lake. Many companies have realized that DevOps requirements for cloud data lakes are causing unbalanced ROI equations, delaying agility and impact.
Learn more about Cazena's Data Lake as a Service here, or contact us.
DIY Data Lake management is challenging and risky. Whether data lakes are in the cloud or on-premises in a datacenter, the fundamentals are the same: Datasets must be protected in motion and at rest. Applications and teams must be able to access data. Governance and compliance is mandatory. Security is a real-time threat for any company, regardless of where their data sits. Building a DIY data lake architecture increases all of these risks exponentially. Data leaders must carefully evaluate the benefits of a DIY deployment vs. an alternative managed service or other approach.
Key questions to consider before deciding to build and run a DIY Data Lake:
What if we lose key skilled staff members? Data lakes rely on a variety of diverse skills sets. Hiring, retaining and managing changing teams is a critical requirement for DIY data lakes. But these resources can be in very high demand, regularly recruited away to competitors. Before building a DIY data lake, enterprises should also consider what core competencies are most important to develop and staff.
Who will track changes across multiple connected ecosystems? Data and analytics demands an interconnected architecture spanning databases, open-source components, analytics tools, cloud infrastructure, networking, security, and more. These ecosystems evolve somewhat independently (even within cloud providers) and require expertise to integrate. For example, determining the impact of upgrades or patching across the entire stack is critical, as is end to end tracking to get the best price-performance.
How will we manage costs across a complex architecture? How will we budget? Data and analytics project ROIs often suffer due to ballooning, hard to measure costs -- and in some cases, the cloud has made this worse, not better. Services can be misused or misunderstood, leading to unpleasant billing surprises. That's why the majority of project budgets often go to skilled DevOps and labor, either internal or external, for expertise on tuning and configuring. There can be hidden costs for ongoing management, and difficulty getting into production, leading to endless expensive pilots with no chance to show results.
Here's a bigger question enterprises should consider when planning big data and analytics deployments:
Cazena makes data lakes easy. See for yourself with instant access. Cazena’s Data Lake as a Service is the Only Sata lake with an Easy SaaS experience to accelerate AI, ML and all analytics. Now, all enterprises can use data lakes instantly and cost-effectively, without requiring a team for DevOps. Cazena powers digital transformation for global enterprises on AWS and Azure.
Browse Data Lake Success Stories: Read SaaS Data Lake case studies from around the globe.
Fill out the form below, let us know what we can share about SaaS Data Lakes.