"DIY" Cloud Data Lake Requirements

Staffing >>
Components >>
Capabilities >>
Risks >>
SaaS Data Lakes >>
  • Data Lake DevOps Costs: Market Data

    Data Lake DevOps Costs: Market Data

    To help enterprises compare the costs of SaaS Data Lakes with alternatives, we've collected data here about data lake requirements and costs. "DIY" or "do it yourself" data lakes may seem cheaper, but not when you consider the labor costs and special skills required. 

    The job search site Indeed.com tracks salary averages for common job categories, including salary lows and highs. They also track the number of available jobs, which gives a sense of the competitiveness of some of these positions. Using data from Indeed.com's website, we put together the infographic above on the roles and salary ranges required to run a cloud data lake. 

    To be exact, the range is $629,000 to $1,186,000 for annual team salaries for Data Lake operations. And some roles, such as DevOps, have an average tenure of less than a year, according to Indeed.com's summary. That indicates the high potential risk of turnover.  We've included links, so you can click through for current data on your region.  

    Salary Ranges of Data Lake Operations & Security Professionals


    Cloud DevOps

    *Salary range: $122,980 to $230,000

    Description: The Cloud Development Operations Engineer, more commonly known as DevOps, is responsible for administering cloud accounts and resources, and managing the cloud infrastructure. View current salaries & job listings on Indeed.com


    ​Data Platform Administrator
    *Salary range: $118,312 to $225,000 


    Description: Data Lakes are often based on open source data stacks, such as Hadoop, Spark or commercial versions of these (AWS EMR, Cloudera, etc). The Data Platform Administrator will provision and tune the data platform/service for the cloud, with attached data stores and centralized object store required to deliver workload performance. You may need different skills or certifications depending on the data stack that you choose, and availability may depend on your region.  View current salaries & job listings on Indeed.com


    Cloud Security Architect
    *Salary range: $143,769 to $225,000 


    Description: The Cloud Security Architect will oversee all aspects of security, from platform to network to data. They administer security controls such as encryption, key-management, identities and role-based access control, as well as establish and ensure compliance controls. This requires a comprehensive knowledge of security protocols, and will require specific knowledge relative to the deployment style (AWS, Azure). View current salaries & job listings on Indeed.com


    Data Management Lead
    *Salary range: $137,113 to $253,000


    Description: Responsible for integrating and securing processes for data ingestion, data governance and logging as well as integrating secure access from a variety of data engineering, machine learning and SQL tools. View current salaries & job listings on Indeed.com
     

    Data Production Operations
    *Salary range: $111,958 to $216,000
     

    Description: This role will cover first and second-line alerting, support, root-cause analysis and upgrade/patching/validation issues. This is also a catch-all capability required for technical issues like sprint tracking, billing, SLA monitoring and management. View current salaries & job listings on Indeed.com

    *Salary ranges, averages and max from Indeed.com, May 2018

    Summary: Data Lake Teams are Expensive


    Obviously, teams will vary depending on the size of the company, cluster and location. Note that the sample above is fairly conservative, assumes a medium size cluster and does not include any of the people needed for actual analysis and data science. This is simply the team required to operate the data lake -- you may even need a few more for the setup (depending on how fast you want to get started) or more for larger global deployments. 

    You'll need to plan for both development, deployment and ongoing operations. The basic data lake team roles will require some crossover training for 24x7 support, and additional resources would be required for major upgrades. In corporate environments, you may have access to additional resources from other IT or security groups, but beware of splitting platform roles between too many teams or people, as that may get risky for security. 

    There Are Alternatives
     

    Cazena is the Only Data Lake as a Service with a SaaS Experience. Get started instantly, with Zero Ops and cut costs in half with automation. Explore the SaaS Data Lake Solution overview, SaaS Data Lake case studies or get in touch to learn more!

     
    Components >>
     
  • Components of a Big Data System

    Components of a Data Lake

    Teams must evaluate, select and implement a wide range of technologies for DIY cloud data lakes, whether deploying in a data center or a public cloud. In fact, 50+  services or components is a conservative number for most enterprise data lakes. After selecting the best components for all of their workloads (!!), a data lake team will need to test, benchmark and integrate technologies with each other. 

    Teams must expect and plan for rapid change. Data technologies and public clouds (AWS, Azure) change often and have their own ecosystem of related products and services. It is not uncommon for each component in a data lake architecture to have different patching and upgrade cycles. Staying up to date is absolutely critical. Weak integrations or missed patches can be high-risk security threats. In fact, a recent survey found that a majority of data breaches in 2019 were caused by cloud mis-configurations

    Monitoring overall spend and analyzing it against returns is important. But, it can be challenging to manage costs. Each vendor typically requires their own contract, billing terms and invoicing cycles. Someone will need to track, monitor and rectify bills across vendors. Data will need to be combined and normalized to prepare summaries of service usage and charge-backs.

    Sample of vendors and components typically used for DIY Data Lakes:

    • Infrastructure 
    • Network 
    • Data engines (Hadoop, Spark, SQL, etc.)
    • Platform, storage 
    • Integration Services
    • Analytics Tool Integration
    • Data Security
    • Workload Management & Optimization (Ongoing)
    • Governance (User, Access, Data)
    • Cloud Connectivity/Hybrid
    • Identity Management
    • Security Augmentation 
    • Encryption, Compression
    • Compliance
    • Monitoring
    • Threat Detection
    • Log Management


    Contact us for a complete list or to learn how to skip all of this, and get instant access to your own Data Lake as Service, the only Data Lake with a SaaS experience.  

     
    Capabilities >>
     

     

  • Critical Capabilities to Develop

    Critical Capabilities to Develop

    To deliver a production-ready data lake, these core capabilities will be necessary (at a very minimum) for most enterprises. This is why data lakes often take so long. Teams often underestimate the time it takes to make data lakes production-ready. For enterprises, making a cloud data lake production-ready is complex and time-consuming. Long deployments aside, the most underestimated task for cloud data lakes is ongoing operations requirements. This can be especially impactful for enterprise IT departments. 

    Each capability required hundreds of tasks, represented by the dots above. These must be orchestrated across many components, with a multi-person team. 

    Sample cloud Data Lake capability requirements:

    • Design and Implement a Secure Multi-Cloud, Hybrid Architecture
    • Evaluate and Provision Cloud, Data and Analytics Engines; Integrate and Optimize for Price-performance
    • Integrate with Tools and Architecture for Data Loading
    • Determine Which Analytics and Data Science Tools to Support and Integrate Existing Tools; Optimize for Performance
    • Implement Cloud Security, Governance and Compliance Controls; Add Logging, Monitoring and Audit Requirements
    • Plan for Production Operations, Support, 24x7 Security Monitoring, Disaster Recovery, etc. 
    • Ongoing Plan for Workload Management, Scaling and Bursting for Transient Workloads; Cost Tracking and Management
    • Develop Self-Service Access Tools for Business, Data and Analytics Teams

    Considering all that is required to deliver the capabilities, it's understandable why data lake success can be difficult. DevOps requirements dominate 80%+ of data lake budgets. A solid use case is required to justify the cost of adding 5+ full time staff to build and run a data lake. Many companies have realized that DevOps requirements for cloud data lakes are causing unbalanced ROI equations, delaying agility and impact. 

    Learn more about Cazena's Data Lake as a Service here, or contact us.

     
    Risks >>
     
  • Risks of DIY: Changing Technology, Staff and Requirements

    Risks of DIY: Changing Technology, Staff and Requirements

    DIY Data Lake management is challenging and risky. Whether data lakes are in the cloud or on-premises in a datacenter, the fundamentals are the same: Datasets must be protected in motion and at rest. Applications and teams must be able to access data. Governance and compliance is mandatory. Security is a real-time threat for any company, regardless of where their data sits. Building a DIY data lake architecture increases all of these risks exponentially. Data leaders must carefully evaluate the benefits of a DIY deployment vs. an alternative managed service or other approach. 

    Key questions to consider before deciding to build and run a DIY Data Lake: 

    What if we lose key skilled staff members? Data lakes rely on a variety of diverse skills sets. Hiring, retaining and managing changing teams is a critical requirement for DIY data lakes. But these resources can be in very high demand, regularly recruited away to competitors. Before building a DIY data lake, enterprises should also consider what core competencies are most important to develop and staff. 

    Who will track changes across multiple connected ecosystems? Data and analytics demands an interconnected architecture spanning databases, open-source components, analytics tools, cloud infrastructure, networking, security, and more. These ecosystems evolve somewhat independently (even within cloud providers) and require expertise to integrate. For example, determining the impact of upgrades or patching across the entire stack is critical, as is end to end tracking to get the best price-performance.

    How will we manage costs across a complex architecture? How will we budget? Data and analytics project ROIs often suffer due to ballooning, hard to measure costs -- and in some cases, the cloud has made this worse, not better. Services can be misused or misunderstood, leading to unpleasant billing surprises. That's why the majority of project budgets often go to skilled DevOps and labor, either internal or external, for expertise on tuning and configuring. There can be hidden costs for ongoing management, and difficulty getting into production, leading to endless expensive pilots with no chance to show results.

    Here's a bigger question enterprises should consider when planning big data and analytics deployments:

    Why DIY a Data Lake? Get a SaaS Data Lake instantly. 

     

     
    The SaaS DAta LAKE >>
     
  • Why DIY? Use Cazena SaaS Data Lakes

    SaaS Data Lakes: Instant Access, Zero Ops.

    Cazena makes data lakes easy. See for yourself with instant access. Cazena’s Data Lake as a Service is the Only Sata lake with an Easy SaaS experience to accelerate AI, ML and all analytics. Now, all enterprises can use data lakes instantly and cost-effectively, without requiring a team for DevOps. Cazena powers digital transformation for global enterprises on AWS and Azure.

    Explore SaaS Data Lakes: Learn more about SaaS Data Lakes

    Browse Data Lake Success Stories: Read SaaS Data Lake case studies from around the globe. 

     

    Talk to the Experts

    Fill out the form below, let us know what we can share about SaaS Data Lakes.