A Near Real-Time Data Pipeline and Analytics Platform with Hadoop in Two Weeks? How We Pulled it Off

August 14, 2017

By: Dan Stair, Senior Engineer

A telecommunications company called Cazena to learn about our Data Lake as a Service, hoping that it would help them deliver an at-risk project. It definitely would, it’s a cloud use case we know well, with a solution designed accordingly.

Deploying, configuring and securing a Data Lake as a Service on Cazena is automated, so that part was done in hours. Since the platform saved so much deployment time, we were able to lend our expertise to help the team optimize their data pipeline and solve multiple business problems quickly.

I was happy to be asked to help. I’m part of engineering. I mainly develop the automation within the Cazena platform that deploys the Hadoop/Spark components, with a focus on security and performance. Working directly with our customers gives me great insight into how they use the service and exactly what they need.

What this team needed was results, fast. They had a broken on-premises Hadoop cluster, real-time requirements and a deadline looming only weeks away. Using our Cazena cloud-based Data Lake as a Service and few other components, we helped the team design and deploy a near real time data pipeline and analytics infrastructure in less than two weeks. That’s a rapid turnaround, by any standards.  

The telco team faced a few challenges. There were 10 potentially-impactful analytics projects identified, which could potentially deliver healthy returns for the business. But, there was no platform. It wasn't for lack of trying; there was a development project underway. But Spark and Hadoop are hard. The on-premises Data Lake pilot was not functional. So, together, we picked a high-priority, achievable task from their list and set about solving it, with Cazena’s Big Data as a Service platform at the core.

Data Lake as a Service Design and Sizing

The initial design phase included meeting with the company’s analytics team to determine the project’s scope and requirements. We determined the average daily volume of data to be transferred and sized HDFS storage needs. These characteristics formed the input into Cazena’s automation to provision the secure, single tenant environment that would be used. The data was generated by a large on-premises application, which provided detailed call log data. Near real-time was an important factor, as the application supported core business operations and services. The analytics team needed rapid alerts on any changes in call quality.

For this project, we used the Cazena Data Lake as a Service solution as a starting point, which is a fully-managed cloud platform that includes an embedded Cloudera cluster with Spark, Impala, core Hadoop, storage and cloud infrastructure (Microsoft Azure, in this case) – and many other custom software components providing data movement, performance metrics, and application monitoring. Then we used our modular framework to customize the platform for this project.

Tuning and Test-Driving the Cloud Data Lake

Cazena delivered the Data Lake fully configured for the workload. That meant that the telco team didn't need to worry about tuning for price-performance, and could take full advantage of our benchmarking and automation features, as well as our expertise in cloud performance.

Cazena selected appropriate schedulers and workload engines to keep the load and analytics jobs running smoothly. We re-wrote existing MapReduce jobs using Spark, for superior performance and versatility. To meet data persistence requirements, we ensured that all raw data was stored unmodified in low-cost blob storage. To meet the performance requirements of data analysts, interactive datasets were stored compressed on fast local HDFS storage in Parquet format.

The goal was to get data from sources to end state quickly and reliably, while supporting simultaneous ad hoc queries, so performance was critical. With Cazena’s built-in intelligence and automation, we were able to deliver impressive performance from the beginning.

To prove the capabilities of the system, we tested every phase of the near real-time platform’s performance including: data upload speeds to the cloud; the ETL scripts newly re-written in Spark; table joins that utilized Impala. We ran scripts that exercised the cloud deployment end to end, and presented test and benchmarking data that clearly demonstrated the performance of our system. The system’s performance easily met the client’s requirements for upload, raw data persistence, ETL job reliability, and interactive analytics.

We all did that in 8 business days, by leveraging automation and extensive prior knowledge and expertise. Cazena has other customers running 24x7 workloads supporting 10,000+ Impala, Spark, Hive, and MapReduce jobs each week. In development, we deploy dozens of Hadoop clusters each day and have a comprehensive automated testing suite to ensure the functionality of 11 Web UIs, a plethora of APIs, and SQL endpoints. Our heavy emphasis on automated infrastructure and testing allowed us to breeze through what are usually the most challenging elements of a Hadoop deployment.

Building on a Foundation

We had helped the telco team cover their bases for the initial use case several days before the two week deadline. So, we found that a few other Data Lake use cases could be deployed very quickly. Existing machine learning scripts, which had been unusable on the on premise cluster were tested – and functional! – on the new cloud deployment. Additional data streaming sources were easily tested on the cluster to fulfill potential future use cases.

It was thrilling to deploy quickly, watch the team get over initial frustration with their previous DIY pilot, and get excited about the future possibilities of their data driven projects. Each of these success stories means a lot to me and I learn from every experience. We hope that the new Data Lake as a Service allows this team to continue to innovate and find new ways to use their data.

Read More about Cazena’s Data Lake as a Service

 

Back ›