Guest blog from Rich Sokolosky & Chael Christopher of Sentier Health Informatics.
Don’t Let Hadoop and Cloud Security Fears Derail Your Path to Value
As data practitioners, we have spent the last four years creating an enterprise data science environment for a life sciences client. We’ve taken the project from an initial pilot on AWS (swiped a credit card!) to an analytic hub that houses 100TB of internally-owned data as well as data procured from syndicated data providers. The data sets are still growing, and they constitute the foundation of our client’s analytics today. The toughest part of our big data journey was not standing up the different machines, dealing with the sources, generating the analytics, or running the updates. It was, and continues to be security.
Hadoop from its inception was not designed with security as its primary mandate. To lock our environment up tight we needed to work fluently with AWS security groups and administration, Microsoft Active Directory, Cloudera Sentry, MIT Kerberos, encryption-at-rest, and transport layer security using SSL certificates. Oh, and we also had to do it all in an ecosystem of Amazon Workspaces that could communicate with the cluster.
In a perfect world, the company’s users would have accessed the environment from their own network but the challenges were too daunting. This means that all analytic tools (R, SAS, Tableau, etc.) and the Active Directory had to reside within the VPC, and have to integrate across all of the worker nodes. There are no external IPs/firewalls into the environment, so we traded usability for security. While we solved all of the core security concerns from IT, we constrained the ways in which the environment can be used.
We don’t believe our experience accommodating all of the various security concerns was unusual. Yet, this security framework took two full-time Hadoop / network security engineers four months to design, setup, and test. It now takes one full time person to manage issues and grant/change the permissions required for the growing user base and data.
When thinking about a big data environment, security has to be addressed first. It has surpassed quality assurance as the number one concern. It is a tough, yet necessary nut to crack, and it will make or break a project given that all the patient level information these days has to pass rigorous compliance tests, and any hint of a breach will shut a project down indefinitely and immediately.
There are many ways to address the compliance and end user needs and all will require a high degree of skill and experience. This begged us to ask this simple question: do we want to tolerate the lag and friction inherent to standing up a new Hadoop environment, or do we want to stay laser focused on delivering new insights and analytics to our clients vis-à-vis our data science sandboxes? We choose the latter – and we chose Cazena!
About Sentier Health Informatics:
Sentier Health Informatics is focused on one objective: getting healthcare information into the hands of analysts, data scientists, and decision makers in a timely and cost effective manner. We tackle information needs that fall outside traditional reporting and data warehousing. We provide adaptive services that utilize proven Big Data approaches and technologies to address the ever increasing speed at which information needs to be delivered. We believe “conception to answer” should be measured in hours and days not months and years and that follow up questions are the key to success. Sentier has decades of experience in healthcare and can anticipate client’s analytic and information needs without any hand holding and we have a library of common analytics that can be deployed rapidly across our services. Learn more at www.sentierinformatics.com.