Webinar: The Data Science Sandbox as a Service on Azure in Action
Hannah Smalltree (Cazena, VP of Marketing): Hello, and welcome to today's webinar, introducing the Cazena Data Science Sandbox in action. My name is Hannah Smalltree, with Cazena, and I'll be your moderator today. As a reminder, we've left plenty of time for Q&A with our panelists today. You can send your questions at any time during the presentation, using the questions tab on the interface, then I'll pose your questions to the panelists, so feel free to submit those any time they come up in your head, and we'll ask those at the end of this session.
First, we'd like to start with some context about Cazena, with a brief overview and a demo of the Cazena Data Science Sandbox service. To present more, here is Lovan Chetty, director of product management with Cazena. Take it away whenever you're ready, Lovan.
Lovan Chetty (Cazena, Director of Product Management): Thanks, Hannah, and hello everyone. I'm going to do a really quick introduction and demo, because I'm sure that you all are much more interested in listening to Craig. I'm going to dive right into what is the problem that we're trying to solve. Cazena's been working with a number of organizations that have been trying to figure out how to accelerate their analytic cycles. They currently struggle with things like the time it takes to install and configure on-premises platforms, but in addition, the entire analytic landscape is changing so much more rapidly these days, with the whole new range of new processing platforms and technologies, that all of those things add a significant extra dimension of complexity to this problem.
The cloud has some great capabilities to solve the agility problem, but building a production analytic platform on the cloud is no mean feat. It introduces some new problems to solve, such as how do you integrate the cloud into your existing on-premise workflows, plus it requires new sets of provisioning, as well as DevOps skills, to deal with the differences in the cloud. Cazena is looking to significantly reduce this adoption curve, that would allow organizations to start to leverage all of these cloud promises in a much easier way.
What is Cazena? We're a service that provides an analytic data processing platform, that enables analytic teams to quickly get to the task of loading, transforming, analyzing, and distributing data. The Cazena service runs on the megascale public cloud providers, and it incorporates best of breed data management technologies, such as Cloudera, RStudio Server Pro, et cetera, rather than us trying to reinvent the wheel for how to do data management and data processing.
What we do, however, is we wrap all of these technologies, so that's the cloud and the data management stuff, into a secure service that turns public cloud into single-tenant private cloud, each Cazena customer. There's mandatory end-to-end encryption, that ensures the privacy of all data within the service. All operational aspects of the platform, from monitoring to upgrades, is all built into the service, with a single support contact point for every level of the stack.
The service provides simple mechanisms to move data into the platform, whether that data originates on premise or in the cloud. We also include some of the popular tools that we see in the market, such as query editors and RStudio Server Pro as just a standard part of the service. However, we know that the range of analytic tools is really, really vast. If you have your favorite tool that you prefer to use rather than the ones that we incorporate within the service, you can just connect them.
We see our service being used for three big use cases at the moment. The first one is augmentation of on-premise data warehouses, with a more scalable, agile cloud data mart as a service. The data lake is a second use case, which is predominantly used to support data engineering, but this webinar is going to focus on what we call the Data Science Sandbox. This is an environment that allows advanced analytic teams to run a wide range of analytics on a single integrated platform.
I thought I'd just start a little bit with the pattern that we've been seeing within this use case, with the organizations that we've been currently working with. There seems to be a pattern of a set of four steps that controls this overall end-to-end data science or advanced analytics workflow. This community of users wants to be able to easily ingest data from a very wide variety of sources, some of those being on-premise, but others being in the cloud, either third-party sources that might be purchased data, as well as publicly available data.
Once they get that data, they want to be able to use a wide range of techniques, to transform or prep this raw data into something that is analytic-ready, because some of the data is not nicely structured. They need something that would allow them to do that. Once they have analytic-ready data, they then move into this iterative cycle of fitting and running models to try and get insight out of that. If they find a valuable model, then they want to be able to deploy that into usually another environment that has stronger governance, rules, and controls around it.
If we start to match the Cazena Data Science Sandbox to that high level workflow, what we see is that the service supports this workflow by providing a wide range of options to efficiently and securely move data into the service. This is available for both batch as well as streaming data. Once the data is on the platform, there are multiple processing engines, such as Spark, that are available to manipulate and transform the data. These engines can be leveraged in the language of your choice. The distributed platform also means that when you get to fitting and running models, you can do this on either larger data sets or on multiple models in parallel, or do both.
Since it is a single platform that supports multiple users, it enables these teams to more efficiently share data and algorithms, as all the users have easy access to the same data repository, as well as consistent libraries and versions of those libraries. Then, once a repeatable insight is discovered, it can be deployed to a separate Cazena environment with higher governance controls, or you could just take the derived data sets and easily move them downstream to an operational system.
Hannah Smalltree: I'm going to launch this poll now. The question today is are you using any public cloud infrastructure or services for analytics and data science? Any public cloud, any kinds of infrastructure. Some people are entirely on-premise today, and some people are entirely in the cloud. We have a few options up there. I'll give you just a few more minutes for answers to come in while Lovan is setting up the demo.
I'm going to go ahead and close the poll out. We'll look at what the results are here. It looks like a lot of people, almost half are using the cloud now. Another quarter about planning, another quarter not, and some people interested. Nobody's not sure, that's good. We all know our status with the cloud, so I'm going to go ahead and hide these results and pass it back to Lovan, to share a demo. Look at how that worked. Beautiful.
Lovan Chetty: Thanks, Hannah. What you're looking at is what we call the Cazena Console. This is what the end user of the Cazena service gets to use to manage and monitor the service. The average user probably wouldn't see this, because they're going to be using one of the tools that we'll go through as we go through the process of going through a simple workflow.
As I mentioned in my introduction, you can consider this a private service. I have a very small environment that I'm going to just do this quick demonstration on, but there's a key component, which we call a Cazena Gateway. That's a very lightweight software device that users will deploy, either on-premise or on the cloud, but you can think of the Gateway as the key that unlocks this private service.
Once I have one of these Gateways, that means that I can start to do some really interesting things very, very quickly. One of the first things, once I have an environment, is let's get some data into it. You have all of the standard capabilities of a Hadoop environment, if you wanted to load data programmatically by things like the WebHDFS API, but what we've done is abstracted away some of those components, to make data movement even easier for somebody who didn't want to write code.
In this case, I've got a very traditional, relational, on-premise data warehouse that I'd like to pull data out of so I can start to move that into the service, and do some more advanced analytics on it. We give you a really easy-to-use interface that will go look into that on-premise system, look at an inventory of objects that are available in it, then at the same time, what's happening on the right hand side is that it is doing a quick analysis, to show the user things that they should be aware of, because I'm moving it out of a relational system into a non-relational system in this case.
It's found a few things, like reserved words, which it will allow you to change, or it might find data type mismatches, but this is all about just showing the user where there's some level of ambiguity or things that they should be aware of. Once they're happy with all of that, they can very quickly start to move across not just all the metadata that describes that set of data, but the data itself. The data mover is smart enough to ensure number one that it moves all of that content across, but it's also going to move it across as securely as ... Because we have end-to-end encryption, so as soon as that movement starts, everything is encrypted, everything is encrypted at rest, and for larger volumes of data, will automatically start to do things like compression, and spun up multiple streams so that you're using your connection to the cloud as efficiently as possible.
Once I have data in the system, then I want to start to manipulate it. There's a few tools that are built directly into the service. All of those are available via the Gateway that I initially mentioned. So there's a query editor, where I can start to interact with data directly. This would interact with some of the data that I've just loaded onto the system, but not all of my data comes from on-premise sources.
I could then decide, "Well, I'm going to switch out of SQL, and I'm going to just write a very simple script, to start to pull some data from the internet." For the kind of analysis that I want to do, I'm going to go and scrape some websites, so that I can go pull in some of that data, I'm going to try and join it with that on-premise data that I had. I'm going to just run something simple, go and pull some data from the BBC website, and I can now start to use that on the platform, together with the data that I pulled in from my on-premise relational system.
This is typically what we see happening within that data prep stage, where they're quickly pulling in multiple data sets, transforming them into that analytic-ready state. Then once these teams start to do analysis, one of the really common things that we start to see them do is statistical analysis.
Once again, to make that statistical analysis as easy as possible, we just unload RStudio Pro right into the service. This is not running locally on my machine, this is just part of the service. You go to a URL. It's connected directly into the data, and because I'm running on top of a fully parallel environment, that means that I can do all of the RStudio and R stuff, but if I want to start to leverage Spark, for instance, within my RStudio, within my R code so I can start to do analysis across bigger data sets, or just use some of the newer capabilities within that. We've done all the plumbing to ensure that that works, and all you have to do is deal with your analysis.
There's multiple ways of doing that. The one that you're seeing at the moment is me driving through to the Spark engine via SparklyR, but if you want to use Spark R, that's on the platform as well. Whether you're using Spark R to interact with Spark, or using Python to interact with Spark, all of those libraries and patterns are there already. I'm not a data scientist, so I'm not going to start to do anything complicated at this point. It was really just to show you that these tools are all built in, and they're all plugged in together so you can run analysis as quickly as you'd like.
Then, just to tie it all together, everything that I've been showing you so far is running live on Azure, under the covers, because the service runs on the megacloud providers. As we go through these steps, you see that you don't actually get to see the cloud in any of this, even though you're using it in every step of the process. That's because of all of it's been abstracted away, so that you don't have to deal with understanding, "How do I secure the cloud? How do I make sure that I have a service that is consistently running, from a production perspective?"
All of that has been done for you, so that you can just do analytic processing in the cloud, without having to worry about all of the configuration, and setup, and management, and operations of that. What we're looking at at the moment is just a high level view of what's the overall activity that's happening across everything, with the left hand side being the cloud layer, the right hand side being all of the workloads that are being issued to the cloud. There were some SQL statements that were done, there were a whole bunch of data movement tasks that were done. It's a consolidation of that entire end-to-end process, so that you can very quickly get started with loading data, manipulating it, then starting to do some analysis.
What Cazena has done is it's taken, I think at last count, well over 200 tasks that you'd probably need to do, to build a production environment that has all of these components together, and we've put them together into a single interface, that is fully managed and supported 24/7. A big part of that support and management is that we're constantly monitoring the platform to ensure that it is optimized for your particular workloads.
The result of that is that you should be able to number one, get started a lot more quickly with any analytic projects that you have, and for certain things, like the data movement, I've shown we've also taken it a step further to say if you don't want to use the manual processes of doing things, there's certain things which can be automated, and maybe even simpler for certain kinds of data, but all of that is wrapped in a totally secure service, so that when it's looked at by your enterprise security group, enterprise compliance groups, I think they'll be more than content with looking at all of the security and compliance postures that we've built into the service.
I think, Hannah, you want to do a second poll at this point.
Hannah Smalltree: I would love to, and thank you very much Lovan, it takes a brave man to do a live demo, but Cazena is one thing that we can really rely on, so I hope that you got a lot out of seeing that demo, and we're always happy to talk more about it.
Now, I'd like to launch this next poll about how data science is organized for your company. Right after this poll, we'll hear Craig Haughan, with Carlson Wagonlit Travel. They added a data science group last year, and he'll talk a little bit about how they're organized for data science. We hear some places have a centralized organization. At some companies they're more embedded within different business units, or a specific insights team. Some places have data scientists sprinkled throughout the organization, more inconsistently, and some places are still hiring their group. Just curious where you are with your organization. Also wanted to remind you that you can post questions any time, within the interface. Things that you hear Craig say, or things you would like to hear Lovan talk about again, from a technical overview.
A few more answers coming in here while Craig gets ready. Oh, hey, this data's always fun in an analytics crowd. Let's share some results here. Looks like pretty even split here, between centralized organizations, within business units, and inconsistently. That is consistent with what we hear, which is that it's all different ways.
With that, I'm going to hide results and introduce Craig Haughan. He's a VP of data solutions and architecture at Carlson Wagonlit Travel. They have a really interesting story, which any of us who travel can really appreciate. Without further ado, let's advance the slide to Craig's picture, and take it away, Craig Haughan.
Craig Haughan (CWT, VP of Data Solutions and Architecture): Excellent, thank you. I hope you can all hear me okay. Yes, my name's Craig Haughan, I'm the vice president of data solutions and architecture at Carlson Wagonlit Travel. Travel is one of those industries that's been around for quite a while, and we haven't got the largest data sets in the world, but we have got very, very complex data sets, and very, very complex environments in which to do it.
As anybody who's had an experience traveling, it's not when everything goes smoothly that you really appreciate it. It's when it goes wrong, and having the people there, and having it there and available to you. Since we've had the Cazena Data Science Sandbox, we've really been focusing on what value we can add to our travelers as a community, understanding more and more about what their habits are, and how we can actually help them, what we can do for them to make that travel experience better.
We've got a number of things that we're actually focusing on. The first of those is understanding more about that customer. What that means is we have, being one of the largest corporate travel providers in the world. We're in at least 150 different countries, with all the language implications that that's got, with all of the difficulty with collecting data sets, and managing those data sets. In the more traditional environments, that's proved to be a challenge, in terms of getting that data together, merging it, allowing people to just play with that, those data sets.
Really, what we've actually done and what we've actually built, with Cazena's help here, is bring it in very, very quickly, because we've gone from a standing start of having, essentially, very diverse data science group, or not even a group. so embedded with various different functions, to having a platform in place which allows a data science team to be put in place.
Now, we started on this journey in November this year, and we now have five data scientists working on this platform. Along with actually gathering all of the data, gathering all of the understanding about those data sets, and also not just our data sets, but taking data sets from our customers as well.
What this platform allows us to do, and why we chose Cazena, was that we don't have the skill sets, internally, to build a Hadoop cluster which allows us to do that. Yes, we can go away and hire the people that go and do it. Will we get it right first time? Who knows, but the advantage we've really got, with Cazena, is that it's allowed us just to turn the key on, add data scientists, add people with the knowledge of manipulating that data, of gathering that data, of understanding that data, and really be able to hit the ground running, and have a very flexible and powerful platform that's allowed us to go from nothing to a data science team.
We've really been running that for the last four months, and we've seen some good results already, some very good insights into things that we can quickly change, at our point of sales systems, to allow us to help our customers.
For example, although we've got literally tens of thousands of different rates that you can get when you go to a hotel, and if any of you have ever tried to book a hotel, you'll know that you can get the same hotel room at many, many, many different prices. That all falls down to the rate access code the hotel is actually doing.
We found that although there's 10,000 of these codes, only around 400 of these actually make sense for people to buy. Of course, without this data science team, and without the platform being able to take that kind of insight, we really couldn't do that before. That's been a very quick win for us. Again, the results that we're getting out of the system are impressive.
Along with that journey, we're also looking at, really, our margin management as well. As any company, as you own transactional business, which travel is, all your customers want everything for less money, but more service. Really, what we're also deploying our data science team, and also the platform that Cazena's providing us with, is looking at that margin management. In the same way that when you shop for a hotel, or an air travel, or whatever else, each of those rates can be different for that same service that you get, but again, we get paid by the providers of those, as being an agent.
In the same way as it can be a different value, we also get paid a different amount, so again, with that margin management, that allows us to actually understand that data, and understand it in a live format, so that we can actually manipulate the rate codes that we're offering to our customers, to both benefit our customers by providing cheaper and better hotels, and still making sure that the margin that we offer on those hotels is up to the best that we can get for that.
We're really on the journey now, which is going to last for a good number of years, but I feel that we've really had a good from zero to 100 miles an hour start by choosing the Cazena service, running on Azure, A, because it's secure. We've had the normal corporate security checks, and as any of you who have dealt with any corporate security environment or anything else you've got to be able to make sure that it's secure. The Cazena platform came through that with the flying colors, especially because of the Gateway service and the end-to-end encryption. We're also provided that assurance that it's also encrypted at rest, and also the data governance is there as well, to actually make sure that we know who is accessing the data, and we know who isn't accessing the data as well.
Over the next few months, in the short term, we're going to extend our use of this to, again out into our community. We have a sandbox approach in this environment, and we're really allowing our data science community to put in whatever data they want in that sandbox, and merge it with our more private data that we've been gathering and curating as well. That's how we've been using this environment, and quite happy to answer any questions about it as well, and I'll hand back over to the Cazena guys.
Hannah Smalltree: Craig, I'm going to actually start with a few questions while we have a few more coming in here. Can you talk a little bit more about how your data science team is organized? They're very focused on discovery, then how they've put things into production, so what the different groups are involved to move things from interesting observations to actually reality.
Craig Haughan: Sorry, you're going to have to say that again, because my phone dropped out unfortunately.
Hannah Smalltree: Okay, I'm wondering how you move things from the data science group into production. I know you have a few different groups involved, so how do things move from an interesting discovery into reality, or what's the plan for that?
Craig Haughan: We put our first thing into production, which was simply understanding these rate pieces, and what we've actually got at the moment is segregation of our warehouse, or Data Science Sandbox, sorry, to actually have a production version running and a non-production version running, which is really how we actually move it across. Of course, we utilize SVN, for managing all of our code, and to be able to facilitate that. Does that answer the question?
Hannah Smalltree: Yeah, it does. Now, Lovan, if you could unmute as well, we have some more questions coming in. This question is for you again, Craig, which is how did Cazena save you six to eight months? Where did that estimate come from, and what are the biggest things that you didn't have to do that you would have had to do if you built it yourself?
Craig Haughan: Yeah, that is the key thing. [Drops out] Skills around Hadoop, provisioned the cloud ourselves, C, we didn't have to worry about the security piece, because that's handled by the Gateway and the end-to-end encryption on the turnkey service. We really concentrated on hiring our data science team rather than building the underpinnings. As much as I'd love to go in there and have to build all the various different servers, having to understand the Cloudera manager behind the scenes, and things like that, we were on the accelerated program, here, getting ourselves to be a data-driven organization.
We took the view that we were actually going to look for a service, a platform as a service, and that which Cazena provided to us, and those were the things that we really didn't have to do, was do all the engineering, and all the plumbings, and all the making sure it's secure that you have to do when you move to a cloud provider.
Hannah Smalltree: Great answer. This next one I think really dovetails nicely, which is we reference on the Cazena side, Lovan, half the cost, which is that's an aggressive, marketing got ahold of that slide for sure. I’m marketing. That's an aggressive number, but a lot of that is related to some of these skills issues that Craig brought up. Lovan, can you talk a little bit more about how we arrived at that number and some of the other things Cazena does to reduce the cost of data processing in the cloud?
Lovan Chetty: Sure. I think there's two categories that you can put the savings into. The first is just getting started. There's a whole range of things that you need to figure out, which is what kind of machine should I run on in the cloud? The cloud is great because it gives you all of these options, but that can also be a hindrance to getting things done, because you get tens of compute options, you get multiple storage options, you get multiple networking options. Figuring all of that out by yourself is initially going to burn through time, resources, and money, so from the Cazena perspective, we've solved that problem, and we have software that does those initial steps and configuration for you.
Then, once you're up and running, the bigger set of problems then, to solve, is around the operational maintenance of the environment. This is things like how do I just keep everything up and running, or the cloud providers are constantly releasing new things, which is great if you can consume them at the rate at which they're releasing it. Cloudera's releasing new things, RStudio's releasing new things and all of these technologies that we are using for data management are releasing new components all the time, so once again, we're taking on the onus of building technology there. Again, this is all software driven, that performs security upgrades, and data management upgrades, and a whole range of tasks that ensures that all of those operational tasks, that again, you'd spend people, time, and effort on, are just automated.
Hannah Smalltree: Right, so a lot of it is in that labor cost, and the cost of learning all of those new tools. I think the other key point here is that you don't have to. There's more services, there's things like Cazena that can help you build those, and that's clearly what they found at Carlson. This question is for you, Craig, and it's about the cost saving. The exact question is are you able to realize the dollar savings you achieved by implementing Cazena? I guess maybe more generally, if you could talk about how you justified the cost, and if there's anything you have about where you think you're saving money, or how you think the finances work out.
Craig Haughan: Yep. In terms of where we're saving money, there's literally people that we haven't hired to build this platform, so we haven't had to spend the time investing in the skills to actually run and manage the Hadoop cluster that we have, or the RServer, which nobody in our organization has got those skill sets. We'd certainly either have to pay for consultants to do that, and to train our operational people to do that, and also in terms of return on investment, you compare this to something like a very large-scale data warehouse, like a Teradata, or an Exadata, or something like that, when you look at the Hadoop cluster, if anyone's ever bought the Teradata, you're talking in the millions of dollars to actually buy those such things, whereas to rent a platform in the cloud, to get you started in that is not that type of investment, is not that scale of investment.
Really, we compared it against what we were spending on our traditional data platform, which is again, with all of the niceties of those traditional platforms, it does work, but actually to let people have free reign and free access to that, when we've got all the SLAs and things like that as well, it's not really a stacking point.
We would have had to duplicate that platform at that cost, in order to provide a Data Science Sandbox to go with our existing skill set that we have, or we would have had to hire the people, hire consultants, hire a company to actually train up our operational people, train up our data science team on how to manage the underlying platform, which again, as I said, we weren't interested in really doing that, and it would have not been a success factor of our project, because we would have spent those six to eight months learning how to build that system, hiring the people, we don't have those people sitting around, and again, those sorts of skill sets are at the expensive end of skill sets nowadays, because this is the skill set that's really in demand right now, because they’re all wanting to get this value out of this data, and the data science is exploding in there.
The case really made itself, because we could start very, very quickly. It's really the opportunity cost of those six to eight months, and building the underlying platform, without actually making the discoveries in the data that are there and that are obvious when you've got a platform of the right power and the right flexibility to be able to do that.
Also, just the raw cost of saying, "We considered this option, and the more traditional way,” but the stuff that we understand, within the company, or the traditional data warehouses provide us, and the cost when weighed up against adding more nodes into that or adding completely separate environments, really made the case for itself.
Hannah Smalltree: You got an enthusiastic thank you from the person asking that question, so thank you for your detailed response there. Another question is is this all in Azure, including Hadoop, and Spark, and maybe you could talk a little bit more about that, Craig, the Azure part.
Craig Haughan: Yes, absolutely, everything is in Azure, and we did choose Azure because of the ancillary opportunities that it affords us as well, because the Azure Cloud, and we looked at all the different cloud providers that are out there, we were really encouraged by the things that Azure is giving us, in addition to just the pieces that Cazena are providing as well. Things like natural language processing, and all the other good bits and pieces. We're not using them yet, but this is on our roadmap for the future, to actually be able to really do that.
Also the locations as well, of the data centers. Like anybody that deals with any of the European countries, knows that data is sensitive as well, so having data centers all over the world, so that we've got the opportunity in the future as well, to place our data where we want, should legislations happen, should different things happen within the world, we have the option as well. Yes, everything is literally running in the Azure cloud, including the RServer and everything else, but having said that, we've also got the Gateway, which essentially makes it as if it's an on-premise solution for us.
We have a business intelligence tool that's been stuck in our data center, which is physically in Nevada, talking to our cloud location, which is, I think, physically in The Netherlands, and it runs very well as well. All of our data science pieces are in the cloud. We have that easy reach out and easy reach back, and into our data center, thanks to Cazena Gateway, which is another reason that we chose it as well.
Hannah Smalltree: Excellent, and that was a great answer, and the Cazena Gateway is a really unique part of the solution that is outside of the cloud. It sounds like our audio quality is saying that it's getting time to wrap this up, so I have some thank yous here, some very great thank yous for Lovan and Craig, from our audience directly, which is awesome to see on a webinar, and I thank you so much. It's been so interesting learning more about the Carlson Wagonlit story, and about all of the cool things that you're doing.
I hope that our attendees will stay in touch with Cazena. We do have a cool test drive of our Cazena Data Science Sandbox available for you, so you can get hands on and see that directly. You can learn more at Cazena.com/Sandbox. We will send out an archive version of this webinar tomorrow, so you can share it with all of your friends, and re-watch it, and hear what Craig said, and of course, please stay in touch, look for us at industry events. We'll have more blogs and webinars coming up, and really look forward to your feedback on our Data Science Sandbox as a Service.
With that, another big thank you to our speakers, Craig Haughan of Carlson Wagonlit Travel, and Lovan Chetty of Cazena, and thanks everyone for joining. Have a great day.