A lot has been published over the last year with regards to big data and cloud, so for today’s post I want to explore some of foundational concepts governing big data and its continued connection to cloud computing.
Traditional use cases for big data are at opposing philosophies with the cloud. Because, as the name implies there’s a lot of data, so getting it on and off of something doesn’t happen quickly. Traditional use cases of big data are very processing intensive, since to generate value out of the scale and scope of the data a business needs to be running data processing engines 24/7, so there’s very little dynamic load. We’ve seen some bucking of that trend recently with Amazon and elastic map reduce, which basically has people loading data into S3 which is persistent and then spinning up large compute farms for the data for short periods of time. As cloud trends and big data goes, that’s the ideal model for how cloud and big data can work together.
What cloud solutions are geared towards big data?
Big data needs lots of compute and massive storage. So the two real things to look for in cloud providers for big data are reliable persistent storage and extremely scalable compute. Persistent storage gives you a place to load data and store data after you spin down compute power. Meanwhile, scalable compute provides the power to process the data.
Another big trend we’ve seen in the big data world, where cloud can really come in to play is using the latest generation of graphics processors (GPUs) to process the data, some cloud providers that offer this sort of service.
What are the tools that handle big data in cloud
There are many tools, Hadoop being the most notable name within the brands related to big data and cloud. But Hadoop itself is simply a framework for an approach to making the traits inherent to cloud work to serve big data needs.
Apache Hadoop, Cloudera and MapR all have initiatives around cloud. They’re each distributions of Hadoop helping to manage the processing. HBase is another technology that has a big initiative around Amazon, and they’re making quite an impact as part of Amazon’s elastic map reduce system.
So why is big data so closely linked to the conversation around cloud?
If we contextualize the cloud as outsourcing instead of dynamic compute, then it makes sense to link big data and cloud together. Many big data businesses seek out cloud providers as a means to access big data. However, in a sense it doesn’t make sense to put big data and cloud together. If you’re not doing 24/7 data processing than cloud is the way to go.
What cloud is best for big data?
An easy way to think about how cloud and big data can work together comes down to the private versus public cloud conversation.
If you are consistently running processing for your big data needs, a Private Cloud can be a major boon to your infrastructure functions and costs, since you will be able to easily account for the amount of usage you need. if you’re doing a lot of intensive predictive metrics and similar processes, then private is the way to go for your company’s needs. Moving to the private cloud helps to control costs, since you’re only using the resources that you need. And since your cost versus need becomes consistent as it relates to financial output on the part of your business, incurring the associated costs related to owning your own technology and infrastructure makes a compelling case for outsourcing, since you pay only for the service, not the technical staff, upkeep, technology or maintenance.
However, if you’re doing a lot of sporadic batch processing, then Public Cloud is the way to do big data. Through the public cloud you can very quickly spin up new instances of compute power to help handle incoming data loads. This way, you don’t have to section off a portion of financing that is consistently dedicated to infrastructure support for big data. Cost will fluctuate, but you end up saving through using only when you need to use.
If you anticipate spikes within a consistent inflow of data, then you can move to a Hybrid Cloud. Digital advertisers, for example benefit most from this version of the cloud. They have the constant flow of data from their core business, but often will see a surge of traffic and will have more data to process
So why all the hype around big data and the cloud?
Simply put, it’s new. The reason there’s buzz around Hadoop is because the technology has put the power of large data processing, which didn’t exist a few years ago. Such processing used to be available to only the largest and most successful Fortune 50 companies in the hands of many more businesses. Previously big data processing were million dollar plus systems from Oracle or SAP. Processing big data requires a lot of compute, and is one of the more accessible ways to generate that level of compute power in the contemporary IT paradigm. Similarly, big data solutions becoming more available to a broader audience parallels the way cloud has increasingly the way that many businesses, from startups to major corporations cheaply (or at least cost effectively) handle their infrastructure needs. Previously (again similar to big data) having that scale of compute power was cost prohibitive, since it would take a major investment in a data center or physical compute stacks to achieve.
Read more about how cloud can work with advertisers here.
Thoughts on this piece? Send them to @CloudGathering.
By Jake Gardner