Google promises a Hadoop or Spark cluster in 90 seconds with Cloud Dataproc

23 Sep 2015 | Author: | No comments yet »

Google Fires Up Spark, Hadoop Service On Its Cloud.

If Google knows a lot about one thing, it is that developers do not want to spend a lot of time setting up infrastructure just so they can create applications that chew on lots of data to run the business. The new Google Cloud Dataproc service sits between managing the Spark data processing engine or Hadoop framework directly on virtual machines and a fully managed service like Cloud Dataflow, which lets you orchestrate your data pipelines on Google’s platform. While many of us admire Google for its systems and datacenter designs, the thing that truly makes Google powerful is its ability to quickly develop new technologies and make them easy for programmers to make use of.

Greg DeMichillie, director of product management for Google Cloud Platform, told me Dataproc users will be able to spin up a Hadoop cluster in under 90 seconds — significantly faster than other services — and Google will only charge 1 cent per virtual CPU/hour in the cluster. It will help you create clusters quickly, manage them easily and it is also going to be economical as it will allow you to turn clusters off when you don’t need them. Ahead of the Strata + Hadoop World conference next week in New York, Google has announced Cloud Dataproc, which packages up the combination of Hadoop storage and Spark in-memory analytics and offers it as a service. Cloud Dataproc – which is short for data processing, of course, a very old school term for what we are all still doing – is in beta testing now; the timing of the commercial release of the service has not been revealed. DeMichillie and Google product manager for big data products James Malone told me Google is able to ensure the service’s speed thanks to its network infrastructure, but also because it patched a few Spark issues (related to the open source YARN resource manager the company is using for this product) and by building optimized images.

The pricing for Cloud Dataproc is sure to get the attention of people who are thinking about setting up Hadoop or Spark analytics clusters, which is a pain in the neck and which takes experts to maintain. DeMichillie acknowledges that some people simply want to have full control over their data pipeline and processing architecture and are hence more likely to want to run and manage their own virtual machines. Google is charging a penny per virtual machine per hour in the virtual clusters it manages on behalf Cloud Dataproc users. (This is in addition to charges for Compute Engine instances, network bandwidth, and storage that customers incur to set up the clusters.) Cloud Dataproc can run on regular reserved and on-demand instances, as well as the new preemptible instances that Google debuted a few weeks ago. Unsurprisingly, Dataproc is also integrated with the rest of Google’s cloud services, including BigQuery, Cloud Storage, Cloud Bigtable, Cloud Logging and Cloud Monitoring.

Also, as the tool uses the traditional Spark and Hadoop distributions, you can expect it to be completely compatible to virtually any existing Hadoop-based products. So, in a sense, instead of the ephemeral storage that was common in the early days of cloud computing, this can be thought of as ephemeral compute with persistent storage. Google does warn that a Cloud Dataproc cluster is limited to a maximum of 24 CPUs and 240 VM instances, like other Compute Engine resources, and you have to ask for any capacity above and beyond that. AWS offers the Apache distribution of Hadoop on its virtual clusters, and also allows customers to use the Hadoop distribution from MapR Technologies. EC2 instances run from 4.4 cents per hour for a an m1.small instance to a high of $5.52 per hour for a d2.8xlarge storage optimized instance; the EMR service adds 1.1 cents to 27 cents per hour on top of these EC2 fees. (Those are on-demand instances in the US East region; you can lower the cost with reserved instances.) Amazon just added support for Spark to EMR back in June, and in July revamped EMR with a 4.0 release that includes Hadoop 2.6.0, Spark 1.4.1, Hive 1.0, and Pig 0.14.

Microsoft similarly has a Hadoop service on its Azure cloud, which it calls HDInsight and which is based on the Hortonworks Data Platform distribution of that analytics platform.

Here you can write a commentary on the recording "Google promises a Hadoop or Spark cluster in 90 seconds with Cloud Dataproc".

* Required fields
Twitter-news
Our partners
Follow us
Contact us
Our contacts

dima911@gmail.com

ICQ: 423360519

About this site