Google opens beta on managed service for Hadoop and Spark

24 Sep 2015 | Author: | No comments yet »

Google Cloud Dataproc: Making Spark and Hadoop Easier, Faster, and Cheaper.

So goes the thinking behind Cloud Dataproc, Google’s GOOG -0.05% new managed big-data service for running Hadoop and Spark as a service on Google’s cloud computing platform.

The new Google Cloud Dataproc service, which is now in beta, sits between managing the Spark data processing engine or Hadoop framework directly on virtual machines and a fully managed service like Cloud Dataflow, which lets you orchestrate your data pipelines on Google’s platform. Greg DeMichillie, director of product management for Google Cloud Platform, told me Dataproc users will be able to spin up a Hadoop cluster in under 90 seconds — significantly faster than other services — and Google will only charge 1 cent per virtual CPU/hour in the cluster.

It will help you create clusters quickly, manage them easily and it is also going to be economical as it will allow you to turn clusters off when you don’t need them. Commercial technology vendors such as Cloudera and Hortonworks HDP -1.82% are trying to solve this problem for users running these technologies in data centers, but the easiest option—for those willing to give up some control over their server—is just to have a cloud provider take care of it for them. Cloud Dataproc is a managed Spark and Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning. Google is also touting the integration of Cloud Dataproc with the company’s other cloud computing services for big data—including BigQuery, Cloud Storage and Cloud Bigtable (a database technology)—and the ability to work with Dataproc using standard interfaces. DeMichillie and Google product manager for big data products James Malone told me Google is able to ensure the service’s speed thanks to its network infrastructure, but also because it patched a few Spark issues (related to the open source YARN resource manager the company is using for this product) and by building optimized images.

DeMichillie said Dataproc clusters take an average of about 90 seconds to come online, compared with at least several minutes if you’re deploying them on local servers, or even running open source Hadoop or Spark on cloud-provider virtual machines. Minutes—whether it’s 2 or 30—can make a big difference if you need those resources now, or if you’re being billed while machines are still spinning up. Big data workloads are becoming more important with each passing day, especially as trends such as the Internet of Things provide a tangible, viable use case for years’ worth of talk about data analysis. When compared to traditional, on-premises products and competing cloud services, Cloud Dataproc has a number of unique advantages for clusters of 3 to hundreds of nodes: Low-cost. In addition to this low price, Cloud Dataproc clusters can include preemptible instances that have lower compute prices, reducing your costs even further.

Instead of rounding your usage up to the nearest hour, Cloud Dataproc charges you only for what you really use with minute-by-minute billing and a low, ten-minute-minimum billing period. You can easily interact with clusters and Spark or Hadoop jobs through the Google Developers Console, the Google Cloud SDK, or the Cloud Dataproc REST API.

Here you can write a commentary on the recording "Google opens beta on managed service for Hadoop and Spark".

* Required fields
All the reviews are moderated.
Our partners
Follow us
Contact us
Our contacts

ICQ: 423360519

About this site