The Future of Apache Kudu

on October 24th, 2016
Big data

Batch processing and real-time data analytics are two different animals. But most organizations would really like to be able to get a single view of both of their batch and real-time data analytics. Until recently, the only such solution was LAMDA, but that option is extremely difficult to use, and most businesses don’t even bother with it.

What’s Kudu?

Now percolating in the Apache Incubator is Kudu. This project was started by Cloudera, and delivers ultra fast analytics on streaming data. Kudu can handle real-time data analytics, and has some similarities with both HBase and Parquet. It’s like HBase in the sense that it allows for real-time ingestion of data, and it’s like Parquet in that it allows for the real-time analysis of both historic data and current data.

How Does Kudu Fit in the Hadoop Ecosystem?

Kudu is like HDFS in the sense that it’s purely a layer for data storage. It’s still necessary to adopt one of the Hadoop data processing engines to do the actual data analytics. Kudu is now integrated with MapReduce, Impala, and Spark, and work is underway to add compatibility with Hive.

As big data analytics matures, real-time streaming is becoming the mainstay of data operations. While lots of batch processing goes on, the real money and ROI for big data is in real-time streaming.

What Does the Future Hold for Kudu?

Kudu stands to add a lot to the Hadoop ecosystem. For one, it delivers ultra strong performance for both scan and random access. This quality will assist users in simplifying complicated hybrid architectures. Kudu is also incredibly efficient in CPU utilization, maximizing ROI on investments in processing power. It delivers High IO efficiency, leveraging today’s persistent storage solutions, and allows for updating data in place, sidestepping unnecessary processing and movement of data. Kudu also brings the ability to support active-active replicated clusters spanning across multiple data centers, even when those data centers are geographically separated.

According to the Apache Kudu developers’ FAQ page, Kudu is completely ready to be deployed in real world situations. There is no training offered for Kudu as of yet, but you can get most of your questions answered in the product documentation. There is also an active Kudu chat room (which is normally the case with Apache Incubator projects). You can also sign up to receive email announcements from Kudu developers, and Cloudera beta release forum is another excellent resource for Q&A type information.

For all the other data analytics news and information you can use, follow us on Twitter.