If your organization is in the throes of establishing a data analytics initiative, then you’ve no doubt been faced with about a million decisions. Hadoop? Spark? NoSQL? How do we handle ETL? What about data discovery? One such bridge to cross is the decision about which programming language to use for your big data analytics projects. While R, Python, and Scala are hugely popular, there are other alternatives to consider, too. Here’s a breakdown of the most promising candidates and some tips for making the right decision for your organization.
As simplistic as the name is, using R really is not. While it’s the go-to language for hardcore data scientists, it’s less known (and not as easy to learn) for those not well versed in Matlab, SAS, and OCTAVE. R is useful for data analytics, but not so great as a general-purpose language. For example, it’s good for constructing models, but it would need to be rewritten in Scala or Python for most production activities. It’s not suitable for things like writing a clustering control system, and the resulting code would be nothing less than nightmarish to debug.
Many data scientists forego the more limited R in lieu of Python. This language is strong within the academic community, particularly for NLP (Natural Language Processing). Python is a traditional object-oriented programming language, and is generally much easier for developers to get on board with than R (or even Scala). Another plus is that it’s generally well-supported in big data processing frameworks like Spark, but it’s not usually supported in the first release of new big data products. Sometimes you’ll have to wait a version or two out to get one that supports Python. Unless your shop has an immediate need to be on the forefront of big data analytics trends, this shouldn’t be much of a problem.
Scala is a beautiful blend of object-oriented and functional programming, and is a strong player in the industries of finance and business, as well as within the dominant tech giants, like Twitter and LinkedIn. Scala is the driving language behind big data success stories like Spark and Kafka. It offers a hearty selection of features and isn’t as wordy and complex as some other languages. But Scala often provides more than one way to skin a cat, and it’s not always readily apparent to the next guy to come around what the programmer was trying to do. Also, the Scala complier is a bit on the slow side. On the upside, it does boast the REPL, web-based notebooks, and effective support.
If data analytics languages have a red-headed stepchild, it’s the Java language. Yet for all its unpopularity and neglect, it absolutely reigns within the enterprise. In fact, some of the era’s most hailed success stories have Java to thank, including Hadoop MapReduce (okay, it isn’t easy and breezy, but it’s stable and reliable), HDFS, Storm, Kafka, Spark, Apache Beam, and many others.
HiveQL is a query-based language that was developed for work with Hadoop (or another distributed storage platform), and is used for coding instructions to Apache Hive. Based on SQL, Hive has been around longer than most of the other data analytics languages on this list, and is very popular, particularly since its acceptance by Facebook. However, HiveQL is quite limited. Most shops use it primarily for ad-hoc queries.
You really have to admire the creativity of the name, and Pig also has the advantage of being a bit more predictable and delivering more capabilities in terms of data analytics performance. Also strongly tied to the Hadoop ecosystem, Pig Latin is the language layer behind Apache Pig, which is a platform for creating Hadoop MapReduce jobs. You can also develop your functions in another language, such as Python, where Pig Latin isn’t supported.
As you can see, some of the languages are more practical and flexible, while others are more targeted. When it comes to data analytics, your team might actually have to master more than one to get everything you need to do done. Also, the big data products you choose to use dictate which languages your shop needs the most — such as whether you opt to use MapReduce or to go with YARN, Spark, or Pig instead.
Now that you’ve got some guidance for the language selection phase of your new data analytics operations, you could probably use some insight into other choices — such as whether you need a NoSQL database or if you actually need a real-time analytical platform. Follow us on Twitter for help on all this and more!