Big Data Programming Languages: What Are The Differences Between Python, R, and Julia?

by   |   November 21, 2016 5:30 am   |   6 Comments

Bernard Marr

Bernard Marr

When it comes to coding Big Data and analytical applications, a select group of programming languages have become the default choices.

This is because their feature sets make them well suited to handling large and complicated datasets. Not only were they originally designed with statistical purposes in mind, but a broad developer ecosystem has also evolved around them. This means there are extensions, libraries, and tools out there for performing just about any analytics functions you might need.

R, Python, and the relative newcomer Julia are currently three of the most popular programming languages chosen for Big Data projects in industry today. They have a lot in common, but there are important differences that have to be considered when deciding which one will get the job done for you. Here’s a brief introduction to each of them, as well as some ideas about applications where one may be more suitable than the others.

R

R, which has been around since 1993, has long been considered the go-to programming language for data science and statistical computing. It was designed first and foremost to carry out matrix calculations – standard arithmetic functions applied to numerical data arranged in rows and columns.

R can be used to automate huge numbers of these calculations, even when the row and column data is constantly changing or growing. It also makes it very easy to produce visualizations based on these calculations. The combination of these features has made it an extremely popular choice for crafting data science tools.

Because R has been around for a while, it has a large and active community of users and enthusiasts. They’ve spent the last couple decades building extensions and libraries that increase the scope of what the language can do, make it simpler for the user to access its functions, and automate monotonous jobs.

Among the popular extensions are SparkR, which provides access to Apache Spark; ggplot2, which provides visualizations; and an extension that has recently been announced that will allow R to access IBM’s Watson cognitive computing engine.

The fact is, though, that in becoming the ultimate programming language for statistical applications, R has sometimes fallen flat in other areas. Other languages competing for developers’ affections – including those mentioned below – are often more generalized. Because of this, a common approach is to first build the framework of an analytical application in R, taking advantage of its modular nature and support infrastructure. Then, once a solution – such as a working analytics engine – has been devised, the code might be recreated in another, more general purpose programming language to complete the application’s production.

Python

Python is far more general purpose than R and will be more immediately familiar to anyone who has used object-oriented programming languages before.

Python’s sheer popularity has helped cement its place as the second most common tool for data science – and although it may not be quite as widely used as R, its user base has been growing at a greater rate. It’s certainly easier to get used to than R if you don’t already have a solid background in statistical computing.

Python’s user base has devoted itself to producing extensions and libraries aimed at helping it match the usefulness of R when it comes to data wrangling. One of the first of these tools was the NumPy extension, which gives it many of the same matrix-based algorithm capabilities as R. This attracted coders interested in analytics and statistics to the language, and over the years it has led to the development of more and more complex functions and methodologies.

Because of this, Python has become a popular choice for applications using the most cutting edge techniques, such as machine learning and natural language processing. Open source applications such as scikit-learn and Natural Language Toolkit make it relatively simple for coders to put these technologies to work, and PySpark gives it access to the Apache Spark framework. However, if you’re only interested in more traditional analytical and statistical computing, then you may find that R presents a more complete and integrated development environment than Python.

Julia

R and Python are still the reigning champions when it comes to data and analytics-oriented programming languages, but there are several other languages attracting attention for their suitability in this field.

One that is certainly worth giving a mention to is Julia. It has only been in development for a few years but is already proving itself to be a popular choice. Like Python and R, Julia is built for scalability and speed of operation when handling large data sets. It was designed with a “best of all worlds” ethos — the idea was that it would combine the strengths of other popular analytics-oriented programming languages. One key influence was the widely used statistical programming language MATLAB, with which it shares much of its syntax.

Julia has specific features built into the core language that make it particularly suitable for working with the real-time streams of Big Data industry wants to leverage these days, such as parallelization and in-database analytics. The fact that code written in Julia executes very quickly adds to its suitability here.

In a head-to-head comparison with R or Python, Julia’s youth is her Achilles’ heel. Its ecosystem of extensions and libraries is not as mature or developed as it is for the more established languages. It is getting there, however, and most popular functions are available, with more emerging at a steady rate.

The Right Tool for the Job

From a general perspective, it may seem that R would be the natural choice for running large numbers of calculations against big-volume datasets, Python would be the go-to for advanced analytics involving AI or ML, and Julia a natural fit for projects involving in-database analytics on real-time streams.

In reality, the nuanced differences between each language and the environment they provide to the programmer means there’s rarely a one-size-fits-all solution. It’s also worth remembering that their open nature (they are all open source projects) means that they don’t pretend to live in isolation. The active communities behind each language frequently cooperate to port functionality between them, and extensions can be used to run code written with one language from within another language.

All of the languages here are living projects that are constantly evolving and updated to be capable of new things. Each has its strengths and weaknesses, but they are all robust choices for enterprise initiatives involving Big Data and analytics.

 

Bernard Marr is a bestselling author, keynote speaker, strategic performance consultant, and analytics, KPI, and big data guru. In addition, he is a member of the Data Informed Board of Advisers. He helps companies to better manage, measure, report, and analyze performance. His leading-edge work with major companies, organizations, and governments across the globe makes him an acclaimed and award-winning keynote speaker, researcher, consultant, and teacher.

 

Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise, plus get instant access to more than 20 eBooks.

Tags: , , ,

6 Comments

  1. Posted November 22, 2016 at 9:51 am | Permalink

    Good discussion Bernard.

    I’d like to add that to be competent in the field it is highly recommended to be proficient in several of these languages.

    I regularly find myself using R for bulk standard analytics – an example being time series analysis, for which R has a wide variety of advanced packages (e.g. forecast). Whereas for machine learning I often find myself dipping into Python and deploying scikit-learn.

    In summary – both are very accessible languages and can compliment each other on the task at hand, so I usually recommend trying both.

  2. REZA JABERI
    Posted November 23, 2016 at 12:49 pm | Permalink

    VERY INTERESTING

  3. oren tal
    Posted November 25, 2016 at 1:44 am | Permalink

    what about Scala?

  4. Mark Lewis
    Posted November 25, 2016 at 11:45 am | Permalink

    I’m okay with calling these the main languages for data science, but the title, and a few places in the article use the term “Big Data”. I would argue that these are not the languages for big data. To support that claim, just go look at frameworks like Spark and Flink. Spark is mentioned in the discussions of both R and Python, but what isn’t mentioned is that Spark isn’t written in R, Python, or Julia. Spark is written in Scala. There is a reason for that. If your data set is truly big and needs to be distributed across multiple machines for processing, the languages listed here aren’t up to the task. If you want to really do big data, you should probably pick the language that the people who are writing the most significant big data frameworks are using. As of today, that would be Scala.

  5. Caren Dymond
    Posted November 28, 2016 at 11:19 am | Permalink

    I find it odd that you don’t mention whether or not the language does the analysis in RAM or not. I find that a limitation in R and frequently trying to find ways around it and I don’t use big data. Do Python and Julia work the same way? Perhaps that’s why it’s not mentioned. It seems to me that it would be better to have the heavy-lifting done on a solid state drive than in RAM, but maybe I’m dreaming.

  6. Mike
    Posted December 12, 2016 at 2:06 pm | Permalink

    Great overview of these three solutions. I’ve used all three in different situations. I will challenge your comment that the are built for scalabiltiy. In my experience all can have performance issues on very large sets of data. I’ve had better luck writing user defined functions (UDFs) that run in-database on MPP and Hadoop. R, julia and python aren’t going to process 486 billion rows of data in any sort of timely manner, but in-database analytics can.

Post a Comment

Your email is never published nor shared. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>