The statistical programming language R is the most popular statistical programming language in the world. It is used by 70 percent of data scientists, according to a Rexer Analytics study, including those at big data gravity centers such as Facebook, Google, and Twitter. In addition, thousands of university students around the world use R, and thousands more take R courses on Coursera.
Joseph Kambourakis, lead data science instructor at EMC Corporation, fielded questions from Data Informed about R’s popularity, strengths, weaknesses, and what he sees ahead for the open-source programming language.
Data Informed: Why is R so popular?
Joseph Kambourakis: It’s popular primarily because it works well and is free. R is also slightly better adapted to the way data scientists think when compared to tools such as Java or Python, which are more adapted to the way computer scientists think. The vast number of libraries and packages (available for R) really make anything possible.
What advantages does being open source give R over proprietary software?
Kambourakis: Being open source makes it much more agile and fast growing. Users can write new packages and functions at any time and much quicker than a company. There is also no limit to the number of people who can do so, whereas a proprietary software company is limited by their current number of employees. There is no hassle or time spent negotiating, updating, or maintaining software licenses. The download size is also a fraction of the size.
How quickly are user-created tools and libraries being created? Is this creation accelerating over time?
Kambourakis: In R, the user created tools and libraries are in the form of packages. There are multiple new R packages created every day. This graphic shows how dramatic this growth has been lately:
What are some business use cases R is best suited for?
Kambourakis: R is best suited for developing data models or building graphics. At this point, every business has data they should be modeling, and everyone needs graphics to help explain the models.
What are some of R’s shortcomings?
Kambourakis: Some companies don’t allow the use of open-source tools for security reasons. There is a lack of certification and product specific trainings that a company like SAS or Oracle would typically administer. There isn’t customer support, so if something goes wrong there is no one to call or help aside from message boards. (Revolution Analytics recently announced AdviseR, the first commercial support program for R – ed.)
What are some of the ways users are misusing R?
Kambourakis: I think the two biggest misuses are in how and where they use it. Using R without an integrated development environment (IDE) or graphical user interface (GUI) makes everything much harder than it needs to be. The other misuse is putting it right into production. It really should be used for developing and testing a model. Then, for production, something faster should be implemented.
Are there potential business consequences of using this tool for the wrong business problems?
Kambourakis: Thankfully, I haven’t seen this problem before. I’m sure it exists, but I haven’t heard any real horror stories comparable to things I’ve heard about other tools such as Excel and the London Whale scandal at JPMC.
Is there a simple way to evaluate if R is the proper tool for a particular business challenge?
Kambourakis: I think the data size and speed dictates most of the decisions. If it’s small data, then R is a great tool. If you have multiple gigabytes of data, then you’re likely to run out of memory. R is much slower than compiled languages such as C or Python. If you need the algorithm to run in microseconds, R likely will be too slow.
What do you see as the future of R?
Kambourakis: In the future, I really see R being the most popular and most commonly used language for any kind of mathematics or statistics. The fact that it is being taught in universities means that in the future there will be more and more students and future employees using it. The availability of functions will only continue to expand. I also think many of the problems that R currently has will be worked out. The community behind R is very dedicated and passionate.
Related on Data Informed:
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise.