Granted, the price tag is attractive. But should you use them for serious work? In the mainframe era, the saying went: “No one ever got fired for choosing IBM.” So is it better to play safe and stick to commercial packages such as SAS or SPSS—the latter, coincidentally, now owned by IBM?
It’s a deceptively simple question, to which there isn’t a single, simple answer.
Certainly, the statistical language R, for instance, is these days hugely popular—not least because it’s free, rather than requiring users to pay SAS’s and SPSS’s annually renewable hefty license fees, where prices in the thousands of dollars are the norm.
But just because something is popular, or free, doesn’t mean that it’s right for your needs.
So here’s a rough guide to some of the core issues impacting serious users, wanting to perform serious statistical analyzes in a commercial environment.
Gretl (Gnu Regression, Econometrics and Time-series Library), as its name suggests, is aimed at econometrics and time-series analysis. Statistical language R reckons to be a multi-purpose statistical workbench; while PSPP, currently in version 0.8, and under development since 1998, aims to be a free, open source equivalent to IBM’s SPSS. Eventually, the plan for PSPP is to duplicate the full functionality of SPSS, but it hasn’t got there yet.
In practice, this means evaluating a free, open source package in order to determine if it will perform the analyzes you want, and establishing for yourself the format in which those analyzes are outputted.
Internet-based reviews can help, as can user conferences and user groups, but the fastest solution may be to download a copy and see for yourself. That’s free in cash terms, although of course there’s the time and resources aspect to consider.
In contrast, offerings such as SAS and SPSS have extensive and rich functionality, as well as good graphics and visualization capabilities—but have price tags to match.
Training, Support, and Documentation
For Steve Messenger, managing director in charge of data analysis, modeling, and insights at London-based marketing analytics company Red Route International, package support is a critical issue. Red Route knows the major open source packages, he explains, and has looked at them.
“But what we want is a company that we can go to if there’s a problem, where there will be someone who will take ownership of our problem and provide a solution,” he says.
In other words, in a commercial environment, he doesn’t want to depend on the kindness of strangers answering questions in an Internet users group, or be posting bug fix requests on an open source project’s SourceForge portal.
Training and documentation—at least training aimed at serious, high-level users—can also be an issue in the free and open source world.
Free PDFs abound on how to get started with R—look here and here, for example—and there’s even R for Dummies ($20 at Amazon). But while some online providers offer training in R, the training available for less mainstream packages is patchy or non-existent. That said, econometrics and time series analysis Gretl has a free 352-page downloadable manual, and there are various free guides to PSPP, with two useful resources to be found here and here.
With SAS and SPSS, training provision isn’t a problem—and besides, many users will come to SAS and SPSS in the commercial world having had prior exposure to one or both at university.
The buzz around R doesn’t always make clear that R in its free, open source version has serious constraints on dataset size. Once the size of dataset reaches about half of the free RAM available on the computer it is being run on, “you start to see performance degradation issues,” says David Smith, vice-president of marketing at R specialists Revolution Analytics.
For the dataset sizes associated with big data, then, that can clearly be a problem. Revolution’s answer: sign up for its own—paid-for—version of R, Revolution R Enterprise. It has no such constraints, thanks to its ability to run multi-threaded on multiple nodes, and manage data in blocks, held on disk but available to the processor.
Typically, says Smith, the resulting increase in speed is of the order of at least 15 to 20 times, and can be greater. When recently switching from the open source version to Revolution R Enterprise, he reports, a marketing analytics company saw an analysis job that routinely took three to four days to complete finish instead in just 50 minutes. In terms of cost, Revolution claims that Revolution R Enterprise costs around half of what competing “legacy” solutions—think SAS and SPSS—cost.
That said, if the analytics software budget is either limited or zero, then rigidly sticking to free, open source software may be the only option. And critically in a big data context, PSPP has no applicable dataset size limitations. On a conventional PC, says PSPP co-developer John Darrington, “the dataset size is only limited by the size of the free space on the hard disk.” Beyond the terabyte or so that this might equate to, data can be streamed from an enterprise server.
That’s one reason, he explains, why PSPP’s user base is so polarized—on the one hand, small businesses and students attracted by the price tag; and on the other, professional statisticians, attracted by PSPP’s ability to handle datasets that cause problems for R and Gretl.
No, I’m not referring to the size of your analytics project, but the size of the development team maintaining and developing the statistics package in question. R, for instance, has a core team of about 20 developers, with another 50 or so credited with bug fixes and special piece of code. Gretl’s developer team is of a broadly comparable size.
PSPP, in contrast, has a smaller team, with Darrington and co-developer Ben Pfaff, who is the project founder and lead developer, shouldering much of the burden. On the plus side, that means that questions to the project’s Internet user group page are answered by the people who wrote the code. The downside is that they’ve got day jobs, and when they’re answering users’ questions, they can’t be coding as well.
Broadly speaking, the bigger the project, the better the documentation and the faster the pace of development.
Reputational Risk and Perception
Back at marketing analytics firm Red Route, data analysis managing director Steve Messenger repeats that his firm has explored free and open source alternatives to the SAS and SPSS packages that it routinely uses. But quite apart from any consideration of the effectiveness of open source tools, he says, “You have to consider how it looks to the client.”
“There’s an element of being able to provide a total assurance that the solutions we are proposing to use are safe, and proven, and are industry standard, and are of high quality,” he explains.
And to clients at a senior level—typically people in their fifties and sixties—open source software doesn’t yet pass that bar. Needless to say, with SPSS and SAS, the question doesn’t arise.
Malcolm Wheatley, a freelance writer and contributing editor at Data Informed, is old enough to remember analyzing punched card datasets in batch mode, using SPSS on mainframes. He lives in Devon, England, and can be reached at firstname.lastname@example.org.