The title “data scientist” evokes for many a vision of a long-haired sorcerer locked away in a tower, crafting magical potions from recipes scrawled upon scrolls only they can read. For others, the image is more modern – the bearded developer surrounded by monitors and LED-lit keyboards, too entrenched in source code to be bothered with the mundane details of daily life (like shaving). Either way, the data scientist, and data science itself, is often viewed as an elite craft reserved for a similarly elite sector of society.
The Open Data Science movement, however, is breaking down these barriers and showing that magic can be a product of teamwork, open-source access, and shared innovation.
Open Data Science closes the skills gap between sorcerer and society, turning data scientists into team players and exponentially increasing the power of data science itself. It enables the type of precision analysis that lets marketers, advertisers, and other non–data experts (and data experts, too) be better at their jobs – without a computer science degree, multiple Ph.D.’s, or a series of software certifications.
The Open Data Science movement is often confused with open source. Although the two work together, the difference between them is elemental. Open-source innovation allows data science teams to avoid the roadblocks of proprietary software. In terms of our sorcerer, proprietary software is the equivalent of writing scrolls in Latin instead of in the common tongue. True freedom, however, requires solutions that are cross-platform – Windows, Linux, OSX – and cross-language – R, Python, Scala, Java, C/C++, etc.
Open Data Science makes that possible.
As the amount of available information explodes, new technologies continually emerge to analyze and compute in a connected ecosystem. In reality, the ecosystem isn’t very connected at all. For example, transmitting data from R or Python to Scala (via Spark) only happens via a second-class API. That forces data scientists to move all their data out of its native environment, which comes with a huge performance penalty. And beyond the performance penalties of trying to make proprietary systems speak nicely to each other, a new problem also arises: siloed expertise.
Proprietary software tends to pigeonhole developers into single tracks of knowledge and tool sets, while Open Data Science embraces a rich ecosystem of compatible open-source technologies. Open Data Science eliminates the need to force data into a new place while opening up the ability to do so to the masses. It lets data-science teams realize true interoperability and accomplish mind-blowing feats. DARPA’s Memex project uses Open Data Science to stop human trafficking. The TaxBrain project uses Open Data Science to bring transparency to policy and help voters educate themselves about the potential impacts of local, regional, and national issues.
Open Data Science allows individuals and enterprises to create, collaborate, and execute advanced analytics among a wide population of team members, whether they are in the same cubicle row or scattered across the globe. Teams use different technology gateways and different approaches to solve different problems. Rather than mandating that a command line is the only way data science teams should be doing this, Open Data Science acknowledges that it’s just one way of doing it.
This freedom and openness is vitally important because the divisions created by proprietary software are costly, not only in ways like data transfer between incompatible systems, but also in paying the salaries of numerous experts in each silo.
Let’s return to that idea of siloed expertise and look at it in a different light. That same sorcerer, crafting potions in his tower, has very different tools and ingredients for each potion. To make a potion, you might need flasks and test tubes and Petri dishes, but these tools are not interchangeable. The knowledgeable sorcerer would never use a Petri dish in place of an Erlenmeyer flask, for the boiling ingredients would quickly spill over the edge. Without Open Data Science, the sorcerer may be limited in his creations, as he is given the right tools or ingredients for one type of potion but not others. With Open Data Science, the barriers are broken down and our sorcerer enjoys a fully stocked lab full of flasks, test tubes, and Petri dishes alike, and recipes written in easily understandable (and non-proprietary) language. With Open Data Science, there is no more chasm between the alchemist, the apothecary, and the sorcerer – the tools and knowledge are equally available to all.
In more concrete terms, Open Data Science is open at a data level, enabling teams to support many different data platforms. It’s open at the analytics level to different libraries, and at the computational level to different architectures – everything from a desktop and server to clusters and GPUs.
This flexibility and freedom is why Open Data Science is gaining momentum so quickly and so powerfully. Traditional solutions are an administrative nightmare. Open Data Science brings all of the technologies together so it’s quick and easy to pick up the right one.
Michele Chambers is an entrepreneurial executive with over 25 years of industry experience. She has authored two books: Big Data Big Analytics, published by Wiley, and Modern Analytic Methodologies, published by Pearson FT Press. Prior to Continuum Analytics, Michele held executive leadership roles at database and analytic companies, IBM, Netezza, Revolution Analytics, MemSQL, and RapidMiner. In her career, Michele has been responsible for strategy, sales, marketing, product management, channels, and business development. She holds a B.S. in Computer Engineering from Nova Southeastern University and an M.B.A. from Duke University.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise, plus get instant access to more than 20 eBooks.