Put Data Science Skills Before Big Data Infrastructure

by   |   January 5, 2016 5:30 am   |   0 Comments

David Johnston, Lead Data Scientist, ThoughtWorks

Dr. David Johnston, Lead Data Scientist, ThoughtWorks

“Big data” and “data science” are today’s buzzwords. Capitalism has responded predictably to the business media frenzy: The world is flooded with big data products, and businesses have invested in them enthusiastically for years now. Many companies are trying to modernize their data platform and enable their employees to monetize their valuable data, but most businesses are not seeing the benefits.

A recent white paper and survey from IDG Research and Kapow Software states: “Big data projects are taking far too long, costing too much, and not delivering on anticipated ROI because it’s really difficult to pinpoint and surgically extract critical insights without hiring expensive consultants or data scientists in short demand.”

As a data science consultant, I have recognized the key problem involved in most of these failures: the insufficient attention given to data science skills and overemphasis on infrastructure. So many companies have the tools, but a deficit of ideas and the right kind of talent using them. Product companies benefit from this problem, and actively contribute to it: It’s easier to sell a software license than to solve a real problem, and it’s easy to believe that your employees already can solve problems but just lack the tools that are being sold to you.

If you think you are in this situation, the answers to these questions might help:

    • Where have you seen concrete examples that your employees have created innovative data products in the laboratory state and you just couldn’t implement them due to lack of technology?


    • Have your employees created accurate predictive analytics solutions that just didn’t scale well enough to run in production?


  • Has the strategy of “build it and they will come” been successful at your company?


With our clients, we rarely see situations in which lack of tools and technology is holding back a data science team. Most often, it’s either lack of skill or inexperience with integrating data science into full software applications. Problems of scaling should be addressed from an information perspective before resorting to brute-force distributed solutions like Hadoop. A good data science team will be the people most knowledgeable about which tools, if any, are needed. Executives who are not in technical roles should resist having the idea sold to them that they should decide on platform-level tools and convince their technical people to use them.

Be Agile

Related Stories

Analytics Experts Discuss Necessary Skills of the Quant.
Read the story »

A Data-Driven Solution to the Data Scientist Shortage.
Read the story »

Analytics Leaders Discuss Care and Feeding of a Successful Data Team.
Read the story »

Don’t Hire Data Scientists Until You Are Ready for Data Science.
Read the story »

Software experts, including those at ThoughtWorks, have had great success over the past two decades convincing businesses of the value of agile methodology and lean approaches. Build or buy what you need only when you know you need it next. Build things iteratively, with feedback loops in place. If someone tells you that you will see benefits only after building some enormous system, run away quickly!

As my colleague Ken Collier argues in his book, Agile Analytics, data infrastructure seems to have survived the big-upfront-investment extinction that happened in the rest of the software industry. While the big data product champions might claim that their products enable agility, the truth is that they are massive, inflexible systems suited only for scaling up systems whose idea development is nearly completed. Their tools solve an important problem. But that problem isn’t one that most companies really have.

Put Small Before Big

There is no point in building a scalable data environment when you haven’t already proven some ideas at a smaller scale. Choosing something like a Hadoop ecosystem or some NoSQL database before actually requiring it is not only wasteful, but may not enable the kind of scaling that you will need. What if your bottleneck is network bandwidth rather than I/O or processing? This is like building a six-lane highway system before finishing the invention of the car.

Experienced data scientists know how to reduce data (extracting the small amount of valuable information from a larger data set). This may include sampling, variable selection, dimensionality reduction, compression, or choosing a more appropriate algorithm. Never in my 20 years of experience have I required more data than can fit on my personal computer in order to learn enough to develop a useful algorithm. Working with more data than this actually hinders development of initial ideas and delays the development of a minimally viable product. By putting small before big, by the time you need to run on larger amounts of data, or all the data, you know exactly what kind of scalable technology is going to be needed and how to make an intelligent, timely investment if necessary.

Have the Right Kind of Data Science Talent

Does your company really have the right kind of data science talent? If so, congratulations! You are in the minority.

Most companies will answer this question in one of two ways. Some will admit that they probably do not have the right kind of talent, but do have software developers or less experienced data analysts who can play the part until they can put together a data science team. They are struggling to hire that data science team and, in the meantime, they have hired a product company to set up a big data platform that currently adds little value.

Other companies will simply answer: Well, I hope we have the right kind of talent! They may have people on the payroll whose job title is “data scientist” or have an entire analytics team, but while these people are doing constructive data work, they may not be capable of creating and using advanced data science algorithms and are under a lot of stress from working well beyond their areas of expertise.

If You Don’t Have it, Rent it

At the risk of appearing to be self-serving, I would be remiss if I didn’t point out the benefits of hiring an experienced data science consultant. If you are unsure about the capabilities of your data science team, this is the quickest way to find out. Like a management consultant, a good data science consultant can work with your existing team and help it be more effective at delivering data science applications. Data science consultants can help with the high-level strategies around utilizing data to improve your business model, and some consulting organizations can provide the entire delivery team and help with hiring people to support it once the delivery is complete.

Data scientists do not compete against software products. Data science problems that are interesting and worth pursuing generally are also unique enough to require custom software. Nobody writes general purpose software to handle your company’s unique problems. And if they did, what would that say about your company’s competitive advantage?

Many companies haven’t even considered hiring data scientists as consultants and simply wish to hire them as permanent employees. While I am strongly in favor of doing this as well, the reality is that data scientists are in great demand and are very difficult to hire. The riskiest data scientist you’ll hire is your first one, as he or she will largely define the team that follows. It’s best to make that important hire in close consultation with an expert, preferably one with whom you have been working on your actual business problems. While there are benefits of having permanent employees, you never know when talented people are going to jump ship. What consultants may lack in knowledge of what happens inside your company, they often make up for with knowledge of what happens outside your company.

I recommend that companies hold off investing in big technology initiatives without first speaking with a data science consultant about the bigger data strategy, and preferably a consultant who isn’t selling products. Consider your commitment to agile methodologies, and apply the same strategy to your data initiatives. If you are trying to build a data science practice, a data science consultant can help you steer that venture and get started on the right foot. Maximally leveraging your unique information should be your core competency, so invest in it wisely.

David Johnston is a Lead Data Scientist at ThoughtWorks. He creates statistical models and predictive algorithms at the core of the innovative data science applications that we build for our clients. He has a Ph.D. in physics and over 20 years of experience working with data. Prior to his career in consulting, he conducted cosmological research at academic institutions, NASA, and government laboratories, developing data processing pipelines, statistical algorithms, and optimization strategies for space missions and astronomical experiments.

Subscribe to Data Informed
for the latest information and news on big data and analytics for the enterprise.

Tags: , , , , , ,

Post a Comment

Your email is never published nor shared. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>