I’d like to offer up some thoughts about what it means to practice data science in the real world, because merely knowing the math isn’t enough.
Anyone who knows me well knows that I’m not the sharpest knife in the drawer. My quantitative skills are middling, but I’ve seen folks much smarter than me fail mightily at working as analytics professionals. The problem is that while they’re brilliant, they don’t know the little things that can cause technical endeavors to fail within the business environment. So let’s cover these softer items that can mean the success or failure of your analytics project or career.
Get to Know the Problem
My favorite movie of all time is the 1992 film Sneakers. The movie centers on a band of penetration testers led by Robert Redford that steals a “black box” capable of cracking RSA encryption. Hijinks ensue. (If you haven’t watched it, I envy you, because you have an opportunity to see it for the first time!)
There’s a scene where Robert Redford encounters an electronic keypad on a locked office door at a think tank, and he needs to break through.
He reaches out to his team via his headset. They’re waiting in a van outside the building. “Anybody ever had to defeat an electronic keypad?” he asks.
“Those things are impossible,” Sydney Poitier exclaims. But Dan Aykroyd, also waiting in the van, comes up with an idea. They explain its complexities to Redford over the comms.
Robert Redford nods his head and says, “Okay, I’ll give it a shot.”
He ignores the keypad and kicks in the door.
You see, the problem wasn’t “defeating an electronic keypad” at all. The problem was getting inside the room. Dan Aykroyd understood this.
This is the fundamental challenge of analytics: understanding what actually must be solved. You must learn the situation, the processes, the data, and the circumstances. You need to characterize everything around the problem as best you can in order to understand exactly what an ideal solution is.
In data science, you’ll often encounter the “poorly posed problem”:
1. Someone else in the business encounters a problem.
2. They use their past experience and (lack of?) analytics knowledge to frame the
3. They hand their conception of the problem to the analyst as if it were set in stone
and well posed.
4. The analytics person accepts and solves the problem as-is.
This can work. But it’s not ideal, because the problem you’re asked to solve is often not the problem that needs solving. If this problem is really about that problem then analytics professionals cannot be passive.
You cannot accept problems as handed to you in the business environment. Never allow yourself to be the analyst to whom problems are “thrown over the fence.” Engage with the people whose challenges you’re tackling to make sure you’re solving the right problem. Learn the business’s processes and the data that’s generated and saved. Learn how folks are handling the problem now, and what metrics they use (or ignore) to gauge success.
Solve the correct, yet often misrepresented, problem. This is something no mathematical model will ever say to you. No mathematical model can ever say, “Hey, good job formulating this optimization model, but I think you should take a step back and change your business a little instead.” And that leads me to my next point: Learn how to communicate.
We Need More Translators
I’m assuming you know a thing or two about analytics. You’re familiar with the tools that are available to you. You’ve prototyped in them. And that allows you to identify analytics opportunities better than most, because you know what’s possible. You needn’t wait for someone to bring an opportunity to you. You can potentially go out into the business and find them.
But without the ability to communicate, it becomes difficult to understand others’ challenges, articulate what’s possible, and explain the work you’re doing.
In today’s business environment, it is often unacceptable to be skilled at only one thing. Data scientists are expected to be polyglots who understand math, code, and the plain-speak (or sports analogy-ridden speak . . . ugh) of business. And the only way to get good at speaking to other folks, just like the only way to get good at math, is through practice.
Take any opportunity you can to speak with others about analytics, formally and informally. Find ways to discuss with others in your workplace what they do, what you do, and ways you might collaborate. Speak with others at local meet-ups about what you do. Find ways to articulate analytics concepts within your particular business context.
Push your management to involve you in planning and business development discussions. Too often the analytics professional is approached with a project only after that project has been scoped, but your knowledge of the techniques and data available makes you indispensable in early planning.
Push to be viewed as a person worth talking to and not as an extension of some number- crunching machine that problems are thrown at from a distance. The more embedded and communicative an analyst is within an organization, the more effective he or she is.
For too long analysts have been treated like Victorian women — separated from the finer points of business, because they couldn’t possibly understand it all. Oh, please. Let people feel the weight of your well-rounded skill set — just because they can’t crunch numbers doesn’t mean you can’t discuss a PowerPoint slide. Get in there, get your hands dirty, and talk to folks.
Beware the Three-Headed Geek-Monster: Tools, Performance, and Mathematical Perfection
Many things can sabotage the use of analytics within the workplace. Politics and infighting perhaps; a bad experience from a previous “enterprise, business intelligence, cloud dashboard” project; or peers who don’t want their “dark art” optimized or automated for fear that their jobs will become redundant.
Not all hurdles are within your control as an analytics professional. But some are. There are three primary ways I see analytics folks sabotage their own work: overly complex modeling, tool obsession, and fixation on performance.
Many moons ago, I worked on a supply chain optimization model for a Fortune 500 company. This model was pretty badass if I do say so myself. We gathered all kinds of business rules from the client and modeled their entire shipping process as a mixed-integer program. We even modeled normally distributed future demand into the model in a novel way that ended up getting published.
But the model was a failure. It was dead out of the gate. By dead, I don’t mean that it was wrong, but rather that it wasn’t used. Frankly, once the academics left, there was no one left in that part of the company who could keep the cumulative forecast error means and standard deviations up to date. The boots on the ground just didn’t understand it, regardless of the amount of training we gave.
This is a difference between academia and industry. In academia, success is not gauged by usefulness. A novel optimization model is valuable in its own right, even if it is too complex for a supply chain analyst to keep running.
But in the industry, analytics is a results-driven pursuit, and models are judged by their practical value as much as by their novelty.
In this case, I spent too much time using complex math to optimize the company’s supply chain but never realistically addressed the fact that no one would be able to keep the model up to date.
The mark of a true analytics professional, much like the mark of a true artist, is in knowing when to edit. When do you leave some of the complexity of a solution on the cutting room floor? To get all cliché on you, remember that in analytics great is the enemy of good. The best model is one that strikes the right balance between functionality and maintainability. If an analytics model is never used, it’s worthless.
Right now in the world of analytics (whether you want to call that “data science,” “big data,” “business intelligence,” “blah blah blah cloud,” and so on), people have become focused on tools and architecture.
Tools are important. They enable you to deploy your analytics and data-driven products. But when people talk about “the best tool for the job,” they’re too often focused on the tool and not on the job.
Software and services companies are in the business of selling you solutions to problems you may not even have yet. And to make matters worse, many of us have bosses who read stuff like the Harvard Business Review and then look at us and say, “We need to be doing this big data thing. Go buy something, and let’s get Hadoop-ing.”
This all leads to a dangerous climate in business today where management looks to tools as proof that analytics are being done, and providers just want to sell us the tools that enable the analytics, but there’s little accountability that actual analytics is getting done.
So here’s a simple rule: Identify the analytics opportunities you want to tackle in as much detail as possible before acquiring tools.
Do you need Hadoop? Well, does your problem require a divide-and-conquer aggregation of a lot of unstructured data? No? Then the answer may be no. Don’t put the cart before the horse and buy the tools (or the consultants who are needed to use the open source tools) only to then say, “Okay, now what do we do with this?”
If I had a nickel every time someone raised their eyebrows when I tell them MailChimp uses R in production for our abuse-prevention models, I could buy a Mountain Dew. People think the language isn’t appropriate for production settings. If I were doing high-performance stock trading, it probably wouldn’t be. I’d likely code everything up in C. But I’m not, and I won’t.
For MailChimp, most of our time isn’t spent in R. It’s spent moving data to send through the AI model. It’s not spent running the AI model, and it’s certainly not spent training the AI model.
I’ve met folks who are very concerned with the speed at which their software can train their artificial intelligence model. Can the model be trained in parallel, in a low-level language, in a live environment?
They never stop to ask themselves if any of this is necessary and instead end up spending a lot of time gold-plating the wrong part of their analytics project.
At MailChimp, we retrain our models offline once a quarter, test them, and then promote them into production. In R, it takes me a few hours to train the model. And even though we as a company have terabytes of data, the model’s training set, once prepped, is only 10 gigabytes, so I can even train the model on my laptop. Crazy.
Given that that’s the case, I don’t waste my time on R’s training speed. I focus on more important things, like model accuracy.
I’m not saying that you shouldn’t care about performance. But keep your head on straight, and in situations where it doesn’t matter, feel free to let it go.
You Are Not the Most Important Function of Your Organization
Okay, so there are three things to watch out for. But more generally, keep in mind that most companies are not in the business of doing analytics. They make their money through other means, and analytics is meant to serve those processes.
You may have heard elsewhere that data scientist is the “sexiest job of the century!” That’s because of how data science serves an industry. Serves being the key word.
Consider the airline industry. They’ve been doing big data analytics for decades to squeeze that last nickel out of you for that seat you can barely fit in. That’s all done through revenue optimization models. It’s a huge win for mathematics.
But you know what? The most important part of their business is flying. The products and services an organization sells matter more than the models that tack on pennies to those dollars. Your goals should be things like using data to facilitate better targeting, forecasting, pricing, decision-making, reporting, compliance, and so on. In other words, work with the rest of your organization to do better business, not to do data science for its own sake.
Excerpted with permission from the publisher, Wiley, from Data Smart: Using Data Science to Transform Information into Insight by John W. Foreman. Copyright © 2013.
John W. Foreman, author of Data Smart: Using Data Science to Transform Information into Insight, is chief data scientist for MailChimp.com, where he leads a data science product development effort called the Email Genome Project. As an analytics consultant, John has created data science solutions for The Coca-Cola Company, Royal Caribbean International, Intercontinental Hotels Group, Dell, the Department of Defense, the IRS, and the FBI.