How to Impose Structure on Unstructured Data

by   |   April 7, 2015 5:30 am   |   0 Comments

Evan A. Schnidman, founder and CEO, Prattle Analytics

Evan A. Schnidman, founder and CEO, Prattle Analytics

Editor’s Note: Evan is presenting a talk on “Using Domain Expertise to Shed Light on Unstructured Data” at a Meetup event today at 6:30 p.m. in Cambridge, MA. Click here for more information and to RSVP.

For over a decade, the business world has been enthralled with big data. Some of this data arrives structured, but much of it is unstructured and or even in the form of text. Over this same time period, it has become common for analysts mine unstructured data for useful correlations.

As interesting as some of the correlations this approach generates can be, much of what is found is presented without the structure of theory. Like blindly shining a flashlight into a dark forest to find the way out, this method guarantees narrow discovery, but not much else. Sure, you might find the right path, but you are just as likely to end up on the wrong path or just staring at trees. Structuring your analysis of unstructured data allows you to systematically record where you have shined the light, so you know the options in front of us. Adding domain expertise to this simple structure adds wattage to your flashlight, better illuminating the alternatives throughout the forest. Deep domain expertise allows you to utilize structure and shine a veritable flood light on the world of big data.

Text Analytics

One of the frontiers of unstructured data is the text that invades every facet of modern life. Whether it is analyzing social network comments, full news stories, or sophisticated regulatory information, text is vital to virtually every industry. So how do we go about analyzing that text? The traditional way is simple: read it. But in the modern world, there is just too much information. Even if it were possible to read everything available, readership bias would plague every step of the process. So, text data must be analyzed in a systematic, unbiased way. This type of examination is commonly referred to as sentiment analysis.

Unfortunately, sentiment analysis has a bad reputation, earned by years of rudimentary software that simply uses dictionary-defined terms as buzzwords. These buzzwords are categorized as good or bad, then sentiment is scored by good­buzzword minus bad­buzzword. Slightly more sophisticated versions of this method incorporate the context of said buzzwords by subtracting bad-buzz­phrase from good-buzz­phrase to glean sentiment. Although more precise, this phase-­analysis method is almost as biased as the simple word approach because both introduce selection bias by pre-defining key words or elements of a communication.

Related Stories

Take a Data-Centric Approach to Managing Your Unstructured Data.
Read the story »

On-Demand Webinar: Make Better Decisions Using Unstructured Data.
Read the story »

Developing a Legal Risk Model for Big Volumes of Unstructured Data.
Read the story »

Analytics Lessons from Spy Work: Machine Learning Applied to Unstructured Data.
Read the story »

While it would be great to have a pure data science solution to this problem, this is simply not possible when analyzing complex text. For any remotely complicated subject matter, it is crucial to introduce expertise into the computing solution. In particular, deep domain expertise can allow a human to train a text analytics algorithm based on comprehensive rules that encompass whole communications rather than simple words or phrases. This method eliminates buzzword bias and ensures that context is crucial to the analysis. Essentially, this means having an expert scale documents in an impartial way to train a sophisticated text analytics system.

Take, for example, analysis of Central Bank policy. Astute readers might remember the “Briefcase Watch” back in the 1990s. Back then, reporters used to follow around Federal Reserve Chairman Alan Greenspan on the day of a Fed meeting to get a glimpse of his briefcase. The idea was that if his briefcase was thick, he had been reviewing data and rates were likely to change. If the briefcase was thin, then rates would remain the same. Aside from the fact that a modern briefcase would contain a laptop, not stacks of paper, the modern method of Fed watching has not changed much. The big difference is that, nowadays, central banks release far more information to the public. The trouble is that, like in the Greenspan era, Fed watchers are still focused on a narrow subject, a few key words, to determine if policy is going to change.

The rationale behind modern Fed watchers’ focusing on words or phrases in larger communications derives from the Fed’s own method of editing press releases. Its use of track changes in those releases led reporters to cue in on simple word changes. Unfortunately, this method of analysis has carried over to much larger, uniquely written communications like meeting minutes and speeches, not to mention communications from central banks all over the world that may or may not use track changes. The bottom line is that this narrow method of analysis leaves carefully crafted data (words) unanalyzed.

To properly analyze the complex communications released by central banks, it is vital not only to examine every word of every communication, but also to scale the system with as little bias as possible. In the context of central banks, this means using historical market reaction to define “hawkish” or “dovish” communications on an ordinal scale. To mitigate selection bias of cherry picking certain overtly hawkish/dovish communications, the system can be rounded out with a sufficiently large sample of random communications in the training set. The text analysis algorithm can then score an archive of historical communications based on similarities in language to the training data. With each added input, the system better learns and parameterizes hawkish/dovish sentiment and adapts to shifting language and language patterns.

Essentially, this mix of natural language processing incorporates machine learning so that​ changes in language can be properly incorporated, and scores reflect new and changing terms in the central bankers’ lexicon.

The end result is a single score for every communication produced by a central bank. These scores are normalized around zero such that negative numbers indicate dovish sentiment and positive numbers indicate hawkish sentiment. Nearly all scores fall between ­2 and +2 because those serve as two standard deviations from the mean. These scores represent the central bank’s sentiment toward the state of the economy (or at least toward inflation expectations), so the trend (rising or falling) is essential to understanding where monetary policy and, more broadly, the economy, is likely to go in the near future.

Using Central Bank Sentiment Data 

Although economics (and more so, finance) is a quantitative discipline, this approach deviates significantly from the established method of Fed watching. Many portfolio managers find value plugging this data directly into a complex multi­factor model as a key signal in that model. Other portfolio managers see value in using this data to better inform their qualitative strategies by eliminating bias and speeding along their existing macro analysis. Regardless of how the data is being used, the most important thing is to remember that any individual score is only telling in how it fits into the larger trend.

Analysis of this trend data has been used to find stark correlations across asset classes. This means that sentiment scores can be used to mitigate downside risk and generate significant alpha across asset classes. Moreover, this method of using domain expertise to impose structure on central bank texts can be seen as a template for how to analyze other complex documents.

We all know that the era of big data is here to stay, but while much of the corporate world has spent years accumulating this data, it has only begun to scratch the surface in analyzing it. Domain or subject-matter expertise in particular areas should be sought-after skills to aid in automating vital analysis. Some experts will always ask if aiding in training a system will eventually make them irrelevant, but I contend that subject matter experts should want to be on the leading edge, training the systems, so they know how they work and how to interpret new, structured data. Similarly, data scientists should want to work with domain experts to advance data analytics efficiently, without stumbling through the dark forest.

Evan A. Schnidman is the founder and CEO of Prattle Analytics, a financial data company. He has been featured in Bloomberg News and on The Deal. Evan’s financial research has been featured in The Wall Street Journal, Bloomberg View, and Seeking Alpha, and his academic research has been published in journals and edited volumes. Evan’s financial research will be further showcased in his forthcoming book, How the Fed Moves Markets.

In his consulting capacity, Evan has vetted the political, economic, and financial risks of major infrastructure investments for large corporations. Evan has also vetted finances, management structures, and community engagement of small and midsize financial institutions to maximize relationships, tax status, and grant opportunities from the government. From 2010-2014, Evan taught courses in economics, public policy, and political science at Harvard University and Brown University.

Evan holds a Ph.D. from Harvard University as well as Bachelor’s and Master’s Degrees from Washington University in St. Louis.


Subscribe to Data Informed
for the latest information and news on big data and analytics for the enterprise.



Tags: , , , ,

Post a Comment

Your email is never published nor shared. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>