Todd Mostak’s first tangle with big data didn’t go well. As a master’s student at the Center for Middle Eastern Studies at Harvard in 2012, he was mapping tweets for his thesis project on Egyptian politics during the Arab Spring uprising. It was taking hours or even days to process the 40 million tweets he was analyzing.
Mostak saw immediately the value in geolocated tweets for socio-economic research, but he did not have access to a system that would allow him to map the large dataset quickly for interactive analysis.
So over the next year, Mostak created a cost-effective workaround. By applying his analytical skills and creativity, taking advantage of access to education and using hardware designed for computer gamers, he performed his own version of a data science project, developing a new database that solved his problem. Now his inventive approach has the potential to benefit others in both academia and business.
While taking a class on databases at MIT, Mostak built a new parallel database, called MapD, that allows him to crunch complex spatial and GIS data in milliseconds, using off-the-shelf gaming graphical processing units (GPU) like a rack of mini supercomputers. Mostak reports performance gains upwards of 70 times faster than CPU-based systems.
Mostak said there is more development work to be done on MapD, but the system works and will be available in the near future. He said he is planning to release the new database system under and open source business model similar to MongoDB and its company 10gen.
“I had the realization that this had the potential to be majorly disruptive,” Mostak said. “There have been all these little research pieces about this algorithm or that algorithm on the GPU, but I thought, ‘Somebody needs to make an end-to-end system.’ I was shocked that it really hadn’t been done.”
Mostak’s undergraduate work was in economics and anthropology; he realized the need for his interactive database while studying at Harvard’s Center for Middle Eastern Studies program. But his hacker-style approach to problem-solving is an example of how attacking a problem from new angles can yield better solutions. Mostak’s multidisciplinary background isn’t typical for a data scientist or database architect.
Sam Madden, the director of the big data at MIT’s Computer Science and Artificial Intelligence Lab (CSAIL), said some faculty thought he was “crazy” for hiring Mostak to work at CSAIL; he has almost no academic background in computer science. But Mostak’s unconventional approach has yielded one of the most exciting computer science projects at MIT.
Madden said that when a talented person with an unusual background presents himself, it’s key to recognize what that person can accomplish, not what track he or she took to get there. “When you find somebody like that you’ve got to nurture them and give them what they need to be successful,” Madden said of Mostak. “He’s going to do good things for the world.”
Using Tweets to Challenge Assumptions
The start of MapD’s creation was Mostak’s master’s thesis, which tested the theory that poorer neighborhoods in Egypt are more likely to be Islamist. He looked at geocoded tweets from around Cairo during the Arab Spring upraising. He examined if the tweet writer followed known Islamist politicians or clerics.
Mostak lived in Egypt for a year, learning Arabic at American University in Cairo and working as a translator and writer for the local newspaper Al-Masry Al-Youm. He knew the situation leading up to the Arab Spring uprising better than most.
He cross-referenced the language in the tweets with forums and message boards he knew to be Islamist to measure sentiment. He also checked the time stamps to see if Twitter activity stopped during the five daily prayers.
Then he plotted the Islamist indicators from 40 million tweets, ranging from August 2011 through March 2012, against 5,000 political districts from the Egyptian census.
In his first attempt to plot the points on the map using his laptop, he discovered it would take several days to run the analysis. Mostak said at this point he was far from an expert at crunching that size of data, but even with the optimized code the data was too big to get reasonable performance.
“You could do it if you had this big cluster, but what if you’re a normal guy like me?” Mostak said. “There really is a need for something to do this kind of workload faster.”
Mostak’s Harvard professors helped him get access to better computing resources to finish his thesis. His results, which he plotted on a choropleth map, were suggestive that more rural – not poorer – areas leaned towards Islamism. His thesis won first prize in the department’s annual contest.
After graduating from his master’s program in May 2012, he began a six-month fellowship at the Ash Center for Democratic Governance and Innovation at the John F. Kennedy School of Government with his advisor, Prof. Tarek Masoud, expanding his effort to analyze social media for insight on Egyptian political changes.
Database Class at MIT
But Mostak had found a new problem to solve: looking for a better way to do spatial or GIS analytics. For his last semester at Harvard, he registered for a course on databases taught by Sam Madden at the Massachusetts Institute of Technology.
Mostak said that when he first signed up for the course, he hadn’t yet encountered his big data problem. Mostak graduated from University of North Carolina at Chapel Hill with degrees in economics, anthropology and a minor in math; he was looking to take advantage of the opportunity to learn at MIT while he still had the chance through Harvard.
But as his thesis project began to encounter problems analyzing millions of tweets Mostak said he saw the class as a chance to better understand how to organize and query data for mapping projects.
“I wanted to know what was going on under the hood, and how to better work with my data,” he said. “That was pretty serendipitous.”
Mostak had already dabbled in programming for 3D graphics with the language OpenGL, when he was making iPhone apps as a hobby. He knew how powerful the top graphical processing units, or GPUs, that hardware companies made for high end gaming computers.
During the class in the spring of 2012 he learned the graphics programming language CUDA, and that opened the doors for tweaking GPUs to divide advanced computations across the GPUs massively parallel architecture.
He knew he had something when he wrote an algorithm to connect millions of points on a map, joining the data together spatially. The performance of his GPU-based computations compared to the same operation done with CPU power on PostGIS, the GIS module for the open-source database PostgreSQL, was “mind-blowing,” he said.
“The speed ups over PostGIS … I’m not an expert, and I’m sure I could have been more efficient in setting up the system in the first place, but it was 700,000 times faster,” Mostak said. “Something that would take 40 days was done in less than a second.”
That was with a $200, mid-level consumer graphics card. With two GeForce Titan GPUs made by Nvidia, the fastest graphics card on the market, Mostak’s database is able to crunch data at the same speed of the world’s fastest supercomputer in the year 2000. That machine computer cost $50 million at the time, and ran on the same amount of electricity it took to power 850,000 light bulbs. Mostak’s system, all told, costs around $5,000 and runs five light bulbs worth of power.
Mostak said his system uses SQL queries to access the data, and with its brute force GPU approach, will be well suited for not only geographic and mapping applications but machine learning, trend detection and analytics for graph databases.
Building the GPU database became his final project for Madden’s class. The database was sufficiently impressive; Madden offered Mostak a job at CSAIL. Mostak has been able to develop his database system at MIT while working on other projects for the university.
Madden said there are three elements that make Mostak’s database a disruptive technology. The first is the millisecond response time for SQL queries across “huge” datasets. Madden, who was a co-creator of the Vertica columnar database, said MapD can do in milliseconds what Vertica can do in minutes. That difference in speed is everything when doing iterative research, he said.
The second is the very tight coupling between data processing and visually rendering the data; this is a byproduct of building the system from GPUs from the beginning. That adds the ability to visualize the results of the data processing in under a second. Third is the cost to build the system. MapD runs in a server that costs around $5,000.
“He can do what a 1000 node MapReduce cluster would do on a single processor for some of these applications,” Madden said.
MapD and the Harvard World Map
Mostak’s MapD database system was built initially to solve problems involving millions of points on maps, and Mostak’s development efforts have started with GIS applications.
At his project’s conceptual stage, Mostak found a sounding board and technical advisor in Ben Lewis, the project manager for Harvard’s open source World Map project. The World Map is housed in Harvard’s Center for Geographic Analysis, and it serves as a free collaborative tool for researchers around the world to share and display GIS data.
Mostak’s database came at a great time for Lewis and World Map, just as the number of users began to increase and Lewis was starting to think about how the system would scale. It’s a web service; nearly all of the processing is done on Harvard’s servers.
Lewis has been running a Hadoop system, but at best all the batch-oriented database can do is preprocess data and prepare it for display while running the background. Mostak’s MapD is instantaneous.
“The thing is this is a whole different animal,” Lewis said. “It’s not using [open source search database] Solr, or not using a different database. It’s truly parallelizing this stuff, and it’s actually searching on the fly through 100 million things without preprocessing. That may seem like a subtle difference, but in terms of what it can enable, in terms of analytics, it’s completely different.”
Lewis has helped Mostak get some funding for hardware to run the system, and with projects to help fund the development of MapD. Through World Map, Mostak worked for the Japan Data Archive, a project to collect data from the 2011 earthquake and tsunami. The project uses MapD to display several data sets on a map instantly.
Mostak is working with Harvard to visualize the Kumbh Mela, a 55-day Hindu religious festival that happens only once every 12 years that will see more than 80 million people attend. Mostak and MapD will visualize anonymized cell phone data to analyze crowd flow and social networks.
World Map also serves as a platform for Mostak’s first visualization project, TweetMap, which allows users to look at Twitter heat maps from 125 million tweets sent in three week span in December of 2012.
It’s open for anyone to use and explore. The project is still in alpha, but users can enter terms and see where and when the highest density of people included in those terms in tweets. “Hockey” lights up the northern United States, Canada and Sweden; “hoagie” shows the term for a sandwich is almost exclusively used in New Jersey and Pennsylvania.
The heat map is a good example of the system’s horsepower. The visualization reads each 125 million geocoded and time-stamped tweet and relates it to those sent nearby, gauging which areas used the terms most often and displaying it on World Map in milliseconds.
Open Source Plans
Mostak said he struggled with how to commercialize his idea. He filed the paperwork for a provisional patent, he said, but said he’s “99 percent sure” he’s going to take MapD open source instead. He plans to keep certain parallel processing algorithms he’s written for the system proprietary, but the base of the data processing system and the computation modules will be open to everyone.
This division between open source and proprietary technologies is Mostak’s way of keeping the full-time pressures of running a for-profit entity at a distance from his research interests. “The business side was just stressing me out so much,” Mostak said. “I wanted to slow it down, and I didn’t even have time to focus on my [research] work because I was so stressed with the business side of things.
“It wasn’t any fun to me,” Mostak said. “Say [seeking private funding] had a greater income potential, I still think you can do well with open source. If it’s open source I can still work here at MIT, write research papers around some of this and then finally, maybe at some point in the future I can build a company doing consulting around this. There are a lot of companies on this model, like 10gen and MongoDB.”
Mostak said he has to clean up his 35,000 lines of code before publishing them on an open source license, which will take months. In the meantime keeping the project in the university setting and working towards an open source license will open up some public and academic funding opportunities, he said.
Once the project does go open source, then he can rely on a community to help him build out the system.
“It’s much more exciting that way,” Mostak said. “If you think of it as the idea that people could really benefit from a very fast database that can run on commodity hardware, or even on their laptops all the way up to big scalable GPU servers, I think it would be really fun for people.”
By opening the code, Mostak opens up the possibility of having a competitor rapidly develop a system and force him to the edge of the market. That possibility doesn’t bother him, he said.
“If worse comes to worst, and somebody steals the idea, or nobody likes it, then I have a million other things I want to do too, in my head,” Mostak said. “I don’t think you can be scared. Life is too short.
“I want to build the neural nets,” he said. “I want to do the trend detection, I want to do clustering. Maybe only one or two of them are novel, and a lot of them have been done, but they’d just be very cool to do on the GPU, and really fast. If I had a few people to help me that would just be awesome.”
This version of the story has been updated to correct the power requirements for the world’s fastest super computer in the year 2000, and to include the name of Mostak’s thesis advisor, Prof. Tarek Masoud. The article has also been updated to correctly state that MongoDB is a project curated by the company 10gen.