Since rolling out Apache Cassandra a year ago, the Weather Channel has been able to process nearly 300 billion requests a month across platforms including TV, the web, mobile, and APIs. But the popular NoSQL database wasn’t always earning accolades.
“It used to be that Cassandra was something of a beast,” recalled Robbie Strickland, the Weather Channel’s software development manager. “It was like an old space shuttle cockpit with a million knobs.” In fact, prior to joining the Weather Channel, Strickland says using Cassandra required spending “most of the day in the Cassandra IRC channel talking to developers.”
That was then. Today, Cassandra has morphed from a highly complex technology into a scalable solution for managing large data sets. But, growth and maturity aside, organizations like the Weather Channel are hedging their bets by relying on multiple platforms for varying purposes.
“The days of the one-size-fits-all data store are gone,” said Strickland. “What you need to do now is think about what is the right data store to solve a particular problem.”
In the case of the Weather Channel, a potent combination of both Cassandra and MongoDB ensures that all of its transactions are processed quickly and effectively. For instance, although the Weather Channel processes about 100 million transactions per day, volume can vary significantly – a factor that requires a database technology like Cassandra, which can handle high throughput and plenty of data variability. MongoDB, on the other hand, is ideal for weather data sets that won’t grow large but need to be queried in multiple ways, says Strickland.
In just a year, the Weather Channel has already discovered plenty of applications for Cassandra. Its first use case involved tracking application statistics in an effort to measure the performance of its systems and gather metrics on system interactions such as instances of latency.
Next, the Weather Channel began using Cassandra to create data mashups in its content generation system (CGS). Essentially, the Weather Channel’s CGS receives various requests to populate weather data templates. Cassandra gathers this data from a variety of sources, mashes it up, and then delivers it to consumers in a variety of forms, from an up-to-date forecast to video content.
Most recently, the Weather Channel updated its iPhone app in late April to include a new feature called Social Weather, which allows users to report the current weather conditions from their GPS location, as well as submit photos. If only one percent of the Weather Channel’s 20 million iPhone app subscribers begin using the feature, Strickland said, it will produce upwards of 200 million additional transactions a day.
It’s a spike in traffic that Strickland said underscores the importance of Cassandra’s virtual node technology. Cassandra features virtual nodes (vnodes) that enable the Weather Channel to scale no matter how large a cluster becomes. Typically, in a Cassandra cluster, data is distributed across a number of nodes. Each node is assigned a token, and that token owns an exact range of data. However, vnodes work by allowing for multiple tokens per node, which can then be subdivided into smaller ranges of data. As a result, when new data is added to a cluster, administrators are spared from having to drastically rearrange data or reassign new data to each and every node on a ring.
“With virtual nodes, we can simply spin up new nodes, join them to the cluster and they immediately start participating with very little administrative overhead,” said Strickland, noting that the process can now be completed “in literally minutes instead of hours.”
Over the past year, the Weather Channel has grown its node count from 3 to 36 on Amazon Web Services. In addition to greater scalability, the arrangement has bolstered the organization’s repair processes. For example, without vnodes, if a node fails, an administrator must decommission that node, as well as any other nodes that are being used for data replication. In the end, the failure of a single node can result in as many as three nodes having to work overtime during the repair process. However, with vnodes, because data is distributed in small pieces across the cluster, all of a cluster’s nodes can participate in the rebuilding process.
“It makes it a much more efficient and faster process by having more nodes that are able to continue to handle transactions,” said Strickland.
Despite its ability to prevent data loss from node failure, Cassandra does have its shortcomings. According to Strickland, “You are going to pay a convenience price for using Cassandra. For example, the price of not being able to just run ad-hoc queries against your data. And then there’s the price of having to write your data multiple ways. These things are inconvenient at times. But if you really need a high throughput data store, [Cassandra] is an administrator’s dream.”
Cindy Waxer, a contributing editor who covers workforce analytics and other topics for Data Informed, is a Toronto-based freelance journalist and a contributor to publications including The Economist and MIT Technology Review. She can be reached at firstname.lastname@example.org or via Twitter: @Cwaxer.