In a very interesting and detailed blog post, a researcher at Neustar performed analytics on the open source data provided by the New York City Taxi and Limousine Commission logging all trips driven by NYC taxis in 2013. The access to this information can be tremendously useful for city planning to determine traffic patterns, time and seasonal differences in traffic, which roads are used most, that might require more timely maintenance, etc. But, as this researcher highlights, if you correlate this data with other public data, such as a Paparazzi photos of a celebrity getting into a cab during that timeframe, you can discover where that person went, how much the cab ride cost, and even what they tipped. Now to be clear, this is 2013 data, so risks of knowing where someone went almost two years ago is not high, but it highlights the privacy risks of derived data. The author then used the location of certain Gentleman’s Club in a otherwise isolated neighborhood, and tracked all cab rides late-night to early morning from that source to determine the likely homes of the customers. Maybe I was wrong: data from 2 years ago could be dangerous—when a customer’s spouse finds out! Similar to my blog post Is Your Cat Selling you Out, I share this to illustrate how your home could identified by sources you didn’t even consider. The author goes on to discuss the topic of Differential Privacy, which could be a solution to protecting privacy while releasing public data; but that is beyond my comprehension, so I’ll move on to my point.
What I find smart about this is the intelligence gathering approach the author conducted. He starts with knowing what questions to ask, what data and where to get the data, to derive the answers. Doing correlation and analytics to get an answer is not a simple automated process—the first time. Once you develop the approach, it can be automated going forward. In Cyber Security, many folks want to buy a SIEM correlation tool, point all their logs to it, and expect it to give them answers to stop intrusions and malware. It’s not that easy; like this author shows, having a set of data is just a starting point, but then one has to figure out WHAT you want to know about the data, and include other data that provides context, relevance and timeliness (if part of question).
One of the trending capabilities in Cyber Security is use of 3rd party threat feeds. These are sources of data of known bad IPs, URLs, and domains that have been identified to be hosting malicious web sites or malware, or the source of a phishing campaign or botnet controller. This data is very useful, if you have a way to ingest it, and apply it to your security process (as well as have a process to respond and remediate). But like the Taxi data, if you just look at it flatly, you only get simple answers, such as chart the longest trips, most expensive cab rides, or most common source of taxi pickups—though interesting, none of which is actionable. This is similar to how folks use SIEM: yes, I can get stats about how many bad things happen, who are most active hosts, what are most common sources of traffic, etc. None of which is actionable. What we need to do is figure out what information is actionable, such as: when is one of my critical boxes talking to a known bad Internet host?; or, when is a known spear phishing actor the source of an otherwise innocuous email sent to my CEO? Start with what answers you want, and then figure out how to ask the question, don’t just let the data tell you.
– Rick Doten, DMI Chief Information Security Officer (CISO)