January 2013 DJ
Patil, formerly of LinkedIn, recently wrote a great article1 on the
development of 'data products', loosely defined as those which
'facilitate an end goal through the use of data'. The main theme is that
such products should be approached in the same way that a Jujitsu fighter
approaches a fight: by clever manipulation rather than brute force. 'In my experience' , says Patil, 'meeting the problem head-on is a recipe for disaster'. Instead, he continues, 'there’s a method to solving data problems that avoids the big, heavyweight solution, and instead, concentrates building something quickly and iterating'. In parallel with the martial art, Patil refers to his methodology as 'Data Jujitsu'. To paraphrase brutally, a black-belt of the art is one who has demonstrated competence in handling the common foes of:
Naturally
Patil exemplifies Data Jujitsu with references to social
media and adjacent industries, but his points generalize easily to the Smart Grid. Some elements of a good data productThough by no means a complete definition, Patil provides some very interesting characterizations of a good data product: the thing has to be accurate, it has to be grounded in the real world of things, and it has to refrain from vomiting on its users. AccurateA data-product is one that translates raw data into insights or automated
actions. Evidently inaccurate insights are useless and automated actions could be down-right dangerous if they are not highly informed. So cross-validations, RMSE, model-selection: it's all good stuff. However,
Patil warns that accuracy is not something that the Quants can define alone because it can often be subjective. Even the best algorithms in the world won't be able to account for the full variations of personal preference and so, from the outset, algorithms are doomed to under-perform or outright fail for some people. The Californian utility PG&E has learnt this lesson the hard way after rolling out their smart-meters before an uncommonly hot summer. The smart meters were repeatedly demonstrated to work, but the public remained discontent because their year-on-year bills had increased. And so, inexorably, PG&E found themselves mounting a PR campaign based on Thermodynamics 101. The costs of algorithmic failure in the Smart Grid are perhaps middling. Though there's probably a smaller risk of burning through $440 million per day than the algorithmic failures of quantitative finance, there's probably more at stake than simply recommending the wrong movie. Either way, there should always be a manual mode... Grounded in the real worldPatil recommends that data-products are grounded in the real world, comparing Amazon's suggestion to 'browse similar items' to what you'd do naturally in a physical shop. The rational is clear: by drawing parallels with previous experiences your product becomes far more intuitive for the user. To this end I predict a backlash against those products which attempt to quantify energy savings in terms of green leaves, smiley faces, or other such vagaries. These things are not real. The only real ways to talk about savings is kWh (for the professionals) or dollars (for the general public). Data vomitData vomit makes your head reel, and induces you to consider a simpler life. Though it might include chunks of value, data vomit is rarely of any use at all. If you've ever spent any time with a computational physicist you'll know about data vomit. It's easy to confuse data for knowledge, but they are not the same... It is explicitly the role of the data-product to assimilate, process, and filter data in order to generate high quality knowledge for the user. Don't vomit on your users. Building the darn thingThe best product in the world isn't worth a dime until it's built, and unfortunately these things can be complicated. In the worst case scenario you end up with several PhDs who demand whole rooms of whiteboards and three months 'thinking time' before they write a single line of code (ahem). Patil discusses some concepts that seem like pretty basic product management such as 'minimum-viable product' and opportunism. However, he also provides a couple of suggestions that are more specific to data-products. Both involve using humans. Using mechanical turksThis is quite simple: instead of building a complex machine learning algorithm, just pay a human to do it. They'll probably do a pretty good job and Amazon's mechanical turks can complete tasks for just cents. Depending on how quickly you scale, this interim solution could well buy you months of development time and, as a bonus, you'll have a much better idea of the what and how by the time you unleash the PhDs. Along these lines, Twitter uses humans to identify and categorize trending topics to determine, for example, that #bindersfullofwomen refers to politics rather than office accessories.
|
Writings >