Mainly Data
An exploration of people and data management, the evolution of learning and the scientific method in an era of data-intensive distributed computing, and efficient knowledge capture and distribution using the web. Probably other stuff, too.
Hadoop at Facebook
There’s a post on the Facebook Engineering blog today from one of the Data team engineers, Joydeep Sen Sarma, discussing how we use Hadoop here at Facebook. Check it out.
For more on Hadoop at Facebook, you can check out the slides from a set of lectures I gave at IBM’s Cloud Computing Center in Dublin.
For a deeper look at the architecture of HDFS, check out the presentation [PDF] that dhruba, a recent addition to the Data team, gave at the IBM storage team’s recent offsite.
Participatory Sensing, a project from the Center for Embedded Networked Sensing at UCLA, reconceptualizes your mobile phone as a diverse collection of sensors that can be used to record data throughout your day. This data is then shipped to a centralized aggregation service that provides different views and analyses of your data. The Reality Mining project from MIT takes a similar perspective but has more targeted goals.
Both of these research projects were conducted in conjunction with Nokia. I gave a talk over at Nokia Research a few months back and was fascinated by their take on the future of the internet. They’re positioning Ovi as your portal into the data collected via all of your Nokia sensors. I’m quite interested to see how their offerings evolve, particularly in parallel to the iPhone and Android.
The Peach open movie project, initiated by the Blender Foundation and hosted at the Blender Institute in Amsterdam, is an innovative project to produce high-quality digital media with open source tools.
Another interesting project in this space is Justin Frankel’s REAPER project. These tools, in addition to the venerable GIMP, are rapidly commoditizing the software needed to produce digital media of professional quality.
Unfortunately, I lack the artistic talent required to make nontrivial projects with these tools, but I’m looking forward to consuming the creations of my more talented friends. In addition, a lower barrier to the creation of digital media means a rapid increase in the amount of multimedia data to be stored and analyzed. Doing machine learning over audio, image, and video data is something that Hadoop should handle well. If anyone has a project using Hadoop to do data mining over a multimedia data set, let me know!
Keeping up with the Facebook Data Team
I’m going to start collecting public presentations and coherent commentary on our work so that I have a single point of reference when people ask me, “What does the Data team do?”. If you see any mentions of the Facebook Data team in the wild, pass them my way, and if you are interested in the problems we’re working on, drop me a line!
p.s. I just learned that del.icio.us lets you create descriptions for your tags. Pretty handy. I’ve always found del.icio.us to be the most useful site on the internet; I can’t tell if I’m happy or depressed that Yahoo hasn’t done much with the product.
peer reviewed journals for source code and data: narrative forms for the modern scientific method
greg wilson has a great post about a new journal devoted to source code for biology and medicine. it’s a fascinating idea, and greg asks, “why don’t we do this?”. clearly there are many reasons; one is the significant financial reward awaiting programmers for solving problems better than others for an extended period of time. in the research community, the prestige of publishing a result often outweighs the financial rewards of keeping that result to yourself.
even for computer science professors, the allure of financial rewards keep them from providing their code to the open source community for criticism: see, for example, ken birman’s licensing of the astrolabe [pdf] source code to amazon for a large sum.
suppose we could align incentives appropriately and a healthy community of peer reviewed journals for source code emerged. how would you structure one of the articles in these journals? they would probably take inspiration from don knuth’s literate programming (see also knuth’s book on the topic). tools for literate programming seem to be getting more sophisticated recently (especially in the two languages i use most, python and r). another related idea is reproducible research, which involves publishing empirical data in addition to code. it’s a logical extension of literate programming to the scientific realm.
as an aside, it’s somewhat notable that these journals are emerging outside of the computer science community, where code is needed primarily to manufacture results. i suppose it’s indicative of the strange relationship computer science professors have with programming, as pointed out to me by a professor of statistics this weekend.
in fields where i meander from time to time, there has been some progress on this issue. well, sort of, though none are quite analagous to the journal highlighted by greg’s post. statistics has the journal of statistical software; machine learning has mloss; and databases have the vldb experiments and analyses papers.
i’ve always enjoyed books that use actual code to illustrate their ideas (example one, example two), and i’d love to see this trend extend to the academic literature. as we all adjust to the new tools of modern science (hypotheses, code, and data), having a unified narrative method for conveying your results to other researchers will grow in importance. it looks like medicine ance biology are moving pretty quickly on this front; if you know of any other examples that illustrate how certain fields are moving forward with literate programming or reproducible research, please send them my way!
scalable messaging
has twitter hired someone from a financial services firm? their funding comes from new york, where they know a thing or two about scalable messaging, and the finance sector isn’t exactly booming. i’ve never used twitter, but if each message is immutable and limited in size, this problem should look quite familiar to the finance types, yes?mathematicians and infrastructure
we were recently invited to give a talk at this year’s sigmod. it’s quite an honor. another talk on the same industrial track is being given by some folks from google about megastore, a layer they’ve written on top of bigtable to make it easier to build web applications. the full abstract and authors list is below.
the last author is a former classmate of mine at harvard and a fellow mathematics major. we recently had another former mathematician come by the facebook offices to present his work on distributed storage: peter braam, who architected the lustre file system.
Megastore: A Scalable Data System for User Facing Applications
JJ Furman, Jonas S Karlsson, Jean-Michel Leon, Alex Lloyd, Steve Newman, and Philip Zeyliger
Megastore provides a rich model and API that facilitates implementation of user facing applications storing data in Bigtable. Our goal is to enable Google developers to quickly build and launch highly available applications at Google scale. We extend Bigtable to provide strong consistency guarantees and higher level abstractions such as transactions, secondary indexes and synchronous replication. Megastore takes a practical approach to schema management, providing integrated declarative schemas with rich data extensions, such as logical data partitioning, which is key to achieve high performance querying and scalable massively parallel transactions.
measuring an ad bundle
in my last post, a few firms were mentioned as possible partners for the major studios and brand advertisers as they search for a consolidated and effective set of metrics for evaluating the success of their new advertising products. clearly the list was not complete; here are two more firms that could help measure the success of ad bundles:
- omniture: for a brand advertiser, web analytics is no more exotic than old school performance marketing. however, because the internet is a new technology (bear with me, we’re talking about media firms here) and each website has a different structure, it’s much more difficult to produce relevant metrics across sites and campaigns. omniture has a strong hold on the traditional web analytics market and could make a strong move into integrated ad bundle measurement as more ad spends move online. if they can complete the visual sciences integration and keep up the strong revenue growth, they’ll be well positioned to grow into the larger audience measurement space.
- spot runner: spot runner is creating a really compelling proposition for getting your campaign on television. given the rise of ad bundles, and because spot runner is native to the web, they may be in the strongest position to capture some of the integrated audience measurement business that will be emerging. it looks as though they’ll use their new war chest to expand the basic product into other markets; in addition, i hope they keep an eye on the smaller audience measurement firms mentioned in the last post. if they can start to hit a meaningfull run rate, an acquisition would be a reasonable next move.
media web trail: follow up
in a previous post i poked at a few things going on in media that seem significant: the diversification of distribution channels for content producers and the proliferation of potential products to be packaged and distributed throughout the content creation process.
today fortune ran an article about the decreasing importance of upfronts. according to the article, media companies used to lock up 80% of their advertising revenues through this single source; in recent years, the sales process has evolved into a continuous engagement between advertisers and content producers. apparently the content producers are having some success selling “ad bundles”, which are “cross-promotional deals where advertisers purchase traditional spots on broadcast TV along with, say, the sponsorship of a program when it is streamed on the network’s website”.
for these ad bundles to grow, the content producers really need standardization of the various components of an ad bundle. they also need metrics to assess the effectiveness of the ad bundle, and the advertisers need a way to relate the metrics for the ad bundle to the metrics for a pure television spot.
a smart network is going to align themselves with some of the sharper firms doing analysis of brand advertising through channels other than television: quantcast, buzzlogic, admob, and the rest.
if you’re aware of any media partnerships with these media measurement firms that have proven successful with advertisers, drop me a line! i’m especially interested to learn if any firms are using immi: their technology looks rad.
statistics and government organizations
while digging up some numbers for my previous post i came across some interesting government organizations related to statistics. given that science is measurement, let’s take a look at some government organizations that aggregate measurements.
in the eu, there’s eurostat, whose focus is on pulling together these entities from across their member states. in the uk, last month saw the creation of the uk statistics authority, as required by the statistics and registration service act of 2007. the goal of this new, independent entity is to monitor and report on all official statistics and to provide oversite for the office for national statistics. all of this reform is apparently the result of a eurobarometer survey which showed that british people trust their official statistics less than the citizens of any other member nation. in france, another nation known for its history of achievements in statistics and probability, things seem to be a bit more settled: the insee has been operating since 1946.
in the united states, we have an array of statistics-gathering entities accessible via the fedstats portal. most critical statistics are collected via the economics and statistics administration, under the department of commerce, or via the bureau of labor statistics, which is under the department of labor. the cia, an independent agency, produces its own set of worldwide statistics and makes them available via their factbook. i wonder what kind of master data management or data quality initiatives they have in place to make sure these numbers are aligned?
let’s take a look at a few other countries. in china, the national bureau of statistics has a website that is easy to use and frequently updated. israel’s central bureau of statistics seems to be a similarly modern institution, but india’s ministry of statistics and programme implementation has a website due for an overhaul.
there are a few international organizations with similar missions, as well: the un has the statistics division and the oecd has sourceoecd. for information on countries not mentioned above, both the un and the oecd have links to their respective statistics authorities.
in a future post, i hope to go deeper into what sorts of statistics these entities collect and what problems they hope to address with the intelligent application of statistics. i wonder if they have any data warehouse architects on hand to treat their organizations as giant data integration challenges; if you have any information on the history of any of these institutions or their technical underpinnings, drop me a line! i picked up schumpeter’s history of economic analysis hoping to find some leads myself.
of course the folks at gapminder and swivel, among others, have started collecting this data for you to manipulate. but while the data is interesting, it’s primarily the ways in which the data and analyses influence policy that interests me. i’d love to see a site collecting mentions of data in the deliberations of government so that we can better understand how the data we collect is ultimately used to make decisions in government.
microbig: follow-up
first, it annoys me that you can’t reblog your own post. see below for the precursor to this post.
second, my friend zander was kind enough to point me to an interesting candidate for the distinction of “microbig”: paul, a bakery in france.
paul is a well-known brand in france with over three hundred bakeries throughout the country. they just recently opened their first bakeries in america and their global presence is expanding: they have around 100 shops outside of france. in their most recent company information sheet, they claim to serve over 5 million customers per month. let’s stop now to formulate our first (admittedly imprecise) requirement for being a microbig:
1. significant involvement in the lives of over 1 million people
now, let’s examine their impact on the world’s financial ecosystem. because paul is privately held by the holder groupe, we can’t inspect their financial results, though they do offer information on their market size. according to paul, as of 2001, the bakery and pastry products market totals 8.2 billion euros, with 54% of the market going to bread and 36% to pastries (don’t ask about the remaining 10%, apparently). of this market, artisanal businesses account for 71%. i have no idea where they get these statistics from, so let’s pretend they don’t exist. instead, we’ll look at a related company: panera bread.
panera’s an interesting company; i first learned about their history from my friend brian a few weeks ago. au bon pain was founded in 1978. in 1994, they bought a company called the st. louis bread company. in 1999, they rebranded the st. louis bread company offerings outside of st. louis as “panera bread company” in an attempt to give the chain national appeal. at the same time, they sold off all of their non-panera divisions, including au bon pain, and renamed the company “panera bread company”.
anyways, on to their financials: as i write, NASDAQ:PNRA has a market capitalization of USD$1.56 billion. a sizable company, certainly, but not a giant. they pull in over USD$1 billion a year in revenues but only scratch out USD$57 million in income after taxes. that’s pretty miniscule. i’m not really sure i want to focus on income/earnings, however; panera has a pretty sizable impact on the financial ecosystem given their revenue, financing activities, franchising program, and real estate dealings.
so, let’s formulate a second imprecise criterion for being a microbig:
2. annual revenues of under USD $100 million dollars
this clearly is not a perfect metric but i want to start somewhere. does paul qualify, then? well, according to their corporate website, panera has 1,168 locations, 493 of which are company-owned. given that franchises account for significantly less revenue than company-owned locations, and given that paul has approximately 400 locations worldwide, i would expect that paul is at least within an order of magnitude of panera’s revenues. thus, they’re not a microbig.
further consideration of what constitutes a microbig makes me think we need one more criterion to match gillmor’s intended usage:
3. people are aware of the involvement of the microbig in their life
thus we can rule out parts suppliers and other silent microbigs and just focus on those with widespread mindshare who lack a significant proximal impact on our financial ecosystem.
note that both paul and panera make physical goods and require a physical presence to distribute these goods. i am thining we’ll find microbigs elsewhere, delivering bits rather than atoms. ubiquitous platforms for the delivery of bits include televisions, movie theaters, mobile phones, personal computers, and perhaps digitized billboards, among others.
so, yeah: any thoughts on good candidates in these domains?
Want Ad: Beautiful Minds
Met Brad Grossman this morning (profiled in above New Yorker piece) who up until three weeks ago was Brian Grazer’s cultural attaché, a job concept that he and Grazer apparently popularized in Hollywood. Basically, Brad was responsible for keeping Brian informed of any and everything, from organic chemistry to the outcome of the World Series, as well as arrange for him to meet interesting people on a weekly basis. For this he was paid handsomely and flown around with Brian everywhere on a private jet. Brad struck out on his own and now Brian’s looking for a replacement.
And I thought leaving my perfect job at CV was hard ….0
so what is brad doing now? who was the most interesting person he introduced to grazer? i need to know these things, zach.
media web trail
producing quality content for the consumption of many is a fairly involved process. at some point, a media company decides to snapshot the product and begin distributing that snapshot through whatever channels they have available: radio, television, newspapers, magazines, those television screens in elevators, tied to the back of an airplane, whatever.
the process continues after the snapshot, as consumers of that media place it in context and annotate its content. a creative mind can apprehend the value in all components of this process, both prior to and after the snapshot, and package many products in addition to the primary output of the project: “making of” shows, outtakes, soundtracks, etc. each of these secondary outputs requires rigorous control of the production process so that you can capture the secondary outputs and manipulate their distribution to generate revenue.
there are so many interaction points (distribution channels) now between the consumer and all parts of the production process that modern media companies are not able to keep up with proliferation. as traditional distribution channels throw off less and less revenue, media companies are scrambling to restrict and better define the alternative interaction points with consumers: after all, you must define before you can control.
some interesting movements recently in the world of modern media:
- one: cbs, trying to use showtime’s recent critical success to strengthen their bargaining position with the pay station’s content partners, has seen their strategy backfire: viacom, their recently separated twin, is working with mgm, lions gate, and now blockbuster (wtf?!) to aggregate content into another premium cable station (read: distribution channel).
- two: cbs had an okay first quarter without the superbowl, even upping their dividend (is that bravado?). they break out their $3.7 billion in revenues into three operating segments (distribution channels): television ($2.6 billion), radio ($364 million), and outdoor ($497 million). given that they are creating television shows, own the rights to premium sports content, and now have a little movie-making division, i’d love to see them reconceptualize their operating segments around properties of content production and report on which content performed well in the different distribution channels. they’re sitting on almost a billion dollars in free cash flow and are clearly looking to make acquisitions in the online and outdoor distribution channels, but i’d be interested in seeing them reconsolidate around their strength as a content creator.
- three: youtube, an alternative distribution channel for video content, is probably starting to realize they can’t monetize the long tail of ugc; they’re going to need premium content to drive alternative revenue generating strategies. major media companies, of course, are trying to create yet another alternative distribution channel rather than realizing their position of strength as premium content creators. it’s cool, though; hulu is light years ahead of youtube on user experience. that’s what happens when you have to focus on scaling your infrastructure and preventing abuse instead of offering a good user experience.
- four: publicis, who knows how to create and distribute content, are starting to turn the crank on their digitas acquisition. their partnership with google is compelling; to be honest, i think they have more to teach google than to learn. i’d love to see content producers just get it over with and merge with these agencies.
- five: twx finally washes their hands of twc. some interesting quotes in here from bewkes: “in a fragmenting world, we think brands will increasingly matter more, not less” and “we believe that ultimately all packaged media will move to digital distribution”. these statements are fairly lucid predictions, but i’m not sure twx is well positioned to benefit wholly from either trend.
Gillmor coins the term "microbig"
dig the style and nails his analysis of consumer internet properties that have evolved over the last five years: their impact on people’s lives and the media far outdistances their impact on capital markets.
i’d like to find a historical parallel of companies that permeated the daily lives of millions without generating significant revenue or impacting capital flows in a major fashion. anybody?