The O’Reilly Radar had an excellent post on the importance of public data, pointing to a recent debate in the UK over the hidden value of government data.
For years, there have been groups fighting to make government data free and available to the public. As technology evolves, this is more and more feasible.
Unfortunately, much of the focus of the article is on data the government has not released, largely ignoring that huge amounts of data that are “publicly available” are not really “available” in any usable format.
Did you know there is no single repository where you can download all the data available from the US Census Bureau? This is data that is supposedly freely available to everyone.
The pieces you do get – via multiple web sites like FactFinder or software installs like CSPro – are largely unusable without an advanced degree in demographics or computer science or both.
A recent conversation with a senior official from the US Census Bureau:
Me: “Hey, so, I’d really like to get a hold of all the census data. It would be great to have in Swivel.”
Census guy: “Well, which data are you interested in?”
Me: “All of it.”
Census guy: “Hmmm. Do you mean the economic census or the household?”
Me: “Both”
Census guy: [Blank stare] “That is a lot of data”
Me: “Yes, yes it is. So, know where I can get it?”
Census guy: “Have you tried the fact finder?”
Me: “Yes, great if you know what you want going in and only want a little at a time, but I want it all. Can’t you put everything onto an FTP server somewhere and I can harvest it all and ongoing as new data are available?”
Census guy: “Um, no. We have nothing like that. It is too hard.”
It is not too hard. We aren’t asking the right questions. We are only asking: "Why are you charging for data? And what data are you hiding?" We aren’t asking: "Why aren’t you making all of our data freely available in a way that makes sense?"
The Radar post raises important issues, but they are just the tip of the iceberg.
Swivel Home
While I agree with you that the availability of federal data on the web is terrible (I spent an hour trying to walk an office assistant through finding our OWN data on the IPEDS PAS website, and she's darned smart but the interface is utterly baffling), I think that any responsible provision of data is in fact "hard" or at least incredibly time-consuming. Data have to be described and documented in order to be meaningful, so a giant file of census data on the web would need to have reams of documentation associated with it. Ideally, the enormous data sets would be broken down into subsets of interest. These are available in a variety of ways, but the fact is that you really *do* need to understand what you're looking for, and "fishing expeditions" (even when they're off tips of icebergs) are at great risk for terrible misuse and misinterpretation of data. I'm all for making public data public, but I'd like to ask the federal government to do so in a meaningful and responsible way, and I don't think that means slapping ascii files up on a server without documentation. If data availability were the main issue, we'd be done: it's interpretation and management of data that are the real challenges.
Posted by: mamacate | November 27, 2007 at 02:11 PM
That is an excellent point, thank you. The issue of data curation is crucial to the understanding and use of data. Furthermore, data curation is timely and expensive. Academics have been looking into this issue for years - and governments could learn from their lead. Have a look at some of the work being done at places like ICPSR at the University of Michigan ( http://www.icpsr.umich.edu/ ), the Institute for Quantitative Social Sciences at Harvard ( http://www.iq.harvard.edu/ ) or the Data Enclave at NORC University of Chicago ( http://www.norc.org ).
Posted by: Sara Wood | November 27, 2007 at 02:36 PM
It seems as though the problem is that there is a separation of authority from technical knowledge. The people with authority don't have technical knowledge (Marybeth Peters, US Register of Copyrights is a self proclaimed Luddite) and assume it's not possible (and/or misinterpret regulations). Then, the people with knowledge either are either not in the government or are lazy and tell their bosses that it can't be done.
Those with the motivation and skill to rise to a position of authority who also possess technical knowledge may tend to the private sector, where they can, you know, make money.
Posted by: Matt | November 27, 2007 at 07:40 PM
I agree strongly with mamacate's comment above. Making those data available with machine readable metadata is not a trivial task, even if there were a standardized metadata format that could simply be applied to those data. The Umich link above seems to describe one effort to standardize on such a format (the Data Documentation Initiative).
Aside: When I clicked on the links above for www.iq.harvard.edu and www.norc.org
I found that they are bad links. I think these are just typos or stray characters that have corrupted the links.
Posted by: Jon Koomey | January 11, 2008 at 12:50 PM
There is a lot of historical census data from the IPUMS project at U. Minnesota, http://usa.ipums.org/usa. And yes, understanding it and interpreting it is complex. But still, having it there is way better than not.
From their description:
"What is IPUMS?
The Integrated Public Use Microdata Series (IPUMS) consists of thirty-nine high-precision samples of the American population drawn from fifteen federal censuses and from the American Community Surveys of 2000-2006. Some of these samples have existed for years, and others were created specifically for this database. The thirty-nine samples, which draw on every surviving census from 1850-2000, and the 2000-2006 ACS samples, collectively comprise our richest source of quantitative information on long-term changes in the American population. However, because different investigators created these samples at different times, they employed a wide variety of record layouts, coding schemes, and documentation. This has complicated efforts to use them to study change over time. The IPUMS assigns uniform codes across all the samples and brings relevant documentation into a coherent form to facilitate analysis of social and economic change. "
Posted by: Joe Hellerstein | January 11, 2008 at 03:06 PM
"Can’t you put everything onto an FTP server somewhere and I can harvest it all and ongoing as new data are available?”
You can get almost all of the census datasets for free here:
ftp://ftp2.census.gov/
http://www2.census.gov
Posted by: Bill | January 16, 2008 at 06:13 PM