Coming soon: Data mining made easier
SHOWCASE | July 11, 2009
New York Times and Pro Publica editors have won a $700,000 grant, the Knight News Challenge’s largest award, for designing an archive that makes documents used in investigative reporting available for future use by reporters and others. As one of the designers puts it, it will provide tools that "there’s no way they would have had access to otherwise.”
By Alex Byers
Maybe you've been there.
Months of investigative research and reporting culminate in hard-hitting, exclusive stories. Other news organizations play catch-up, and your own follow-ups keep the story running. And then it fades away, referred to rarely and found only by an archives search online. The reports, transcripts and documents you gathered find a home in a dusty corner of your filing cabinet, likely never to be seen again by new inquiring minds.
It's exactly the kind of thing that the founders of DocumentCloud want to change.
DocumentCloud, the brainchild of ProPublica editors Scott Klein and Eric Umansky, New York Times Interactive News Technologies Editor Aron Pilhofer, and Times software engineer Ben Koski, is an online database that could change the way the public consumes investigative reporting. The largest grant winner in this year’s Knight News Challenge, the foursome will receive more than $700,000 to launch an online, searchable database that will allow journalists and the public to find, inspect, and contribute original source documents gathered from investigative reporting.
“We want to take all these documents that all these organizations are collecting and acquiring via FOIA, and we would like to make them easier for people to find, easier for people to share, easier to search,” Pilhofer said. “We want to take advantage of some of the incredible advances in data mining and text mining technology that we’ve seen over the last couple of years.”
The project is designed to be more than just an aggregation of public documents on the Web, however. DocumentCloud will give news consumers and journalists the ability to look back and find older documents that might have been used in a previous investigation. Sometimes documents will have additional and previously unknown value after their first use, Pilhofer said.
In addition, the team wants the database to link documents by several criteria, such as location, topic, company, and others. That would give users the ability to search for all documents related to a specific entity and originating near a certain city, Pilhofer said.
“Now you actually have meaningful entities that you can use to link documents together,” he said. “Something you could do, for example, would be to say ‘show me all the documents that reference IBM that also reference a place within 50 miles of New York City.’ Those are the kinds of searches you could do that you just absolutely cannot do any other way.”
Essentially, journalists and the public will be able to search the DocumentCloud database to find any documents submitted from contributing organizations on any topic they choose. Searchers can narrow queries to show only documents that relate to two specific entities, such as a company and a place.
The foursome will lead the project but will not take a leave of absence from their current jobs to do so, Pilhofer said. They will also be hiring staff for coding and development, he added.
The founders are currently in the process of finding organizations to provide documents. Most all academic, journalistic, or otherwise public-supportive organizations will be able to join, and those interested can contact the team by email here. Some organizations already on board include the Times and Pro Publica, as well as the National Security Archive and Talking Points Memo.
“We’re going to have a limited universe of contributors, so the sorts of orgs that are going to be contributing are those orgs who have a track record of accuracy and authority and all those sorts of things. The onus, for the most part, is going to be on the contributor to ensure that the document is accurate,” Pilhofer said.
The project will be funded through the Knight Challenge grant for its first two years, and Pilhofer says the team will be searching for sustainable funding, with the hope of having a good idea about its long-term monetary situation by the end of its first year.
DocumentCloud will be open to the public sometime in its first year, Pilhofer said. “The tool that we’re building, I hope, will help … do some things that otherwise might have been not technologically possible in the past,” he said. “It will give them access to tools that facilitate that kind of reporting that there’s no way they would have had access to otherwise.”
07/13/2009, 10:19 PM
Bravo! I'm hoping it'll be a powerful tool for resisting the self-serving tide of secrecy in government (the anti-accountability weapon of choice that has darkened the public forum at the very moment it needs bright lights)! Might be just the do-over the 21st century needed to bring attention back where it belongs - on public policy decisions. It's been focused too long on whatever half-baked plots can be imagined by people who see malice wherever they look, most disturbingly while in the act of violating personal privacy (that quaint concept expressed in the constitution as protection against unreasonable search and seizure, not to mention the right to due process of law). Sorry for the rant. It's been festering for quite awhile now.