Coming soon: Data mining made easier
SHOWCASE | July 11, 2009

New York Times and Pro Publica editors have won a $700,000 grant, the Knight News Challenge’s largest award, for designing an archive that makes documents used in investigative reporting available for future use by reporters and others. As one of the designers puts it, it will provide tools that "there’s no way they would have had access to otherwise.”


By Alex Byers
byersalex@niemanwatchdog.org

 

Maybe you've been there.

Months of investigative research and reporting culminate in hard-hitting, exclusive stories. Other news organizations play catch-up, and your own follow-ups keep the story running. And then it fades away, referred to rarely and found only by an archives search online. The reports, transcripts and documents you gathered find a home in a dusty corner of your filing cabinet, likely never to be seen again by new inquiring minds.

It's exactly the kind of thing that the founders of DocumentCloud want to change.

DocumentCloud, the brainchild of ProPublica editors Scott Klein and Eric Umansky, New York Times Interactive News Technologies Editor Aron Pilhofer, and Times software engineer Ben Koski, is an online database that could change the way the public consumes investigative reporting. The largest grant winner in this year’s Knight News Challenge, the foursome will receive more than $700,000 to launch an online, searchable database that will allow journalists and the public to find, inspect, and contribute original source documents gathered from investigative reporting.

 

“We want to take all these documents that all these organizations are collecting and acquiring via FOIA, and we would like to make them easier for people to find, easier for people to share, easier to search,” Pilhofer said. “We want to take advantage of some of the incredible advances in data mining and text mining technology that we’ve seen over the last couple of years.”

 

The project is designed to be more than just an aggregation of public documents on the Web, however. DocumentCloud will give news consumers and journalists the ability to look back and find older documents that might have been used in a previous investigation. Sometimes documents will have additional and previously unknown value after their first use, Pilhofer said.

 

In addition, the team wants the database to link documents by several criteria, such as location, topic, company, and others. That would give users the ability to search for all documents related to a specific entity and originating near a certain city, Pilhofer said.

 

“Now you actually have meaningful entities that you can use to link documents together,” he said. “Something you could do, for example, would be to say ‘show me all the documents that reference IBM that also reference a place within 50 miles of New York City.’ Those are the kinds of searches you could do that you just absolutely cannot do any other way.”

 

Essentially, journalists and the public will be able to search the DocumentCloud database to find any documents submitted from contributing organizations on any topic they choose. Searchers can narrow queries to show only documents that relate to two specific entities, such as a company and a place.

 

The foursome will lead the project but will not take a leave of absence from their current jobs to do so, Pilhofer said. They will also be hiring staff for coding and development, he added.

 

The founders are currently in the process of finding organizations to provide documents. Most all academic, journalistic, or otherwise public-supportive organizations will be able to join, and those interested can contact the team by email here. Some organizations already on board include the Times and Pro Publica, as well as the National Security Archive and Talking Points Memo.

 

“We’re going to have a limited universe of contributors, so the sorts of orgs that are going to be contributing are those orgs who have a track record of accuracy and authority and all those sorts of things. The onus, for the most part, is going to be on the contributor to ensure that the document is accurate,” Pilhofer said.

 

The project will be funded through the Knight Challenge grant for its first two years, and Pilhofer says the team will be searching for sustainable funding, with the hope of having a good idea about its long-term monetary situation by the end of its first year.

 

DocumentCloud will be open to the public sometime in its first year, Pilhofer said. “The tool that we’re building, I hope, will help … do some things that otherwise might have been not technologically possible in the past,” he said. “It will give them access to tools that facilitate that kind of reporting that there’s no way they would have had access to otherwise.”

 

-

Guy
Posted by Dan
07/13/2009, 10:19 PM

Bravo! I'm hoping it'll be a powerful tool for resisting the self-serving tide of secrecy in government (the anti-accountability weapon of choice that has darkened the public forum at the very moment it needs bright lights)! Might be just the do-over the 21st century needed to bring attention back where it belongs - on public policy decisions. It's been focused too long on whatever half-baked plots can be imagined by people who see malice wherever they look, most disturbingly while in the act of violating personal privacy (that quaint concept expressed in the constitution as protection against unreasonable search and seizure, not to mention the right to due process of law). Sorry for the rant. It's been festering for quite awhile now.


-

Martin Lobel
It’s time to do more than just say the economy is the No. 1 issue
If voters are to go into the midterm elections with any understanding at all, the press needs to get away from he-said, she-said reporting and look into the positions that candidates and the two parties are taking. Martin Lobel offers some vital questions.

William Claiborne
What a broken Senate looks like from far away...and why it matters
Our correspondent in Australia has ideas on how to improve things a little. But he’s not optimistic that anyone on Capitol Hill will be interested.

Steven Greenhut
How severe is the public employee pension problem across the U.S.? (Hint: Is a $3 trillion debt severe?)
Columnist and author Steven Greenhut looks at the ongoing pension issue, including abuses of it, and deals with some of the key questions.

Watchdog Blog
Herb Strentz
Des Moines Fair Coverage, Part 2
Cleaning up in the wake of the 2010 Iowa State Fair will be daunting this year. In addition to the mess left by nearly 1 million visitors and thousands of farm animals, we have a continuing saga of news coverage that told of possible racial assaults and then, in Saturday Night Live fashion, appears [...]

Herb Strentz
On ‘Beat Whitey Night’ in Des Moines
(Editor’s note: The incidents described here have become part of a developing story, as this Google link shows.) The Des Moines Register’s reluctance to identify criminal suspects or victims by race has turned into an outright refusal to do so. The closing night of the Iowa State Fair was marked by an observance not exactly on the [...]

Barry Sussman
Justice Department Shows Its Mettle, Indicts Clemens
I got this note from a friend and colleague a little while after Roger Clemens was indicted by a federal grand jury on Aug. 19th: “And meanwhile, Condoleezza Rice, Donald Rumsfeld, CIA officials and others who lied to Congress in sworn testimony about Iraq go free. If we can ‘look forward, not backward’ on torture, perjury, [...]

Blog main page >>
Web Essentials
Leading journalism sites, blogs...
Enter your e-mail address
Spotlight On

TWITTER
Follow Nieman Watchdog on Twitter.
(Nieman Watchdog)

Telecoms charging more to do nothing
It's getting more expensive to have an unlisted phone number. What's the logic behind that?
(Center for Media and Democracy)

Prosecute those leaks
The Obama administration has indicted another alleged leaker, this time for reportedly passing along to Fox News an intelligence assessment that North Korea was likely to respond to U.N. sanctions by conducting another nuclear test.
(Secrecy News/Federation of American Scientists)

A broad array of massive financial crimes
As PRWatch.org shows, court-imposed settlements have only skimmed the surface of big banks' wrongdoing in the financial crisis.
(Center for Media and Democracy)

More Spotlights >>