DocumentCloud goes from startup to newsroom standard

This piece is one of a series that looks at the Knight News Challenge winners, and their thoughts on future trends, on the occasion of the challenge’s 10th anniversary. 

Aron Pilhofer, co-founder of DocumentCloud, was running a product technology team at The New York Times and experiencing great frustration with the way journalists there – and at most newspapers – published documents online.

“I got really frustrated by the lack of transparency and the lack of openness, not because journalists didn’t want to publish source documents, but because there was no good way to do it back then,” said Pihofer, also executive director of the program which is based at Temple University in Philadelphia, where Pilhofer has an endowed chair.

The breaking point – now known as the moment of inspiration – came when an investigative reporter did a piece on a commuter railroad and published the related documents in a 12,000-page PDF.

“Good for him,” Pilhofer said. “But from the point of view of the reader, that’s almost like not doing it all.”

It happened again during the 2008 Democratic presidential primary season when the Clinton campaign agreed to release Hillary Clinton’s diary from the time when she was first lady, 1993-2000.

“We knew this was going to be a massive document dump,” Pilhofer recalled. “And we knew that it was probably going to be in a really crap format. It was not going to be searchable, and it was going to be thousands and thousands of pages. We did an in-house, all-night hack-athon, during which we built, effectively, a prototype of what ended up being DocumentCloud.”

Pilhofer and his team took all the PDFs from the Clinton diaries and converted them into web standard images using HTML, CSS and JavaScript so they would load fast and be searchable, and so they could direct readers to specific pages of interest.

DocumentCloud was among Knight Foundation’s second class of News Challenge winners. It launched in 2009 and has been broadly used and wildly successful. More than 8,000 journalists representing 1,600 organizations worldwide have uploaded at least one document. There are currently 3.7 million documents in the DocumentCloud repository and up to 60,000 new documents are uploaded weekly. On the consumption side, the public accesses 100 million documents a year.

The first iteration of DocumentCloud was intended as a generic platform to allow journalists to annotate and share documents, securely and privately, while they’re working on stories, using technology to enhance and expand their journalistic capabilities.

“At that time, the technology that most investigative reporters brought to a deep document story would be a yellow legal pad and a highlighter,” Pilhofer said with a laugh. “We’d run it through optical character recognition to make it searchable. You could share a document collection with a fellow reporter, or with somebody in a different newsroom, even. The big picture idea was trust transparency. That’s what we wanted to encourage. It’s just a better way to work with documents.”

Winning a Knight News Challenge grant changed the course of Pilhofer’s career.

“It changed my life,” he said. “Before doing something ambitious like DocumentCloud, I had always been in the nerd wing of journalism. I’ve always been a data journalist. But building tools that can scale to the size of something like DocumentCloud, the ability to actually get something like that done and see that grow to where it is now … if not for the News Challenge, it never would have happened.”

In the decade since the Knight News Challenge launched, Pilhofer recognizes several significant developments on the internet, including the combination of machine learning and artificial intelligence, ubiquitous computing, and what he calls the third act of data journalism.

Of the first, the combination of machine learning and AI, he says that algorithmically enabled reporters are now empowered to be better at their jobs. “We haven’t even begun to scratch the surface of what’s possible at a time when journalists are being asked to do more and more and more. And there are fewer and fewer of us,” Pilhofer said.

“Ubiquitous computing,” as he describes it, is the idea that wherever you go, there’s a computer present, whether it’s in your kitchen through a woman talking to you from a metallic tube or on your wrist. “You’re never that far away from a device that can help you make decisions, and help you solve problems, and achieve a goal.”

Finally, Pilhofer breaks down what he sees as the three acts of data journalism – so far:

“The first act of data journalism was kind of when I got in the business. PCs enabled journalists when they became cheap and easy to get. Suddenly journalists could apply technologies to their reporting that they never were able to before,” he said.

“I think the second act is where we are right now, the rise of data visualization and the interactive web, making journalism more visual and interactive. Creating pieces of journalism that allow readers to find their own narrative.

“And,” he adds, “the third act – and this is maybe a little forward-looking – but data journalism will definitely have a third act. It will probably involve algorithmic applications for data journalism that I don’t think we’ve even conceived of yet. That’s going to be pretty exciting.”

Bob Andelman is a Florida-based journalist.

Recent Content