10 Years of Turning Documents into Data: A Q&A with DocumentCloud

Journalism / Article

DocumentCloud, an open-source platform that allows journalists and their readers to upload, analyze, annotate and collaborate on primary source documents, was founded in 2009 with a grant from the Knight News Challenge. Since then, it has been adopted by news outlets big and small for collaboration on investigations such as the Panama Papers and to make sourcing transparent to readers. Knight has continued to invest in the platform

The platform was the brainchild of Temple University’s Aron Pilhofer and ProPublica’s Scott Klein and Eric Umansky. Over the past decade, other project leaders such as Ted Han, Jeremy Ashkenas and Amanda Hickman have contributed to DocumentCloud’s development and success. In 2018, DocumentCloud merged with MuckRock under the leadership of co-founders Michael Morisy and Mitchell Kotler. 

On its 10th anniversary, we asked DocumentCloud’s founders and project leaders to share some about the program, the organization and what to expect in the coming years. Their responses are below.

Q&A with DocumentCloud’s Founders and Project Leaders

What motivated you to create DocumentCloud?

Aron Philofer

Aron Philofer: “Helping journalists be more transparent about their sourcing — that was and remains DocumentCloud’s mission. Ten years ago, there were no good tools to help journalists share, annotate and publish source documents in a web-friendly way. To the extent newsrooms published documents at all, it was usually in the proprietary and bloated PDF format: newsrooms would publish massive documents, expecting users to download them to their local hard drives and try to make sense of them. It was nuts! 

My team at The New York Times had built an early version we called Document Viewer to handle a massive document dump during the 2008 presidential election. Scott Klein and Eric Umansky at ProPublica saw it and asked if we would share the tool with them. As we started talking, we soon realized this could be a much bigger thing: we could build something any newsroom could use. And before we knew it, we were writing the Knight News Challenge grant.”

What were some of the early challenges you had to overcome while developing the platform?

Jeremy Ashkenas

Jeremy Ashkenas: “At its 2009 launch, DocumentCloud was an early and ambitious example of a rich web application — a powerful workspace for organizing, annotating, searching and visualizing documents that ran entirely in the browser. It was a different era of tool-building for the web, and we had to invent a lot of the technology required to get the job done: a web framework, an asset packager, a traffic tracker and a job queuing system, among other open source components.

But we got lucky, and several of the pieces of DocumentCloud were widely adopted by industry, helping power rich-web applications for companies such as WordPress, Airbnb, Stripe, USA TODAY and Hulu, just to name a handful. Large-scale adoption led to feedback and code contributions that helped solidify and improve the core components. And now, 10 years later, the general approach that DocumentCloud helped pioneer is no longer a curiosity.”

How did newsrooms respond to DocumentCloud’s initial launch?

Amanda Hickman

Amanda Hickman: “It was a really easy sell, but we still had to sell it. Everyone wanted to see DocumentCloud, but during the first year almost all the newsrooms that signed up had seen a demo somewhere else; and as soon as they saw the demo, they definitely wanted an account. We led a lot of conference sessions, workshops and brown bag lunches. We were very hands on for the first few years — sitting down with users really helped us dial in what needed to work better, what needed better documentation, and how newsrooms were actually using what we were building.”

What are some interesting ways journalists are using DocumentCloud?

Ted HanTed Han: “We’ve been fortunate that our users have tried so much with DocumentCloud.  Some of those experiments and projects are technical, but many aren’t.  

Just the act of publishing documents to an interested audience can turn up stories, as was the case when Buzzfeed’s Kendall Taggart published FOIA requests she received regarding President Trump’s New Jersey bankruptcies.  Saint Louis Public Radio published the Ferguson Grand Jury documents and asked readers to follow along with them as they analyzed and annotated the documents publicly.

Newsrooms have built entire archives on top of DocumentCloud, like the Intercept’s Snowden Archive.  Brad Heath at USATODAY has taken his transparency work & archive-building even further and turned Big Cases Bot into procedural coverage of Federal Court cases which is followed by tens of thousands.

DocumentCloud has also been the home of multi-newsroom collaborations like Propublica’s Free the Files which collected public ad records from TV stations and gamified their document analysis with their audience.

But the project’s influence reaches outside journalism too.  For example, the FBI adopted our document viewer for their public records vault.”

How has DocumentCloud changed since its inception?

Michael Morisy

Michael Morisy: “The most visible changes are with the viewer, from embedding whole documents, making it easy to just highlight a particular page, to allowing newsrooms to weave in specific annotated passages that help tell a broader narrative. Documents really can be a storytelling tool, and it’s been fantastic to see that become more widespread.

There’s also been great innovation as a platform. Newsrooms have built entire news apps on top of the DocumentCloud platform, such as WRAL’s award-winning tool that let readers hone in on specific public figures who appeared throughout a large document release or Brad Heath’s Big Cases Bot, that automatically spots court filings on important legal proceedings, archives them in DocumentCloud, and then tweets them out. 

But the most important change is the one outside of our platform: changing the culture around newsroom transparency. It used to be rare to post primary source documents, and now it’s the expectation. I think that’s an excellent opportunity to build trust and really bring readers into the reporting that continues to excite me.”

What can we expect from DocumentCloud in the coming years?

Michael Morisy

Michael Morisy: “In the short term, speed and stability. Getting thousands or tens of thousands of pages is no longer an extraordinary event, and newsrooms need to quickly respond when the next Mueller Report-level release comes out. So we’re working hard to adapt to that.

We’re also focused on are helping newsrooms find new ways to get their readers to engage with their reporting. Finding ways to let the audience actively participate, whether that’s by highlighting sections of a document or helping crowdsource analysis, means that these stories stick with them longer and readers build a direct relationship with the newsroom.

Finally, expect DocumentCloud to tackle new ways of helping you understand what’s in those documents, whether that’s new analytical tools or using machine learning to partially automate some kinds of analysis, or a more robust editorial program that provides clear guidance on reading a contract or delves into the ethics of how newsrooms handle mugshots.”

Read more about the launch of DocumentCloud.

Paul Cheung is director of journalism and technology innovation at Knight Foundation. You can follow him on Twitter at @pcheung630.


Image (top) by myrfa on Pixabay. Headshots courtesy of DocumentCloud.