DocumentCloud releases first piece of software: “CrowdCloud”

Journalism / Article

Warning: Geekspeakā„¢ ahead!

As the folks at DocumentCloud (2009 Knight News Challenge winners) got to work on building their document crowdsourcing platform, they quickly realized they were going to need a lot of processing power:

Our PDFs need to have their text extracted, their images scaled and converted, and their entities extracted for later cataloging. All of these things are computationally expensive, keeping your laptop hot and busy for minutes, especially when the documents run into the hundreds or thousands of pages.

So they’ve created a program called CrowdCloud that will allow all these tasks to be distributed across a set of machines, and released it as open-source. (True geeks will delight in the pun inherent in the name of the software: while we’re crowdsourcing the task of sifting through PDFs to a group of people, why not crowdsource the task of making those PDFs siftable to a group of computers?)

This means other developers can use the CrowdCloud software to distribute their processor-heavy tasks.

Read more about the software on the DocumentCloud blog.

Update (9/15): The headline and body of this post were edited after publication to reflect the fact that this is central to the DocumentCloud software, not just an extra add-on. Scott Klein, one of the collaborators on the DocumentCloud project, told me we can expect incremental code releases like this in the future: “We’re going to be releasing components as we go instead of doing a big code release at the end,” he said.