Munging Posterous With CouchDB
Previously I used the Posterous API to retrieve all my blogs posts.
In this post, I’m going to show how easy it is to use CouchDB’s
_bulk_docs API to get lots of data in via JSON import. Later on,
I’ll transform it using CouchDB’s show, view and list functions.
CouchDB’s bulk loading API requires JSON documents to be
embedded in an array called docs within a parent JSON object
that contains optional parameters to indicate to CouchDB how to
handle the upload. Note that the _id value must be a string.
1 2 3 4 5 6 7 | |
I’m going to use the all_or_nothing model and keep munging my data
until it works in one go. You could just as easily keep removing documents
that were successfully uploaded from your parent JSON object, which might
be a better approach if I had a lot of data.
The Posterous data I retrieved last time delivers my posts as a single JSON array. Here’s a trimmed sample:
1 2 3 4 5 6 7 8 9 10 11 | |
That id field looks like an ideal choice to map to CouchDB’s _id field.
It’s not a string though, so we’ll need to quote the following value.
As an old-school kinda guy, I did a perl one-liner. I am sure you node.js
ninjas out there can do it in 20 lines of valid js with beautiful
nested callbacks though.
The final step is to wrap the documents in the _bulk_docs format we saw
initially - { "options" :..., "docs": [array]}, and then run it through
jsonlint to confirm that Douglas Crockford is happy. No doubt it would
look prettier in ruby.
1 2 3 4 5 6 7 | |
So let’s push this into a new CouchDB and see what happens. In the worst case, our upload will be rejected in its entirety and we’ll simply need to re-try with improved data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | |
In the subsequent post, I’ll transform it using CouchDB’s show, view and list functions to load into octopress.