Random Musings

O for a muse of fire, that would ascend the brightest heaven of invention!

Generating YAML From CouchDB Docs

Continuing the theme of the last two posts, the old Posterous blog content is now available as JSON inside CouchDB. I’m now going to combine a few pieces that are unique to CouchDB to build up the components that will become blog fodder for OctoPress.

Octopress, which is based on Jekyll, uses a mixture of YAML and markdown for pages and posts. We’ll use a [show] function, which passes a JSON document to a JavaScript transformation function, to build this up.

First, the posterous format includes a whole lot of stuff we won’t need. For OctoPress we want title, display_date,tags, and body_full only:

Excerpt of typical post as a CouchDB JSON doc
1
2
3
4
5
6
7
8
9
10
11
{
   "_id": "21298063",
   "_rev": "1-8ba11a44954e3171de4f4fa9d68c3210",
   "is_owned_by_current_user": true,
   "slug": "setting-up-a-shared-photo-library-in-picasa3",
   "tags": [
   ],
   "title": "setting up a shared photo library in Picasa3 on MacOS",
   "display_date": "2009/01/09 13:49:53 -0800",
   "body_full": "setting up a shared photo library for several...",
}

Assuming OctoPress can parse the date format, this will be easy. Let’s map these to title, date, categories, and content to build our YAML like this:

Typical OctoPress YAML header
1
2
3
4
5
6
7
8
---
layout: post
title: "setting up a shared photo library in Picasa3 on MacOS"
date: 2009/01/09 13:49:53 -0800
comments: true
categories: []
---
... body_full goes here ...

The show function is pretty straightforward:

CouchDB show to transform Posterous JSON into YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
function(doc, req) {
    if (doc.slug && doc.title && doc.display_date && doc.tags && doc.body_full) {
        return {
            body: '---\nlayout: post\n' +
              'title: "' + doc.title + '"\n' +
              'date: ' + doc.display_date + '\n' +
              'comments: true\n' +
              'categories: ' + doc.tags + '\n---\n',
            headers: {
                'Content-Type': 'application/text'
            }
        }
    }
}

Let’s walk through that line by line.

  1. Declare the function, and therefore our scope.
  2. First I check that all the entities we require are present. This avoids generating an expensive exception in the JavaScript engine if later on I try to access data that isn’t actually present.
  3. return an object comprising the body content, and the headers
  4. the body is built up from JSON properties of the supplied doc object.

That seems like a good start, so wrap that up into a design document, drop it into your CouchDB and test it out:

testing our show via cURL
1
2
3
4
5
6
7
8
9
$  curl --silent --header "Content-Type: application/text" \
  http://localhost:5984/posts/_design/posts/_show/yaml/31797293
---
layout: post
title: "ubuntu saves the day"
date: 2010/10/28 04:25:37 -0700
comments: true
categories:
---

Notice how we needed to query using Content-Type: application/text? Try that same link in your browser. You’re prompted for a download that refers to the _id stored in CouchDB.

It would be nicer to get that with the correct markdown filename already. Let’s use doc.slug for the name, prefixed with the date of the original post. Octopress expects a yyyy-mm-dd format so I’ve sprinkled liberally with regex pixie dust.

Finally, an additional HTTP header Content-Disposition: attachment; filename=<file.ext> is required to prvide the proposed name via our show function.

Now’s a good time to append the actual blog post content too.

CouchDB show to transform Posterous JSON to OctoPress
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
function(doc, req) {
    if (doc.slug && doc.title && doc.display_date && doc.tags && doc.body_full) {
        // Replace / with - and trim display_date to yyyy-mm-dd- only
        // to match the octopress expected post format.
        // This will be passed as an HTTP header and will be used by
        // browsers or wget as the proposed filename.
        var post_date = doc.display_date.replace(/\//g, '-').replace(/^([-0-9]+).+/, "$1");
        var post_name = 'attachment; filename=' + post_date + '-' + doc.slug + '.md';
        return {
            body: '---\nlayout: post\n' +
              'title: "' + doc.title + '"\n' +
              'date: ' + post_date + '\n' +
              'comments: true\n' +
              'categories: ' + doc.tags + '\n---\n' +
              doc.body_full + '\n',
            headers: {
                'Content-Type': 'application/text',
                'Content-Disposition': post_name
            }
        }
    }
}

I’ve put this into a separate show function, and this is what comes back:

Results from Octopress show function
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
$ curl --silent --header "Content-Type: application/text" \
 http://localhost:5984/posts/_design/posts/_show/octo/31797293
* About to connect() to localhost port 5984 (#0)
*   Trying 127.0.0.1... connected
* Connected to localhost (127.0.0.1) port 5984 (#0)
> GET /posts/_design/posts/_show/octo/31797293 HTTP/1.1
> User-Agent: curl/7.21.4 (universal-apple-darwin11.0) libcurl/7.21.4 \
 OpenSSL/0.9.8r zlib/1.2.5
> Host: localhost:5984
> Accept: */*
> Content-Type: application/text
>
< HTTP/1.1 200 OK
< Vary: Accept
< Server: CouchDB/1.1.1 (Erlang OTP/R14B04)
< Etag: "6PML44SHRNE54M212K0O6BLXZ"
< Date: Thu, 22 Dec 2011 13:02:36 GMT
< Content-Type: application/text
< Content-Length: 1668
< Content-Disposition: attachment; filename=2010-10-28-ubuntu-saves-the-day.md
<
{ [data not shown]
* Connection #0 to host localhost left intact
* Closing connection #0
---
layout: post
title: "ubuntu saves the day"
date: 2010-10-28
comments: true
categories:
---
My work laptop had a BSOD today, which looks like it was caused by bit rot ...

$ wget http://localhost:5984/posts/_design/posts/_show/octo/31797293 \
  --content-disposition
Resolving localhost... 127.0.0.1, ::1, fe80::1
Connecting to localhost|127.0.0.1|:5984... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1668 (1.6K) [application/text]
100%[======>] 1,668       --.-K/s   in 0s

2011-12-22 14:04:46 (79.5 MB/s) - `2010-10-28-ubuntu-saves-the-day.md' saved [1668/1668]

Now we can transform arbitrary Posterous blog entries via CouchDB into Markdown format. Next time, I’ll use CouchDB to pull all the data out in one swoop.

Munging Posterous With CouchDB

Previously I used the Posterous API to retrieve all my blogs posts. In this post, I’m going to show how easy it is to use CouchDB’s _bulk_docs API to get lots of data in via JSON import. Later on, I’ll transform it using CouchDB’s show, view and list functions.

CouchDB’s bulk loading API requires JSON documents to be embedded in an array called docs within a parent JSON object that contains optional parameters to indicate to CouchDB how to handle the upload. Note that the _id value must be a string.

Using cURL and the Posterous API

Posterous has a really neat feature of their API docs - you can use it directly from the web page. Unfortunately, I only need it to migrate off to Octopress.

Log into Posterous and then open the API page. Use the first entry to obtain your API token. Put this, your login and password below and you should be able to obtain a list of your Posterous Sites, or Spaces as they’re now called.

Ubuntu Saves the Day

My work laptop had a BSOD today, which looks like it was caused by bit rot
on the root partition. While everything’s backed up onto S3, restores from NZ
of 20GiB of data take a while, so I was kinda hoping to recover smoothly
without getting the IT guys to visit, who probably will just rebuild it. I’m
pretty sure that’s fair payment for getting chocolate inside my laptop’s fan.

It was a good opportunity to give ubuntu maverick a spinup before it goes
on the big iMac at home as dual-boot. The install is slowly tweaked each time,
and it’s really clean. I am pretty sure my mum could do this without any
help now, and the fresh look is nice - it’s truly a class act OS now.

One workaround was needed to resolve what is probably a stuck trackpad
on the loan laptop
http://xpapad.wordpress.com/2009/09/09/dealing-with-mouse-and-touchpad-freeze…
with a ‘rmmod psmouse’ and then it was all go. Everything works which
really is an impressive step forward for Canonical, with strong OEM relations
clearly now paying off. Hats off guys!

Anyway long story short, nautilus and brasero to the rescue, and I now have
a bunch of md5 checksummed DVDs stashed before the hired goons come
tomorrow to blow it away. I love the ntfs integration in linux, and the new
maverick Ubuntu gets thumbs up all round - especially as it’s now got
CouchDB 1.0.1 included - yay!