Atom Feeds

2 January 2007

(Updated as Sam Ruby persuaded me to use second-precision on my dates.)

When I started this bliki, on that flight to Bangalore, I decided to use RSS 0.91 for my feed on the grounds that it was simple. All I had to do was look at an example (PragDave's as it happened) and I could easily create the XML provide that capability.) It's served me well, but I do get occasional complaints that the posts aren't dated.

I asked Ade Oshineye, the man who keeps ThoughtBlogs running for his advice. He gave me a page of carefully considered reasons for either staying with RSS 0.91 or choosing a new format (and which one). In the end I followed his rather more passionate conclusion: "For the love of god please use Atom."

Cutting to the chase, I now have atom feeds. I still have the RSS 0.91 feeds, but if I ever have to do any work to maintain them I will summarily drop them. So I suggest cutting over to the atom feeds when you can. I've updated the references on my web pages for them or you can find them in the now badly named RssFeeds page.

What follows are a few experiences and thoughts on the conversion.

Over Christmas I dug out what information I could find on atom. My first thought was to find and use Ruby libraries. Ruby has a pretty sophisticated feed processing library called FeedTools. It claimed it could produce a feed, and I believe it. However all the documentation was about consuming and converting feeds, caching them in a database, and the like. It introduced a bunch of dependencies and it wasn't glaringly obvious how to use it just to create a feed.

So I decided to create the XML file myself. After all this is very easy in Ruby, especially now we have the awesome builder library.

So the next trick was to figure out what an atom file looked like and what the bits meant. I found three things very helpful to me.

Me being me, I always want a real life example. I reckoned Sam Ruby's feed should be a good exemplar.
One big reason Ade gave for favoring atom was a solid specification. Like most specifications I skimmed it to answer the questions I needed. In general I prefer to start with an example and gradually tweak it until it works, going to the specification when I have a problem. This is the typical behavior of a moron.
Possibly the best reason to use Atom is the excellent test framework: feedvalidator. I found this to be extremely helpful.

I have three feeds to work on: my updates feed, my bliki feed, and the feed for refactoring.com. The data for the feeds came from different formats, so this was the common but tedious task of data conversion from one arbitrary format to another. Much of enterprise software is like this, and it's not the fun part.

I started out by creating my own feed and entry objects to act as gateways. This way I could program to objects that made sense to me for the three conversions and keep the XML conversion and any atom weirdness that might appear in one place. Initially I wondered if this would be worthwhile, after all builder is so simple to use. I quickly found it to be worthwhile.

Most of the process was very straightforward. I just looked at how I created the RSS feed and did the same with the atom feed. (Yes I know I should have used gateways for the old RSS feeds. I get to be foolish too.) The tricky bits were really about things that were new to the atom feed.

The first of these were ids. Atom insists that you give each entry an id. This makes it easier for aggregators to spot multiple copies of the same entry that might come from different sources, or just to tell if a new entry is a truly new entry or an updated old entry. For my bliki it was easy to choose an id - the entries correspond exactly to the web bliki entries so I just used the URL of the bliki entry.

For news updates there isn't a particular page. Looking at Sam Ruby's page I saw he used tags. These were new to me, but googling found an explanation. I generated tags with my domain name, a date, and cleaned up text from the title - copying from Sam Ruby again.

def calculate_atom_id
  specific = title.gsub(/\W/,'-')		
  return "tag:#{domain_name},#{date.strftime("%Y-%m-%d")}:#{specific}"
end

The real driving purpose of this was to add dates and this introduced a couple of oddities. First was the RFC 3339 dates, which I had to look up to see how they worked. It didn't look like the Ruby date classes had a method to return an RFC 3339 date, but after some poking around I realized that the Time class has exactly what I need as the method Time.xmlschema.

One thing that wasn't clear in the spec was what the updated date really meant. The spec merely said it was "the most recent instant in time when an entry or feed was modified in a way the publisher considers significant." When I change a bliki entry it's either to fix a typo, or to revise the entry in some way. Typos I don't consider significant. I do expect aggregators to update their copy of the entry, but I don't expect them to highlight it as new or changed. The latter changes I do expect to be highlighted. It would have been helpful had the spec given some suggestions as to how aggregators and readers might interpret the date - after all it's that interpretation that conveys the true meaning of the field. I often find this problem, writers of specifications are reluctant to put in a standard what clients should do, because they don't want to constrain clients. I understand that concern, but I do think it's really helpful to say how they imagine it might be used with some scenarios.

The most awkward aspect of the updated date for me is the precision of the updated date. The atom spec says that "Date values should be as accurate as possible. For example, it would be generally inappropriate for a publishing system to apply the same time-stamp to several entries that were published during the course of a single day." However I've always looked at my updates as something with date precision. The time I upload the entry to the server isn't relevant to me, just which date I did it. My timestamps thus reflect that - they only mention the day (and indeed use Ruby's Date class which is Date precision).

My initial thought was to leave them at Date precision, picking an arbitrary 00:00Z for the time part to satisfy RFC 3339. The atom spec said 'should' rather than 'shall', which is an important distinction in StandardsSpeak and feedvalidator marked two entries with the same time-stamp as a warning rather than an error. Unless I could understand a downstream problem, I didn't see why I should put in the work to handle second-precision updates rather than date-precision.

Sam Ruby provided a compelling scenario. Some people, including him, aggregate multiple feeds and read them by starting at the latest and reading back till they read something they read earlier. My entries would usually be inserted earlier in the time log than they were supposed to and wouldn't get read. (I could give them an arbitrary late time part which would keep them at the top of the list, but that would just irritate the reader.)

So I decided to use second precision. I needed to replace the Date objects I was using to handle timestamps with Ruby's Time objects. I also now need to start putting full times in posts, which from this entry onwards I will do.

I downloaded a copy of feedvalidator to test my feeds out as I gradually filled them out. It was easy although I grimaced at actually having to install raw as opposed to just using apt-get - I guess I'm getting soft.

As a final aside story. A year or two ago a Very Large Software Company (one that makes software I'm sure you're familiar with) asked me if I minded having my feed aggregated into an architecture feed they were producing. My response, as it usually is, was "fine, that's what feeds are for". A month or two later I got an email saying they couldn't use my feed and I needed to change it to RSS 2.0. It was more effort than I fancied, so I declined. But I couldn't help chuckle that this big organization, which had clearly set up a full blown project to do this work, couldn't do what Ade does for us in his spare time.