Tools bliki

MovingToNokogiri tools 10 January 2011 Reactions

Most of this site, including this bliki, is built using an XML to HTML transformation process. I write the articles and bliki entries in my own XML vocabulary and then transform these sources to the HTML you read. When I stated back in 2000 I did it in XSLT. While I got pretty good at programming XSLT I came to the conclusion that I was not enough of a masochist to want to continue using it. After a short experiment, writing the bliki transformer in Ruby on a flight to Bangalore, I switched to Ruby using the REXML library. Now it's time to change that core library to Nokogiri

When I started the Ruby transformer the default way to parse XML in Ruby was the REXML library. Although it had its quirks, on the whole I liked it. The API was certainly much easier to work with than the java libraries of that era. But time has marched on. REXML is a ruby library and is thus slow compared to libraries based on libxml. Other libraries have come out that provide a nicer API to work with.

The popular choice for XML parsing these days seems to be Nokogiri. As a result over the last few months I've given it a spin for a number of transformation tasks and have grown to like it. It soon became my first choice for new transformation tasks. But this still leaves the big question, should I replace REXML for my core transformations?

Until recently, my life has been dominated by the DSL book, so I didn't consider any serious work on my site generation code. Once that was done my first priority was to redo the look and feel of the site and introduce the guide pages. This didn't require much surgery to the existing ruby code, so I left it as it is. But my next steps require more serious refactoring of that code, which brought the thought of Nokogiri replacement more to the front of my thinking.

Indeed I decided to tackle it first, for two reasons. One is that much of the transformation code involves mucking around with XML, and I want to use Nokogiri's API to do that. Second is that my primary functional test is to rebuild the site and diff the result with the released version. Nokogiri's speed advantage (10 seconds versus 1 minute) becomes more important when I'm doing that.

Making the change

Replacing the XML library in a program that mainly does XML processing is often seen as a fraught task. There are REXML calls all over the code, so this is very much a global change. Since I'm the only programmer I can be more casual than if I was working on a team with this, but I still follow much the same habits that I would if working with other people.

The basic plan comes in three steps:

  • Introduce an insulating layer between my code and REXML. This way all my transformation code calls this insulating layer which then passes on the calls to REXML. At this stage the interface of the insulating layer is close to REXML.
  • Create an alternative implementation of the insulating layer that passes the same calls onto Nokogiri instead. Once I'm done with this I can build the site entirely with Nokogiri.
  • Adjust the interface and application code to change it from REXML style to Nokogiri style. Finish by removing the insulation layer.

I use this approach to keep the steps smaller. Switching over to Nokogiri in one fell swoop is too big a change, instead I can gradually implement it while my site can still build fine with the REXML version until the Nokogiri implementation is complete. If I was working with others, this would be more important as I'd need to have them building new functionality while I was doing the surgery. This way I can gradually ease them over to the insulating layer while it's being built.

There's an argument for leaving the insulating layer in place, effectively turning it into an anti-corruption layer. That would be a good idea if I wanted use a different API to Nokogiri's. I didn't do it in this case since I actively want to use the Nokogiri API. Of course this means that should I change the library I'll have to rebuild it then, but I'd rather pay that price then than have to pay the price of dealing with an unnecessary layer now.

SnowLeopard tools 9 November 2010 Reactions

I've been intending to upgrade my laptop to Snow Leopard for ages. Particularly once I got Aperture 3, which I'm told works better. But I never quite got around to it, after all operating system upgrades are usually such a pain. (Although Ubuntu upgrades are much less painful than most.)

But yesterday I finally decided to bite the bullet as I needed to use some software that needs it. As usual I spent ages backing up lots of stuff that was already backed up, but I always get extra-paranoid around moments like this.

The actual upgrade seems to have gone well so far. Ironically Aperture failed to start, and I needed to download the 3.1 upgrade again and reinstall it. Then it launched fine, we'll see if I notice any better performance.

The biggest yak so far was software I'd installed via MacPorts. I tried several times to get MacPorts to rebuild its software but kept failing after long compiles and inscrutable error messages. My moans on Twitter (there's something compelling about sharing your pain on Twitter) led a flood of people suggesting Homebrew. I'd heard people speak positively of Homebrew before, and since the only thing I could think of to fix MacPorts was to completely eradicate and reinstall it, I decided to try the switch. Thus far, it's working fine.

VcsSurvey tools 8 March 2010 Reactions

When I discussed VersionControlTools I said that it was an unscientific agglomeration of opinion. As I was doing it I realized that I could add some spurious but mesmerizing numbers to my analysis by doing a survey. Google's spreadsheet makes the mechanics of conducting a survey really simple, so I couldn't resist.

I conducted the survey from February 23 2010 until March 3 2010 on the ThoughtWorks software development mailing list. I got 99 replies. In the survey I asked everyone to rate a number of version control tools using the following options:

  • Best in Class: Either the best VCS or equal best
  • OK: Not the best, but you're OK with it.
  • Problematic: You would argue that the team really ought to be using something else
  • Dangerous: This tool is really bad and ThoughtWorks should press hard to have it changed
  • No opinion: You haven't used it

The results were this:

ToolBestOKProblematicDangerousNo OpinionActive ResponsesApproval %

As well as the raw summary values, I've added two calculated columns here to help summarize the results.

  • Active Responses: The total of responses excluding "No Opinion". (eg for git: 65 + 19 + 1 + 0)
  • Approval %: The sum of best and ok responses divided by active responses, expressed as a percentage. (eg for git: (65 + 19) / 85)

The graph shows a scatter plot of approval percentage and active responses. As you can see there's a clear cluster around Subversion, git, and Mercurial with high approval and a large amount of responses. It's also clear that there's a big divide in approval between those three, together with Bazaar and Perforce, versus the rest.

Although the graph captures the headline information well, there's a couple of other subtleties I should mention.

  • Although the trio of Subversion, git, and Mercurial cluster close together on approval, git does get a notably higher amount of best scores: (65 versus 20 and 33).
  • VSS got the most "dangerous" responses, but a couple of people approved of it.
  • Neither TFS or ClearCase are liked much, but ClearCase got more "dangerous" responses than TFS (41 versus 22).
  • Don't read too much into small differences as I'm sure they aren't significant. I'm sure the difference in approval percentage between VSS, TFS, and ClearCase isn't signifcant, but the difference between these three and the leaders is.

Some caveats. This is a survey of opinion of ThoughtWorkers who follow our internal software development discussion list, nothing more. It's possible some of them may have been biased by my previous article (although unlikely, since I've never managed to get my ThoughtBot opinion-control software to work reliably). Opinions of tools are often colored by processes that are more about the organization than the tool itself. But despite these, I think it's an interesting data point.

I should also stress the important point to take away from this isn't the comparison between those close in the numbers, eg comparing git and Mercurial or comparing TFS and ClearCase. Any survey like this has a certain amount of noise in it, and I suspect the noise here is greater than such a difference. The important point is the big approval gap between the leading tools (Subversion, git, and Mercurial) and the laggards - essentially the point in VersionControlTools.

VersionControlTools tools 17 February 2010 Reactions

If you spend time talking to software developers about tools, one of the biggest topics I hear about are version control tools. Once you've got to the point of using version control tools, and any competent developers does, then they become a big part of your life. Version tools are not just important for maintaining a history of a project, they are also the foundation for a team to collaborate. So it's no surprise that I hear frequent complaints about poor version control tools. In our recent ThoughtWorks technology radar, we called out two items as version control tools that enterprises should be assessing for use: Subversion and Distributed Version Control Systems (DVCS). Here I want to expand on that, summarizing many discussions we've had internally about version control tools.

But first some pinches of salt. I wrote this piece based on an unscientific agglomeration of conversations with my colleagues inside ThoughtWorks and various friends and associates outside. I haven't engaged in any rigorous testing or structured comparisons, so like most of my writing this is based on AnecdotalEvidence. My personal experience in recent years is mostly subversion and mercurial, but my usage patterns are not typical of a software development team. Overwhelmingly my contacts like to work in an agile-xp approach (even if many sniff at the label) and need tools that support that work style. I expect many people to be annoyed by this article. I hope that annoyance will lead to good articles that I can link to.

(After writing this I did do a small VcsSurvey which didn't undermine my conclusions.)

Fundamentally there's three version control systems that get broad approval: subversion (svn), git, and mercurial (hg).

Behind the Recommendability Threshold

Many tools fail to pass the recommendability threshold. There are two reasons why: poor capability or poor visibility.

Many tools garner consistent complaints from ThoughtWorkers about their lack of capability. (ThoughtWorkers being what they are, all tools, including the preferred set, get some complaints. Those behind the threshold get mostly complaints.) Two in particular generate a lot of criticism: ClearCase (from IBM) and TFS (from Microsoft). One reason they get a lot of criticism is that they are very popular on client sites, often with company policies mandating their use (I'll describe a coping strategy for that at the end).

It's fair to say that often these problems are compounded by company policies around using VCS. I've heard of some truly bizarre work-flows imposed on teams that make it a constant hurdle to get anything done. Since the VCS is the tool that enforces these work-flows, it does tend to get tarred with that brush.

I'm not going to go into details about the problems the poor-capability tools have here, that would be another article. (This has probably made me even more unpopular in IBM and Microsoft as it is.) I will, at least for the moment, leave it with the fact that developers I respect have worked extensively with, and do not recommend, these products.

The second reason for shuffling a tool behind the recommendability threshold is that I don't hear many comments about some tools. This is an issue because less-popular tools make it difficult to find developers who know how to use them or want to find out. There are many reasons why otherwise good tools can fall behind there. I used to hear people say good things about Perforce, but now the feeling seems to be that it doesn't have compelling advantages over Subversion, let alone the DVCSs. Speaking of DVCSs, there are more than just the two I've highlighted here. Bazaar, in particular, is one I occasionally hear good things about, but again I hear about it much less often then git or Mercurial.

Before I finish with those behind the threshold, I just want to say a few things about a particularly awful tool: Visual Source Safe, or as I call it: Visual Source Shredder. We see this less often now, thank goodness, but if you are using it we'd strongly suggest you get off it. Now. Not just is it a pain to use, I've heard too many tales of repository corruption to trust it with anything more valuable than foo.txt.

So this leaves three tools that my contacts are generally happy with. I find it interesting that all three are open-source. Choosing between these tools involves first deciding between a centralized or distributed VCS model and then, if you chose DVCS, choosing between git and mercurial.

Distributed or Centralized

Most of the time, the choice between centralized and distributed rests on how skilled and disciplined the development team is. A distributed system opens up lots of flexibility in work-flow, but that flexibility can be dangerous if you don't have the maturity to use it well. Subversion encourages a simple central repository model, discouraging large scale branching. In an environment that's using Continuous Integration, which is how most of my friends like to work, that model fits reasonably well. As a result Subversion is a good choice for most environments.

And although DVCSs give you lots of flexibility in how you arrange your work-flows, most people I know still base their work patterns on the notion of a shared mainline repository that's used with Continuous Integration. Although modern VCS have almost magic tools to merge different people's changes, these merges are still just merging text. Continuous Integration is still necessary to get semantic consistency. So as a result even a team using DVCS usually still has the notion of the central master repository.

Subversion has three main downsides compared to its cooler distributed cousins.

Because distributed systems always give you a local disk copy of the whole repository, this means that repository operations are always fast as they don't involve network calls to central servers. This is a palpable difference if you are looking at logs, diffing to old revisions, and anything else that involves the full repository. If this is noticeable on my home network, it is a huge issue if your repository is on another continent - as we find with our distributed projects.

If you travel away from your network connection to the repository, a distributed system will still allow you to work with the repository. You can commit checkpoints of your work, browse history, and compare revisions on an airplane without a network connection.

The last downside is more of a social issue than a true tool issue. DVCS encourages quick branching for experimentation. You can do branches in Subversion, but the fact that they are visible to all discourages people from opening up a branch for experimental work. Similarly a DVCS encourages check-pointing of work: committing incomplete changes, that may not even compile or pass tests, to your local repository. Again you could do this on a developer branch in Subversion, but the fact that such branches are in the shared space makes people less likely to do so.

This last point also leads to the argument against a DVCS, that it encourages wanton branching, that feels good early on but can easily lead you to merge doom. In particular the FeatureBranch approach is a popular one that I don't encourage. As with similar comments earlier I must point out that reckless branching isn't something that's particular to one tool. I've often heard people in ClearCase environments complain of the same issue. But DVCSs encourage branching, and that's the major reason why I indicate that team needs more skill to use a DVCS well.

There is one particular case where subversion is the better choice even for a team that skilled at using a DVCS. This case is where the artifacts you're collaborating on are binary and cannot be merged by the VCS - for example word documents or presentation decks. In this case you need to revert to pessimistic locking with single-writer checkouts - and that requires a centralized system.

Git or Mercurial

So if you're going to go the DVCS route - which one should you choose? Mercurial and git get most of the attention, so I feel the choice is between them. Then the choice boils down to power versus usability, with a dash of mind-share and the shadow of github.

Git certainly seems to be liked for its power. Folks go ga-ga over it's near-magical ability to do textual merges automatically and correctly, even in the face of file renames. I haven't seen any objective tests comparing merge capabilities, but the subjective opinion favors git.

(Merge-through-rename, as my colleague Lucas Ward defines it, describes the following scenario. I rename Foo.cs to Bar.cs, Lucas makes some changes to Foo.cs. When we merge his changes are correctly applied to Bar.cs. Both git and Mercurial handle this.)

For many git's biggest downside was its oft-cryptic commands and mental model. Ben Butler-Cole phrased it beautifully: "there is this amazingly powerful thing writhing around in there that will basically do everything I could possibly ask of it if only I knew how." To its detractors, git lacks discoverability - the ability to gradual infer what it does from it's apparent design. Git's advocates say that much of this is because it uses a different mental model to other VCSs, so you have to do more unlearn your knowledge of VCS to appreciate git. Whatever the reason git seems to be attractive more to those who enjoy learning the internals while mercurial seems to appeal more to those who just want to do version control.

The shadow of github is important here. Even git-skeptics rate it as a superb place for project collaboration. Mercurial's equivalent, bitbucket, just doesn't inspire the same affection. However there are other sites that may begin to close the gap, in particular Google Code and Microsoft's Codeplex. (I find Codeplex's use of Mercurial very encouraging. Microsoft is often, rightly, criticized for not collaborating well with complementary open source products. Their use of Mercurial on their open-source hosting site is a very encouraging sign.)

Historically git worked poorly on Windows, poorly enough that we'd not suggest it. This has now changed, providing you run it using msysgit and not cygwin. Our view now is that msysgit is good enough to make comparison with Mercurial a non-issue for Windows.

People generally find that git handles branching better than Mercurial, particular for short-lived branches for experimentation and check-pointing. Mercurial encourages other mechanisms, such as fast cloning of separate repository directories and queue patching, but git's branching is a simpler and better model.

Mercurial does seem to have an issue with large binary files. My general suggestion is that such things are usually better managed with subversion, but if you have too few of them to warrant separate management, then Mercurial may get hung up by the few that you have.

Multiple VCS

There's often value to using more than one VCS at the same time. This is generally where there is a wider need to use a less capable VCS than your team wants to use.

The case that we run into frequently is where there is a corporate standard for a deficient VCS (usually ClearCase) but we wish to work efficiently. In that case we've had success using a different VCS for day-to-day team team work and committing to the corporate VCS when necessary. So while the team VCS will see several commits per person per day, the corporate VCS sees a commit every week or two. Often that's what the corporate admins prefer in any case. Historically we've done this using svn as the local VCS but in the future we're more likely to use a DVCS for local fronting.

This dual usage scenario is also common with git-svn where people use git locally but commit to a shared svn system. Git-svn is another reason for preferring git over mercurial. Using a local DVCS is particularly valuable for remote site working, where network outages and bandwidth problems can cripple a site that's remote from a centralized VCS.

A lot of teams can benefit from this dual-VCS working style, particularly if there's a lot of corporate ceremony enforced by their corporate VCS. Using dual-VCS can often make both the local development team happier and the corporate controllers happier as their motivations for VCS are often different.

One Final Note

Remember that although I've jabbered on a lot about tools here, often its the practices and workflows that make a bigger difference. Tools can certainly make it much easier to use a good set of practices, but in the end it's up to the people to use an effective way of working for their environment. I like to see approaches that allow many small changes that are rapidly integrated using Continuous Integration. I'd rather use a poor tool with CI than a good tool without.

MercurialSquashCommit tools 9 July 2009 Reactions

I've recently had a bit of a fiddle squashing some commits with Mercurial, so thought it was worth a post in case anyone else is looking to do this. I don't know whether this is the best procedure, but it seemed to work pretty well for me.

hg clone base working
# tip of base is revision 73
cd working
# do work, committing on the way
cd ..
hg clone working squash
cd squash
hg qimport -r 74:tip
hg qgoto 74.diff
hg qfold $(hg qunapp)
hg qfinish -a
cd ../base
hg pull ../squash

The basic task I was doing was some fairly severe moving around of files and folders. I wanted to do this in several steps to checkpoint my work as I went, but I wanted a single commit in the version history. (I gather git does this more easily with rebase.) Making a single commit makes it easier to understand what happened - particularly since moving files tends to complicate looking at repository logs. Moving files also complicates the process - a couple of times I ended up with a procedure that didn't work because it lost the ability to track the moves - I want to be able to go hg log -f and see when and what the original commits were before the move.

To begin I needed to enable the mq extension (mercurial queues) and set my diffs to git style. Git style diffs help to track file moves properly.

# in ~/.hgrc


When using Mercurial in this way, it seems the general way of working is to have multiple repositories. Mercurial encourages different repositories where other systems, eg git or svn, would use different branches. People argue about this, but it's the Mercurial way of working. For this example I had 'base' as my original repos.

My first step was to clone base into a working repos.

hg clone base working

At this point the tip of base (and working) was revision 73. I did the file moves, with several checkpoint revisions as I went.

cd working
hg mv foo1 newdir/foo1
.. more hg mv ..
hg ci -m "moving around"
.. more hg mv ..
hg ci -m "moving around"
.. more hg mv and hg ci..
cd ..

By the time I was done the last revision was 80.

To squash them down into a single commit I cloned another repos.

hg clone working squash

It's important to clone at this point because I was about to edit history, so wanted to keep the original history handy until I knew it had worked. I now moved into there.

cd squash

Now I turned all the commits I'd done for the revisions into patches for the mercurial patch queue mechanism.

hg qimport -r 74:tip

I made the first change the current patch

hg qgoto 74.diff

I squashed all the patches together into a single patch

hg qfold $(hg qunapp)

The commit message for this folded patch would be all the individual commit messages linked together. I wanted a single message for my clean commit.

hg qrefresh -m "reorganized files"

I then turned the patch into a regular commit.

hg qfinish -a

I now had a single commit with all that work. I looked through it to see that everything was sane, in particular testing hg log -f on some moved files to ensure the history was still there. Once I was convinced all was well, I pulled the single changeset into the base repos.

cd ../base
hg pull ../squash

It's interesting to see how the attention on version control system has changed over the years. Early on the primary and only purpose was audit - to be able to safely go back to older revision - mainly to diagnose problems. Then attention switched to how they enabled collaboration between people. This didn't replace the need for audit, but built on top of it. Now there's more attention to using them to provide a narrative of how a code base changes - hence the desire for history rewriting commands like this. Again this need is built on top of the other two, but introduces new capabilities and new tensions.

My thanks to my colleague Chris Turner for his help and I also found this page very useful.

Android tools 6 July 2009 Reactions

One of the side benefits of speaking at the Google IO conference last month was that I got a new phone - the HTC Magic android phone that Google gave to all attendees. I was actually in the market for changing my phone to something like this, so it came at a good time. Here's my impressions after carrying it around for a month or so.

My previous phone was a Nokia E61. I liked the E61 as a phone, but found it's web browser to be slow and unreliable and it that, together with the relatively small screen, was beginning to bug me - hence the desire for something else. Naturally I considered an iPhone, but although the company phone plan that I use is AT&T, it isn't possible to use the iPhone on it and I didn't fancy the hassle of sorting out a new phone plan. I tried a Blackberry storm for a few days, but (how about this for irony) the email was no good for me. Blackberries copy every email that comes into the email account, so it doesn't work well for an IMAP account with server-side filters - which is how I use my gmail account.

The short statement is that I do like the htc magic android.

The Good

  • Physically the device works very well for me. It's small, light, and fits well in my hand. The screen is bright, making web browsing is much nicer than with the Nokia.
  • Battery life seems reasonable, a day or two with my usual usage.
  • The app market seems to have a fair few useful things, I've downloaded a bunch of little apps which have seemed handy.
  • Video play works well. I've watched some TED videos and transcoded some other video using Handbrake which played well on the screen.
  • I like that I can upgrade the memory using a micro-SD card. It came with 2GB, and I'm upgrading to 8GB since it's pretty cheap.
  • I use gmail and Google calendar and the phone syncs nicely with those.
  • The phone charges via a mini-USB connection. One less charger to have to carry around.
  • I read one of the prags's books using an ebook reader and it worked pretty well.

The Bad

  • My biggest irritant so far is that it makes it hard to browse local HTML pages. It doesn't support file:// URLs. This is a big issue for me as I often copy static HTML files to my phone for reference purposes. There is a work-around, but it's kludgy.
  • Like every other calendar app on the planet, Google calendar suffers from TimeZoneUncertainty. This is a big issue with a phone that you want to change time zone as you travel.
  • I miss the Nokia's keyboard when typing. The soft keyboard just doesn't work as well.
  • While the touch navigation works pretty well, I'm sure I'd prefer the iPhone's multi-touch gestures.

The Uncertain

  • I haven't tried writing an app for it. I'd like to experiment, but I'm not allowing myself any such fun until I get the book finished.