Technical Notes

This website is constructed with the help of Apache, Blosxom (with several plugins), a bit of javascript and Web Standards. It is very much a work in progress, but then point me to a weblog that isn’t. Here are some notes on the techniques used here.

Apache tricks

I use a couple of simple .htaccess tricks to help create more human-friendly URLs. Most of these hacks use mod_rewrite, and none are particularly esoteric. The comments in the following extract from the site-wide .htaccess should explain what’s going on:

# This little trick allows me to rename the blosxom
# script from the cumbersome script.cgi format to
# something simpler (in this case, 'sam'):

  SetHandler cgi-script

# Rewrite rules:
RewriteEngine On

# As I prefer to use just 'sgp.me.uk' as the base
# domain, this rule redirects requests for
# 'www.sgp.me.uk'.  This both allows the use of the
# 'www.' prefix while hopefully encouraging people
# not to use it by visibly redirecting rather than
# simply rewriting.
RewriteCond %{HTTP_HOST} ^www\.sgp\.me\.uk$ [NC]
RewriteRule ^(.*)$ http://sgp.me.uk/$1 [R=301,L]

# Pages devoted to baby photos have been integrated
# into my blog. This rule preserves the old
# hierarchy and provides me with a really simple
# URL to point family to for these photos:
RewriteRule ^uma(.*)$ /sam/uma$1

# I liked the idea of losing the required
# 'index.$flavour' suffixes for rss feeds:
RewriteRule ^sam(.*)/rss$ /sam$1/index.rss
RewriteRule ^sam(.*)/rdf$ /sam$1/index.rdf

Blosxom configuration

The site uses Blosxom 2.0. The script has been slightly modified to remove the code used to escape HTML when outputting RSS (see this post for more information on this and a patch). I also use the following plugins, some of which I wrote myself:

Standards

The website is designed to be accessible in all of the major browsers (having said that, it does render best in Firefox because of -moz-border-radius). It uses vaguely semantic HTML for structure and CSS for layout. The HTML template is written to validate against the XHTML 1.0 Strict Doctype, and the xhtmlmime plugin means that non-well-formed entries quickly get picked up. The RSS feeds are all validated against the Feed Validator from time to time.

To do

  • application/xhtml+xmldone!
  • Sort out some better page titlesdone!
  • Comments/Writebacks
  • Tags (think del.icio.us). Probably implement this using a hacked version of the meta plugin
  • Atom feed, once the spec is finaliseddone!
  • More stuff as I think of it
  • Oh yeah, actually post some real content

Blosxom and application/xhtml+xml

Since this website is written to the XHTML 1.0 Strict Doctype, I thought it would be nice to serve it with the correct MIME type to conforming user-agents. I remembered hearing about a plugin called xhtml that would do this, but after a cursory search came up with nothing I decided that I’d just write my own.

So here’s xhtmlmime. It uses CGI.pm to sniff the Accept: HTTP header from the user-agent and then serves blosxom with the preferred MIME type. There are two variables that need to be set:

  • $flavours needs to be set to a list of flavours upon which to act. This defaults to empty, and the plugin will exit quietly until you set it.
  • $charset should be set to the character encoding used on your weblog. This defaults to utf-8.

You can force the plugin to send the application/xhtml+xml MIME type by specifying a URL parameter of mime=xhtml. Setting mime to anything else will result in the user getting text/html.

Update 2005-07-21

The plugin now exports a variable – $xhtmlmime::meta_http_equiv – for use in your head templates. If you use the http-equiv element, set it as follows:

And the plugin will ensure that it is set correctly.

Warning

As noted by Bill Lovett, serving as application/xhtml+xml raises a couple of issues. The most important one is that this will lead to very strict interpretation of your pages by the web browser, so unless your pages are well-formed – contain no mistakes in the markup – your visitors will just get error messages! Bad plugin!

So, before using this plugin you need to be confident that this is the case, and that you have some method for ensuring that only well-formed, valid markup ends up on your pages.

Feedback

If you have any feedback, either contact me directly or post a message to the blosxom mailing list.

Clanking Replicator

Via the BBC, news that researchers at Cornell University have constructed a self-replicating machine and published their notes in Nature. This one is certainly a “Clanking Replicator” type and not a more exotic nanoassembler, but it’s pretty cool stuff nevertheless. I for one welcome the arrival of our new robotic overlords and look forward with great anticipation for the lauch of our first von Neumann probe and the beginning of (post)human galactic domination – Bwahahaha!

Hide plugin and date-based URLs

The hide plugin allows you to conceal certain posts and/or categories from the standard blosxom index pages – this works for either path or date based URLs. A side effect of this is that if you are using date-based links like me – http://example.com/YYYY/MM/DD/filename – then the posts won’t show up when you visit their permalinks – they’ll only display when referenced with a URL like http://example.com/path/to/filename.flavour.

A solution is to add quick test to the start subroutine in the hide plugin that checks to see whether blosxom knows about any date-based path information:

sub start {
  $blosxom::path_info_yr and return 0;
  ...

As I don’t want posts in about/ to display in the monthly archives, I used the $path_info_da variable, but you could just as easily use $path_info_yr or $path_info_mo instead to allow the posts to display on less specific date-based index pages.

Stripping out <script> tags from RSS feeds

It’s good practice not to include <script> tags inside your RSS feeds, and feedvalidator.org will react if you include them, so here’s a plugin that will do this for you.

By default, the plugin will attempt to strip <script> tags from the $body of your posts when $blosxom::flavour equals “rss”. This is configurable, of course.

Note that the plugin is not designed as a security precaution and so the regular expression used to try and find the tags isn’t particularly sophisticated. It is case-insensitive but expects to find the actual tags themselves to be unbroken by white space or line breaks, however the contents of the tags can be arranged in any way you please.

Auto escaping HTML in RSS with blosxom

Out of the box, blosxom comes with simple HTML and RSS formats built in. In order to get the RSS 0.91 feed set-up correctly, blosxom escapes HTML tags during the story generation phase where is finds an XML content-type (specifically: $content_type =~ m{\Wxml$}).

This feature is fine until you start experimenting with your RSS feeds when it can become a bug. For example, if you decide to provide the full text for each story in a feed enclosed in a CDATA section, you don’t want this escaping to take place. So I recommend that you move the escaping section from blosxom itself and place it into a plugin where you can then call it if you need it.

The escaping code is lines 378-384 in the standard blosxom 2.0 script. Just for fun I’ve made a patch, but it’s probably simpler to just open a text editor and do it by hand. Here’s a simple plugin that allows you to control whether or not the code is called during the story generation phase.

Untainting blosxom

If you want to run Blosxom in taint mode (using the -T switch), you’ll hit some problems with 2.0 as it comes out of the box. Joe Landman posted a patch dealing with one issue on the mailing list over the weekend, but if you’re using any plugins then you’ll immediately encounter another problem when blosxom does a readdir() on your $plugins_dir. Here’s a patch that solves both these issues.

Blog spam and unmaintained lists of referers

I wish people who don’t closely maintain their blogs would stop publishing lists of top referers (unfortunately we are stuck with this illiterate spelling of “referrer” in this context). These lists are often crammed full of referer spam, and help to provide the incentive for the idiots who do this kind of thing to carry on.

While I’ve almost completely stopped getting comment spam – partly, I think, due to closing comments on posts older than 2 weeks – I still get vast quantities of referer spam despite not publishing my logs. Referer spamming my site gets the spammer nothing; I can only assume that they spam massive numbers of sites knowing that a proportion will blindly go on publishing their logs thus hopefully adding to the spammers PageRank and at least getting them the odd click-through. (Comment spammers often also leave referer spam with their comments, just to add insult to injury, so perhaps the referer spam I see is the result of failed attempts to comment spam.)

Referer spam isn’t quite so annoying as comment spam, but it’s irritating none the less – on a fairly low traffic site like this one, legitimate referers get lost in amongst the crap from spammers, and like most people who maintain a website I like to know who is linking to me. Wading through spammed logs is just another chore I could do without – and I can think of lots of things I’d rather be doing than continually updating my blacklist.

So if you run a blog, and don’t have the time or expertise to prevent spam appearing in your referer list, please don’t publish one. You’ll be doing your little bit to help remove the incentive for this kind of thing.

Government + IT = Farce

The BBC headline says it all: “Benefits computer failure chaos”. Do we trust these people to manage the proposed National Identity Register (PDF)? I mean, this isn’t the first time there’s been a complete and utter collapse of a government IT project – in the last week or so the Child Support Agency has also been in the news for similar reasons, and that’s just one example of many. Even if you’re not concerned about the civil liberties implications of the register and card, there are serious questions to ask about the competence of the government to run a scheme like this.

Update: I’ve just read that the catastrophe happened due to a screw-up during a trial upgrade of some of their computers from Windows 2000 to XP. Oh, the joys of Microsoft.

No software patents!

The proposed European software patent legislation could have some pretty serious effects, not least on the ability of smaller players to compete with software giants like Microsoft. Today three European open source luminaries – Linus Torvalds, Michael Widenius and Rasmus Lerdorf – lent their support to the anti-patents campaign. Let’s hope this helps, and that this short-sighted and foolish legislation is kept out of our legal system. (Via Simon Willison.)