I got a hit in my logs recently from the NameProtect web crawler. Perhaps not such a big deal on the surface – the user-agent string provides the above URL for more info, and a quick visit reveals some fairly standard crawler documentation, including a claim that the NPBot will honour your site’s robots.txt file.
(For the uninitiated, web crawlers (aka robots, spiders, bots, etc) are automated programs which wander the great network of links that makes up the web for a variety of tasks – some helpful, some nefarious. Possibly one of the most famous is the Googlebot, which helps keep the Google search engine up to date. A well behaved crawler will first read a file on a website called robots.txt, and follows the instructions that it contains – you might not want a site to be indexed by a search engine, for instance.)
So what, you might think. Well, NameProtect describe themselves like this:
NameProtect(R) is a Digital Brand Protection company that provides a comprehensive suite of Trademark Research, Trademark Watching and advanced Online Brand Monitoring services that assist trademark professionals in meeting the evolving Intellectual Property challenges of the digital era.
And they describe the reasons for their crawler activity thus:
As a Digital Brand Asset Management company, NameProtect engages in crawling activity in search of a wide range of brand and other intellectual property violations that may be of interest to our clients.
So, they are sniffing around the Internet looking for any copyright infringements they might be able to make a bit of cash from. Even so, you still might think so what – people who infringe copyright deserve what they get, right? Well, that’s a discussion for another day, but I felt sure I’d heard of these guys before, so I did a bit of surfing. Sure enough, they’ve been shown to have lied in the past about the behaviour of their web crawler – it has allegedly been known to ignore directives in robots.txt files – and to have used unidentified crawlers to harvest websites alongside their documented ones. Now that just shouldn’t be encouraged
Since the info I’ve dug up isn’t exactly recent, I’ve decided to create a robots.txt for this site, and chuck them in it:
User-agent: NPBot Disallow: /
and to keep an eye out to see if the NPBot or anything that looks like it returns. If it does, I’ll ban their IP Address range as suggested on the sites linked to above.