Sunday, January 28, 2007; 3:14 pm. An email arrives. It’s Barbara with the subject “[Tom's Blog] New Comment Posted to 'Some mashups'”. Wow! Somebody is reading my Blog; let’s check and publish the comment… Oh, but what happened?
Lesbian Sistas say never planned her tryst lesbian kissing [...]
Barbara-- please! not here!
As Bud pointed out in one of the first lectures, the robots and spiders have finally found us and are trying to put SPAM onto Learningremix. But long ago there was a solution against these web crawlers-- at least against Google’s long arms: The robots.txt file. Here is an extract of Wikipedia’s description:
The robots exclusion standard or robots.txt protocol is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website. The information specifying the parts that should not be accessed is specified in a file called robots.txt in the top-level directory of the website.
However, the robots.txt doesn’t seem to fit very well into Web 2.0 in which the emphasis is on sharing and spreading knowledge- also throughout search engines. Another problem in these days could be that today’s crawlers are simply ignoring these files and violating the politeness policy. Anyway – to protect my intellectual property (often referred to as bullshit ;-) from being cached by Google or any other crawler I just uploaded a robots.txt forbidding any of these to enter my directory. Let’s see if it works.
200701300039
Comments (1)
Hate to say it, but spammers don't respect robots.txt. It's voluntary.
Posted by Bud Gibson | January 30, 2007 8:40 AM
Posted on January 30, 2007 08:40