Friday 25 January 2013

robots.txt retroactively removes content from Wayback Machine

There is a fun experiment anyone can try: create some unique information and store it on two kinds of media: print it on a sheet of paper and store it in a file on a USB pen drive. Then, put both these objects on a hard solid surface, like a concrete floor. Take a hammer and hit the paper hard once. Then, hit the body of the USB pen drive with the same force. Next, try to recover the information from both media. The paper may have a hole in it, and perhaps it is impossible to read a few words. The pen drive however, is likely to be a total loss. If the silicon chip is cracked, your only chance would be to bring it to a specialized laboratory which will charge you a fee you cannot even imagine just to have a tiny chance of recovering perhaps a few words from the text.
What I am saying with this whole story is that I laugh at every advert that claims “save your old photos by scanning them with our digital photo scanner!” The easier it is to create and replicate information, the easier it also is generally to lose it. I am certain that at some point in history, there will be something like a super-sized version of that hammer, hitting our fragile digital archives. If it is bad enough, humanity will be catapulted back to medieval times and history will be a black hole starting from around the year 2000. Maybe they will believe the world really went to hell at the end of 2012. If you have something you really want to preserve, make a hard copy of it. No, make as many copies of it as possible, on all kinds of media.
Now, this whole introduction serves to illustrate how grave a certain issue is with the Internet Archive's “Wayback Machine”. The Wayback Machine is a great initiative. Its goal is to create digital archives of old websites. I once believed that once a website was archived, it would stay accessible until either the whole Wayback Machine were destroyed, or someone explicitly asked the information to be deleted. Now however I have discovered that information can disappear also in a very trivial and dumb way.
If someone places a ‘robots.txt’ file on a domain that prohibits crawlers from retrieving the domain, the Internet Archive will retroactively apply this prohibition. There is logic behind this: if someone noticed that a confidential website has leaked and has been archived in past months, this system allows to remove the archive without much fuss. The mechanism however is dumb as a brick and if a domain expires and is subsequently bought by someone who has no rights whatsoever to the original content, they can still put anything in the robots.txt to retroactively remove anything from the archive.
Proves that there are many domain name squatters who buy old domains and place a prohibitive robots.txt on the empty “for sale” page because they do not want it to litter search engines, which is actually a good thing. What is bad however is that this instantly hides the entire archive in the Wayback Machine. There is no justification for this aside from laziness of the programmers and excessive prudence. The squatter has no rights whatsoever to influence the information that was stored on the old website, he only has bought a domain name. Therefore I would greatly appreciate it if the people responsible for the Wayback Machine would implement a better way to provide a balance between legal concerns and the completeness of their valuable archive.

No comments: