this is a fantastically cool idea

Posted on September 8, 2006 by Lessig

Check out webcitation.org — a project run at the University of Toronto. The basic idea is to create a permanent URL for citations, so that when the Supreme Court, e.g., cites a webpage, there’s a reliable way to get back to the webpage it cited. They do this by creating a reference URL, which then will refer back to an archive of the page created when the reference was created. E.g., I entered the URL for my blog (“http://lessig.org/blog“). It then created an archive URL “http://www.webcitation.org/5IlFymF33“. Click on it and it should take you to an archive page for my blog.

Why, you might ask, would you ever want to substitute that long ugly URL for the short and spiffy http://lessig.org/blog? Well first, and most obviously if you’ve ever written something for publication, URLs are not always short and spiffy. Second, the point is to create an archive of a page at a particular moment.

A bunch of us have been talking about a service like this for sometime. One idea we had been talking about was a slight modification: Rather than a link that always took you to the archive, the link would first check whether the page referenced is still there unchanged. If so, it would give you that page; if not, it would take you to the archive. Difficulty with this is dynamic pages.

It would be fantastic if the consortium running this would keep a publicly accessible archive of the URLs they generate tied to the original URL — so if the service goes bunk, there’s a way to recover the original URL. And someone should write an app that could sit on a toolbar — ArchiveMe — and when clicked, generated the URL (and put it in the copy/paste field).

But these are quibbles: This is a very cool project, really really needed.

This entry was posted in good code. Bookmark the permalink.

14 Responses to this is a fantastically cool idea

anonymous says:

September 8, 2006 at 9:16 pm

It’s interesting but highlights a problem with the there when the electric is on and off when it’s not nature of info on the net. As we move more and more to relying on the info being there when we hit the bookmark or Google, a lot of really useful information vanishes as people change or delete their websites, lose interest, find the hosting expense prohibitive, etc.

The 4th Dr. Who once said as a space ship was in crisis and all systems were down “I told them not to put the emergency manual on the computer”.

Log in to Reply
Seth Finkelstein says:

September 8, 2006 at 10:28 pm

Careful, that sort of format is dangerous. A citation like:

http://www.webcitation.org/2006-09-08-23:50:32/http://lessig.org/blog/

at least gives someone a ghost of a chance of finding the material again when the server goes the way of many other projects, to that great big bitbucket in the sky.

A format like http://www.webcitation.org/5IlFymF33 is totally opaque.

It’s a very bad lock-in for an archival service.

Log in to Reply
James P. Howard, II says:

September 9, 2006 at 5:13 am

How is this any different from http://www.purl.org or DOI’s for web pages?

Log in to Reply
Djibril says:

September 9, 2006 at 5:33 am

It’s different from a purl because it does not just redirect to the page in question, it copies and archives a snapshot of the page in question at the date+time in question. An essential distinction.

The Internet Archive does something like this, I think, but does not guarantee to provide a permanent copy with stable URL. As far as I understand it, webcitation only creates an archive version of a page when someone requests it (a bit like tinyurl). I agree that a more transparent URL would be good, though not as snappy…

Log in to Reply
Andreas says:

September 9, 2006 at 11:52 am

Although I definitely see the value of this kind of service (I second Seth’s doubts about the opaque URI format though), isn’t this a bit a grey zone with regards to copyright? Services like Furl and other bookmark managers that save a copy of the pages in your library, haven’t added “world sharing” or even “group sharing” features for exactly that reason. More at http://furl.net/faq.jsp#copy

Log in to Reply
Philip Weiss says:

September 9, 2006 at 1:30 pm

It’s nice to have that feature, but there are a whole host of problems related to it.

First, there’s no guarantee that the page archived is the same one you viewed. Given that many web sites show different versions of the web page based on browser, based on IP address, based on country of the viewer, etc. They may return one thing to the citation company, and another to you.

Similarly, there’s no guarantee that what a person views now is the same as what was originally stored in the archive. While probably good enough for referencing in news articles, if you want to use it for legal purposes, or for other reasons where you need a correct copy, this is not a good way to do it. (For example, I want to record the price advertised for an item so I can take advantage of a competitor’s “meet or beat” guarantee.) Can you trust the citation company and it’s agents to maintain integrity? I wouldn’t put money on it.

Log in to Reply
Max Battcher says:

September 10, 2006 at 2:35 am

“If so, it would give you that page; if not, it would take you to the archive. Difficulty with this is dynamic pages.”

A truely dynamic page shouldn’t be used in a citation, as it isn’t a verifiable source, righy? A “dynamic page” (ultimately static over time but within a dynamic web engine) generated by blog software or what have you, should still attempt to send a correct Last-Modified header if at all possible, and you could certainly rely on that just as so many cache engines do (such as Google Cache).

Log in to Reply
Guilherme P. de Freitas says:

September 10, 2006 at 8:55 am

I’d just like to emphasize the points made by Seth and Phillip:

1. “Opaque” URLs are not nice (Seth).
2. At least in principle, they may archive a different version of the webpage due to location differences (Phillip).
3. Authenticity of the archived content is an issue (Phillip).

Point 3 is critical; point 1 could be easily addressed, I guess; point 2 is important, but you can always check if it archived the right version of the webpage in order to avoid mistakes.

Log in to Reply
QrazyQat says:

September 10, 2006 at 11:00 am

Plus, really, who says this service is going to be around forever? Is it really going to archive all those pages for years; where’s their incentive? At least the original owner/poster has some incentive to keep the info online, although they may not. Why would you assume this service will do it?

I note that web archive, which I understood was originally going to archive it all (sure), simply dropped pages after a while.

Log in to Reply
Lessig says:

September 11, 2006 at 12:56 am

I think some of this is missing the point. The purpose is not to prove with 100% certainty what was at a particular URL at a particular time. The purpose, as I understand it, is to make URL’s useable as citations. A simple way, that is, to go back to the thing cited, for the purpose of completing the reference. Sure, more confidence is better than less. But that you don’t have perfect conference doesn’t mean the service has no important value.

Second, as I said in the post, better would be if the cite published a table of its opaque URLs and the originals, so someone could at least go back to the original (and alternative archives for the original URL) if needed.

Third, of course there’s always a risk that the archive disappears. But if people start supporting the archiving movement, there’s less risk they will disappear than that a single URL will disappear.

Log in to Reply
Juan says:

September 11, 2006 at 6:21 am

Hi all. I am the developer behind WebCite and I wanted to clarify a few of the points noted here.

1. the snapshot ID shown is just a way to get URLs to be short and intended to look ‘pretty’ on print publications. It is also possible to search for a given webcite by using URL parameters on the query page. For example, http://www.webcitation.org/query?url=http://lessig.org/blog/&date=2006-09-08 also gets you to the archive of this blog.

2. How do you know that WebCite wont itself disappear? WebCite is in talks with the University of Toronto Library and it appears that they will provide hosting should the Centre for eHealth Innovation ever stop being the host. — BioMedCentral already archives with us and PLoS is starting soon — showing that some big players are already relying on the service and thus providing some ‘guarantee’.

3. The purpose of WebCite is not to be used as legal proof that a given website looked a certain way. It was designed for the purpose of archiving web citations on academic papers.

I welcome any other questions/comments: jalperin [ at ] ehealthinnovation.org

Please also visit our FAQ. Note there is a link there to a “Best Practices Guide” that explains all URL parameters and other technical ins and outs.

Log in to Reply
Gunther Eysenbach says:

September 11, 2006 at 10:02 am

Some of the commentators here would benefit from looking at the detailed technical description (http://www.webcitation.org/doc/WebCiteBestPracticesGuide.pdf ).
First of all, URLs are not necessarily opaque. A format such as http://www.webcitation.org/query?url=http://www.ehealthinnovation.org&date=2006-02-
02 is also functional and can be used as alternative format for citation purposes. This URL however gets very long (if the cited URL is already long), so it can be replaced by a shorter URL using a ID such as http://www.webcitation.org/5IlFymF33 or http://www.webcitation.org/query?id=5IlFymF33 .
If you only know the cited URL but not the ID but want to cite the ID version then you can look up the ID by using the query form at http://www.webcitation.org/query
Copyright concerns are addressed at the FAQ (http://www.webcitation.org/query) – essentially, recent jurisdiction in the context of the Google cache supports archiving projects like this where the copyright holder can opt out using robot exclusion standards, metatags, or an email requesting removal of material.
Somebody mentioned the idea of a toolbar. WebCite welcomes any initiatives to create a toolbar (or to embed this into existing toolbars), but it should be pointed out that a “bookmarklet” – as offered and described on the WebCite page – works as easy and convenient as well.
The answer to questions about sustainability of such a service and incentives for the WebCite consortium to actually maintain the archive is that the WebCite consortium is a consortium of academic editors and publishers who are using WebCite in their journals. They have an intrinsic motivation to keep this service running, otherwise everything which has been cited in their (printed) academic journals would vanish.
The WebCite consortium alsp collaborates with the U of T library and seeks active collaboration with other archiving projects such as IA.
Future iterations of WebCite will contain features like first displaying the live page and only if it is not the same as the archived version displaying the archived snapshot.
The cited snapshot is usually exactly what the citing author saw and archived (a given webpage at a certain date/time). The ONLY exception is if the dynamic page looks different for different viewer IPs (e.g. different countries) – in which case WebCite will archive/display the page the WebCite robot – which is located in Canada “sees” – but as somebody remarked, such pages probably should not be cited anyway.
I just also clarify that WebCite is not a “company”, but a open source / community project, and everybody who things he could contribute code or ideas is more than welcome to contact the WebCite consortium.

For further background about this see also the following article (published in a journal which uses WebCite rountinely for all references):

Eysenbach G, Trudel M
Going, Going, Still There: Using the WebCite Service to Permanently Archive Cited Web Pages
J Med Internet Res 2005;7(5):e60
“>“>http://www.jmir.org/2005/5/e60/>

Log in to Reply
Lars Bell says:

August 1, 2007 at 11:55 pm

Hi,

I have a web service now that should answer many if not all of the above concerns.
http://www.stayboystay.com
Free on demand archival service.

The standard output is a URL that reveals within it the original URL so no more opaque URLs. The date of the capture is also within the new URL.

In addition the new archive URL has a hash built into it. This provides a guarantee that the cached version has not been changed long after it was stored. Due do the cryptographic nature of the public domain hash it is computationally infeasible to change the content and then come up with the same hash.

This service is free and simple to use. You can use it anonymously or you can sign up for a free account and get some additional administration functions.

thanks

Lars Bell

Log in to Reply
Gunther Eysenbach says:

November 24, 2007 at 10:20 pm

Great move, Lars.
What exactly is the point of plagiarizing WebCite?
Your “innovation” that the standard output is “a URL that reveals within it the original URL so no more opaque URLs.” is not really an innovation.
Perhaps you missed my response where I am pointing out that “transparent: URLs like
http://www.webcitation.org/query?url=http://www.ehealthinnovation.org&date=2006-02-02
are fully supported by WebCite. The abbreviated format using a TinyURL format is mainly for publishers and citing authors, who in their list of references usually also provide the “live” URL.

WebCite is meanwhile used by hundreds of journals and publishers like Biomed Central.

Well, I guess Imitation is the sincerest form of flattery.

Log in to Reply

this is a fantastically cool idea

Related

14 Responses to this is a fantastically cool idea

Leave a Reply Cancel reply

Archives

Meta