Archiving the WWWeb

In 1901 a little-known German neu­rologist began attending to a female patient known as Auguste D. At a mere 51 years of age, the woman was unable to remember her name, her husband’s name or how long she had been in hospital.

Written by      

After Auguste D’s death, the doc­tor, one Alois Alzheimer, would find her brain covered in lesions (which he would name plaques and tangles) and subsequently give his name to a now well-known disease. For the time being, however, both doctor and patient were bewildered by the latter’s declining mental abilities. On Auguste D’s first day in hospital Alzheimer tried to get her to write her name. She failed several times before looking up at him, exasper­ated, and announcing, “I have lost myself.”

A little over a century later, it’s possible to liken Auguste D’s deterio­rating mental state to a threat facing our collective digital memory. It’s an unforeseen consequence of the spread of the internet—the principal manifestation of the digital revolu­tion—and a problem that libraries and academic institutions are grap­pling with the world over.

While it’s possible to identify and capture all, or at least most, pub­lished material that is printed, purely digital material is another matter: its sheer volume and transience are a challenge to archivists.

Much of the information about and commentary on recent events of which New Zealanders feel enor­mously proud, such as the Lord of the Rings film trilogy or the country’s America’s Cup successes, is only on the web, not in print, says the National Library’s Penny Carnaby. “If we do not preserve it, we are ef­fectively losing our heritage.”

Consider, too, that somewhere out there the next Janet Frame or Colin McCahon may well be bashing out emails that could give future biographers fascinating insights into their lives and creative processes. In the past their letters would have been saved, but today their emails, after hanging around on a few computers and servers for a while, will simply disappear.

To ensure that large chunks of New Zealand’s history do not go the way of Auguste D’s memory, the National Library has embarked on a four-year, $24 million project to pre­serve the country’s digital heritage. The project will see video, sound, text and graphics, including websites and weblogs, saved as digital ob­jects. When it’s completed, in mid-2008, historians, researchers and the general public will have access to a massive digital database, which will be maintained and continually updated. Because of the earthquake risk in Wellington, and as a precau­tion against hackers, the archive will be replicated and updated daily at another site, probably in Auckland.

Leading the project is the man­ager of the National Library’s In­novation Centre, Steve Knight. “The library’s been thinking about how to deal with digital material since 2001,” he says. “We have a mandate to comprehensively collect New Zealand and Pacific material in the print environment and we knew at some stage we would have to extend that to the online world.”

Collecting and storing digital material is not as straightforward as one might think. Because technology changes so fast, the digital archive will need to be transferred to a new operating system every 10–15 years. It’s also important to preserve the technological context in which material is created. Take a poem, for example. It isn’t just the words but also how they are arranged on a page that gives it its cadence and meaning. So how can you ensure a poem written using Microsoft Word remains intelligible in the 22nd century even though no one will be using Microsoft Word by then?

To tackle this problem Knight has developed a meta-data extraction tool that burrows into websites, pulls out proprietary information from dif­ferent file types and translates it into a format that’s consistent with the meta-data model.

Knight’s tool, which is still a work in progress, was one of five technol­ogies short listed for the UK’s prestig­ious Pilgrim Trust digital-preservation award last year. Knight is in discus­sion with Harvard University about combining his tool with a similar one developed there. And his expertise has been internationally recognised with his selection by the US National Science Foundation to be on a panel to evaluate bids for funding digital-preservation projects. Not a bad effort for someone with no technical training, who learnt technology on the job during more than 20 years working in libraries.

The massive volume of material on the internet and the speed at which it is updated—43 days is the average life span of a web page­ mean it’s impossible to collect eve­rything. In the print world the library has a mandate to collect 100 per cent of published material relating to New Zealand and the Pacific. The Legal Deposit System requires pub­lishers to lodge two copies of every­thing they publish with the National Library. Inevitably not everything is saved—the general consensus is that about 80 per cent of material is col­lected and preserved—but, owing to its nature, printed material endures, and items not collected at the time of publication are often acquired later. Most countries are looking at capturing only 3–5 per cent of digit­ally published material, although, as collection methods become more sophisticated and storage costs drop, the target may rise.

Ingrid Mason, the National Library’s e-collections manager, has recently returned from a trip to the US, where she visited the US Library of Congress and the International Internet Preservation Group to inves­tigate how best to save online mate­rial. “You can’t suck up the whole internet,” she says. “Most nations are looking at different methods of col­lecting online-published material.”

The changing iterations of certain websites deemed important enough will be preserved. (An iteration is a single pass by a computer through a group of instructions, an action the computer executes repeatedly.) Choosing what to save comes down to curatorial judgement, says Mason. “It’s difficult when you have been se­lecting comprehensively in the print environment because you have to be more selective in the online world.”

This raises some interesting questions about what to include and what to leave out. Will the Destiny Church’s website, or local pornog­raphy sites, be saved for example? While the inclusion of such material in the archive might irk social and cultural conservatives, leaving it out would effectively airbrush history. “If it’s not illegal, we’ll collect it,” says Carnaby. “Pornography is a grey area, and where there’s a grey area we’ll err on the side of collecting it. We must not sanitise our history. It’s about the memory of a nation and we must not lose it.”

As with print, the library will be guided by its collections policy and the law, with the curator of the library’s published collection, Clark Stiles, overseeing decisions on online collection.

The library is also looking at “event capture”—collecting all online material related to certain events, such as the recent tour of the British and Irish Lions rugby team or the forthcoming general election. “Using artificial intelligence you can run web harvesters and get them to pull in everything on a thematic basis,” says Mason.

Event capture is already hap­pening. Around 30 websites relat­ing to the election—political party sites, lobby group sites and political blogs—are being saved on a daily basis. “The changing iterations of these sites already give a fascinating glimpse of New Zealand’s recent his­tory,” says Carnaby.

Finally, the library will also take regular broad “snapshots”, sucking up everything in the .nz web domain at a certain point in time.

More problematic is saving the electronic correspondence of artists or writers. Carnaby says some people are already deposit­ing electronic correspondence with the National Library, while some younger people—whom she terms digital natives—have created blogs to record their lives. “But we need to start having a conversation with the authors, artists and creators of today about how we capture and preserve their process of creativity in the 21st century.”

Because the library is interested in collecting not just from that part of the web that is publicly accessible but also content that is made avail­able on a commercial basis only, the cooperation of online publishers will be essential to the success of the project.

“We are currently in discussions with publishers,” says Carnaby. “There’s understandable nervousness around the provisions of access. The power is with the publishers. They will determine the rights of access. On day one they’ll be conservative about what they submit, but hope­fully they’ll come to realise that giv­ing access will actually benefit them in the long run.”

The digital-archiving project is a key plank in the government’s recently released Digital Strategy. In another project under the strat­egy’s auspices the library is mapping the content asset—libraries, data­bases, radio and film archives—of New Zealand. Once this has been done, the library will employ New Zealand’s best web-design talent to produce a user interface called NZOnline, which will be the public’s window to the content resource and allow federated searching across a number of databases. This will allow someone researching the southern albatross, for example, to pull up scholarly articles about the bird from libraries, sound recordings of its calls from the radio archive, and digital images from Matapihi (a web-based library of digital images, sounds and objects), as well as other relevant material from the National Library’s digital archive.

Archivists around the world are watching the project with inter­est. It’s being peer-reviewed by a number of academic institutions, in­cluding the British National Library, the Royal Library of the Netherlands, and Cornell University and the Getty Research Institute in the US. Two companies—IBM and Endeavour Information Systems—have been short-listed as technology partners, with the final choice due to be made next year.

Carnaby reckons New Zealand’s approach to the problem is unique. “I don’t see many countries having a comprehensive strategy to preserve their digital memory. We’re look­ing at it holistically from creation to preservation and then looking at an access strategy.”

More by