• UABDivulga
09/2009

Aly Conteh, Digitisation Programme Manager at the British Library

Aly Conteh

"Document analysis and recognition technology allow people to undertake resource discovery on massive amount of data"

Aly Conteh directs the Digitisation Programme at the British Library, one of the largest libraries in the world, where some 150 million documents in all known languages and formats can be found. He currently is coordinating a project to digitise 23 million pages of 19th century books, 4 million pages of pre-1900 newspapers and hundreds of manuscripts which will be then made available to researchers, students and the public in general through the library's website. In July Aly Conteh was invited to speak at the 10th International Conference on Documentary Analysis and Recognition (ICDAR), organised by the UAB Computer Vision Centre. The field of document analysis and recognition combines image processing techniques, recognition of shapes, and computer vision which automatically extracts textual and graphical contents of digitised documents.

Since 2003 Aly Conteh is Digitisation Programme Manager at the British Library. He serves on the Executive Board of the Impact project, a large-scale integrating project funded by the European Commission as part of the Seventh Framework Programme (FP7). He is a member of the European Commission's Member States' Expert Group on Digitisation and Digital Preservation and has advised UK government departments on digitisation matters.

- What is document analysis and recognition technology?

- When we talk about Document Analysis and Recognition from the viewpoint of a national library, like the British Library, we're considering a key activity which is digitisation. This activity is really about how we take historical material, e.g. newspapers, books, manuscripts... and make those available on the web. What Document Analysis and Recognition allows us to do is add value to this documentation.  For example, traditional research with newspapers involves a physical newspaper or microfiche and you need to draw through each page to find the information needed.  It's fine if you want to look for a particular issue or date, but when you are searching for something more general, it is quicker and easier to use technologies such as Optical Character Recognition (OCR), which allows you to scan pages and detect the characters of each word individually. These characters are included in the software's database and that's how it can detect them within a text. This means by putting the digital images through that process you can support functionality such as searching for key words in a text. Until recently you could only do this with physical material or microfiches and you had to search for the words yourself. This is the main benefit of moving from the physical to the digital environment. This allows people to undertake resource discovery on massive amount of data in a way that wasn’t possible before. So it really opens new avenues to researching with this type of material.

- Could you explain what the large scale digitisation project consists of in terms of techniques, equipment, etc?

- One of our projects consisted in the digitisation of 23 million pages of 19th century books, and that equates to about 80,000 volumes. This is what we describe as a mass digitisation project and so the workflow consists in getting the material stored in the basements of the British Library into the digitisation studio for a digital capture, a post-processing and then making the digital resource available to users.
We do this by using a digital capture device built as a V-shaped cradle with two high end digital cameras mounted above. As the book sits in the cradle, you're capturing the two facing pages at once. The additional benefit of this device is its automated page-turner consisting of a head that comes in minimal contact with the pages. This is possible thanks to a small curtain surrounding the head which creates a vacuum in its interior and is able to turn the pages. With this system our productivity can reach up to four times more than with a person turning the pages manually; it also puts less stress and wear and tear on the book. Normally, we turn the page more or less in the same place, but this head touches the middle of the page and therefore helps to preserve the books. There is an operator sitting at the terminal and can immediately see on the screen next to the capture device the images that have been taken and their quality and check to see if there are any mistakes. If any errors are detected the process can be quickly stopped and repeated.
Then there are things like the "fold-out" pages. For example, in many geography books, there are fold-out maps which are larger in size than a normal page. This device cannot handle fold-outs so the operator marks these fold-outs and when the digitisation of the book is finished, an overhead scanner can capture the images of the fold-out. The software in the device detects that it is a fold-out pertaining to that specific book and inserts it with the rest of images in the right place.

- What are the main handicaps to progress in this technology and what advances will we see in coming years?

- The biggest handicap is the quality of the OCR. This software is brilliant for modern printed material, for which it was developed, but with historic material, with the challenge of fonts and language and the quality of the paper which may be centuries old, what we are seeing is lower accuracy rates. This hinders the kind of services we can develop with this type of material. Where I think we will advance is greater sophistication and tuning of OCR software to be able to manage historic text, not only in the recognition of characters and solving problems such as poor quality due to bleed-through, but also handling issues of language, such as introducing historical dictionaries into the software to detect archaic language or changes in spelling. 

- Could you briefly describe what can be found at the British Library and give some numbers in term of books and documents catalogue?

The amazing thing is that the British Library has just about everything. There are 150 million items: around 15 million books and 825 million pages of newspapers. Other objects include prints and drawings, philatelic items, stamps, manuscripts... For example, if we were to digitise all the Anglo-Saxon medieval manuscripts found at the British Library, it would create about 8 million objects. The British Library probably has the greatest collection of medieval manuscripts in the world between journals, periodicals, everything you can think of. There is single sheet material, such as theatre play bills and other quirky items.  For example, all magazines are deposited in the library and many have things stuck on the front such as CDs, or a lipstick, soft toys... and the library collects all this stuff. It collects everything.

- What is the main objective of the British Library's digitisation project and how will it be useful for the public in general?

The main objective is to bring these items to a wider audience. At the moment, if you want to see the material, you have to physically go to the British Library. We can’t send information out, which makes it different from other public libraries where you can take out the items and take them home. I was asked once "Who is this for? Is it just for researchers?" It is for researchers, but it is not only for researchers. For example, we've digitise 4 million pages of newspapers and a researcher studying social reform in Victorian Britain can go and get a good sense of the different papers and political views of that era. But the resource is also important for those doing genealogy or looking at their family history, or people who may want to draw parallels between the past and what happens in today's world. So you can appeal from the very detailed serious researcher right through the spectrum to the curious general public who just wants to see what was happening on a particular day or being said on a specific topic. The important thing with these resources is to be able to deliver them in a way that shows that they are made to include a wide angle of interests.

- Will libraries as we know them now still make sense in a virtual future?

- Yes, definitely. It's interesting because with the amount of digital content we are producing you would think that the printed word would start to reduce, but that's not the case. We get more and more printed material every day. Personally I believe that mankind will always be interested in the physical representation. What will happen to libraries is that they will become more hybrid institutions. They will need to operate and give support to research in the digital environment, but people will always want to interact with the physical and be able to look back and understand how we today and how past generations consumed their information. Maybe over time the balance will change in how much physical and digital material will come in. That may change, but I don’t think the library as a physical place you go to and interact with physical things will change for many generations.

 

Entrevista: Dímpel Soto. Fotografia: Antonio Zamora
Universitat Autònoma de Barcelona
 
View low-bandwidth version