Posts Tagged ‘keying’

Transcription process and accuracy levels

Wednesday, January 21st, 2009

There have been a few questions on transcription accuracy and our policy towards certain aspects of transcribing the records. We hope this post clears up a few questions!

The transcription accuracy of the 1911census.co.uk website at launch is in excess of 98.5% according to recent tests - this threshold is set as a requirement by the National Archives.

Transcribing the census is a massive exercise - every single digitised document has to be read and transcribed and this process results in over 7 billion keystrokes over the course of the project. Naturally in this volume of keystrokes, more than a few errors will be made.

However, during the transcription process, we do apply a number of processes (developed during our many years’ experience of digitising censuses and other historical documents) to correct the most obvious errors and keep inaccuracy to a minimum.

The 1911census in particular poses specific problems - because the household summaries are the core documents rather than enumerators’ books, the variety of the handwriting itself is significantly wider - in fact there are 8 million different hands writing returns, making interpretation of the handwriting a much more challenging task!

Now some good news - the 98.5% accuracy at launch will improve over time.

The first way that it will be improved is by users of 1911census.co.uk reporting errors to us. Each report is reviewed by hand by the transcription team and if the change is approved, the change is incorporated into the search results, usually within a month (when the next data upload is made to the website).

Our policy is to accept changes only if they match what is on the original page (i.e the household form). So if your ancestor made spelling mistakes on the original page, they will be carried through into the transcript. This is actually more common than you might think, so please be sure to check the original page before you assume that there is an error, rather than an accurate transcription of the original document.

The second way that we improve the quality of the transcription over time is by applying ‘data standardisation’ processes. This is basically a set of rules we develop over time as we identify errors and apply to the data. A basic standardisation that we apply for example is converting “Geo” to “George” and listing records from Kent, Surrey and Middlesex as “London” if they fall within the metropolitan London area. We are developing and applying more data standardisations over time to eliminate more of the current transcription errors and to make searching easier, but some of these processes are much easier to apply once the data is complete.

All of our transcriptions undergo thorough batch sampling, by the transcription house, by The National Archives and by our in-house Quality Control team. Any batch failing to meet the required level of accuracy is rejected and rekeyed.

One way of reducing transcription errors is by ‘double-keying’ every entry - this basically means getting the transcriptions done twice (by different people) and then comparing the two versions and eliminating differences by hand. However, the cost of doing this naturally doubles the transcription cost, would not improve the accuracy rate by a hugely significant degree (you can never reach 100%), and the costs would have had to have been passed on to the public – resulting in higher prices for the census service.

We could also have taken the route of transcribing fewer fields – just a name index, like the old pre-digital booklets – but feel that this would have resulted in fewer people being able to find their ancestors as it would narrow the number of fields you can search on. It would also have made the transcription much less useful for academic study, which is one of the uses to which 1911 census will be put when it is completed.

It is important to remember that the transcription is designed as a finding aid for the original documents, which should be viewed as the “source of truth”; happily most users are able to find their ancestors despite the inevitable errors that creep in.

We have also provided very flexible search options (using wildcards, for example), which, with some lateral thinking, can also help you track down those who do not appear on the first search. The search options had to be constrained at launch to allow for the volumes of people searching, but we have been unlocking these features as the week has worn on, and there is more to come (see other blog posts).