OCR Correction Makes Our Shared History More Accessible to All

Forums OCR Text Correction OCR Correction Makes Our Shared History More Accessible to All

This topic contains 0 replies, has 1 voice, and was last updated by  Regan 4 months, 3 weeks ago.

  • Post
    Regan
    Keymaster

    We all win through this crowd sourcing activity.

    We at the State Library are honored to be partnering with the cultural and civic organizations within our state to add so much interesting, unique, and entertaining historic content to the Colorado Historic Newspapers Collection (CHNC).

    Over the past three years, we have added more than 600,000 pages of historic newspapers to the collection, including 91 new titles, 5 new languages, student papers from our institutions of higher education, previously unrepresented geographic regions, and so much more. But we need your help! Technology alone can only do so much. It takes human intervention to make the collection even more valuable to the tens of thousands of students, genealogists, and researchers who use it every month. What do I mean by human intervention? Why, OCR correction of course.

    Optical Character Recognition, or OCR, is a process by which software reads a page image and translates it into a text file by recognizing the shapes of the letters. OCR enables searching of large quantities of full-text data, but it is never 100% accurate. The level of accuracy depends on the print quality of the original newspaper issue, its condition at the time of microfilming, the level of detail captured by the microfilm scanner, and the quality of the OCR software. Issues with poor quality paper, small print, mixed fonts, multiple column layouts, or damaged pages may contribute to poor OCR accuracy. The effectiveness of OCR software has improved dramatically over the years, however, there are many pages within the CHNC that were added more than 10 years ago, and the quality of the OCR created text for those pages can only be corrected manually – and that is where “the crowd” comes in. We need your help to clean up the database.

    Here is an example of some pretty bad OCR from content added to CHNC back in the early 2000s. Even though we can read the original article with little difficulty, it is because our eyes and brain work together to “fill in the blanks”. This was not easily accomplished by early OCR software, and the resulting textual representation of this article is missing many important words and names, and would probably not be found by someone searching for Vincent Johnson.

    The good news is that the CHNC database has a built in text correction tool that allows users to make corrections to the OCR text when errors are discovered. Using this tool, any registered user can edit the OCR text for the articles they are using or finding in the database. Correcting text is simple and safe, and does not alter the original image of the newspaper article, just the searchable text created from it.

    Using the text correction tool, I made edits to the article’s OCR to the right, and now it looks like this. All of the names are now entered correctly, and all other words are corrected as well.

    To date, 485 users have collectively corrected over 2,647,588 lines of text in articles held within the Colorado Historic Newspapers Collection. Our top five correctors are listed on the front page of the database, in a place of honor for their contributions to the resource.

    I recently asked some of our top correctors what motivated them to correct text in the database and here are some snippets of their responses.

    “When I am correcting text, I feel like I am bring[ing] the people and events back to life, if only for a moment.”

    “For me, originally I was looking for information on grandparents in Routt County. … Since then I just correct because I realize that there are other people who are looking for their family histories as well.”

    Whatever your reason for correcting, we appreciate every correction made, because it makes the CHNC experience better for everyone that follows. Help us make the CHNC better by correcting text. To learn more about correcting text, see our help and forums page – and check out our new text correction video on how to do it yourself.

    For more information about text correcting, or to inquire about specific content relevant to you and your research, contact Leigh Jeremias. Thank you for helping us make the CHNC the wonderful resource that it is for Colorado.

You must be logged in to reply to this topic.

Skip to toolbar