Scanning Texts

THL Toolbox > Scanning & OCR > Scanning Texts

Scanning Tibetan Texts

Contributor(s): Chris Bell, Jama Coartney, David Germano, Zach Rowinski, Gene Smith, Kedrub.

Reference on image formats and scanning tips: external link: https://www.scantips.com/basics09.html

Scanning for OCR

Contributor(s): Sam Chrisinger, Zach Rowinski, Steven Weinberger, Than Grove

If you are scanning Tibetan texts that are dpe cha (unbound folios) to then run OCR on the scans, use the Canon ImageFormula DR-M160 scanner (it has a sheet feeder) that belongs to Religious Studies and is currently in Gibson 053 and follow this procedure:

  1. This scanner only works with Windows computers; it does NOT work with Macs.
  2. Install Capture On Touch software from DVD
    1. Scanner software and drivers are also available online, but it is better to install them from the DVDs that came with the scanner because those available online might be different versions that don’t run as well: external link: http://www.usa.canon.com/cusa/support/office/imageformula_scanners/imageformula_dr_m160/imageformula_dr_m160?selectedName=DriversAndSoftware
      1. M160DRIT_ ****.exe
      2. M160COT_V24.exe
      3. If installation issues: RepairReg.exe
  3. Plug the scanner's USB cable into your Windows computer's USB port
  4. Turn on the scanner by pressing the On/Off button on the top of the scanner
  5. Launch the Capture On Touch software
  6. Configure Capture On Touch settings:
    1. Job title: Pecha scanning
    2. Output method selection: Save to folder
    3. Check output after scanning: off
    4. Enable continuous scanning: on
    5. Scans in the full automode: off
    6. Press Scanner Setting button
      1. Color Mode: black and white
      2. Page Size: Match original size
      3. Dots per Inch: 600 dpi
      4. Scanning Side: duplex
      5. Automatically straightens skewed image: off
      6. Rotate image to match orientation of text: off
      7. Use advanced settings dialog box: off
    7. Now press Output Setting button:
      1. File name: click the gear to the right
      2. Start file name with character string: enter the text you want to prepend to the filename for each file
      3. Add date and time: OFF
      4. Add counter to file name: ON
      5. Pulldown menu select: 3 digits (or 4 if the text has 1000 or more pages)
      6. Start counter with this number: 1 if it starts with file 001. Could set to 0 if there is title page, etc
      7. Press OK and the File name box will close
      8. File Type: TIFF
      9. Click gear to right of File Type dropdown menu:
        1. Select: Create a file for each specified number of pages. Enter 1 in the box
        2. Compress image: OFF
        3. Press OK and TIFF setting box will close
    8. Save in Folder: press the folder icon and navigate to the folder on your hard drive where you want to save the files
  7. Put dpe cha pages in the sheet feeder. It scans from the bottom of the stack of pages in the feeder. You can hold the stack of pages while it scans but this is probably not necessary.
  8. Press the start button in the application (at the top of the CaptureOnTouch screen, right above "Job title" and below "ImageFORMULA"
  9. Watch paper to make sure it doesn’t stick as it comes out. You can detach the output tray if that is hindering the pages exiting the scanner.
  10. After the stack of pages is scanned, then put another stack in. Press the green Start button on the scanner to start the scanning.
  11. If a page doesn't feed or for some reason doesn't scan, when you are ready to scan again, press the "Scan" button in the lower right of the CaptureOnTouch window
  12. When you have finished scanning, and the software no longer says "Scanning" in the middle of the window, at the lower right of the window, press Next Step and it will save the files.

Note about scanning in black & white vs. color, from Zach Rowinski If you are scanning in color and 600 dpi is very slow, you could set it to 400 dpi. I wouldn't go lower than 400 dpi. Do not save in jpeg format since this is lossy. Generally speaking, it is a bad idea to scan in color if all you are capturing is text, unless you are scanning old original manuscripts, in which case all the detail you can capture is useful.

Scanning in black and white and saving to TIFF has the added benefit of being able to compress the images using the GROUP4 fax setting, which makes the file sizes small.

Choice of Equipment

Note: see the following page for scanning Karmapa Dege volumes

You can use scanners or digital cameras to produce images of texts. Scanners can be either flatbed or have automatic feeders. Flatbed scanners are good for not damaging paper, but also take more time since you have to manually place each piece of paper and remove it. In addition, they are large and bulky, especially if you have one large enough to accommodate long pages. Scanners with automated feeding are much more compact and thus easy to transport as well as accommodate long pages, but they may also damage paper in the process - especially if the paper is torn, fragile, or of irregular texture. Either way, there is a natural wear and tear on the glass as paper passes through scanners. The paper itself, depending on how rough its texture is, often scratches the glass and results in lines from the scratches appearing in the scans, which can be very problematic.

It should be stressed that any serious conservator will strongly resist any use of automated feeders because of the damage that they potentially do to a manuscript. If you are working with a unique or uncommon manuscript, we strongly advise against using an automated feed system for this reason. However, one may very well be scanning a common print so that damage is not a concern.

We are currently using Fujitsu scanners with automated feeding in Tibet because this is what people are able and willing to do. It is important that the scanner can take the full length of the page without cutting the originals. In terms of adjusting settings to accommodate Tibetan style books, the scanning application provided on the CD with your auto-feed scanner usually list of standard page sizes, but also will allow you to specify the length according to your needs. Be sure to adjust the settings before you begin to scan so the machine knows to expect an unusual page size. Also note that upper-end scanners like the Fujitsu 5120c and related family of models will scan both the front and back of a page simultaneously, thus making them an attractive option especially when scanning a high volume of materials at a high quality setting (such as 600 dpi-see "Preparing to Scan" below).

Scanners are generally easier to use, but digital cameras can particularly be good when you are worried about fragile or torn paper being damaged in the process of feeding them through a scanner. They also address the issue of the paper itself, if it is rough, scratching the glass of a scanner and causing lines in the resulting scans which can be major problems.

The main challenge with cameras is having a tripod or related set up that allows you easily keep the camera pointed at the pages with a perfectly straight shot looking down. There are special structures for doing this called Copy Stands, but they tend to be expensive and heavy. The Library of Congress in the US is using a special Phase 1 P45 Camera Back which involves a single column structure that holds it in place and facilitates a relatively rapid work process despite the inherent slowness of using a camera. These are used for oversize items like newspapers or maps, or relatively small and fragile items. For information on photographing Tibetan texts, see photographing Tibetan texts.

Issues Pertaining to the Actual Paper

The ease of scanning depends in large part on the material nature of the pages you are scanning. Traditional Tibetan texts are loose leaf long rectangles with text on the front and back. The loose leaf nature of Tibetan texts makes scanning much easier since there is no binding to be concerned a about, but the length of the page can pose challenges. In addition, contemporary paper with its smooth surface and consistent character is much easier to scan than traditional Tibetan paper with its rough and inconsistent surfaces.

Another issue is the clarity of the print. Traditional wood block prints are often very inconsistent in terms of how much ink is used in each page, so that some pages may be very light and hard to scan. Thus one has to be careful not to get into a rhythm and miss that some pages are much lighter than others. In addition, some special Buddhist scriptures such as the Kangyur may be printed in red, which tends to be much more difficult to scan than black ink.

Preparing to Scan

Before you scan, assess the paper - how long are the pages? how clear and consistent is the print in terms of the darkness of the ink? what color is the ink? what is the texture of the paper? Depending on the answers, you should choose what equipment to use and what settings. Experiment with different settings, especially the dpi, and color, to determine which to use.

In addition, you have to be concerned about missing pages and pages out of order. Since Tibetan texts are looseleaf without binding traditionally, it is easy for pages to go missing or to become out of order. This can either be checked prior to scanning, or you can try to sort out the digital images. If there is time, inspecting and fixing the paper as possible prior to scanning can be more efficient since paper is easier to look at than digital images.

It may also be that paper has become creased, so that it will present an irregular surface for scanning. In this case, you may want to experiment with exposing the paper to humidity to essentially "stretch it out" to make it more flat for scanning.

If you are scanning Tibetan paper then you need to gently rub each page with a soft cloth to wipe away small stones and splinters that can scratch the lens. A scratched lens/screen for an autofeed scanner is difficult to replace, (if replaceable at all).

Open a target folder for the new scans. Output files should be TIFF, with no compression or increase. Set the scan type to long page duplex and specify the exact size of the pages (for Derge paper it is usually 680mm long X 100mm tall). Setting the scan type to duplex is important: it ensures the autofeed scanner will scan both sides of the page simultaneously.

In terms of determining the scanner settings, the real litmus test is - when you print the scanned image, does it look good, is readable, etc.?

File format: tiff. For printing, you may want to subsequently convert to PDFs and print from the PDF. However, you should always save the scans as TIFFs.

Compression: none

Resolution: 300 dots per inch (dpi) minimum If the source material is really rich in detail, then do a test scan at 300 and at 600 dpi. If you see more detail in the 600 dpi scan, then I would use the higher setting. As an example, is the scanned text more readable at 600 dpi? If so, then use 600 dpi.

Color/Grayscale/bitonal: variable. This really depends on the source material. If it's yellow parchment colored paper with faint gray writing on it, then you definitely want to scan in 24-bit color. Actually, you can't go wrong if you scan in color, and if it makes sense to do so, it can always be converted later. However, one issue is scanning in color can be much slower and take up much more storage, which can both pose challenges. However color can be crucial where detail is necessary - everything depends on the document.

How to Scan

If you use an autofeed scanner that scans both sides of a page simultaneously, the scanning process entails little more than feeding pages into the scanner. It is recommended you feed the pages into the scanner in succession yourself rather than stack a few pages into the tray and assume the autofeed will scan them in the correct order and without grabbing multiple pages at a time. Indeed, for Tibetan style pages, an autofeed scanner often will make mistakes without personal attention to make sure pages are fed one at a time in succession. Most autofeed scanners have guides in the feed tray which you can adjust according to the size of the pages your scanning. These are extremely helpful in ensuring pages feed through the scanner evenly. As one page feeds through, prepare to insert the next page as soon the page being scanned is nearly finished. The Fujitsu autofeed scanners typically have a short delay between when one scan finishes and another starts. Thus, it's not difficult to maintain a steady rhythm of scanning many pages in succession-even if you fumble for a moment in preparing the next page to be fed into the scanner.

It is helpful to intermittently save a group of scans as you go so as not to lose all of your work if the computer crashes. Initially, you should number the files in sync with the text's page numbers themselves. Thus in case a page goes un-scanned, etc., it is easy to figure out exactly where you made a mistake. Your scanning equipment may automatically apply a custom name and number to scanned images as it saves them. This can greatly save time, though be careful to make sure the numbers it applies and the text's own page numbers are in sync. Later, you can easily batch rename scans according to whatever makes the most sense.

We advise making one image for each folio side, rather than storing back and front (recto and verson) of a given folio on a single image. The reason for this is that by having one folio side on one image, you have preserved maximum flexibility for how you later deliver them.

Scanning Pages With Background Distortion

When scanning pages of Tibetan text there may occasionally be thin pages where the back-page text actually bleeds partially through to the front-page, distorting the text. This is different from pages so thin that the back-page text can be seen, an issue easily resolved with a thick piece of white paper. In the case of bleed-through, there are two solutions depending on severity. Either of these solutions should successfully lighten the distorting bleed-through in the background and make the front-page text more readable.

  1. Under "Advanced Settings" > "Image Quality" set the brightness to the lightest possible setting before scanning.
  2. Do the above and photocopy the pages. Then scan the photocopies while repeating the above step, thus effectively doubling the lightness. This solution is more time-consuming and results can be somewhat grainy, so it is advised only for severe instances.

Processing

After scanning, processing may entail renaming files, organizing images into folders by text or volume, converting images to jpegs or PDF, and- in special cases- editing the image itself to remove blemishes or other distortions. For processing purposes it is helpful to use one or two programs that allow for (1) easily browsing a large number of TIFF files, (2) batch renaming, and (3) editing. Adobe Photoshop can perform all of these functions though it tends to perform slowly. "Thumbs Up" is a program that allows for easily viewing and renaming tiff images. Kodak Imaging is an light-weight, fast Windows application for editing images.

Storing images

High resolution scans take a large amount of disk space. You may want to use some form of lossless compression technique like "zip" if you are in need a temporary solution for conserving disk space. "Lossless" means that image will compress without any loss of information. (By contrast, a "lossy" image is one in which compression resulted in loss of information. A jpeg is a lossy image format.)

Zipping a file is acceptable for a temporary on-site storage solution, but a zipped file is not considered an acceptable archival format. For the long term, it is advisable to store the files uncompressed on a dedicated hard disk. A 300 GB hard drive can be found for about $120.

IMPORTANT: Compress a file using zip after you have saved the image as an uncompressed .tiff file! Do NOT save the file as a compressed tiff! This may be tempting as it does save disk space. Your scanning software in fact may offer this as a default option. However, you will lose information and be left with an inferior quality image.

As an alternative to the .zip format, it's fine to use other compression techniques (like .7z or other lossless methods) if they provide better results -- as long as, again, these solutions are only temporary. Note however that .7z and other formats may not always work across platforms. (For example, you may not be able to decompress a .7z file on Mac).

Delivery options

Typically, final scans are delivered on DVDs or a hard drive. We generally recommend use of a hard drive if possible, since then it is easier to transfer the images as a whole rather than having to serially process a large number of DVDs. Of course this becomes less of an issue as high capacity DVD storage becomes more common. In addition, it is essential to keep two entirely separate copies of all scans in case the media of storage gets corrupted or lost.

Training

The training given to the people doing the scanning is crucial. It is important to closely supervise their initial work, and also stress to them not to change agreed upon settings later on because they decide they have to go faster, or because they switch staff, etc.

Provided for unrestricted use by the external link: Tibetan and Himalayan Library