Open Source Tools for Digitization

Digitization is the transferring of print or analog material into digital format. Through digitization, materials can be preserved for future use and study.

Paper to Electronic Format Converters

PDFCreator (http://www.pdfforge.org/products/pdfcreator)

PDFCreator is a Windows application that converts documents into Portable Document Format (PDF) files. The software is released under the GNU General Public License. The primary features of PDFCreator, as listed in a 2007 review by Clifnotes on FreewareWiki are:
• the ability to create PDFs from any program which has a print function
• the ability to encrypt PDFs to specify rules for opening, printing, etc.
• the ability to e-mail PDF files as they are created
• the ability to save to additional formats including PNG, JPG, TIFF, BMP, PCX, PS, and EPS
• the ability to AutoSave files to folders and filenames based on tags such as “Username,” “Date,” “Time,” etc.
• the ability to merge a number of files into one combined PDF file
• easy installation

Usage: The Yale Medical Library has PDFCreator installed on all library machines to aid in the creation of PDF files (http://elibrary.med.yale.edu/blog/).

PDF995 (http://www.pdf995.com/)

Easy-to-use interface allows users to create PDF files by simply selecting the "print" command from any application, creating documents which can be viewed on any computer with a PDF viewer.

PDF reDirect v2 (http://www.exp-systems.com/PDFreDirect/Features.htm?1)

PDF reDirect is a freeware utility for Windows that allows the user to create PDF files from most applications that have a print option. The program operates by creating a virtual printer that shows up as an option in the “Print” menu. When the PDF reDirect virtual printer is chosen, the program exports the file as a PDF file. According to a review of the application from software.informer, “The program also includes a built-in previewer that shows you the PDF file, allowing you to optimize the file settings on the fly. It also allows you to select the output printer settings such as picture quality, color model and page rotation.”

Other features of PDF reDirect include:
• live previews to allow settings to be optimized “on the fly”
• optimization for print or web-quality PDF files
• the ability to merge multiple documents into a single PDF
• the ability to choose how the PDF is displayed when it is opened
• encryption capabilities using 40 bit password protection
• no popups or watermarks limiting usage of the program

The program is also available as a paid professional application for $19.99 with additional features at the official web site.

Usage: In a 2007 article, Michael Bennett discussed how PDF reDirect was utilized in the creation of Digital Treasures (http://dlib.cwmars.org), a central and western Massachusetts digital library project that contains historical documents and information about the agricultural and industrial cultural history of central and western Massachusetts. Bennett said (2007), “Using PDF file creator freeware known as PDF reDirect, PDF versions of both the network's descriptive and administrative metadata standards as well as the scanning lab's imaging standards were produced from Word document originals. These would be open for public view from Digital Treasures' 'about' link once the site went live.”

Reference: Michael J. Bennett. (2007). Digital repository implementation: a toolbox for streamlined success. OCLC Systems and Services, 23(3), 254-261. Retrieved May 8, 2009, from Research Library database. (Document ID: 1325810991).

CutePDF (http://www.cutepdf.com)

Create a PDF file from almost any printable document. CutePDF has an open SDK and doesn’t bog down their software with popups or watermarks. CutePDF installs itself as a subsystem for any printer allowing users to save any document as a PDF.

· Added support for both 32-bit and 64-bit Windows Vista.
· Added support for 64-bit Windows XP/2003.
· Supports foreign language Windows better.
· No longer include Ghostscript. You may download and install it separately.

Open Office (http://www.openoffice.org)

Open office is a great open source alternative for Microsoft Office, its open software includes word processing, spreadsheets, presentations, graphics and databases. The software is available in multiple languages. Open office uses an international open standard format for saving documents so that other common software packages will have the ability to alter the file.

Open office is a very easy software package to install and use. The package offers a full list of features that rival in office software package. The software package is available on a number of different operating systems.

Usage: The Open Office suite used in a host of facilities all over the world:
· Dewitt Public Schools in Michigan saves nearly $48,000 dollars in licensing fees per year using Open Office
· Noxon Schools in Montana has over 185 desktops using Open Office
· Earlham College in Richmond uses Open office on all its public computers.
· University of the Philippines
· Howard County Library System use Open Office on 283 computers
· State of Nevada Department of Corrections
· The city of Prague replaced MS office on 60 computers

Open office is used all over the world in an assortment of applications this is just a few instances see the reference section to view all of Open Offices noted deployments.

Optical Character Recognition (OCR) Software

GOCR (http://jocr.sourceforge.net)

GOCR is an OCR (Optical Character Recognition) program, developed under the GNU Public License. It converts scanned images of text back to text files. GOCR is used by a number of user interface systems making its usability very high. The software allows you to convert many image formats other than just BMP, JPEG and even barcodes. The software continues to improve on a rapid pace.

SimpleOCR (http://www.simpleocr.com/)

SimpleOCR is a free OCR software that is used in thousands of applications all over the world. SimpleOCR also offers a royalty-free SDK for developers to customize the software for their own use. All users need is a scanner to scan their documents and instead of retyping them SimpleOCR can convert the documents text free and easy.

OCRopus (http://code.google.com/p/ocropus/)

OCRopus is an open source document analysis and optical character recognition (OCR) system intended “for high-throughput, high-volume document conversion efforts,” released under the Apache License, Version 2.0 (http://code.google.com/p/ocropus/).

It features pluggable layout analysis and character recognition, statistical natural language modeling, and has the capability of handling multiple languages. The development of OCRopus is sponsored by Google for use in simplifying the task of digitizing books for its Google Books project. According to the project's home page, the OCRopus engine “is based on two research projects: a high-performance handwriting recognizer developed in the mid-90's and deployed by the US Census bureau, and novel high-performance layout analysis methods.” The project is developed for use with the Linux operating system, although it can be successfully run on Mac OS X (http://en.wikipedia.org/wiki/OCRopus).

Thomas Breuel, lead developer on the project, said (2009) that unlike many commercial OCR systems, the OCRopus system “uses a strictly forward architecture,” reducing the coupling between components and making it easier to plug in other modules for text recognition and layout analysis (p. 392). The three main stages in the system's architecture are layout analysis, text line recognition, and statistical language modeling (Breuel, 2009, p. 393).

Reference: Breuel, T.M. (2009). Applying the OCRopus OCR system to scholarly Sanskrit literature. In Sanskrit computational linguistics(pp. 391-402). Berlin: Springer.

Video Editing

HyperEngine-A/V (http://sourceforge.net/projects/hyperengine)

HyperEngine-AV is an open source program for the Macintosh which lets users “capture, arrange, edit and process video, audio and text in a free-form, trackless document, to create movies and slide shows” (Sourceforge.net, 2009). It supports a large number of file formats including TIFF, PICT, Photoshop, GIF, JPEG, MPEG, DV, QuickTime, MP3, MP4, and WAV. The program is capable of capturing digital video in real time from a FireWire-capable digital video camera or importing existing digital video clips or digital photos. The software also comes with text editing tools that allow for easy creation of credits and subtitles (MacDirectory.com, 2004).

Imaging

Enguage Digitizer (http://digitizer.sourceforge.net)

This open source, digitizing software converts an image file showing a graph or map, into numbers. The image file can come from a scanner, digital camera or screenshot.  The software is used by a wide assortment of fields including aeronautical engineering, cryogenics, engine catalysts, bioinformatics, biomedicine, chemistry and many others (Digitizer.sourceforge.net, 2007). 

Usage:  The software is currently used by the University of Delaware in the engineering department lab.