mdxi.collapsar.net::software::ocr-mode

Description

No, ocr-mode doesn't actually perform Optical Character Recognition. To quote its own docstring, it is:

A major mode for assistive first-pass editing of OCR'd text via side-by-side display of scanned pages and raw OCR data for that page. Idea brazenly stolen from Project Gutenberg's Distributed Proofreaders website.

ChangeLog

Synopsis

Commands
M-x ocr-init Start a proofing session; instantiate ocr-mode
ESC-n M-x ocr-init Start a proofing session at pageset n
M-x ocr-resume Restart ocr-mode where previous session was halted (does not work across Emacs sessions)
C-c nLoad next pageset
C-c pLoad previous pageset
ESC-n C-c j Jump to pageset n during a proofing session
C-c qTerminate ocr-mode
Variables
ocr-source-dir Sets the location of pageset data (no default; user is asked if not bound)
ocr-tmp-image Sets location of temporary display image (defaults to /tmp/ocr-mode.jpeg; should always be a JPEG file)
ocr-image-extension Sets file extension to match for generating image list (defaults to .tiff)
ocr-text-extension Sets file extension to match for generating text file list (defaults to .txt)
ocr-fill-column Overrides the default fill-column while ocr-mode is in effect, restoring fill-column upon ocr-mode termination (defaults to the pre-existing value of fill-column)
ocr-justify Determines if paragraphs will be justified when filled (defaults to nil)
ocr-edit-window-width Controls how wide, in characters, the ocr-edit window will be (defaults to 2 characters wider than ocr-fill-column). When specified, this value should be given as a negative number (e.g. -72 for 72 columns wide).

Installation

Put ocr.el where-ever you want it (e.g. /usr/share/emacs/site-lisp/), add (require 'ocr) to your .emacs file, and restart emacs.

Usage

Since ocr-mode is not associated with a specific filetype, it should be invoked by doing M-x ocr-init (and not M-x ocr-mode). Upon invocation, it will ask for the location of a directory full of scanned page images and matching raw OCR'd text files, like this:

0000.tiff  0001.txt   0003.tiff  0004.txt   0006.tiff  0007.txt   0009.tiff  0010.txt   0012.tiff
0000.txt   0002.tiff  0003.txt   0005.tiff  0006.txt   0008.tiff  0009.txt   0011.tiff  0012.txt
0001.tiff  0002.txt   0004.tiff  0005.txt   0007.tiff  0008.txt   0010.tiff  0011.txt

...or whatever you want to call them; it doesn't matter as long as there's a 1:1 correspondence between image and text files, both groups sort in the same order, and they have different extensions. The directory location can be bound to ocr-source-dir in your ~/.emacs file; doing so will cause emacs not to ask you for it when ocr-init is called. The variable ocr-tmp-image can also be bound, but this is never asked for; it defaults to /tmp/ocr-image.jpeg. Likewise, ocr-image-extension specifies the extension for image files (defaulting to ".tiff") and ocr-text-extension specifies the extension for text files (defaulting to ".txt").

Once this is accomplished, your emacs frame will be split into two side-by-side windows, the first image will be scaled to fit the left window, the contents of the first text file will be loaded into the right window (which will be ocr-edit-window-width characters wide), and fill-paragraph will be called if the text was loaded from a raw OCR data file. The result will look something like this:

And from here the text is simply edited as desired. When you're done with a page, pressing C-c n will load the next pageset, C-c p loads the previous pageset, and C-c q prompts for confirmation to exit ocr-mode (cleaning up the OCR windows and buffers behind itself). When any of these commands is issued, the text buffer is tested for modification, and if anything has changed, the data is saved to a file named [current-text-file].ocr (e.g. 0012.txt.ocr). Further, the loading functions test for the existance of these files and load data from them instead of the original text files if they are available, so the whole operation is as seamless and easy as possible.

Dependancies

TODO

Download


Home