I noticed, accidentally, that Cuneiform, an Optical Character Recognition (OCR) program was in the repository. I had used Cuneiform at work about 10 years ago but found a better one: FineReader.
However, after trying Cuneiform out today I can say it has improved a LOT from 10 years ago. While browsing Amazon's book collection 6 months ago I saw a photocopy of a page from a book by B.F. Morris which intrigued me because what he wrote about 1864 seems to apply today. I copied that page as a jpg file. Today decided to use it to test Cuneiform.
My first attempt failed because the graphic image wasn't a BMP file. I used Gimp to convert it to a BMP and then redid:
cuneiform -t text Exerpt_Xian_life_character_intro.bmp
The image file (as a jpg) is attached. Here is text file which resulted:
Every occurrence of " and " was converted to " snd ", which I had to globally convert back using KWrite. There was one occurrence of " l' or " which was supposed to be " for ", but that was all the errors. All in all, Cuneiform did an excellent job converting the text in a graphic image to a text file.
Here's the man page:
I am going to experiment with its ability to correctly interpret graphical images which contain columns of text, or special formatting.
If what Morris wrote about 1864 is accurate, then it appears that nothing has changed since then.
However, after trying Cuneiform out today I can say it has improved a LOT from 10 years ago. While browsing Amazon's book collection 6 months ago I saw a photocopy of a page from a book by B.F. Morris which intrigued me because what he wrote about 1864 seems to apply today. I copied that page as a jpg file. Today decided to use it to test Cuneiform.
My first attempt failed because the graphic image wasn't a BMP file. I used Gimp to convert it to a BMP and then redid:
cuneiform -t text Exerpt_Xian_life_character_intro.bmp
The image file (as a jpg) is attached. Here is text file which resulted:
Meanwhile, from the hearts of multitudes the dignity of honest labor and the dictates of a sober and frugal economy have died out, on the one hand increasing pauperism and crime and lending to misfortune the aggravation of human improvidence, and on the other fostering habits of false show, and thus increasing the temptation to deception, fraud, peculation, and all the dishonesties of the most high-pampered extravagance and excess. Moreover, the wanton neglect or abuse of our providential blessings, and the unconscious apostasy from every sentiment of purity and virtue, have served greatly to defile and degrade the mind of a large portion of the community, and fill the centres of population with slow and vulgar herd, who throng the open temples of obscenity and infamy.
Thus the materials are prepared for human guilt and wretchedness, whose catalogue of crimes and woes exhausts the power of language to express them. Beyond all this, political controversy and partisan strife for the reins and spoils of power, conducted without principle, and reeking with abuse, have taken so fierce a form as often to have driven the best men from the arena and left the worst upon the field. The selfish and profligate stand forward to control the nominations and elections to office, and afterwards gamble with its duties and obligations without shame and without remorse. Nor is this all. Our wrongs to the Indian and the African, continued from the beginning, have brutalized the temper, darkened the understanding, and perverted the judgment of the nation in regard to the plainest principles of common humanity and justice. The tide of emigration from the Old World has borne to our shores a large element of the foreign-born, who speedily become imbued with our native and inexorable prejudice in this respect. Thus, while we claim to be a free government, we have cherished institutions in our midst which are a mockery ...
Thus the materials are prepared for human guilt and wretchedness, whose catalogue of crimes and woes exhausts the power of language to express them. Beyond all this, political controversy and partisan strife for the reins and spoils of power, conducted without principle, and reeking with abuse, have taken so fierce a form as often to have driven the best men from the arena and left the worst upon the field. The selfish and profligate stand forward to control the nominations and elections to office, and afterwards gamble with its duties and obligations without shame and without remorse. Nor is this all. Our wrongs to the Indian and the African, continued from the beginning, have brutalized the temper, darkened the understanding, and perverted the judgment of the nation in regard to the plainest principles of common humanity and justice. The tide of emigration from the Old World has borne to our shores a large element of the foreign-born, who speedily become imbued with our native and inexorable prejudice in this respect. Thus, while we claim to be a free government, we have cherished institutions in our midst which are a mockery ...
Here's the man page:
CUNEIFORM(1) multi-language OCR system CUNEIFORM(1)
NAME
cuneiform - multi-language OCR system
SYNOPSIS
cuneiform [--dotmatrix] [-f FORMAT] [--fax] [-l LANGUAGE] [-o OUTPUT] INPUT
DESCRIPTION
Cuneiform is an OCR system. In addition to text recognition it also does layout analysis and text format recognition. Cuneiform supports several languages.
OPTIONS
--dotmatrix
Undocumented option.
-f FORMAT
By default Cuneiform outputs plain text. There are several other output formats. To get a list run the command "cuneiform -f".
--fax
Undocumented option.
-l LANGUAGE
By default Cuneiform recognizes English text. To change the language use the command line switch -l followed by your language string. To get a list of supported languages type "cuneiform -l".
-o OUTPUT
If you do not define an output file with the -o switch, Cuneiform writes the result to a file
"cuneiform-out.[FORMAT]". The file extension depends on your output format.
HOMEPAGE
More information about cuneiform can be found at <http://launchpad.net/cuneiform-linux/>.
NAME
cuneiform - multi-language OCR system
SYNOPSIS
cuneiform [--dotmatrix] [-f FORMAT] [--fax] [-l LANGUAGE] [-o OUTPUT] INPUT
DESCRIPTION
Cuneiform is an OCR system. In addition to text recognition it also does layout analysis and text format recognition. Cuneiform supports several languages.
OPTIONS
--dotmatrix
Undocumented option.
-f FORMAT
By default Cuneiform outputs plain text. There are several other output formats. To get a list run the command "cuneiform -f".
--fax
Undocumented option.
-l LANGUAGE
By default Cuneiform recognizes English text. To change the language use the command line switch -l followed by your language string. To get a list of supported languages type "cuneiform -l".
-o OUTPUT
If you do not define an output file with the -o switch, Cuneiform writes the result to a file
"cuneiform-out.[FORMAT]". The file extension depends on your output format.
HOMEPAGE
More information about cuneiform can be found at <http://launchpad.net/cuneiform-linux/>.
If what Morris wrote about 1864 is accurate, then it appears that nothing has changed since then.