Greetings, good people!
This is driving me crazy. First, a little background: I'm putting together a book based on a few decades of publications. For the period before 1995, no electronic files exist, so I've scanned the publications, run OCR on them, and had then to format and so on. It is time consuming and awful.
For the period 2000 and after, I have electronic files, which makes things much easier. Now I can cut and paste as I like without the hellish scan/OCR/cleanup.
Ah, but for that in-between five years: I have electronic files. They are binary, generated by some application, but they bear no file extension. They do contain text, though it's little hunks of text with a lot of non-ascii gibberish attached -- about a paragraph per page. I thought they might have been QuarkXPress files, but a program that reads those files ain't having any. I'm trying to get in touch with those responsible for creating these things, to see if I can find out what program made them. So far no luck there, either.
Thing is, I remember having a Linux word processor sometime in the foggy distant past -- I thought it was TextMaker, but if so the feature seems to have been abandoned -- that would open files by brute force and display the ascii text therein, thereby letting me easily (well, relatively easily) copy and paste to my heart's content. I'd need of course to format and all that.
Anyway, I cannot for the life of me remember a way of doing this other than using strings from the command line, which is nice but I'd like to be able to have a little more control.
Anybody know of an application that will nicely suck the text out of a binary file, where the binary stuff appears to be formatting?
Thanks in advance.
This is driving me crazy. First, a little background: I'm putting together a book based on a few decades of publications. For the period before 1995, no electronic files exist, so I've scanned the publications, run OCR on them, and had then to format and so on. It is time consuming and awful.
For the period 2000 and after, I have electronic files, which makes things much easier. Now I can cut and paste as I like without the hellish scan/OCR/cleanup.
Ah, but for that in-between five years: I have electronic files. They are binary, generated by some application, but they bear no file extension. They do contain text, though it's little hunks of text with a lot of non-ascii gibberish attached -- about a paragraph per page. I thought they might have been QuarkXPress files, but a program that reads those files ain't having any. I'm trying to get in touch with those responsible for creating these things, to see if I can find out what program made them. So far no luck there, either.
Thing is, I remember having a Linux word processor sometime in the foggy distant past -- I thought it was TextMaker, but if so the feature seems to have been abandoned -- that would open files by brute force and display the ascii text therein, thereby letting me easily (well, relatively easily) copy and paste to my heart's content. I'd need of course to format and all that.
Anyway, I cannot for the life of me remember a way of doing this other than using strings from the command line, which is nice but I'd like to be able to have a little more control.
Anybody know of an application that will nicely suck the text out of a binary file, where the binary stuff appears to be formatting?
Thanks in advance.
Comment