[Top] -> [Works] ->
[Unix] -> [catdoc]
catdoc & xls2csv
Overview
catdoc is program which reads one or
more Microsoft word files and outputs text, contained insinde them
to standard output. Therefore it does same work for .doc files, as
unix cat command for plain ASCII files.
It is now accompanied by xls2csv - program which
converts Excel spreadsheet into comma-separated value file
Optionaly, catdoc is able to translate some non-ASCII chars into correspoindig
TeX escape sequences and convert charsets from Windows ANSI codepage to
local codepage of target machine. (Becouse catdoc is russian program,
by default it converts cp1251 to koi8-r, when running under
UNIX and to cp866 when running under DOS.
Catdoc has rudimentary table handling. In TeX mode it inserts & when
encounters field delimiter and \\ when encounters end of table row. No
table headers are produced although.
Catdoc doesn't even try to preserver MS-Word character formatting. It's
goal is to extract plain text and allow you to read it and, probably,
reformat with TeX, according to TeXnical rules, most Word users
haven't even heard about.
xls2csv does roughly same for Excel files. It extracts
data and leaves out any formatting info and formulas. Concept is that
you want to see data, not the way it was created.
Supported platforms
- Unix. Catdoc was initially developed for Linux and Sparc Solaris.
It also runs on variety of other Unices. For instance it is included in
FreeBSD ports collection.
- MS-DOS. Catdoc also runs on MS-DOS, even on XT machines. MS-DOS is
only platform for which compiled executables are provided. These
executables are 16-bit real mode. I think that protected mode version of
xls2csv might be useful, but don't have time to support it.
There is no support for catdoc under
Windows
Not because I hate windows. Just becouse I don't use it. Note that DOS
catdoc is not intended to be used under windows. For example, it doesn't
support long file names.
License
catdoc and xls2csv are distributed under GNU Public License.
Commercial licensing of some parts of catdoc is also available
upon request
Current status
Current development version of catdoc is 0.93.3.
It finally is able to autodetect unicode/non-unicode Word files and also
recognizes (and hopefully parses) MS-Write files and rtf.
It also eliminates garbage which troubled prevoius
version of catdoc. Note that footnotes and fastsaves still not handled.
Previous version is 0.92
This is first version where RTF support apperared. It seems that 0.93
series are just as stable as 0.92 and do their job better.
Old stable version of catdoc is 0.90.3
It supports UNICODE and uses runtime-configurable charset tables.
Autodetection of Word files is also changed so catdoc can be used
blindly on any file with .doc extension, producing reasonable results
in most cases. These improvements have drawback that now it is not
enough to copy executable to install catdoc properly.
"Competing" products
There are several other free programs, which read MS-Word files
- Microsoft own freeware viewer.
It probably is best of all listed in handling of Word format. (And only one,
which can handle Word 97)
Unfortunately,
requires Windows to run, although I have heard that it runs successifully under
Wine
- Laola, the perl library to read OLE files
- word2x much more elabodate than catdoc C++ program. It sometimes reads files, which catdoc couldn't and vice versa.
- wvware Elaborated system which parses
Word files and converts them into something more interesting, i.e. HTML
or LaTeX.
- xlHtml Converter of Excel and
PowerPoint files into readable and processable form (html, csv, xml)
- Word2Html (link
appears to be dead).
Very simple program initially developed for OS/2. Doesn't handle tables
and charsets, but finds start and end of textstream correctly, based
on code, borrowed from LAOLA.
- Filters project
Project to create GPLed filters for various proprietary formats. Now has
C library cole, which does same job as LAOLA. Aimed to completely
convert all information from Word file (including formatting) to XML.
- w2tPerl program
which clames to convert word to LaTeX. Based on LAOLA.
- antiword
- Spreadsheet::ParseExcel
Perl module to read Excel files.
- doc2xml Python script
which reads post-97 Word files. Handles fast-save
Revision history
- 0.93.3 Nov 15 2003
- It was planned as feature release. It has support for Excel Date
formatting, output of blank cells and help window in wordview.
Unfortunately, during its development important bug was found in
ole parser code. So I have to publish this release real soon after
previous
- 0.93.2 Nov 14 2003
- Improved performance of OLE parser, fixed problems with unicode
chars 0xFF00-0xFFFF in catdoc, rewrite wordview for unicode-aware
version of Tcl, with support of displaying text in language different
from current locale. Reworked autoconf configuration.
- 0.93.1 Sep 24 2003
- Fixed numerous bugs in newer OLE and RTF code, including problem
with incorrectly interpreting last (incomplete) 256-byte block of text
as Unicode. Restored support for pre-OLE Word versions, which was
accidently lost in 0.93
- 0.93 July 29 2003
- Added proper handling of OLE structure (by Alex Ott).
- 0.92 June 16 2003
- Added RTF parser at last (contributed by Alex Ott). MS-DOS
executable for xls2csv is included. Some code clean up and splitting.
- 0.91.6 May 24 2003
- Added autodetect of output charset from current locale.
Fixed handling of RK and MULRK records in xls2csv. No more missing
numbers. Fixed long-standing bug with loosing of first 8 symbols when
recoding text file.
I finally began to provide MS-DOS executables for 0.91.x series
- 0.91.5 January 30 2002
- I finally got to catdoc again. UTF-8 output is added.
Just specify utf-8 as output charset.
- 0.91.4 December 30 1999
- Fixed important bug in xls2csv - improper recognition of numeric
cells (as opposed to formula). Fixed segfault when catdoc is used to
recode plain text files.
- 0.91.3 December 14 1999
- Mainly xls2csv fixes - xls2csv now recognizes some options (man page
is in sync), added endianess check to configure, so xls2csv compilies
correctly out of the box on big-endian machines
- 0.91.2 October 19 1999
- This is first verison which includes xls2csv program. Also,
some long-standed bugs are fixed and newly-introduced bug when catdoc
hangs on broken files. Although these files are not read properly
without -b switch. New charset koi8-u is added
to distribution. If you want to use it in the stable version, just
download it from here and put in the catdoc library directory.
New switch -l is added. It causes catdoc to list available charsets in
current charset path.
- 0.91.1 October 15 1999
- As it was expected it was wrong decision to believe information
about extended charset from word document header. Now we analyze
encoding for each 256-byte page separately (becouse it is possible that
first ones would be 8-bit and other 16-bit). When processing non-word
files (i.e. plain text) encodings are converted and -u is taken into
account, so catdoc can be used as generic character converter, which
supports utf8 and utf16 (both byteorders) as input.
- 0.91.0 October 12 1999
- Implemented new format analyzis. Now most versions of word format
as well as MS-Write and rtf are detected. Boundaries of main text stream
are also detected, so no more garbage is produced at the end of file
- 0.90.3 August 11 1999
- Fixed small OS-specific bugs - broken isspace in Turbo C under DOS
and %x was replaced %i for compatibility with SunOS 4.
- 0.90.2 May 24 1999
- Artem Chuprina pointed out to
segfault error when non-existent charset is specified in command line.
It turned out to be silly bug in check_charset function with oneline
fix. You can get one-line
patch.
- 0.90.1 Nov 26 1998
- Duncan Simpson pointed out to numerous places in catdoc source where
paranoid sysadmin could suspect buffer overflow. They was investigated
and either rewritten or commented why they are safe.
Also fixed minor bug in Makefile (make args are now propagated to
subdirs) and wordview (saving files now works).
- 0.90 Oct 29 1998
- Fixed bug with redeclaring source_csname and target_csname in main.
Fixed bug in configure when dealing with wish 8.0.3. Catdoc considered
stable enough to be released.
- 0.90b5 Oct 14 1998
- Fixed handling of 0x1F char (soft hyphen in Word 6.0),
now it is translated to 0x00AD (unicode soft hyphen)
Fixed permissions for manual page
Added --with-install-root configure arg to simplify
building of binary packages.
- 0.90b4 September 17 1998
- Added proper configuration of library dir in wordview.
Added --disable-charset-check config option
Added 0x2026 symbol in ascii.rpl
Added more Windows codepages in distribution
- 0.90b3 September 11 1998
- added -x option to simplify debugging of substitution maps
- 0.90b2 September 10 1998
- Added replacement sequences for some special charachers which
present in cp1251 and cp1252. Fixed some filename-handling problems
in wordview
- 0.90b1 September 8 1998
- Added cyrillic transliteration into
ascii.replchars
fixed some bugs in configuration. Added us-ascii
output
charset.
- 0.90a3 September 7 1998
- Fixed small bug in table handling, which caused catdoc to output
extra column delimiter just before row delimiter. Added autoconf
configuration.
install
is back, although not for charsets
- 0.90a2 August 18 1998
- version 0.90 was tested on BSDI and Solaris platform. Makefile was
rewritten to avoid use of highly incompatible
/usr/{ucb,bin}/install
- 0.90a1 August 13 1998
- Catdoc undergone major rewrite. Now it has proper charset
handling, including UNICODE and runtime configurability.
- 0.35 - June 5 1998
- Fixed bug with -s switch which prevents catdoc from returning non-zero
code when invoked on UNIX text file
- 0.34 - Apr 28 1998
- Files now opened in binary mode thus allowing catdoc to work on DOS
and simular systems. All specs arrays now have terminating NULL
- 0.33 - October 1997
- Fixed missing terminating NUL in specs array, which caused random
seqfaults on Linux and many other systems, becouse specs is searched
by strchr fynction
- 0.32 - August 1997
- First mayor public release, uploaded to CTAN. Tk interface appeared,
manual page was written. Unfortunately, this release was buggy.
[Top] -> [Works] ->
[Unix] -> [catdoc]