Thanks.
Do you know of 3rd party tools which converts these files to HTML? Or, what
export option best suits this purpose? Problem is, my client uses Ventura,
not me, I haven't got it. And he's far from professional. I need to import
stuff from Ventura in a readable format where I could extract text (no
graphic objects) and identify key styles to recognize titles, keywords, etc.
Benny
The good news is that V4.x stores most text in external text files. The file
format can be a variety of things, including ASCII, so he can run therm through
Ventura and save them as ASCII for you.
If it's saved as ASCII, the text will have Ventura markup codes in it, but they
are actually bery close to HTML, so it should be easy to go through and convert
them.
As for third party tools, the current version of MasterHelp (Performance
Software, 804-794-1012) will convert Ventura files to Word for Windows. Contact
Performance Software for specific info regarding conversions
(http://www.richmond.infi.net/~psi/).
Zandar provides filters for Ventura 4.2, and 5 to go to formatted RTF files.
They have a web page http://www.tagwrite.com. Please understand that this
product is sold as is without any available support.
These vendors might have HTML conversion support as well, but I'm not sure.
-- Eric
[C_TECH Volunteer]
No problem doing that. The text and style markup part of a VP4.2 file is
usally held as an ascii file (could be another wp format, according to
choice of the user), with a simple SGML-based mark-up scheme with a lot of
similarities to HTM.
Extracting the text and styles to recognize titles, keywords etc and index
them is pretty much a doddle if the file has been consistently formatted.
The actual mark-up coding of the text files is fully documented in the
Ventura help file. In simplest terms:
A paragraph style is indicated by para starting with @stylename =
Rest-of-para-text....
If no stylename is indicated, it is implicitly 'Body text' style.
The @ tagging is also used with special tags to construct tables of any
complexity required. It is easy to skip over tables or extract the raw text
and structure. In later versions it is also possible to embed style
overrides to the basic tagged style on a para-by-para basis - also by simple
inline text coding.
In-line attributes are delimited with <brackets> as in HTML. For example:
Bold is turned on with <B>
Italic with <I>
Underscore with <U>
Normal default text with <D>
The attributes may be combined (eg <BI>Bold italic<D>) and there are many
many more attributes for controlling colour, point size, baseline shifts,
kerning and so on. Just as in HTML, you can essentially ignore the ones you
don't recognize. The actual < and > characters are encoded as << and >>.
Non-ascii7 characters are encoded by number - a simple number for ascii
codes (eg <130> for e-acute) or prefixed with @ for ansi codes (eg <@150>
for en-dash).
Hope that helps - check out the Help file, and take a look at the text files
associated with the documents you want to index - you will see it is very
straightforward, certainly no harder than parsing HTML. Well, just a tad
harder, because Ventura doesn;t use the always paired /switch to turn off
attributes, it uses a general return to default code, or a 2555 code for
numerical values (eg <P8>8-point section<P255> back to default size text).
> Plus, did Ventura doc files change their structure in later versions and
> when?
They didn't change the text mark-up system, except by adding more and more
codes for fancier features. The attribute system is still plain ascii markup
in <brackets>.
But, as a user option in later versions of Ventura the text file can be
embedded along with the graphics, layout and style definitions, in a single
compound publication file. It is held internally as an RTF file object in a
Microsoft compund document structure (just as in MS Word and loads of other
apps). If you have access to a suitable copy of Ventura, you can tell it to
re-export the text to an ascii/ansi or RTF file.
The later versions respond to all the same inline codes as the earlier ones,
but generally use more complex versions. For example a weight command
replaces Bold and its variants, including normal, are coded by weight
numbers (eg <W0> for normal weight).
As I said, all this is documented in the help file of each version, and is
not too complex.
Hope that helps.
Alex Gray [C_Tech]
Corel User magazine
www.coreluser.com
Thaks a lot!
Benny