ALTO XML

98 views
Skip to first unread message

Michelle MORAVEC

unread,
Jan 29, 2015, 1:41:25 PM1/29/15
to ant...@googlegroups.com
Hello all, 

I'm quite lucky to be receiving a HUGE corpus to work with but it so far has been provided in ALTO XML format.  I cannot seem to get the setting right (if there is such a thing) in Antconc.  I generally work with txt files so I'm quite lost here. 

Any help would be appreciated. 

best regards

Michelle Moravec

Laurence Anthony

unread,
Jan 29, 2015, 5:13:55 PM1/29/15
to ant...@googlegroups.com
Hi,

XML is text. Just open the file in AntConc and it can be analyzed. If you are unsure how to open XML files, just change the extension from xml to txt (in *one* text) to confirm that the file can be viewed in AntConc. Then change the extension back, and then check how to open XML file via the global settings.

Laurence.


###############################################################
Laurence ANTHONY, Ph.D.
Professor
Center for English Language Education in Science and Engineering (CELESE)
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

--
You received this message because you are subscribed to the Google Groups "AntConc-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antconc+u...@googlegroups.com.
To post to this group, send email to ant...@googlegroups.com.
Visit this group at http://groups.google.com/group/antconc.
For more options, visit https://groups.google.com/d/optout.

Michelle Moravec

unread,
Jan 29, 2015, 5:23:18 PM1/29/15
to ant...@googlegroups.com
Thanks Laurence for the quick reply.  

I can get the files into Antconc but the results have all the xml tags still in them 

I'll attach a screen shot.  Perhaps I've inadvertently ticked a box I ought not to have?  Or perhaps I"m just not understanding how XML output looks in Antconc.  

Michelle Moravec

unread,
Jan 29, 2015, 5:29:19 PM1/29/15
to ant...@googlegroups.com
sorry I should have added I've tried hiding the tags, but the way the XML renders a string has it INSIDE < > and looking at the line I can't figure out any tags that aren't' going to include the CONTENT=

<String ID="String2" CONTENT="stayed" HPOS="289" VPOS="110" WIDTH="95" HEIGHT="32" STYLEREFS="TS2" WC="1" CC="9 9 9 9 9 9"/>







On Jan 29, 2015, at 5:13 PM, Laurence Anthony <antho...@gmail.com> wrote:

Laurence Anthony

unread,
Jan 29, 2015, 5:37:03 PM1/29/15
to ant...@googlegroups.com
Hi again,

><String ID="String2" CONTENT="stayed" HPOS="289" VPOS="110" WIDTH="95" HEIGHT="32" STYLEREFS="TS2" WC="1" CC="9 9 9 9 9 9"/>

I see the problem. It comes back to an age-old limitation in AntConc that it only allows the user to ignore tags, but doesn't allow the user to specify content of interest within a tag.

At the moment, the only way around this limitation is with a clever regex expression that looks behind to find the CONTENT=" and looks forward to find the ". To find <stayed> in the above line, you would use something like the following:
(?<CONTENT=\")Stayed(?=\")

where ?< means look behind and ?= means look forward.

Try it and let me know if it works.

Laurence.


###############################################################
Laurence ANTHONY, Ph.D.
Professor
Center for English Language Education in Science and Engineering (CELESE)
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

Michelle Moravec

unread,
Jan 29, 2015, 5:50:11 PM1/29/15
to ant...@googlegroups.com
thanks for confirming I'm not crazy.  I was pretty sure this was a non-starter

I thought of regex, but of course doing stats on the corpus would still not be possible right?  

I think my sole option may be scripting to remove the xml. It is very nice to be given a corpus of course so I won't complain.


On Jan 29, 2015, at 5:36 PM, Laurence Anthony <antho...@gmail.com> wrote:

(?<CONTENT=\")Stayed(?=\")

Laurence Anthony

unread,
Jan 29, 2015, 5:52:52 PM1/29/15
to ant...@googlegroups.com
Looking at the XML in the screenshot, none of it is particularly useful for language study. I would agree that deleting the 'noise' would be a useful starting point. Notepad++ could do this easily without requiring any scripting.

Laurence.


###############################################################
Laurence ANTHONY, Ph.D.
Professor
Center for English Language Education in Science and Engineering (CELESE)
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

Michelle Moravec

unread,
Jan 29, 2015, 5:54:08 PM1/29/15
to ant...@googlegroups.com
thanks  It is a VERY large OCR'd set of women's movement periodicals that I'll be doing historical analysis on.  I'm excited to get it but so wish it was in a nicer format!

JFlorian

unread,
Jan 29, 2015, 9:21:49 PM1/29/15
to ant...@googlegroups.com
A couple comments...

First, make several backups of the original file(s) before doing anything!

You'll have two problems:

1. Reviewing the OCR text for errors. Old text (newsprint or mags) often cause sm'al errors, like that stray single quote.  Or big errors like replacing small LL with capital II or other weirdness.  Typically, OCR reproduces the same or similar errors throughout the document.   I'd suggest reading through a document first.  Notice repetitive errors to see if you can do Search/Replace on those first.  But manually choose "replace" on EACH item, don't do it globally by computer e.g. Replace ALL will lead to more errors.  Copy-Paste exact errors to Notepad before you start replacing---then, if something goes wrong, you'll know exactly what you were removing.


2. Removing coding/tags.  Again, read through the codes.  Notice which xml/html tags that won't matter much if you remove them.  For example:  tags for span  or color   Look for codes that might be easy to put back, either manually or with Search/Replace.  Weigh whether you want to preserve appearance--- or have a document to study.  If what's most important is studying the document, I'd ditch the height/width tags and those ID tags.  Try it on one or two lines and check the appearance--you can always put back what you removed.  Or, copy-paste a <...br> tag after each line, just to keep things partly readable and delete all the other codes except for paragraphs and line breaks <...p>.   Go over anything in a table very carefully before removing those tags.  I always hated to lose italics <...i> and bold <...b> but I've deleted those too sometimes just to get a clean document.  But I was cleaning files made in MS Word---the worst proprietary coding beast that was ever invented.   Note:  Make sure you remove FULL tags, including any brackets, colons, etc used inside a <...tag> or you'll have a huge mess of code punctuation/delimiters mixed into your text punctuation.

Judy
Reply all
Reply to author
Forward
0 new messages