Conversion file for Santa Barbara Corpus of Spoken American English

275 views
Skip to first unread message

Pilar González

unread,
Aug 11, 2009, 11:10:15 AM8/11/09
to WordSmi...@googlegroups.com

Hello,

 

I am working with Santa Barbara Corpus of Spoken American English (SBCSAE) in WordSmith Tools 5.0, and I would like to ask the group members with more expertise than me for their help, since I have encountered some problems:

 

1.    Part 2 of the corpus (16 XML files) has around 47,000 tokens. However, according to the statistics in the wordlist, there are 72,771 tokens. I have tried to obtain a few concordances, and the problem seems to be that some tags are being processed as words. I also get many lines with just a few words (although I have set Concord to save 1,000 characters per entry), so it is not possible to read the concordances and exclude the tokens in which the search word belongs to a different grammatical category than the one under study.

 

2.    I have tried to convert the files to TXT format with Text Converter, but it has not worked, probably because I used the conversion file for BNC and the tagging may be different. Does anyone know where can I find a conversion file for SBCSAE, or how can I create a conversion file myself? I have found the transcription conventions in the web (http://projects.ldc.upenn.edu/SBCSAE/transcription/sb-csae-conventions.html), but I do not know how to use them to create a conversion file.

 

Many thanks in advance and best regards,

 

Pilar González

Mike

unread,
Aug 12, 2009, 4:41:56 AM8/12/09
to WordSmith Tools
Pilar, hi

I haven't got SBCSAE myself so am not sure. But here is how I would
tackle the problem.

1. Open up just one of the shortest corpus texts in Notepad. Look at
the format and try to understand how it works. Nowadays many XML
formats have a large amount of mark-up between the individual words of
the text, and each bit is usually marked up with <>. The format is not
at all likely to be identical to that using in the BNC XML version.
But the idea will be similar, possibly the word plus the lemma, the
part-of-speech, maybe some meaning or pronunciation or discourse mark-
up too.

2. Make an untagged version using Text Converter ("removing all tags"
at http://www.lexically.net/downloads/version5/HTML/?convert_text_file_format.htm)
and read that to see if it looks coherent. You should just get the
words spoken, without part-of-speech etc. Probably without the
people's names either. The untagged version is what is meant by a TXT
version (just changing the ending from .xml to .txt is not enough!).
Now make a word-list of it so as to know what words there ought to be
after processing the tagged version.

3. If there's a header at the start of each corpus text look for a
sign that it ends, such as </header>. YOu might want to clean out that
too if you want only the plain words of the text itself.

4. After you've done 2. you will understand better what the mark-up
formats are like, and now reading the corpus documentation will be
much easier. You should now be able to decide what needs converting. I
wonder whether anything needs converting, actually -- you probably
just need to get WordSmith to handle the corpus right.

5. One possibility is that the closing > might be too far away from
the opening < sometimes. That is set by altering the "search-span" --
look at the screenshot in http://www.lexically.net/downloads/version5/HTML/?proc_tags_as_selectors.htm
to find it, where it is there set at 200. In that case if the > comes
say after 250 characters, WordSmith will assume that mark-up is text
and should not be ignored.

Cheers -- Mike

Pilar González

unread,
Aug 12, 2009, 5:43:18 AM8/12/09
to WordSmith Tools
Hi Mike,

Many thanks for your reply. I agree that the best option would be to
handle the files properly without having to delete the tags. I have
tried increasing the search span as you suggested, but the wordlist
was reduced only in a few hundred words. Checking the wordlist, I have
noticed that words intended to account for paralinguistic signs (i.e.
SNIFF or LAUGH) are included in the wordlist as tokens. These are some
samples of the concordance lines I have obtained for "well":

SNIFF <w>Well
<w>We<p type="drawl"/>ll no
in <w type="fragment">Well
in:long <w>Well you know I

The concordance lines are very difficult to interpret, which is
already bad because I have to read all the concordance lines to
ascertain if the items are used as discourse markers, but my biggest
problem is not being able to obtain accurate statistics.

Maybe I need to change the settings relating to the Mark-Up that
should be excluded or included. I have tried to create an empty tag
file for the Mark-Up to be included, since I thought this may prevent
tags to be processed as words, but it made no difference.

I paste here a sample of one of the files (a single utterance):

- <u who="LYNNE" uID="u666">
<pause symbolic-length="long" />
<w>But</w>
<w>he's</w>
<w>just</w>
<w>really</w>
<w>really</w>
<w>really</w>
<w>strange</w>
<t type="p" />
<media start="1509.960" end="1515.320" unit="s" />
</u>
</CHAT>

This is the end of the file, so </CHAT> is probably signalling the end
of the header. I also tried to change this in the tag settings, but
again it did not solve the problem.

I have also tried to convert to TXT removing all tags as you
suggested, yet paralinguistic information (i.e. in:long) is still
kept:

in:long
and
I'm
not
ex

that
good
or

I hope there is a quicker solution than going through the converted
files one by one and deleting the unwanted information. Maybe I need
to do some more changes in the tag settings?

Thank you again for your help and kind regards,

Pilar



Mike

unread,
Aug 12, 2009, 8:53:47 AM8/12/09
to WordSmith Tools
You need to spend some time looking carefully at the source texts, as
I said.

> was reduced only in a few hundred words. Checking the wordlist, I have
> noticed that words intended to account for paralinguistic signs (i.e.
> SNIFF or LAUGH) are included in the wordlist as tokens. These are some

You'd need to find cases of SNIFF to see exactly how they're marked
up. From the sample you supplied all I could see was that words are
marked with <w> before and </w> after. What about the paralinguistic
features that you want to cut out? For example if
it was <paralinguistic> sniff </paralinguistic> then you could handle
that in WS5 by making a file of markup to EXclude which had
<paralinguistic>*</paralinguistic>
(See http://www.lexically.net/downloads/version5/HTML/?tag_file.htm)

> This is the end of the file, so </CHAT> is probably signalling the end
> of the header. I also tried to change this in the tag settings, but

Not safe to assume that end of header is </CHAT>. Look at the start
of the text and find it. It will be the same for all texts, I bet.

Cheers -- Mike

Pilar González

unread,
Aug 12, 2009, 11:40:45 AM8/12/09
to WordSmith Tools
Hi Mike,

After examining the XML files, I created a TAG file with this
information:

<long-feature type="end">*</long-feature>
<long-feature type="begin">*</long-feature>
<w type="fragment">*</w>
<long-feature type="begin">*</long-feature>
<e type="happening">*</e>
<w type="fragment">*</w>

Is there anything else I need to add? After changing the tag settings
(loading the TAG file in the Marl-up to exclude, I am still getting
all of the SNIFF, LAUGH, etc., although now know that they are
represented in the XML files with <e type="happening">SNIFF</e> and <e
type="happening">laug</e>, so they should be included. Maybe there is
anything wrong with my syntax in the TAG file?

Regarding </CHAT> as the end of header, I had checked previously and
the files start like this:

<?xml version="1.0" encoding="UTF-8" ?>
- <CHAT xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://www.talkbank.org/ns/talkbank" xsi:schemaLocation="http://
www.talkbank.org/ns/talkbank http://talkbank.org/software/talkbank.xsd"
Media="02" Mediatypes="audio" Font="CAfont:13:0" Version="1.4.2"
Lang="en" Corpus="SBCSAE" Id="02" Date="1984-01-01">

..so I guess this is right.

Many thanks for your help. Previously I had used WordSmith Tools only
to search small DIY corpora for terminology and collocational
patterns, so all this tagging thing is new for me.

Cheers,

Pilar

Mike

unread,
Aug 14, 2009, 6:08:15 AM8/14/09
to WordSmith Tools
Pilar, hi

I have just tested with this little text:

***
<CHAT xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://www.talkbank.org/ns/talkbank"
xsi:schemaLocation="http://
www.talkbank.org/ns/talkbank http://talkbank.org/software/talkbank.xsd"
Media="02" Mediatypes="audio" Font="CAfont:13:0" Version="1.4.2"
Lang="en" Corpus="SBCSAE" Id="02" Date="1984-01-01">

SOME WORDS WHICH SHOULD BE INCLUDED.
<e type="happening">SNIFF</e>

</CHAT>
***
and with this in a .tag file

<e type="happening">*</e>

And I set the settings which you can see at
http://www.lexically.net/downloads/version5/HTML/?proc_tags_as_selectors.htm
so that the search span was 1000 (not 200 as in the picture, because
your CHAT mark-up is so lengthy)
and the Markup to EXclude file was the one mentioned above.
When I ran a word list I got the words wanted and no SNIFF.

Cheers -- Mike

Zainur Rijal Abdul Razak

unread,
Aug 14, 2009, 6:13:32 AM8/14/09
to WordSmi...@googlegroups.com
thank you very much for the updating

--- On Fri, 14/8/09, Mike <mi...@lexically.net> wrote:

Pilar González

unread,
Aug 14, 2009, 12:19:47 PM8/14/09
to WordSmith Tools
Hi Mike,

I really do not understand what is wrong with these files. I have
tried again, and I am getting all of the sniffs and laughs. Then I
tried to copy and paste the XML files one by one into TXT files and
then tried one more time. Curiously, I get even more types and tokens
this time, and of course sniffs, laughs, etc. are still there.

I have checked several times the tag settings and they are exactly as
the ones in the link you sent, except for the search spam and the tag
file in mark-up to EXclude, which has this text:

<long-feature type="end">*</long-feature>
<long-feature type="begin">*</long-feature>
<w type="fragment">*</w>
<long-feature type="begin">*</long-feature>
<e type="happening">*</e>
<a type="comments">*</a>
<nonvocal type="end">*</nonvocal>

I had previously searched the texts for the word "laugh" in several
files and check again carefully the mark-up to make sure that every
possible tag was included.

I do not know if there is anything else that can be done. I have
contacted Linguistic Data Consortium to check if they had unnanotated
transcriptions, but they do not.

Many thanks for your help anyway.

Cheers,

Pilar

Mike

unread,
Aug 14, 2009, 12:36:41 PM8/14/09
to WordSmith Tools
Well, try just putting the one line that I showed in your markup to
EXclude tag file, leaving out the others, and run the process on just
one text file where you are certain SNIFF appears. That is what I did.
If it doesn;t work then you are probably doing something wrong, it is
OK for me and should be OMK for you too (assuming you are using the
latest download of WS5). If it works, add more tag file lines and see
if it still works.

Mike

Pilar González

unread,
Aug 15, 2009, 6:02:44 AM8/15/09
to WordSmith Tools
Hi Mike,

I think the problem was related to the installation, because I tried
to change other settings, like the columns to be shown in the
concordances, and the changes were not taking place. I have installed
the software again, and this time the changes I do in the settings are
actually working, so finally no more LAUGH, SNIFF!

A million thanks for your help.

Cheers,

Pilar
Reply all
Reply to author
Forward
0 new messages