Hello,
I am working with Santa Barbara Corpus of Spoken American English (SBCSAE) in WordSmith Tools 5.0, and I would like to ask the group members with more expertise than me for their help, since I have encountered some problems:
1. Part 2 of the corpus (16 XML files) has around 47,000 tokens. However, according to the statistics in the wordlist, there are 72,771 tokens. I have tried to obtain a few concordances, and the problem seems to be that some tags are being processed as words. I also get many lines with just a few words (although I have set Concord to save 1,000 characters per entry), so it is not possible to read the concordances and exclude the tokens in which the search word belongs to a different grammatical category than the one under study.
2. I have tried to convert the files to TXT format with Text Converter, but it has not worked, probably because I used the conversion file for BNC and the tagging may be different. Does anyone know where can I find a conversion file for SBCSAE, or how can I create a conversion file myself? I have found the transcription conventions in the web (http://projects.ldc.upenn.edu/SBCSAE/transcription/sb-csae-conventions.html), but I do not know how to use them to create a conversion file.
Many thanks in advance and best regards,
Pilar González
| thank you very much for the updating --- On Fri, 14/8/09, Mike <mi...@lexically.net> wrote: |