chat to text conversion (accents in Spanish)

Yametazamwa mara 13
Ruka hadi kwenye ujumbe wa kwanza ambao haujasomwa

Elnaz Kia

hayajasomwa,
28 Okt 2021, 13:28:0928/10/2021
kwa chibolts
Hi Everyone,

I have a question about accents in Spanish. So, here is what I have on a .cha transcript file:

*STU: yo [:: _] me gusta ver(lo) él [:: _] porque él es muy bien
[:: bueno] en el xxx eh porque es el goleador .

And when I convert the file to a text. this happens:
*STU:	yo [:: _] me gusta ver(lo) él [:: _] porque él es muy bien 
[:: bueno] en el xxx eh porque es el goleador .

And this is how I convert .cha files to .txt files:

chstring +re +cbullets.cut *.cha
ren -f +re *.chstr.cex *.txt 

My question is, how can I avoid this problem?

Thanks,
Elnaz




Brian Macwhinney

hayajasomwa,
28 Okt 2021, 14:06:5828/10/2021
kwa ChiBolts,Elnaz Kia
Dear Elnaz,
     In order to view non-Roman characters such as é, as well as diacritics, CLAN relies on use of the Arial Unicode font which supports not only special European characters, but also Chinese, Sinhalese etc. because it is all of Unicode.  If you convert a CHAT file viewed in Unicode to .txt format, you are going to see what you are seeing now unless your editor for the .txt format allows you to load in a Unicode font.
 
— Brian MacWhinney
Teresa Heinz Professor of Cognitive Psychology, 
Computational Linguistics, 
and Modern Languages, CMU

--
You received this message because you are subscribed to the Google Groups "chibolts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chibolts+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/chibolts/CAOwOJYkD%3DarGj%2B7sDYxHzyZikHs_Jd3wHJmCALbCfJCWe69T1A%40mail.gmail.com.

Leonid Spektor

hayajasomwa,
28 Okt 2021, 14:50:3728/10/2021
kwa chib...@googlegroups.com,Elnaz Kia
Elnaz,

I just want to add more specific information to what Brian wrote. Your text editor needs to be able to display Unicode UTF-8 encoded characters. If you open the .txt file with CLAN, then you will see that characters from .txt file are displayed correctly.

If characters in your .txt file are not displayed correctly in CLAN, then please make sure that you have the latest version of CLAN. Otherwise, please email your sample file that show this problem to me directly for further testing.

Copying the line from .cha transcript that you have in your email below to a test file on my computer and then running the CHSTRING and REN command produces correct result on my computer.


Leonid.

Elnaz Kia

hayajasomwa,
28 Okt 2021, 15:03:2128/10/2021
kwa Leonid Spektor,chib...@googlegroups.com
Dear Leonid and Brian,

Thank you both so much for your detailed responses.

@Leonid Spektor you are right. I just realized that the txt files that I created using the CHSTRING and REN commands on my computer were correct. However, when the database crew at the university upload the txt files to the university database, the txt files do not show the correct characters. Do you have any solutions for this? 

Another issue is with the pdf versions of the mentioned files. Even on my computer when I convert the correctly converted txt files to pdf, it does not show the characters correctly. Are there any solutions for this problem? Note: I create the pdf files in batch using the Adobe Acrobat DC Create PDFs Tool.

Many thanks for taking the time and answering my questions!

Best,
Elnaz

Also, 

Also

Brian Macwhinney

hayajasomwa,
28 Okt 2021, 15:08:2428/10/2021
kwa ChiBolts,Elnaz Kia,Leonid Spektor
When I convert files to PDF using Adobe Acrobat DC, all the characters look fine.
I just did this for one file.  Perhaps batch doesn’t work well?

Why are your computer people uploading in text format? They should just be uploading in CHAT format.

— Brian MacWhinney
Teresa Heinz Professor of Cognitive Psychology, 
Computational Linguistics, 
and Modern Languages, CMU

Leonid Spektor

hayajasomwa,
28 Okt 2021, 15:20:3128/10/2021
kwa Brian Macwhinney,ChiBolts,Elnaz Kia
Elnaz,

I am not familiar with Adobe Acrobat DC, so I will defer to Brian's email.

The problem with the university upload of the txt files to the university database is something that people who support this process need to clarify. It is possible that they expect some BOM character at the beginning of the txt file to explicitly indicate the text encoding. It would be best if you tell them that those text files are UTF-8 text encoding and let them say what they believe is missing in those files for them to get the upload right. Normally newer text editors can automatically detect the text file encoding and adjust their display accordingly.


Leonid.

Elnaz Kia

hayajasomwa,
28 Okt 2021, 15:30:2628/10/2021
kwa Leonid Spektor,Brian Macwhinney,ChiBolts
Hi Leonid,

I just forwarded your message to them. Now that you mentioned that I think I might have had something to do with that. Because another step that I took after creating the text files was to remove the @UTF8 and @Window lines at the beginning of the text files. :-(

Referring to what Brian said about the pdf tool working for him, I just tried that again with the same txt file with the @UTF8 line intact and still got the incorrect results. 

Best,

Brian Macwhinney

hayajasomwa,
28 Okt 2021, 15:44:5228/10/2021
kwa Elnaz Kia,ChiBolts,Leonid Spektor
Dear Elnaz,
     I meant that I was able to make a fine-looking PDF from the original CHAT file.  If you make it from the TXT file that you created, it will indeed have the problem you described.

— Brian MacWhinney
Teresa Heinz Professor of Cognitive Psychology, 
Computational Linguistics, 
and Modern Languages, CMU

Elnaz Kia

hayajasomwa,
28 Okt 2021, 15:59:0828/10/2021
kwa Brian Macwhinney,ChiBolts,Leonid Spektor
Dear Brian,

oh, got it! I have 2 questions though:

1. How did you convert .cha to .pdf? This is the message that I get when I try to do the same.
2. Also, the txt file that I sent to you looked fine in terms of showing accents. Why do you think it should not be working when we convert it to pdf?

image.png

Many thanks!
Jibu wote
Mjibu mchapishaji
Sambaza
Ujumbe 0 mpya