split words/hyphenation problems in creating text files for corpus building

213 views
Skip to first unread message

kat

unread,
Oct 12, 2011, 3:31:03 AM10/12/11
to AntConc-discussion
Many journal articles in PDF format are in two columns, necessitating
words split by hyphens or underscore marks. When students copy/paste/
save as text files in building a corpus, these split words are not
recognized by Antconc.

Laurence, thanks for your help in solving this issue; I will describe
your solution here for others:

After copying/pasting to a doc file, go to EDIT, then FIND AND
REPLACE. In FIND WHAT, type in _^p. This will find the underscore mark
(_) and paragraph marks (^p) for carriage returns after the
underscore. Click REPLACE and leave this space blank (so the "_^p" is
replaced with "nothing"). Click REPLACE ALL and SAVE. Now you can save
as a text file. The first few times I tried this, it didn't work so if
there is a problem, first use FIND to find a simple word ("the" or
"is"). Once it snaps out of its reverie, it will then find and replace
the codes.

I have 30 students this term creating their own corpora from web-based
journal articles in their fields. Only one student reported this
problem.

Laurence Anthony

unread,
Oct 12, 2011, 5:53:49 AM10/12/11
to ant...@googlegroups.com
Hi Kat,

Thank you for explaining your procedure to the group. This is probably the easiest way for non-specialists (e.g., students) to pre-process corpus texts. However, if you are a researcher wanting to batch pre-process an entire corpus, the best way is probably to use a more advanced text editor like Notepad++ or TextPad. With these tools, you can specify exactly what characters to find and replace, and then do a batch process for the entire corpus.

(Of course, you could also write a simple Perl, Python, R script to do the same thing, but it's probably not worth it in this case.)

Regards,
Laurence.

Dr CK Jung

unread,
Oct 12, 2011, 6:35:06 AM10/12/11
to ant...@googlegroups.com
Hi Kat

Thanks for sharing this. Very helpful!

Best Rgds
CK

> --
> You received this message because you are subscribed to the Google Groups "AntConc-discussion" group.
> To post to this group, send email to ant...@googlegroups.com.
> To unsubscribe from this group, send email to antconc+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/antconc?hl=en.
>
>

--
---
Best regards
Dr CK Jung BEng (Hons) Birmingham MSc Warwick EdD Warwick Cert Oxford

Senior Research Fellow
Center for Corpus Research, 306-1 Widang Hall, Yonsei University, 50
Yonsei-ro, Seodaemun-gu, Seoul, 120-749, Korea
Email: corp...@yonsei.ac.kr
Tel (Direct): +82 (0)2 2123 7516
Fax: +82 (0)2 362 2381

Lecturer
Department of English Language Education, Inha University, 253
Yonghyun-dong, Nam-gu Incheon, 402-751, Korea
Email: c.k....@inha.ac.kr
Tel: +82(0)32 860 7850
Fax: +82 (0)32 875 3857

Thesis Supervisor & Tutor
Centre for English Language Studies, Westmere House, University of
Birmingham, Edgbaston, Birmingham, B15 2TT, UK
Email: c.k....@bham.ac.uk
Tel: +44 (0)121 414 5695
Fax: +44 (0)121 414 3298

> Referee, Language and Intercultural Communication (Listed in the Thomson Reuters Social Sciences Citation Index (SSCI)).
> Referee, The English Linguistics Society of Korea (Listed in the Korea Research Foundation Citation Index)
> Planning Director, The Applied Linguistics Association of Korea (ALAK), South Korea.
> Advisory Committee, The (First) Asia Pacific Corpus Linguistics Conference, New Zealand http://corpling.com/conf/
> Columnist, Monthly Chosun (the largest monthly politics and news magazine in South Korea) http://monthly.chosun.com/client/column/list.asp?C_CC=C&tbKey=C.K.Jung

For more about Dr Jung's work, please visit: http://plaza4.snu.ac.kr/~ckjung

Warren Tang

unread,
Oct 12, 2011, 6:49:35 AM10/12/11
to ant...@googlegroups.com
Kat and Laurence,
I already use Word, Perl and a couple of other programs to do my text cleaning (it is enough for most of what I need to do). But I would like to know how to do it in Notepad++ and to do batch edits to streamline and improve my editing process.

Laurence, any recommendations of webpages and books which could teach me this?


Regards,
Warren

Laurence Anthony

unread,
Oct 12, 2011, 8:18:01 AM10/12/11
to ant...@googlegroups.com
Hi All,

I've attached a screenshot of the search/replace window for Notepad ++. Basically just load in all the corpus files to Notepad++ and then hit the "Replace all in all open documents" button. A very similar option is available in TextPad and most other good text editors. Also, most good text editors can also handle regular expressions. So even more powerful search/replace pre-processing operations are possible.

(Sorry to any group members who think this is not really related to AntConc).

Laurence.
search_and_replace.png

Le Thi Ngoc Phuong

unread,
Aug 11, 2013, 4:33:58 AM8/11/13
to ant...@googlegroups.com
Dear all, 

Thank you so much for raising this point! I find it very very useful, as I am compiling a corpus of 150 research articles, and experiencing the same problem. But I have another problem, and look forward to your suggestions.

When I convert a research article in PDF format into a doc file. Sentences are split up like this;

THE STUDY–ABROAD EXPERIENCE FOR LANguage learning is a subject of increasing importance in foreign language education. A number of recent studies have investigated the development of oral proficiency (Brecht, Davidson, & Ginsberg, 1993, 1995; Freed, 1995; Magnan, 1986; Magnan & Back, 2007; Segalowitz & Freed, 2004), the use of communication strategies (Lafford,

1995, 2004), and the acquisition of grammatical (Collentine, 2004; Duperron, 2006; Isabelli,

2004, 2007), pragmatic (Barron, 2003; Cohen & Shively, 2007; Magnan & Back, 2006; Rodr´ıguez,

2001), and sociolinguistic competence (Barron,

2006; Regan, 1995, 2003). 

I am wondering whether this influence the results when I use different tools in Antconc.

I have edited such files manually, which takes me a lot of time. Does anyone know how to fix it?

Thank you very much!

Kind regards,

Phuong

Laurence Anthony

unread,
Aug 20, 2013, 7:36:26 PM8/20/13
to ant...@googlegroups.com
Dear Phuong,

AntConc does not process doc files. You should be saving your files at plain text (.txt) files, preferable with a UTF-8 encoding.

When you convert PDF to text (not doc!), you will also find problems with line breaks. This is due to the nature of the PDF file format. However, within AntConc, almost all tools work at the word level, so there should not be a problem. When processing texts, AntConc will remove line breaks replacing them with spaces (in the default mode), so as long as this is appropriate for your data (which it usually is), the results should be valid.

Laurence.


###############################################################
Laurence ANTHONY, Ph.D.
Professor
Center for English Language Education in Science and Engineering (CELESE)
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.antlab.sci.waseda.ac.jp/
###############################################################


--
You received this message because you are subscribed to the Google Groups "AntConc-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antconc+u...@googlegroups.com.

To post to this group, send email to ant...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages