split words/hyphenation problems in creating text files for corpus building

Oct 12, 2011
to AntConc-discussion
Many journal articles in PDF format are in two columns, necessitating
words split by hyphens or underscore marks. When students copy/paste/
save as text files in building a corpus, these split words are not
recognized by Antconc.

Laurence, thanks for your help in solving this issue; I will describe
your solution here for others:

After copying/pasting to a doc file, go to EDIT, then FIND AND
REPLACE. In FIND WHAT, type in _^p. This will find the underscore mark
(_) and paragraph marks (^p) for carriage returns after the
underscore. Click REPLACE and leave this space blank (so the "_^p" is
replaced with "nothing"). Click REPLACE ALL and SAVE. Now you can save
as a text file. The first few times I tried this, it didn't work so if
there is a problem, first use FIND to find a simple word ("the" or
"is"). Once it snaps out of its reverie, it will then find and replace
the codes.

I have 30 students this term creating their own corpora from web-based
journal articles in their fields. Only one student reported this

Laurence Anthony

Oct 12, 2011
to ant...@googlegroups.com
Hi Kat,

Thank you for explaining your procedure to the group. This is probably the easiest way for non-specialists (e.g., students) to pre-process corpus texts. However, if you are a researcher wanting to batch pre-process an entire corpus, the best way is probably to use a more advanced text editor like Notepad++ or TextPad. With these tools, you can specify exactly what characters to find and replace, and then do a batch process for the entire corpus.

(Of course, you could also write a simple Perl, Python, R script to do the same thing, but it's probably not worth it in this case.)


Dr CK Jung

Oct 12, 2011
to ant...@googlegroups.com
Hi Kat

Thanks for sharing this. Very helpful!

Best Rgds

Warren Tang

Oct 12, 2011
to ant...@googlegroups.com
Kat and Laurence,
I already use Word, Perl and a couple of other programs to do my text cleaning (it is enough for most of what I need to do). But I would like to know how to do it in Notepad++ and to do batch edits to streamline and improve my editing process.

Laurence, any recommendations of webpages and books which could teach me this?


Laurence Anthony

Oct 12, 2011
to ant...@googlegroups.com
Hi All,

I've attached a screenshot of the search/replace window for Notepad ++. Basically just load in all the corpus files to Notepad++ and then hit the "Replace all in all open documents" button. A very similar option is available in TextPad and most other good text editors. Also, most good text editors can also handle regular expressions. So even more powerful search/replace pre-processing operations are possible.

(Sorry to any group members who think this is not really related to AntConc).


Le Thi Ngoc Phuong

Aug 11, 2013
to ant...@googlegroups.com
Dear all, 

Thank you so much for raising this point! I find it very very useful, as I am compiling a corpus of 150 research articles, and experiencing the same problem. But I have another problem, and look forward to your suggestions.

When I convert a research article in PDF format into a doc file. Sentences are split up like this;

THE STUDY–ABROAD EXPERIENCE FOR LANguage learning is a subject of increasing importance in foreign language education. A number of recent studies have investigated the development of oral proficiency (Brecht, Davidson, & Ginsberg, 1993, 1995; Freed, 1995; Magnan, 1986; Magnan & Back, 2007; Segalowitz & Freed, 2004), the use of communication strategies (Lafford,

1995, 2004), and the acquisition of grammatical (Collentine, 2004; Duperron, 2006; Isabelli,

2004, 2007), pragmatic (Barron, 2003; Cohen & Shively, 2007; Magnan & Back, 2006; Rodr´ıguez,

2001), and sociolinguistic competence (Barron,

2006; Regan, 1995, 2003). 

I am wondering whether this influence the results when I use different tools in Antconc.

I have edited such files manually, which takes me a lot of time. Does anyone know how to fix it?

Thank you very much!

Kind regards,


Laurence Anthony

Aug 20, 2013
to ant...@googlegroups.com
Dear Phuong,

AntConc does not process doc files. You should be saving your files at plain text (.txt) files, preferable with a UTF-8 encoding.

When you convert PDF to text (not doc!), you will also find problems with line breaks. This is due to the nature of the PDF file format. However, within AntConc, almost all tools work at the word level, so there should not be a problem. When processing texts, AntConc will remove line breaks replacing them with spaces (in the default mode), so as long as this is appropriate for your data (which it usually is), the results should be valid.


