Shameless advertising

Skip to first unread message

Stefan Th. Gries

Nov 3, 2016, 7:56:15 PM11/3/16
to StatForLing with R
Sorry for the shameless self-promotion, but for those of you who are interested: the new edition of QCLWR has just appeared: Amazon still says "not released" but Routledge has it and I know of one person having received her copy today (even before I got mine) :-) Again for those who are interested, here's a brief overview over some changes:

This book has changed quite a bit from the first edition; it is now structured as follows. Chapter 2 defines the notion of a corpus and provides a brief overview of what I consider to be the most central corpus-linguistic methods, namely frequency lists, dispersion, collocations, and concordances; in addition, I briefly mention different kinds of annotation. The main change here is the addition of some discussion of the important notion of dispersion.
   Chapter 3 introduces the fundamentals of R, covering a variety of functions from different domains, but the area which receives most consideration is that of text processing. There are many small changes in the code and the examples (for instance, I now introduce free-spacing), but the main differences to the first edition consist of: (1) a revision of the section on Unicode, which is now more comprehensive; (2) the addition of a new section specifically discussing how to get the most out of XML data using dedicated packages that can parse the hierarchical structure of XML documents; (3) an improved version of my exact.matches function; and (4) a new section on how to write your own functions for text processing and other things – this is taken up a lot in Chapter 5.
   Chapter 4 is what used to be Chapter 5 in the first edition. It introduces you to some fundamental aspects of statistical thinking and testing. The questions to be covered in this chapter include: What are hypotheses? How do I check whether my results are noteworthy? How might I visualize results? Given considerations of space and focus, this chapter is informative, I hope, but still short.
   The main chapter of this edition, Chapter 5, is brand new and, in a sense, brings it all together: More than 30 case studies in 27 sections illustrate various aspects of how the methods introduced in Chapters 3 and 4 can be applied to corpus data. Using a variety of different kinds of corpora, corpus-derived data, and other data, you will learn in detail how to write your own programs in R for corpus-linguistic analyses, text processing, and some statistical analysis and visualization in detailed step-by-step instructions. Every single analysis is discussed on multiple levels of abstraction and altogether more than 6,000 lines of code, nearly every one of them commented, help you delve deeply into how powerful a tool R can be for your work.
   Finally, Chapter 6 is a very brief conclusion that points you to a handful of useful R packages that you might consider exploring next.

As for the case studies in Chapter 5:
The sequencing I am using here is designed to help you zoom in from a general description of the task into the more specific aspects of the code such as (1) the R functions you will need to use and (2) the structure of the script, i.e., how and in which order the functions are used. These are the four parts, or subsection headings:
  • What are the things we will need to do? This section explains in plain English which steps the relevant task involves; it uses hardly any R code but already introduces the kinds and names of a few data structures that the script will contain.
  • What are the functions we will need for that? This section lists all the main functions one will need to use to perform the things implied by the description formulated in the previous step.
  • Thus, this is the overall structure of the script. This section provides a skeleton of the R code we will use in what is called pseudocode: a description of the algorithm and structure of a program that performs a particular task. It is imperative that you read this part with the relevant script open in RStudio because in the pseudocode I will provide the line numbers of the relevant R script from the companion website so you can see exactly which lines of code in the script do which part of the pseudocode, or how the pseudocode is ‘translated’ into actual R code. The files with all the scripts are in the folder <_qclwr2/_scripts/> and all begin with “05_”; this part will greatly help you understand the logic of the scripts.
  • Which aspects of the script are worth additional comment? If you look at the script files, you will see that they are very heavily commented: Often even a single function call is broken up into several lines so that each argument can be explained, and sometimes I will break down regular expressions into multiple lines (remember free-spacing from above?) to explain everything in detail. However, sometimes scripts involve something that I think merits additional explanation here, or they involve something you haven’t seen yet at all or in the form in which something is used. If there are such situations – and not every case study has such parts – then this section provides additional discussion of these aspects of the scripts.

</end of promotion>
Reply all
Reply to author
0 new messages