Pro Corpus

0 views

Skip to first unread message

Giorgio Aguilar

unread,

Aug 4, 2024, 8:48:56 PM8/4/24

to granacreasouv

Inorder to make the corpora more useful for doing linguistic research, they are often subjected to a process known as annotation. An example of annotating a corpus is part-of-speech tagging, or POS-tagging, in which information about each word's part of speech (verb, noun, adjective, etc.) is added to the corpus in the form of tags. Another example is indicating the lemma (base) form of each word. When the language of the corpus is not a working language of the researchers who use it, interlinear glossing is used to make the annotation bilingual.

Some corpora have further structured levels of analysis applied. In particular, smaller corpora may be fully parsed. Such corpora are usually called Treebanks or Parsed Corpora. The difficulty of ensuring that the entire corpus is completely and consistently annotated means that these corpora are usually smaller, containing around one to three million words. Other levels of linguistic structured analysis are possible, including annotations for morphology, semantics and pragmatics.

a data.frame (or a tibble tbl_df), whose defaultdocument id is a variable identified by docid_field; the text of thedocument is a variable identified by text_field; and other variablesare imported as document-level meta-data. This matches the format ofdata.frames constructed by the the readtext package.

Names to be assigned to the texts. Defaults to the names ofthe character vector (if any); doc_id for a data.frame; the documentnames in a tm corpus; or a vector of user-supplied labels equal inlength to the number of documents. If none of these are round, then"text1", "text2", etc. are assigned automatically.

optional column index of a document identifier; defaultsto "doc_id", but if this is not found, then will use the rownames of thedata.frame; if the rownames are not set, it will use the default sequencebased on ([quanteda_options]("base_docname").

the character name or numeric index of the sourcedata.frame indicating the variable to be read in as text, which mustbe a character vector. All other variables in the data.frame will beimported as docvars. This argument is only used for data.frameobjects.

logical; if TRUE, split each kwic row into two"documents", one for "pre" and one for "post", with this designation savedin a new docvar context and with the new number of documentstherefore being twice the number of rows in the kwic.

For quanteda >= 2.0, this is a specially classed character vector. Ithas many additional attributes but you should not access theseattributes directly, especially if you are another package author. Use theextractor and replacement functions instead, or else your code is not onlygoing to be uglier, but also likely to break should the internal structureof a corpus object change. Using the accessor and replacement functionsensures that future code to manipulate corpus objects will continue to work.

The HCRC's mission is to provide timely, high-quality legal representation for indigent petitioners in death penalty habeas corpus proceedings before the California state courts and the federal courts.

The HCRC also recruits and trains attorneys to expand the pool of private counsel qualified to accept appointments in death penalty habeas corpus proceedings and serves as a resource to appointed counsel, thereby reducing the number of unrepresented indigents on California's death row. For more information regarding appointments, please refer to our court appointments page.

The AMI Meeting Corpus is a multi-modal data set consistingof 100 hours of meeting recordings. For a gentle introduction to the corpus, see thecorpus overview.To access the data, follow the directions given there. Around two-thirds of the datahas been elicited using a scenario in which the participants play different roles in a design team, takinga design project from kick-off to completion over the course of a day.The rest consists of naturally occurring meetings in a range of domains. Detailed information can be found in the documentation section.

The corpus callosum (plural: corpora callosa) is the largest of the commissural fibers, linking the cerebral cortex of the left and right cerebral hemispheres. It is the largest white matter tract in the brain.

Immediately above the body of the corpus callosum, lies the interhemispheric fissure in which runs the falx cerebri and branches of the anterior cerebral vessels. The superior surface of the corpus callosum is covered by a thin layer of grey matter known as the indusium griseum.

The corpus callosum has a rich blood supply, relatively constant and is uncommonly involved by infarcts. The majority of the corpus callosum is supplied by the pericallosal arteries (the small branches and accompanying veins forming the pericallosal moustache) and the posterior pericallosal arteries, branches from the anterior and posterior cerebral respectively. In 80% of patients, additional supply comes from the anterior communicating artery, via either the subcallosal artery or median callosal artery.

subcallosal artery (50% of patients) is essentially a large version of a hypothalamic branch, which in addition to supplying part of the hypothalamus also supplies the medial portions of the rostrum and genu

median callosal artery (30% of patients) can be thought of as a more extended version of the subcallosal artery, in that it travels along the same course, supplies the same structures but additionally reaches the body of the corpus callosum

Various small veins draining the central parts of the corpus callosum drain into the internal cerebral veins, in turn draining into the straight sinus. Tributaries of the internal cerebral veins draining the corpus callosum include 10:

Studies, including using MR tractography, cast some doubt on this assertion, instead suggesting that the anterior body develops first and then continues bidirectionally, with the anterior portions (genu) developing earlier/more prominently than the posterior portions (splenium) 7,8. This is not, however, universally accepted 11.

The OpenCitations Corpus (OCC) is an open repository of scholarly citation data made available under a Creative Commons public domain dedication (CC0), which provides accurate bibliographic references harvested from the scholarly literature that others may freely build upon, enhance and reuse for any purpose, without restriction under copyright or database law. An in-depth description of the OCC is available in the following paper:

The corpus URL ( ) identifies the entire OCC, which is composed of several sub-datasets, one for each of the aforementioned bibliographic entities included in the corpus. Each of these has a URL composed by suffixing the corpus URL with the two-letter short name for the class of entity (e.g. be for a bibliographic reference) followed by an oblique slash (e.g. ). Individual members of each sub-dataset are identified by incrementing numbers, unique within that sub-dataset, e.g. or

The ingestion of citation data into the OCC, briefly summarised in Figure 1, is handled by two Python scripts called the bibliographic references Extractor (BEE) and the SPAR Citation Indexer (SPACIN), available in the OpenCitations GitHub repository.

In particular, for each article retrieved by means of the Europe PubMed Central API, BEE stores all the possible identifiers (in the example, doi, pmid, pmcid, and localid) and all the textual references, enriched by their own related identifiers if these are available. In addition, the JSON file also includes provenance information about the source, its provider and the curator (i.e. the particular BEE Python class responsible for the extraction of these metadata from the source).

Starting from the output provided by BEE, SPACIN processes each JSON file, retrieving metadata information about all the citing/cited articles described in it by querying the Crossref API and the ORCID API. These APIs are also used to disambiguate bibliographic resources and agents by means of the identifiers retrieved (e.g., DOI, ISSN, ISBN, ORCID, URL, and Crossref Member URL). Once SPACIN has retrieved all these metadata, appropriate RDF resources are created (or reused, if they have been already added to the OCC in the past). These are stored in the file system in JSON-LD format and additionally within the OCC triplestore. It is worth noting that, for space and performance reasons, the triplestore includes all the data about the curated entities, but does not store their provenance data nor the descriptions of the datasets themselves, which are accessible only via HTTP.

The SPACIN workflow introduced in Figure 1 is a process that runs until no more JSON files are available from BEE. Thus, the current instance of the OCC is evolving dynamically in time (even if now it has been stopped for updating the ingestion scripts), and can be easily extended beyond ingest from Europe PubMed Central by reconfiguring it to interact with additional REST APIs provided by different bibliographic sources, so as to gather new article metadata and their related references, thereby expanding the scope and coverage provided by the OCC.

These rules have been adopted to provide the procedure for post-conviction habeas corpus proceedings as they are set forth in West Virginia Code 53-4A-1 et seq. These rules supplement, and in designated instances supersede, the statutory procedures set forth in 53-4A-1 et seq. of the West Virginia Code. For petitions filed in any circuit court in the State, all of the rules apply. For petitions filed in the Supreme Court of Appeals, only Rule 2 applies.

Within such time as may be specified by the court, the State shall file an answer which shall respond to the allegations of the petition. The answer may be consolidated with other pleadings, such as a motion under Rule 12(b)(6) or Rule 56 of the West Virginia Rules of Civil Procedure. The answer shall indicate what transcripts (of pretrial, trial, sentencing, and post-conviction proceedings) are available, when they can be furnished and what proceedings have been recorded and not transcribed. There shall be attached to the answer such portions of the transcripts as the answering party deems relevant. The court, on its own motion or upon request of the petitioner, may order that further portions of the existing transcripts be transcribed and furnished. If a transcript is neither available nor procurable, a properly verified narrative summary of the evidence may be submitted.