国際シンポジウム「通時音声コーパス」

9 views
Skip to first unread message

Takehiko Maruyama

unread,
Aug 25, 2017, 1:45:15 AM8/25/17
to linguis...@googlegroups.com
MLのみなさま

丸山岳彦@専修大学・国立国語研究所 と申します。

# 10日前となりましたので、再度お知らせいたします。このメールの末尾に、
# 講演アブストラクトを掲載しています。

9月4日、国立国語研究所において、国際シンポジウム「通時音声コーパス」を
開催いたします。

今回が初来日となる Bas Aarts 教授(UCL)をはじめ、ヨーロッパの各国から
4人のゲストを招き、話し言葉の経年変化をコーパスで研究する方法について
話し合います。

古い音源資料をどのようにコーパス化するかを議論する、初めての機会です。
みなさま、お誘いあわせの上、ぜひご来場ください。事前登録は不要です。

ウェブサイト:
http://pj.ninjal.ac.jp/conversation/event/sympo2017.html

====================================================================
International Symposium on Diachronic Speech Corpora
国際シンポジウム「通時音声コーパス」

◆ 日時: 2017年9月4日(月)10:00~17:00
◆ 場所: 国立国語研究所 2階 講堂

● 開催の趣旨

1990年代以降、世界各地で様々なコーパスが構築されてきました。書き言葉コー
パス、話し言葉コーパス、学習者コーパス、パラレルコーパスなど、コーパス
の多様化が進む中、次のターゲットの一つとして目されるのが、「通時音声コー
パス」です。古い音源資料を収集してコーパス化し、近年の音声資料と比較・
対照することにより、話し言葉の経年変化(アクセント・イントネーション・
語彙・文法など)を実証的に明らかにすることができると考えられます。

今回のシンポジウムでは、イギリス、フィンランド、イタリア、フランスから
ゲストをお招きし、日本を含めた5か国で、通時音声コーパスをどのように整
備・分析しているかについて、デモを交えながらご紹介します。

※ 全編英語での開催です。

● プログラム

10:00-10:15 Opening Remarks

10:15-11:15 Bas Aarts (University College London, UK)
"Exploring the grammar of spoken English using the Diachronic Corpus of Present-Day English"

11:15-12:15 Marja-Liisa Helasvuo (University of Turku, Finland)
"Finnish spoken corpora: A diachronic perspective"

13:15-14:15 Takehiko Maruyama (Senshu University / NINJAL, Japan)
"What's left for diachronic researches of Japanese Speech?"

14:15-15:15 Alessandro Panunzi (University of Florence, Italy)
"The LABLITA Corpus of spoken Italian in diachrony: Theoretical framework, corpus design, and a lexical comparison"

15:30-16:30 Marie Skrovec (University of Orleans, France)
"A diachronic spoken corpus for French: ESLO, a variationist survey"

16:30-17:00 Commentaries and discussion

● オーガナイザ・問い合わせ先

丸山岳彦(専修大学・国立国語研究所)
maruyama <at> isc.senshu-u.ac.jp

※本シンポジウムは、国立国語研究所音声言語研究領域共同研究プロジェクト
「大規模日常会話コーパスに基づく話し言葉の多角的研究」および、JSPS科研
費16H03426 「「昭和話し言葉コーパス」の構築による話し言葉の経年変化に
関する実証的研究」(基盤B、研究代表者 丸山岳彦)による共同開催です。

● 注意

会場周辺はレストラン、コンビニが少ないため、お弁当をお持ちになることを
お勧めします。

● 講演アブストラクト

===================================================================
Bas Aarts (UCL)
Exploring the grammar of spoken English using the Diachronic Corpus
of Present-Day Spoken English

In the first part of my talk, I will begin by presenting the corpus
exploration software ICECUP (International Corpus of English Corpus
Utility Program) that we developed at the Survey of English Usage
(SEU) at UCL. This software can be used to explore the two corpora
that we compiled, namely the British Component of the International
Corpus of English (ICE-GB) and the Diachronic Corpus of Present-Day
Spoken English (DCPSE). Both are fully tagged and parsed corpora of
British English. I will demonstrate the functionality of the software
and its capabilities. Specifically, I will show how the innovative
Fuzzy Tree Fragment facility allows users to search for grammatical
patterns in the corpora.

In the second part of my talk I will discuss some of the SEU's recent
linguistic research on changes in the grammar of Present-Day English
using DCPSE, with special attention being paid to the use of the
progressive construction and the use of the core modal verbs.

Ferdinand De Saussure famously said that:

"The contrast between the two points of view, synchronic and
diachronic, is absolute and allows no compromise." (Cours de
Linguistique Générale)

In my talk I will argue that the research that we carried out in the
SEU demonstrates that this view is contestable.

===================================================================
Marja-Liisa Helasvuo (University of Turku)
Finnish spoken corpora - a diachronic perspective

In Finnish studies, there is a long tradition of research on the
spoken varieties of Finnish. The orientation was first
dialectological: the research focused on areal characteristics and
differences between different dialects. The earliest studies used
direct observation of spontaneous speech as their data: the examples
were written down immediately when they were heard.

However, there are also collections of spoken narratives from the late
19th century that have been published and used for research. These
could be considered as the first corpora of the spoken language. With
the development of recording equipment, more sophisticated data
collection methods have been developed. In 1967, the first electronic
corpus of spoken Finnish was started (project leader prof. Osmo Ikola,
University of Turku).

In my presentation, I will give an overview of Finnish spoken corpora
and discuss the possibilities they offer and their limitations.

===================================================================
Takehiko Maruyama (Senshu University / NINJAL)
What's left for diachronic researches of Japanese Speech?

In this talk I will investigate how a diachronic speech corpus of
Japanese can be realized and how it should be analyzed.

A diachronic speech corpus must be a collection of recorded speech
across multiple time periods. It should be carefully designed and
systematically organized for analyzing diachronic changes of
speech. The recorded data must be digitalized to enable playback and
listening with as good sound quality as possible. Also rich annotation
is needed, such as transcriptions, POS tagging, parsing, and various
metadata such as speakers' info, recorded date, speaking situations
and so on. The problem is that the amount of old recordings is much
smaller and limited than that of written text.

In this talk I will illustrate what kinds of recorded materials are
available for us to compile into a diachronic speech corpus of
Japanese: These include political speeches recorded during the 1910s
to the 1940s, NINJAL's pioneering records of Japanese daily
conversations and lectures from the 1950s to the 1960s, and
contemporary large corpora of spoken Japanese built in NINJAL after
2000.

In addition I will present some pilot studies analyzing these spoken
data from the point of view of diachronic change, such as changes of
intonation patterns and grammatical forms during the last 80 years.

===================================================================
Alessandro Panunzi (University of Florence)
The LABLITA Corpus of spoken Italian in diachrony:
Theoretical framework, corpus design, and a lexical comparison

The LABLITA Linguistic Laboratory of the University of Florence
collected a wide corpus of spontaneous spoken Italian, transcribed and
analyzed on the basis of Language into Act Theory. This theory assumes
that spoken language is governed by pragmatic principles, whose main
features (illocutionary values and information structure) are conveyed
by prosody.

The talk focuses on the description of two sub-corpora of the LABLITA
collection, and namely the corpus Stammerjohann (recorded in 1965 in
Florence), and a comparable corpus mainly derived from a sampling of
the C-ORAL-ROM Italian corpus (texts collected in the Florence area in
the years 1990-2002). The two resources share a common design,
specifically adopted in order to assure the maximum comparability. The
lexical comparison highlights that the regional lexicon decreased in
the spontaneous speech of Florentine area by roughly 20%, but also
that a high frequency Tuscan lexical core is nowadays lively.

===================================================================
Marie Skrovec (University of Orleans)
A diachronic spoken corpus for French : ESLO, a variationist survey

At the LLL (Orléans, France) researchers are constituting a reference
corpus of spoken French, the ESLO corpus (Enquête Sociolinguistique à
Orléans : Socio-Linguistic Survey in Orléans), which takes into
account sociolinguistic variation with a micro-diachronic span, since
ESLO contents two sets of data (ESLO1, ESLO2).

The first survey (ESLO 1) was undertaken from 1968 to 1971 by British
scholars. Their aim was to record spontaneous interactions to teach
French as a foreign language at secondary school level. The data
gathered constitute an important spoken corpus of about 300 hours of
speech (4,500,000 words), with interviews and other recordings. A new
survey, ESLO 2, has been undertaken by the LLL since 2008, in order to
constitute, forty years on, a corpus which may be comparable in terms
of data gathering and archiving. The objective was set to 400 hours of
speech data, that is about 6,000,000 words. Put together ESLO 1 and 2
form now a collection of 700 hours of recording and about 8 million
words, which is today considered as a reference value for the
processing and investigations planned.

In this presentation, I will first give an overview about the origin
of the project in the 60's and the actual corpus design, and then
address some diachronic studies investigating linguistic variation and
change in the last 40 years, regarding different linguistic levels as
phonology, morphosyntax or discourse. A focus will be given on the
special case of future tense in modern French.

===================================================================

----
丸山 岳彦 (maru...@isc.senshu-u.ac.jp)
専修大学 文学部 日本語学科
国立国語研究所 音声言語研究領域(客員)
Reply all
Reply to author
Forward
0 new messages