Description of the conference

Corpora and Tools in Linguistics, Languages and Speech:

Status, Uses and Misuse


Conference organised by the Research Unit 1339 Linguistics, Languages and Speech – LiLPa

University of Strasbourg – UNISTRA

in collaboration with











3 – 5 July 2013

Strasbourg - France


This international and interdisciplinary conference focuses on original and innovative work relating to methods of analysing empirical data, to the use and status of such data in the Linguistic Sciences. The conference concerns all types of data extracted from various sources (texts, sound recordings, multimedia, images, movies, web data, etc.), central to all areas of the linguistic sciences, as well as other scientific disciplines interested in linguistic issues (e.g. sciences and technologies contributing to the study, design and implementation of models of information systems, computer sciences, medicine, etc.). The conference is in line with certain issues related to the French ANR (National Research Agency) call for projects “Corpus”[1]. Setting up or elaborating corpora and databases, developing and using analysis and processing tools represent, for the various areas of the linguistic sciences, essential steps in their research endeavours. Conceptual models or tools contribute to theoretical breakthroughs and modelling of cognitive issues which are usually quite complex.

The availability of large written data and processing tools provide new perspectives in analysing synchronic and diachronic variations of manuscripts, of syntactic structures or of semantic constants.

Regarding writing systems, corpora allow, in a didactic perspective for example, the study of errors and of their consequences on performance and learning, or on learning a new language.

In the area of languages, procedures allow describing, defining typologies, documenting and archiving corpora of different languages in order to study, in a linguistic or sociolinguistic perspective, their origins and evolution, by taking into account the regional distribution of variants, for example.

Likewise, natural language processing uses corpora to build resources such as lexicons or electronic grammars. New annotation tools and procedures provide new data, enrich resources and open new perspectives for using such data.

In speech production and perception, using 3D or digital simulation representation techniques contributes to interpreting data which were acquired in a piecemeal manner.

Thus, the development of structured, annotated corpora opens the way to several exploratory research themes in the diverse areas of the linguistic sciences, including discourse analyses, by rendering all sorts of sources (written, spoken, audio-visual, etc.) more coherent and also by facilitating their systematic quantitative or qualitative exploration.

Availability of large scale and varied corpora and of adequate tools for their exploration implies a change in the way these resources may be used. A good amount of data extracted from such corpora requires specific methodological and practical choices. Working methods should adapt to new conditions in order to cope with larger and larger volumes of available data. The meeting intends to bring new light to the different uses of currently available corpora in all areas of the linguistic sciences.

Regardless of the area investigated, the notion of error or of noise (signal-to-noise ratio) should be properly addressed, since it is inherent to corpus or to data which the researcher has to inevitably deal with. Hence, one cannot bypass analysis and supervision of deviant written linguistic data (typos, misspellings or grammatical errors, unfinished sentences, inadequate translations, etc.) or of deviant spoken data (disfluencies, dysphonia, etc.) when analysing real data or when building new tools. Automatic analysis tools also introduce a small error rate, which, however, may influence results of linguistic analyses. Methodological problems arise as how to treat these errors when analysing the data and in the perspective of constructing resources.

Besides questions related to corpora building and processing, the researcher should nowadays consider the nature of the data (iconic, multimodal, multi-code… data), their use (a corpus as an object vs. a corpus as a support) or also their validity (extension, attestation, etc.).

Finally, the status, uses and misuse of corpora and of tools will also be examined by taking into account questions dealing with protection of confidentiality of personal data, and with respect of legal rights. Such issues, relating to constraints in the use of corpora and databases, will be overtly addressed, by examining legal problems related to original and annotated documents, to protection of persons and public liberty, to protection of intellectual and commercial property, etc.

Proposals should treat, within the scope of one of the themes mentioned below:

1)            either the study of a specific issue in the domain of the linguistic sciences, related to corpus or data analysis;

2)            either an issue allowing enhancement or development of methods, of tools and analysis procedures required for the scientific use of corpora or of a set of data in the area of the linguistic sciences;

3)            either a reflection on the balance between the advantages and the limits of a corpus and of its uses: dead-angles of a corpus, unanswered questions following exploration of a corpus, necessary readjustments in requests after elaboration and exploration of a corpus, error treatment. In this perspective, relationships between intuition and empirical approach, between theory and corpus, between deduction and induction could be questioned with regards to corpus-based research.

In all cases, proposals should be clearly in line with the perspective adopted by the conference.


Official Languages of the Conference: French and English

[1] “Corpora, data and tools for research in the Humanities and Social Sciences”.

Online user: 1 RSS Feed