OPUS 4 | Search

Conference Proceeding

5 search hits

1 to 5

Sort by

Year
Year
Title
Title
Author
Author

For a fistful of blogs: Discovery and comparative benchmarking of republishable German content (2014)

Barbaresi, Adrien ; Würzner, Kay-Michael

We introduce two corpora gathered on the web and related to computer-mediated communication: blog posts and blog comments. In order to build such corpora, we addressed following issues: website discovery and crawling, content extraction constraints, and text quality assessment. The blogs were manually classified as to their license and content type. Our results show that it is possible to find blogs in German under Creative Commons license, and that it is possible to perform text extraction and linguistic annotation efficiently enough to allow for a comparison with more traditional text types such as newspaper corpora and subtitles. The comparison gives insights on distributional properties of the processed web texts on token and type level. For example, quantitative analysis reveals that blog posts are close to written language, while comments are slightly closer to spoken language.

The TITUS Project : 25 years of corpus building in ancient languages (2013)

Gippert, Jost

The article summarizes the contents and the structurtal premises of the “Thesaurus Indogermanischer Text- und Sprachmaterialien” (TITUS), focussing on search functions and facilities and questions of the encoding of ancient languages written in various scripts. Examples are taken from Tocharian, Greek, Vedic Sanskrit, and other ancient Indo-European languages covered by TITUS.

Canonicalizing the Deutsches Textarchiv (2013)

Jurish, Bryan

Virtually all conventional text-based natural language processing techniques - from traditional information retrieval systems to full-fledged parsers - require reference to a fixed lexicon accessed by surface form, typically trained from or constructed for synchronic input text adhering strictly to contemporary orthographic conventions. Unconventional input such as historical text which violates these conventions therefore presents difficulties for any such system due to lexical variants present in the input but missing from the application lexicon. To facilitate the extension of synchronically-oriented natural language processing techniques to historical text while minimizing the need for specialized lexical resources, one may first attempt an automatic canonicalization of the input text. This paper provides an informal overview of the various canonicalization techniques currently employed by the Deutsches Textarchiv project at the Berlin-Brandenburg Academy of Sciences and Humanities to prepare a corpus of historical German text for part-of-speech tagging, lemmatization, and integration into a robust online information retrieval system.

The Ramses project : Methodology and practices in the annotation of Late Egyptian Texts (2013)

Polis, Stéphane ; Winand, Jean

This paper is an updated presentation of the Ramses project being currently developed at the University of Liège. The first section stresses the main objectives and gives a technical description of the general architecture of Ramses software. The second part describes the encoding procedures and reviews the current state of the annotation. In the third section, some changes brought about by the use of large-scale corpora are discussed from an epistemological viewpoint. The paper ends with the presentation of some new avenues for research that will ensue from the use of a complex multilevel corpus.

The Ramses project in perspective : Managing evolving linguistic data (2013)

Rosmorduc, Serge

As the initial phase of development of Ramsès is almost done, with a working prototype of a syntactic editor, we have started to think about ways of improving the encoding process, and securing our data consistency. This paper explains the current state of our ideas on the subject.

1 to 5

Open Access

Conference Proceeding

Refine

Author

Year of publication

Document Type

Language

Keywords

Has Fulltext

Institute

5 search hits