Refine
Year of publication
- 2014 (71) (remove)
Document Type
- Part of a Book (37)
- Article (8)
- Conference Proceeding (8)
- Working Paper (6)
- Book (4)
- Other (3)
- Annualreport (1)
- Lecture (1)
- Part of Periodical (1)
- Preprint (1)
Language
- German (65)
- English (4)
- French (1)
- Multiple languages (1)
Keywords
- Wissenschaftsorganisation (11)
- Antike (7)
- Rezeption (7)
- Wissenschaftspolitik (5)
- Wissenschaftssystem (5)
- Mediävistik (3)
- Preußen (2)
- AEMASE (1)
- Acta Borussica (1)
- Akademie der Wissenschaften (1)
Has Fulltext
- yes (71)
Institute
- Berlin-Brandenburgische Akademie der Wissenschaften (42)
- Akademienvorhaben Census of Antique Works of Art and Architecture Known in the Renaissance (9)
- Zentrum Mittelalter (6)
- Akademienvorhaben Monumenta Germaniae Historica (5)
- Interdisziplinäre Arbeitsgruppe Exzellenzinitiative (4)
- ALLEA (2)
- Akademienunion (1)
- Akademienvorhaben Die alexandrinische und antiochenische Bibelexegese in der Spätantike (1)
- Akademienvorhaben Digitales Wörterbuch der Deutschen Sprache (1)
- Akademienvorhaben Griechisches Münzwerk (1)
- Akademienvorhaben Preußen als Kulturstaat (1)
- Drittmittelprojekt Lebenswelten, Erfahrungsräume und politische Horizonte der ostpreußischen Adelsfamilie Lehndorff vom 18. bis in das 20. Jahrhundert (1)
- Interdisziplinäre Arbeitsgruppe Klinische Forschung in vulnerablen Populationen (1)
- Interdisziplinäre Arbeitsgruppe Zukunft des wissenschaftlichen Kommunikationssystems (1)
- Veröffentlichungen von Akademiemitgliedern (1)
For a fistful of blogs: Discovery and comparative benchmarking of republishable German content
(2014)
We introduce two corpora gathered on the web and related to computer-mediated communication: blog posts and blog comments. In order to build such corpora, we addressed following issues: website discovery and crawling, content extraction constraints, and text quality assessment. The blogs were manually classified as to their license and content type. Our results show that it is possible to find blogs in German under Creative Commons license, and that it is possible to perform text extraction and linguistic annotation efficiently enough to allow for a comparison with more traditional text types such as newspaper corpora and subtitles. The comparison gives insights on distributional properties of the processed web texts on token and type level. For example, quantitative analysis reveals that blog posts are close to written language, while comments are slightly closer to spoken language.