September 15-16 2022
The Slovenian Language Technologies Society (SDJT), the Centre for Language Resources and Technologies at the University of Ljubljana (CJVT), the Institute of Contemporary History (INZ) and the research infrastructures CLARIN.SI and DARIAH-SI organised the conference “Language Technologies and Digital Humanities” on 15th and 16th September 2022. The biennial conference “Language Technologies” was first held in 1998, with the thematic expansion to Digital Humanities introduced in 2016.
- Conference proceedings
- Timetable of the conference
- Invited speakers
- Pre-conference tutorials
- Presentation of the interim results of the “Development of Slovene in a Digital Environment” project
- Conference and social events information
“Large-scale language models: challenges and perspective” [Video]
Abstract: The emergence of large-scale neural language models in Natural Language Processing (NLP) research and applications has improved the state of the art in most NLP tasks. However, training such models requires enormous computational resources and training data. The characteristics of the training data has an impact on the behaviour of the models trained on it, depending for instance on the data’s homogeneity and size. In this talk, I will speak about how we developed the large-scale multilingual OSCAR corpus. I will describe the lessons we learned while training the French language model CamemBERT, the first large-scale monolingual model for a language other than English, especially in terms of the influence of size and heterogeneity of the training corpus. I will also sketch out a few research questions related to biases in large-scale language models, with a focus on the impact of tokenisation and language imbalance, in the context of the BigScience initiative. I will conclude with my thoughts on the future of language models and their impact on NLP and other data processing fields (speech, vision).
Bio: Benoît Sagot, Directeur de Recherches (Senior Researcher) at Inria, is the head of the Inria project-team ALMAnaCH in Paris, France. A specialist in natural language processing (NLP) and computational linguistics, his research focuses on language modelling, language resource development, machine translation, text simplification, part-of-speech tagging and parsing, computational morphology and, more recently, digital humanities (computational historical linguistics and historical language processing). He has been the PI or co-PI of a number of national and international projects, and is the holder of a chair in the PRAIRIE institute dedicated to research in artificial intelligence. He is also the co-founder of two start-ups where he uses his expertise in NLP and data mining for the automatic analysis of employee survey results.
“Designing computational systems to support humanities and social sciences research” [Video]
Abstract: From the viewpoint of the humanities and social sciences, collaborations with computer scientists often fail to deliver. In my research group, we have tried to understand why this is, and what to do about it. In this talk, I will discuss three key elements that we have discovered:
Often, datasets in the humanities and social sciences are not neatly representative of the object of interest. Systems need to provide ways in which to evaluate and counter the biases, confounders and noise in the data. Often, there is also a large gap between what is in the data, and what would be of interest. This gap needs to be bridged using algorithms, but care must be given that a) what the algorithm produces actually matches the interest and b) that its application does not introduce bias of its own (also interestingly, algorithm performance metrics of interest here often differ from those generally used in NLP/computer science). On a process level, collaboration between researchers from different disciplines is hard due to discrepancies in expectations relating to all facets of research, from research questions through methodology to the publication of results. Projects and systems need to acknowledge this, and be designed to facilitate iterative movement in the right direction.
Bio: Eetu Mäkelä is an associate professor in Human Sciences–Computing Interaction at the University of Helsinki, and a docent (adjunct professor) in computer science at Aalto University. At the Helsinki Centre for Digital Humanities, he leads a research group that seeks to figure out the technological, processual and theoretical underpinnings of successful computational research in the humanities and social sciences. Additionally, he serves as a technological director at the DARIAH-FI infrastructure for computational humanities and is one of three research programme directors in the datafication research initiative of the Helsinki Institute for Social Sciences and Humanities. For his work, he has obtained a total of 19 awards, including multiple best paper awards in conferences and journals, as well as multiple open data and open science awards. He also has a proven track record in creating systems fit for continued use by their audience.
On Wednesday, September 14th, 2022, two workshops were held as part of the conference:
Topic modelling parliamentary debates before and during the COVID-19 pandemic
The tutorial introduces researchers in the humanities and social sciences to text mining and shows the value of such approaches for research in the aforementioned fields. The tutorial presents the particularities of parliamentary discourse and topic modelling by answering concrete research questions. The practical example is based on the freely accessible corpus of parliamentary debates ParlaMint and the Orange data mining software. No programming knowledge is required, but the participants will need their own laptop with Orange installed.
Lecturer: Ajda Pretnar Žagar
Boost your research with CLARIN.SI
This tutorial will introduce the CLARIN.SI research infrastructure, which facilitates the creation, processing, archiving, and reuse of language data from books, newspapers, social media, interviews, etc. We will demonstrate how to use the digital repository to find existing language resources relevant for your research questions as well as the most common tools to analyse them. We will also present the rich knowledge base and funding instruments offered by CLARIN that researchers can benefit from when dealing with legal, standardisation, annotation and other issues related to their language data. The tutorial is ideal for novice and experienced researchers from linguistics but also other fields that rely on collecting and analysing written and spoken language materials, such as literary studies, translation studies, history, media studies, anthropology, and sociology, who would like to become more familiar with the CLARIN.SI research infrastructure. [PDF]
Lecturers: Jakob Lendardič and Kristina Pahor de Maiti
Presentation of the interim results of the “Development of Slovene in a Digital Environment – Language Resources and Technologies” project
On Friday 16 September, at the end of the conference, a presentation of the project “Development of Slovene in a Digital Environment – Language Resources and Technologies” took place, in which the leaders of the work packages presented the current results of the project. [Video]
Thematic areas of the conference
The conference aims to bring together researchers from various backgrounds and methodological frameworks. The main topics will include but are not limited to:
- Speech and other mono- and multilingual language technologies
- Digital linguistics: translation studies, corpus linguistics, lexicology and lexicography, standardisation
- Digital humanities and historical studies, ethnology, literary studies, musicology, cultural heritage, archaeology, and fine arts
- Digital humanities in education and digital publishing
We welcome submissions that present guidelines, research, good practices, projects and results in these areas. The conference will also include invited lectures, a student section, and roundtables on topics related to the conference. The official languages of the conference will be Slovene and English.
May 15th, 2022: Deadline for submission of papers and extended abstracts May 30th, 2022: Extended deadline for submission of papers and extended abstracts June 30h, 2022: Notification of acceptance August 15th, 2022: Submission of final papers August 16th, 2022: Registration deadline September 16th-16th, 2022: Conference
Instructions for authors
The authors are invited to submit either a full paper or an extended abstract describing work to be presented at the conference. The extended abstract will be published in the book of abstracts and the full papers in the conference proceedings, both of which will be published on the conference website under the Creative Commons license at the beginning of the conference. We leave it up to the authors whether to submit their contributions anonymized or not.
The official languages of the conference are Slovene and English.
The extended abstracts should be 2-4 pages long and the full papers 6–8 pages, formatted according to the conference guidelines:
- extended abstract: example, Word template
- full paper: example, Word template, LaTeX template
- templates are also available for papers written in Slovene; you can find them on the Slovene page of the conference.
The contributions are collected using EasyChair by clicking on this link.
The authors of full papers should indicate if it is a student contribution by adding “student paper” to the list of keywords. All the co-authors of student papers should be students. These papers will be presented in a separate student session and will be eligible for the best student paper award.
For more information please contact the Organising Committee at the following e-mail address (email@example.com)
- Mojca Šorn, chair
- Ana Cvek
- Kaja Dobrovoljc
- Jerneja Fridl
- Katja Meden
- Mihael Ojsteršek
- Nataša Rozman
- Darja Fišer (chair), Faculty of Arts, University of Ljubljana and Institute for Contemporary History
- Simon Dobrišek, Faculty of Electrical Engineering, University of Ljubljana
- Tomaž Erjavec, Jožef Stefan Institute
- Andrej Pančur, Institute for Contemporary History
- Matej Klemen (student section), Faculty for Computer Science and Informatics, University of Ljubljana
- Aleš Žagar (student section), Faculty for Computer Science and Informatics, University of Ljubljana
Members of the programme committee
- Špela Arhar Holdt, Faculty of Arts, University of Ljubljana
- Petra Bago, Faculty of Arts, University of Zagreb
- Vuk Batanović, Faculty of Electrical Engineering, University of Belgrade
- Zoran Bosnić, Faculty of Computer and Information Science, University of Ljubljana
- Narvika Bovcon, Faculty of Computer and Information Science, University of Ljubljana
- Václav Cvrček, Institute of the Czech National Corpus, Charles University in Prague
- Jaka Čibej, Faculty of Arts, University of Ljubljana
- Helena Dobrovoljc, Fran Ramovš Institute of the Slovenian Language, ZRC SAZU
- Kaja Dobrovoljc, Faculty of Arts, University of Ljubljana
- Jerneja Fridl, ZRC SAZU
- Polona Gantar, Faculty of Arts, University of Ljubljana
- Vojko Gorjanc, Faculty of Arts, University of Ljubljana
- Jurij Hadalin, Institute of Contemporary History
- Miran Hladnik, Faculty of Arts, University of Ljubljana
- Ivo Ipšić, University of Rijeka
- Mateja Jemec Tomazin, Fran Ramovš Institute of the Slovenian Language, ZRC SAZU
- Alenka Kavčič, Faculty of Computer Science, University of Ljubljana
- Iztok Kosem, Faculty of Arts, University of Ljubljana
- Simon Krek, Artificial Intelligence Laboratory, Jožef Stefan Institute
- Jakob Lenardič, Faculty of Arts, University of Ljubljana
- Nikola Ljubešić, Department of Knowledge Technologies, Jožef Stefan Institute
- Nataša Logar, Faculty of Social Sciences, University of Ljubljana
- Matija Marolt, Faculty of Computer and Information Science, University of Ljubljana
- Sanda Martinčić Ipšić, University of Rijeka
- Maja Miličević Petrović, University of Bologna
- Dunja Mladenić, Artificial Intelligence Laboratory, Jožef Stefan Institute
- Matija Ogrin, Institute of Slovene Literature and Literary Sciences, ZRC SAZU
- Matevž Pesek, Faculty of Computer Science, University of Ljubljana
- Dan Podjed, Institute of Slovenian Ethnology, ZRC SAZU
- Senja Pollak, Department of Knowledge Technologies, Jožef Stefan Institute
- Ajda Pretnar Žagar, Faculty of Computer Science, University of Ljubljana
- Marko Robnik Šikonja, Faculty of Computer and Information Science, University of Ljubljana
- Tanja Samardžić, University of Zurich
- Miha Seručnik, Milko Kos Historical Institute, ZRC SAZU
- Mirjam Sepesy Maučec, Faculty of Electrical Engineering and Computer Science, University of Maribor
- Marko Stabej, Faculty of Arts, University of Ljubljana
- Branislava Šandrih Todorović, Faculty of Philology, University of Belgrade
- Mojca Šorn, Institute of Contemporary History
- Janez Štebe, Faculty of Social Sciences, University of Ljubljana
- Simon Šuster, University of Melbourne
- Daniel Vasić, University of Mostar
- Darinka Verdonik, Faculty of Electrical Engineering and Computer Science, University of Maribor
- Andrej Žgank, Faculty of Electrical Engineering and Computer Science, University of Maribor
- Jerneja Žganec Gros, Alpineon d.o.o.
- Branko Žitko, Faculty of Science, University of Split