JOTA

JOTA je serija priložnostnih jezikovnotehnološko obarvanih predavanj, ki poteka od leta 2005. Ustanoviteljica JOTE je Špela Vintar, v sezoni 2016/17 jo organizira Darja Fišer. Če bi na JOTI želeli gostovati ali predlagati predavatelja, nam pišite.

SEZONA 2016/17


28. 2. 2017

Gordana Hržica, University of Zagreb – Department of Speech and Language Pathology, Laboratory for Psycholinguistic Research

Non-professional and specialised corpora

Ongoing trend in corpus linguistics has been its orientation towards professional writing, or in case of speech corpora towards professional speakers (e.g. TV hosts, lecturers). Exceptions (some examples: McEnery and Wilson, 2001) are both rare and small in size. Professional writers’ and speakers’ corpora provide useful information, but cannot be representative for the everyday written language use (e.g. emails, letters, notes, essays, business correspondence, telephone calls, oral or written instructions…). Current trend of building web-based corpora has allowed inclusion of non-professional texts to written corpora. However, additional retrieval is required to access non-professional writing.

Persons with language difficulties (specific language impairment, dyslexia, aphasia) are a special group of non-professional writers/speakers. They have speed and accuracy difficulties when producing written or spoken language. Also, due to their problems in language processing, they might err in different way than writers of typical language status (e.g. overview Ramus, 2014). Up to now, specialised corpora has been rather small.

In the Laboratory of Psycholinguistic Research (Department of Speech and Language Pathology, University of Zagreb) two new corpora has been developed: Croatian Adult Spoken Language Corpus (HrAL) and Croatian Corpus of Non-Professional Written Language. Methodological issues, as well as some results from corpus analyses will be presented.


24. 1. 2017
Filip Muki Dobranić, Danes je nov dan
Parlameter – predstavitev platforme, metodologije in podatkovnih zbirk
 
Na predavanju bomo predstavili Parlameter (https://parlameter.si), platformo za spremljanje in raziskovanje dogajanja v slovenskem parlamentu. Spregovorili bomo o tehnologiji, ki Parlameter poganja, ter predstavili metodologijo nekaterih od bolj zanimivih analiz. Na koncu si bomo ogledali še strukturo podatkov, ki so na voljo za nadaljne analize (glasovanja in transkripti), in za trenutek sanjali, kaj vse se z njimi še lahko naredi.

13. 12. 2016
Peter Holozan, Amebis

Kako Besana popravlja vejice?

Besana (http://besana.amebis.si) je slovnični pregledovalnik za slovenščino, v katerem je precejšen del namenjen popravljanju vejic v besedilih. Ta del je bil v zadnjih letih precej dopolnjen, kar je v precejšnji meri posledica uvedbe analize zgradbe povedi. Na predavanju bo predstavljen način dela Besane in osnove delovanja analize povedi. Predstavljena bo tudi zbirka primerov rabe vejice Vejica 1,0, ki omogoča tako preizkušanje programov za postavljanje vejic kot tudi uporabo strojnega učenja za postavljanje vejic.

Koristne povezave:


22. 11. 2016
Ben Verhoeven, Raziskovalno središče za računalniško jezikoslovje in psiholingvistiko Univerze v Antwerpnu

Author profiling: more linguistics and explanation

The task of author profiling is predicting psychological or sociological characteristics (e.g. gender or age) of any author based on their linguistic writing style. Current approaches mostly use word-based features of text to base classification decisions on. Over the last few years, we created two novel corpora for author profiling which we will show to have been very useful resources. We also present the start of our efforts in bringing more linguistics into this task by developing and re-using methods of discourse and semantic analysis in order to use their output as new features for our experiments. Ultimately, we hope to improve our understanding of sociological and psychological diversity of writing style by looking at the explanation behind the behavior of our classifiers.

[slajdi]


SEZONA 2015/16


10. 5. 2016
Bruno Nahod, IHJJ

Struna: Methodology, Practice, and Future

Struna is the Croatian national term bank (http://struna.ihjj.hr/). Its aim is to gradually standardize Croatian terminology for all professional domains. After eight years of database development, 20 processed domains and a number of big and small customizations of the methodology the Department of General Linguistics is getting ready to take on a new batch of projects of terminology processing. The aim of this presentation is to introduce the unique model of terminological processing and term retrieval that is used in the Struna term base. Furthermore, we will try to present some of the problems in term-unit processing that are specific to multi-domain term bases, as well as the solutions that were implemented in order to solve some of them. In the final part of the presentation, we will present the theoretical background and a possible practical implementation of Domain Cognitive Models (DCM); a new terminology paradigm. The development of the DCM arises as a possible means of dealing with the inadequate methodology that is, in most aspects, based on the Vienna School of terminology.


21. 4. 2016
Milan Ojsteršek, FERI UM

Ogrodje Textproc in njegova uporaba pri procesiranju besedil

V Laboratoriju za heterogene računalniške sisteme na Fakulteti za elektrotehniko, računalništvo in informatiko na Univerzi v Mariboru smo razvili programski paket Textproc za procesiranje besedil v slovenskem, angleškem in nemškem jeziku. Ogrodje omogoča povezovanje programskih komponent v različne delovne tokove za procesiranje naravnega jezika ali tekstovnega rudarjenja (npr. pretvorba različnih vrst dokumentov v besedilo, razčlenjevanje besedila, lematizacija, oblikoslovno označevanje, razreševanje sklicev, pomensko označevanje, polavtomatsko dopolnjevanje pomenskega slovarja, pomenski opis procesa…). V predavanju se bom najprej osredotočil na opis ogrodja in jezikovno infrastrukturo, ki jo uporabljamo, nato pa bom predstavil nekatere naše projekte, ki smo jih izvedli s pomočjo tega ogrodja: detekcija podobnih vsebin in priporočanje vsebin v nacionalni infrastrukturi odprtega dostopa.

[slajdi]


29. 3. 2016
Nada Lavrač, IJS

Besedilno rudarjenje za kreativno meddomensko odkrivanje zakonitosti

Izjemna rast števila objavljenih znanstvenih člankov, ki so uporabnikom dostopni preko spleta, omogoča znanstvenikom razvoj in uporabo posebnih metod za kreativno, meddomensko odkrivanje zakonitosti. Z računalniško podporo lahko sedaj učinkovito preiskujemo hipotetične povezave med doslej nepovezanimi specializiranimi področji raziskav. V predavanju bomo obravnavali metode za iskanje bisociativnih povezav s preiskovanjem povezav med članki v knjižnici Medline. Te vključujejo iskanje bisociativnih termov s sistemom CrossBee ter iskanje povezav v netipičnih dokumentih dane domene (t.i. osamelcih). Ob tem bomo ponovili nekaj osnovnih načinov predprocesiranja tekstov, gručenja dokumentov in tekstovnega rudarjenja.

[slajdi]


11. 2. 2016
Marko Grobelnik in Gregor Leban, IJS

Spremljanje globalnih medijev

Na predavanju bomo predstavili sistem za spremljanje globalnih medijev v realnem času Event Registry. Sistem spremlja novice iz preko 110.000 virov v različnih jezikih. Vsak pridobljeni dokument (novico) po jezikovni in semantični plati obdelamo s podsistemom za semantično anotacijo Wikifier. Iz novic sestavljamo dogodke, ki se naprej povezujejo v zgodbe. Vse skupaj je povezano v sistem Event Registry, ki omogoča iskanje, pregledovanje in analizo globalne socialne dinamike. Sistem je bil v celoti izdelan na IJS – na predavanju bomo pokazali tehnične rešitve in demonstracijo sistema.

[slajdi]


25. 11. 2015
Maja Miličević, Univerza v Beogradu

Quantitative analysis of corpus data using the R environment

The goal of the workshop is to provide an introduction to quantitative analysis of corpus data using the R environment. The rationale is that (1) quantitative analysis is needed to properly describe corpus data, and in particular to generalise from one language sample to other similar samples and language in general; (2) R is one of the most powerful tools for quantitative analysis out there, and is freely available. The workshop will be divided in three sessions, dedicated in turn to basic considerations of corpus data and R, sample description and statistical inference. All sessions will use (meta)data from JANES.

Prerequisites: Experience in work with corpora will be assumed. No previous knowledge of statistics or R is required; an introductory handout will be provided about a week before the workshop to help participants brush up some basic math concepts and form expectations about R.

  • Session 1: “Obtaining data from corpora: How and why?”
  • Session 2: “Describing and visualising corpus data”
  • Session 3: “Generalising from corpus data”

[slajdi]


13. 11. 2015
Kaja Dobrovoljc, Zavod Trojina

Označevanje korpusov z orodjem WebAnno

WebAnno je eno najuniverzalnejših spletnih orodij za jezikoslovno in drugo označevanje korpusov, ki ga ni potrebno nameščati na svoj računalnik, omogoča projektno in oddaljeno delo, prav tako pa tudi ne zahteva programerskih znanj, zaradi česar je idealno za jezikoslovce, humaniste, družboslovce in študente.

Program
13. 11., predavalnica MPŠ
“Označevanje korpusov z WebAnno”, Kaja Dobrovoljc (zavod Trojina)
15:00 – 15:45 Predavanje
15:45 – 16:30 Demo
16:30 – 16:45 Kava
16:45 – 17:30 Vaje

[slajdi]


20. 10. 2015
Žiga Pušnik, Fakulteta za računalništvo in informatiko Univerze v Ljubljani

Učenje globokih konvolucijskih nevronskih mrež na surovem besedilu

Globoke konvolucijske nevronske mreže so se izkazale na področju procesiranja vizualne informacije. V zadnjem času se pojavljajo poskusi njihove uporabe na različnih področjih. Področje procesiranja naravnega jezika ni izjema. V članku “Text Understanding from Scratch” so avtorji za klasifikacijo člankov uporabili globoke konvolucijske nevronske mreže na surovem besedilu. Rezultati poskusov so pokazali velik potencial, ki ga ima takšen pristop pri procesiranju naravnega jezika. Te raziskave so bile motivacija za diplomsko delo, v katerem smo se lotili evalvacije novega pristopa.
Najprej bomo opisali delovanje nevronskih mrež. Zanimalo nas bo, kako so mreže zgrajene in kako uporabiti gradientni spust za učenje. Predstavimo konvolucijo, njene lastnosti in kako jo lahko uporabimo v nevronski mreži. Prednost pri učenju besedila na konvolucijskih nevronskih mrežah je, da besedila ni treba predprocesirati, saj se model uči na nivoju znakov oziroma črk. Za uspešno in učinkovito učenja  pa potrebujemo nekaj trikov, kot je kvantizacija z vektorizacijo znakov. Razložili bomo, kakšne modele smo zgradili, in ovrednotili rezultate.

[slajdi] [diploma]


23. 9. 2015
Maarten Janssen, Univerza v Lizboni

TEITOK: tagging and digital humanities

When looking at historic corpora, there are typically two different types of corpora: on the one hand those corpora created by philologists, which are annotated with textual information concerning line breaks, letter types, deletions, additions, etc. and on the other hand those created by corpus linguists, which are annotated with linguistic information concerning part-of-speech, lemmatisation, and (shallow) syntactic information. Corpora that contain both are rare, although they do exists, such as the Sainte Graal of the university in Lyons, or the IMP corpus of the university of Ljubljana. One reason for the lack of such corpora is that people working with textual markup are often not interested in the linguistic markup, and vice-versa. However, another important reason is that there are no tools to help in the hard task of creating corpora combining both types of information. In this presentation, I will demonstrate the TEITOK system, which aims to provide a system that does exactly that: it allows adding and modifying layers of linguistic annotation on top of documents containing textual annotation. It is a system for both the creation and maintenance, and the distribution of such corpora combining two types of annotation. The system allows corpus queries using the rich CQP query language, and see the results in a way that closely resembles the original format. The presentation will give a general overview of the philosophy and architecture of the system, with a focus on those aspects that are most relevant for historic corpora: how the system allows to combine orthographic variants of each word to be displayed and searched, most notably the original orthography and the normalized orthography; how the system deals with tokenisation mismatches where for instance the grammatical word does not align with the orthographic word, or where the original orthography does not align with the normalized orthography; and how the system allows users to switch from a more text-oriented view to a more manuscript-oriented view.

[slajdi]


SEZONA 2014/15


25. 5. 2015
Vid Podpečan, Odsek za tehnologije znanja IJS

Robot, ki sliši in govori

Humanoidni robot NAO je v zadnjih nekaj letih postal eden najpopularnejših robotskih sistemov, ki se uporablja v številnih raziskovalnih in izobraževalnih ustanovah po celem svetu. Odlična tehnična zasnova ter dodelana in privlačna zunanja podoba sta pripomogli, da je garažno podjetje Aldebaran Robotics iz Pariza v nekaj letih postalo eden najpomembnejših proizvajalcev humanoidnih robotov.
Predavanje bo osredotočeno na jezikovne spretnosti robotskega sistema NAO. Predstavljene bodo njegove govorne in slušne spretnosti ter gradniki, ki so na voljo pri ustvarjanju novih robotskih aplikacij, ki vključujejo glasovno komunikacijo. Omenili bomo tudi možnost komunikacije v slovenskem jeziku, težave, na katere pri tem naletimo ter predloge za razvoj na tem področju. Na koncu bodo predstavljene tudi nekatere aplikacije robota NAO, ki jih trenutno razvijamo oz. so fazi zasnove.
Vse omenjene tehnologije in zmogljivosti robota NAO bodo predstavljene tudi v živo v nekaj zabavnih robotskih aplikacijah.

[slajdi] [videolectures]


16. 4. 2015
Ana Zwitter Vitez, FHŠ UP in FF UL

Ugotavljanje avtorstva besedil: od strojnega učenja do jezikovne forenzike

Plagiatorstvo, psevdonimi in anonimne grožnje v zadnjih letih povzročajo močne družbene pretrese. Kako odkriti avtorjeve sledi v besedilu? Kakšne so omejitve tovrstnih raziskav? Kakšna naj bo vloga raziskovalca na tem področju?
[slajdi]


31. 3. 2015
Darja Fišer, FF

Uvod v družino orodij za delo s korpusi za jezikoslovce: SkeLL, noSkE in SketchEngine


26. 2. 2015
Jasmina Smailović, IJS

Sentiment analysis of Twitter microblogging posts

In recent years, more and more people use social networking and microblogging Web sites to post messages about their observations, opinions, and emotions. Such messages are suitable for various analyses because of their large volume and near-real-time publishing. In our study, we are interested in the analysis of opinions expressed in Twitter microblogging posts. In order to detect expressed opinions we employ sentiment analysis – a research area concerned with detecting opinions, attitudes, and emotions in texts. The main focus of the talk will be on selecting the most suitable sentiment analysis algorithm and determining the best text preprocessing setting for Twitter messages. Moreover, the mechanism for employing a binary Support Vector Machine (SVM) classifier to classify tweets into three sentiment categories of positive, negative, and neutral (instead of positive and negative only) will be showed. Finally, real-world applications of the developed sentiment analysis methodology will be briefly presented.

[slajdi] [doktorska disertacija]


5. 1. 2015
Kaja Dobrovoljc, zavod Trojina

Procesiranje slovenskega jezika v razvojnem okolju NooJ

Z razvojem področja gradnje in označevanja obsežnih računalniških zbirk avtentičnih besedil se tudi v slovenskem jezikoslovju povečuje število raziskav, ki temeljijo na njihovem preučevanju. Kljub vse večjemu uveljavljanju korpusnih metod pa trenutno v slovenističnem prostoru obstaja precejšen razkorak med mnogimi raziskovalnimi priložnostmi, ki jih obstoječi korpusi ponujajo, ter njihovo dejansko izrabo. To stanje je v določeni meri tudi posledica pomanjkanja enostavnih orodij za kompleksnejšo obdelavo korpusnih besedil. Kot primer računalniško zmogljivega, a jezikoslovnemu uporabniku prijaznega orodja v prispevku predstavljamo NooJ, jezikoslovno razvojno okolje za izdelavo obsežnih formaliziranih opisov naravnih jezikov in njihovo uporabo v besedilnih korpusih. Na primeru izbranih jezikovnih virov in pravil iz pilotnega modula za slovenščino predstavimo najpomembnejše funkcionalnosti tega razvojnega okolja in prednosti njegovega povezovanja z že obstoječimi viri in orodji za slovenščino.

[slajdi]


16. 12. 2014
dr. Žiga Golob, Alpineon d.o.o.

Zmanjševanje odvečnosti informacije slovarjev izgovarjav za sintezo govora

Slovarji izgovarjav pri pregibno bogatih jezikih, kot je npr. slovenščina, predstavljajo pomnilniško zelo potraten del sinteze govora. Da bi omogočili kvalitetno sintezo govora tudi v sistemih, ki so pomnilniško manj zmogljivi, smo razvili nekaj postopkov, ki temeljijo na predstavitvi s končnimi pretvorniki in omogočajo opustitev hranjenja nepotrebne informacije za grafemsko-alofonsko pretvorbo iz slovarjev izgovarjav. Tako smo lahko preostalo informacijo predstavili bolj zgoščeno. Izkaže se, da lahko s pomočjo predstavljenih metod grafemsko-alofonsko pretvorbo opravimo razmeroma točno tudi za določene besede, ki v slovarju izgovarjav niso vsebovane.


18. 11. 2014
Dr. Nikola Dobrić, Oddelek za angleške in ameriške študije Univerze Alpe-Adria, Celovec

Error annotation in learner corpora – Thoughts on agreement and norm

There is a difficulty of setting a clear norm in terms of error annotation of learner corpora as an acceptable and ‘correct’ phrase, clause or a sentence, relaying the same semantic content, can have theoretically an unlimited number of instantiations. Hence we have situation where one annotator analyzes a writing performance and produces a strikingly different error analysis from someone else equally competent and equally trained looking at the same text. Having in mind the vagaries of language, it is difficult to assume that this problem can ever be fully solved through any kind of a standardization or training effort. It can, however, be reduced to its possible minimum form the point of view of error annotation, by focusing on the choice and training in application of a particular error taxonomy together with an insistence on using unified and corpus-based descriptors of norm. The lecture intends to address the use of a newly developed error taxonomy (the SD Error Taxonomy) and the level of inter-rater agreement it produces as an indication of it primary usability in learner corpus annotation. The results indicate that a more transparent and coarsely grained taxonomy which leaves the feedback information out of its domain produces high levels of agreement even with minimal rater training.


28. 10. 2014
Mihael Arčan, Irski center za analizo podatkov Insigt, Galway

Statistical Machine Translation and Terminology

Professional translators deal on a daily basis with texts coming from different domains (information technology (IT), legal, agriculture, etc.), which require a specific lexical knowledge of the domain.
Nowadays, statistical machine translation (SMT) systems are suitable to translate very frequent expressions, but fail in translating domain-specific terms. This mostly depends on a lack of domain-specific parallel data from which the SMT systems can learn. Generic models such as Google Translate and Bing Translator, are the most common solutions, and are often used to translate manuals or very specific texts resulting in modest translations.
On the other hand, online terminological resources (e.g. the ‘Interactive Terminology for Europe’, IATE) are a valuable and fundamental support for translators, although their continuous use can be time demanding. For all these reasons, the integration of the terminological knowledge in the SMT system is a crucial step to increase translator productivity and limit their initial overload when working in different domains.
The talk will give an overview on how an SMT system generates a translation from source language to target language. The main focus of the presentation will centre on the embedding of terminology into open source phrase-based SMT (PB-SMT) systems such as Moses. Finally, the talk will conclude with a discussion about the future of SMT.

[slajdi]


17. 9. 2014
dr. Antoni Oliver Gonzalez, Katalonska odprta univerza, Barcelona

WN-Toolkit: automatic creation of WordNets. Results for Croatian and Slovene (preliminary)

In this talk I will present some algorithms for automatic wordnet creation following the expand model. The methodologies we are using are based on bilingual dictionaries and parallel corpora. I will show complete evaluation results for Croatian, and preliminary for Slovene (only using dictionary-based techniques).

[slajdi]


MINULE SEZONE

Vabila na jezikovnotehnološki abonma in gradiva z minulih predavanj najdete na tej povezavi.