JOTA je serija priložnostnih jezikovnotehnološko obarvanih predavanj, ki poteka od leta 2005. Ustanoviteljica JOTE je Špela Vintar, v tej sezoni jo organizirajo Darja Fišer, Kaja Dobrovoljc in Katja Zupan. Če bi na JOTI želeli gostovati ali predlagati predavatelja, nam pišite.

Predavanja v okviru Jezikovnotehnološkega abonmaja so odslej dostopna tudi na portalu Videolectures.NET.
Objavo posnetkov omogoča CLARIN.SI.

SEZONA 2017/18

17. 4. 2018
Ruslan Mitkov, Univerza v Wolverhamptonu
Anaphora and coreference resolution: still a hard nut to crack? How far has it gone, what is its impact on NLP and what are the ways forward?

Anaphora and coreference resolution are arguably among the most challenging Natural Language Processing (NLP) tasks. Research in anaphora resolution and coreference resolution has focused almost exclusively on the development and intrinsic evaluation of various algorithms. While publications report positive results, the speaker will show that the replication of some of the best-known algorithms reveals reasons for concern in that the performance is far from ideal and the evaluation is far from transparent.

Anaphora and coreference resolution as tasks are crucial for the operation of NLP systems and should not be regarded in isolation but only in the wider picture of NLP applications. The extrinsic evaluation or the impact of an anaphora or coreference resolution module on a larger NLP system of which they are part, is an under-researched topic and several studies conducted by the speaker, seek to fill in this gap. More specifically, the speaker will discuss whether anaphora and coreference resolution can improve (and if they can, to what extent?) or not the performance of four NLP applications: text summarisation, term extraction, text categorisation and textual entailment.

The presentation will finish with suggested ways forward as to how anaphora and coreference resolution can do better and will outline the latest related research of the author.

26. 3. 2018
Prashant Pardeshi, Univerza v Kobeju
Japanese Basic Verbs Handbook: what is it and what does it do?

Polysemous lexical items, wherein one form has more than one meaning, are pervasive in the basic vocabulary of human languages and are considered to be generally difficult to acquire for foreign learners of any language. In the domain of linguistic theory, semanticists provided excellent insights into understanding the semantic connections among the multiple meanings of polysemous lexical items. However, dictionaries meant for native speakers tend to merely list multiple meanings of polysemous lexical items without describing the semantic connections among the multiple meanings, probably because the users of such dictionaries already know them intuitively. In the absence of learner’s dictionaries providing descriptions of the semantic connections among multiple meanings for a given form, understanding and acquiring multiple meanings of polysemous lexical items remains a difficult task for foreign learners of a language. The Japanese Basic Verbs Handbook—an online resource offering comprehensive description of multiple meanings and connections among them—is an attempt to fill this gap. In this talk we will offer an overview of the Japanese Basic Verbs Handbook and demonstrate its salient features (, in addition to introducing the two corpora used in its compilation. All the slides of the presentation are in Japanese and English and the talk will be given in English in order to make it accessible to those interested in general issues related to teaching/learning of polysemous basic verbs.

19. 3. 2018
Jerneja Žganec Gros, Alpineon
eBralec – sintetizator govora za slovenščino

eBralec je sintetizator govora za slovenski jezik, ki zna slovenska besedila samodejno pretvarjati v prijeten in razumljiv govor. Številnim slepim in slabovidnim uporabnikom ter dislektikom predstavlja nepogrešljiv pripomoček v vsakdanjem življenju. Predstavljen bo postopek njegove izgradnje, skupaj s številnimi primeri uporabe.


21. 2. 2018
Luka Krsnik, FRI
Napovedovanje naglasa slovenskih besed z metodami strojnega učenja

Za naglaševanje slovenskih besed ne obstaja preprost algoritem, naglasa slovenskih besed se namreč govorci naučimo med njihovim spoznavanjem. Metode strojnega učenja so se pri naglaševanju izkazale za uspešne. V magistrski nalogi smo na problemu preizkusili globoke nevronske mreže. Testirali smo različne arhitekture nevronskih mrež, več različnih predstavitev podatkov in ansamble mrež. Najboljše rezultate je vrnil ansambelski pristop, ki je pravilno napovedal 87,62 % besed iz testne množice. S predlaganim pristopom smo za več odstotkov izboljšal rezultate drugih metod strojnega učenja.


10. 1. 2018
Tadej Škvorc, FRI
Gručenje z omejitvami na podlagi besedil in grafov pri razporejanju akademskih člankov

Ročno sestavljanje urnikov je lahko velikokrat zelo časovno zamudno opravilo. To drži še povsem pri organizaciji znanstvenih konferenc, kjer morajo organizatorji ustrezno razporediti predstavitve člankov, ki jih je lahko na velikih konferencah na stotine. Predstavitve morajo biti razporejene v seje tako, da vsaka seja vsebuje predstavitve člankov s sorodno tematiko. Na predavanju bomo predstavili, kako lahko z različnimi metodami obdelave naravnega jezika in strojnega učenja avtomatiziramo izgradnjo takega urnika. To storimo tako, da najprej z metodami obdelave naravnega jezika poiščemo podobne članke na podlagi njihovih besedil. Pomagamo si tudi z dodatnimi metapodatki v obliki grafov, ki jih imajo na voljo organizatorji konferenc. Na podlagi najdenih podobnosti lahko nato z gručenjem članke ustrezno razporedimo v seje urnika konference. Delo je bilo opravljeno na angleškem gradivu, prevajanje pa bo v slovenščini.


13. 12. 2017
Matthew Purver, Queen Mary University of London
Analysing Dialogue for Diagnosis and Prediction in Mental Health

Conditions which affect our mental health often affect the way we use language; and treatment often involves linguistic interaction. This talk will present work on three related projects investigating the use of computational natural language processing (NLP) to help understand and improve diagnosis and treatment for such conditions. We will look at clinical dialogue between patient and doctor or therapist, in cases involving schizophrenia, depression and dementia; in each case, we find that diagnostic information and/or treatment outcomes are related to observable features of a patient’s language and interaction with their conversational partner. We discuss the nature of these phenomena and the suitability and accuracy of NLP techniques for detecting them automatically.


22. 11. 2017
Geraint A. Wiggins, Computational Creativity Lab, Queen Mary University of London
Creativity, deep symbolic learning, and the information dynamics of thinking

I present a hypothetical theory of cognition which is based on the principle that mind/brains are information processors and compressors, that are sensitive to certain measures of information content, as defined by Shannon (1948). The model is intended to help explicate processes of anticipatory and creative reasoning in humans and other higher animals. The model is motivated by the evolutionary value of prediction in information processing in an information-overloaded world.

The Information Dynamics of Thinking (IDyOT) model brings together symbolic and non-symbolic cognitive architectures, by combining sequential modelling with hierarchical symbolic memory, in which symbols are grounded by reference to their perceptual correlates. This is achieved by a process of chunking, based on boundary entropy, in which each segment of an input signal is broken into chunks, each of which corresponds with a single symbol in a higher level model. Each chunk corresponds with a temporal trajectory in the complex Hilbert space given by a spectral transformation of its signal; each symbol above each chunk corresponds with a point in a higher space which is in turn a spectral representation of the lower space. Norms in the spaces admit measures of similarity, which allow grouping of categories of symbol, so that similar chunks are associated with the same symbol. This chunking process recurses “up” IDyOT’s memory, so that representations become more and more abstract.

It is possible to construct a Markov Model along a layer of this model, or up or down between layers. Thus, predictions may be made from any part of the structure, more or less abstract, and it is in this capacity that IDyOT is claimed to model creativity, at multiple levels, from the construction of sentences in everyday speech to the improvisation of musical melodies.

IDyOT’s learning process is a kind of deep learning, but it differs from the more familiar neural network formulation because it includes symbols that are explicitly grounded in the learned input, and its answers will therefore be explicable in these terms.

In this talk, I will explain and motivate the design of IDyOT with reference to various different aspects of music, language and speech processing, and to animal behaviour.

24. 10. 2017
Peter Holozan, Amebis
Novi Amebisovi spletni servisi

V Amebisu smo se odločili, da ponudimo nekatere spletne  servise, uporabne predvsem za razvijalce virtualnih asistentov. Prvi je bil servis za napredno lematizacijo stavkov (kar imenujemo kar normalizacija), drugi pa bo pripisovanje konteksta vprašanjem/odgovorom glede na prejšnji pogovor. Predstavljeni bodo primeri, kako lahko pripisovanje konteksta izboljša odgovore v sistemu SecondEgo.


27. 9. 2017 
Niko Colnerič, Fakulteta za računalništvo in informatiko
Prepoznavanje čustev na Twitterju

Ste kdaj želeli prepoznavati čustva v tvitih, pa je teh bilo preveč za ročno pregledovanje? Bi vam koristilo, če bi računalnik sam znal ugotoviti, ali je bil nekdo jezen ali vesel, ko je nekaj zapisal? Na septembrskem predavanju bomo predstavili sistem za avtomatsko prepoznavanje čustev v angleških tvitih. Govorili bomo o tem, kako pridobiti podatke za učenje ter kako jih predobdelati. Primerjali bomo različne napovedne modele (tako klasične kot nevronske mreže) ter ugotavljali, kateri se najbolje obnesejo. Predstavili bomo pomanjkljivosti trenutnih pristopov ter prikazali delovanje na nekaj primerih.


SEZONA 2016/17

25. 5. 2017
Kaja Zupanc, Fakulteta za računalništvo in informatiko
Avtomatsko ocenjevanje esejev s sistemom SAGE+

Razumevanje besedila je za računalnike še vedno zelo zahtevna naloga. Kljub temu se v tujini za ocenjevanje esejev poleg časovno potratnega ročnega ocenjevanja že vrsto let uporablja tudi računalnik. Na majski JOTI bomo predstavili področje avtomatskega ocenjevanja esejev in sistem za avtomatsko ocenjevanje angleških esejev SAGE+. Poleg ocenjevanja sintakse besedila se bomo posvetili tudi semantiki. Spoznali bomo, kako SAGE+ avtomatsko gradi ontologije iz besedil in si pogledali, kako mu te pomagajo pri razumevanju vsebine.


20. 4. 2017
Simon Dobrišek, Fakulteta za elektrotehniko
Sedanjost in prihodnost govornih tehnologij

Samodejni pretvorniki govora v besedilo, sistemi za zaznavanje ključnih besed, pretvorniki besedila v umetni govor, sistemi za vodenje govorjenega dialoga, razpoznavalniki govorcev in njihovih psihofizičnih stanj, razpoznavalniki govorjenega jezika in narečja, sistemi za ocenjevanje pravilnosti in kakovosti izgovarjave, sistemi za obdelavo in označevanje večmedijskih govornih vsebin ipd. Vse to so tehnologije, ki jih uvrščamo med t. i. govorne tehnologije. Na predavanju bomo predstavili dolgoletne izkušnje Laboratorija za umetno zaznavanje, sisteme in kibernetiko na Fakulteti za elektrotehniko Univerze v Ljubljani na področju razvoja govornih tehnologij, s poudarkom na podpori slovenskemu govorjenemu jeziku. Razmišljali bomo o prihodnosti teh tehnologij in tudi o prihodnosti govora kot človeku najbolj naravnega načina sporazumevanja.


16. 3. 2017
Aljaž Košmerlj, Institut “Jožef Stefan”
Tacita – orodje za pomoč pri anonimizaciji sodb

Na predavanju bomo predstavili orodje za pomoč pri anonimizaciji sodb, Tacita, ki smo ga za Ministrstvo za pravosodje razvili na Institutu “Jožef Stefan”. Zaradi varovanja zasebnosti je iz vsake sodbe pred njeno javno objavo potrebno odstraniti vse podatke, preko katerih bi lahko prepoznali vpletene v postopku. Tacita z metodologijo strojnega učenja napoveduje, kateri deli besedila morajo biti odstranjeni z visoko verjetnostjo, in tako pomaga pri sicer zamudnem ročnem postopku anonimizacije. Pogledali si bomo, kako smo orodje pripravili in katera orodja analize naravnega jezika smo uporabili.


28. 2. 2017
Gordana Hržica, University of Zagreb – Department of Speech and Language Pathology, Laboratory for Psycholinguistic Research
Non-professional and specialised corpora

Ongoing trend in corpus linguistics has been its orientation towards professional writing, or in case of speech corpora towards professional speakers (e.g. TV hosts, lecturers). Exceptions (some examples: McEnery and Wilson, 2001) are both rare and small in size. Professional writers’ and speakers’ corpora provide useful information, but cannot be representative for the everyday written language use (e.g. emails, letters, notes, essays, business correspondence, telephone calls, oral or written instructions…). Current trend of building web-based corpora has allowed inclusion of non-professional texts to written corpora. However, additional retrieval is required to access non-professional writing.

Persons with language difficulties (specific language impairment, dyslexia, aphasia) are a special group of non-professional writers/speakers. They have speed and accuracy difficulties when producing written or spoken language. Also, due to their problems in language processing, they might err in different way than writers of typical language status (e.g. overview Ramus, 2014). Up to now, specialised corpora has been rather small.

In the Laboratory of Psycholinguistic Research (Department of Speech and Language Pathology, University of Zagreb) two new corpora has been developed: Croatian Adult Spoken Language Corpus (HrAL) and Croatian Corpus of Non-Professional Written Language. Methodological issues, as well as some results from corpus analyses will be presented.


24. 1. 2017
Filip Muki Dobranić, Danes je nov dan
Parlameter – predstavitev platforme, metodologije in podatkovnih zbirk
Na predavanju bomo predstavili Parlameter (, platformo za spremljanje in raziskovanje dogajanja v slovenskem parlamentu. Spregovorili bomo o tehnologiji, ki Parlameter poganja, ter predstavili metodologijo nekaterih od bolj zanimivih analiz. Na koncu si bomo ogledali še strukturo podatkov, ki so na voljo za nadaljne analize (glasovanja in transkripti), in za trenutek sanjali, kaj vse se z njimi še lahko naredi.

13. 12. 2016
Peter Holozan, Amebis
Kako Besana popravlja vejice?

Besana ( je slovnični pregledovalnik za slovenščino, v katerem je precejšen del namenjen popravljanju vejic v besedilih. Ta del je bil v zadnjih letih precej dopolnjen, kar je v precejšnji meri posledica uvedbe analize zgradbe povedi. Na predavanju bo predstavljen način dela Besane in osnove delovanja analize povedi. Predstavljena bo tudi zbirka primerov rabe vejice Vejica 1,0, ki omogoča tako preizkušanje programov za postavljanje vejic kot tudi uporabo strojnega učenja za postavljanje vejic.

Koristne povezave:

22. 11. 2016
Ben Verhoeven, Raziskovalno središče za računalniško jezikoslovje in psiholingvistiko Univerze v Antwerpnu
Author profiling: more linguistics and explanation

The task of author profiling is predicting psychological or sociological characteristics (e.g. gender or age) of any author based on their linguistic writing style. Current approaches mostly use word-based features of text to base classification decisions on. Over the last few years, we created two novel corpora for author profiling which we will show to have been very useful resources. We also present the start of our efforts in bringing more linguistics into this task by developing and re-using methods of discourse and semantic analysis in order to use their output as new features for our experiments. Ultimately, we hope to improve our understanding of sociological and psychological diversity of writing style by looking at the explanation behind the behavior of our classifiers.


SEZONA 2015/16

10. 5. 2016
Bruno Nahod, IHJJ
Struna: Methodology, Practice, and Future

Struna is the Croatian national term bank ( Its aim is to gradually standardize Croatian terminology for all professional domains. After eight years of database development, 20 processed domains and a number of big and small customizations of the methodology the Department of General Linguistics is getting ready to take on a new batch of projects of terminology processing. The aim of this presentation is to introduce the unique model of terminological processing and term retrieval that is used in the Struna term base. Furthermore, we will try to present some of the problems in term-unit processing that are specific to multi-domain term bases, as well as the solutions that were implemented in order to solve some of them. In the final part of the presentation, we will present the theoretical background and a possible practical implementation of Domain Cognitive Models (DCM); a new terminology paradigm. The development of the DCM arises as a possible means of dealing with the inadequate methodology that is, in most aspects, based on the Vienna School of terminology.

21. 4. 2016
Milan Ojsteršek, FERI UM
Ogrodje Textproc in njegova uporaba pri procesiranju besedil

V Laboratoriju za heterogene računalniške sisteme na Fakulteti za elektrotehniko, računalništvo in informatiko na Univerzi v Mariboru smo razvili programski paket Textproc za procesiranje besedil v slovenskem, angleškem in nemškem jeziku. Ogrodje omogoča povezovanje programskih komponent v različne delovne tokove za procesiranje naravnega jezika ali tekstovnega rudarjenja (npr. pretvorba različnih vrst dokumentov v besedilo, razčlenjevanje besedila, lematizacija, oblikoslovno označevanje, razreševanje sklicev, pomensko označevanje, polavtomatsko dopolnjevanje pomenskega slovarja, pomenski opis procesa…). V predavanju se bom najprej osredotočil na opis ogrodja in jezikovno infrastrukturo, ki jo uporabljamo, nato pa bom predstavil nekatere naše projekte, ki smo jih izvedli s pomočjo tega ogrodja: detekcija podobnih vsebin in priporočanje vsebin v nacionalni infrastrukturi odprtega dostopa.


29. 3. 2016
Nada Lavrač, IJS
Besedilno rudarjenje za kreativno meddomensko odkrivanje zakonitosti

Izjemna rast števila objavljenih znanstvenih člankov, ki so uporabnikom dostopni preko spleta, omogoča znanstvenikom razvoj in uporabo posebnih metod za kreativno, meddomensko odkrivanje zakonitosti. Z računalniško podporo lahko sedaj učinkovito preiskujemo hipotetične povezave med doslej nepovezanimi specializiranimi področji raziskav. V predavanju bomo obravnavali metode za iskanje bisociativnih povezav s preiskovanjem povezav med članki v knjižnici Medline. Te vključujejo iskanje bisociativnih termov s sistemom CrossBee ter iskanje povezav v netipičnih dokumentih dane domene (t.i. osamelcih). Ob tem bomo ponovili nekaj osnovnih načinov predprocesiranja tekstov, gručenja dokumentov in tekstovnega rudarjenja.


11. 2. 2016
Marko Grobelnik in Gregor Leban, IJS
Spremljanje globalnih medijev

Na predavanju bomo predstavili sistem za spremljanje globalnih medijev v realnem času Event Registry. Sistem spremlja novice iz preko 110.000 virov v različnih jezikih. Vsak pridobljeni dokument (novico) po jezikovni in semantični plati obdelamo s podsistemom za semantično anotacijo Wikifier. Iz novic sestavljamo dogodke, ki se naprej povezujejo v zgodbe. Vse skupaj je povezano v sistem Event Registry, ki omogoča iskanje, pregledovanje in analizo globalne socialne dinamike. Sistem je bil v celoti izdelan na IJS – na predavanju bomo pokazali tehnične rešitve in demonstracijo sistema.


25. 11. 2015
Maja Miličević, Univerza v Beogradu
Quantitative analysis of corpus data using the R environment

The goal of the workshop is to provide an introduction to quantitative analysis of corpus data using the R environment. The rationale is that (1) quantitative analysis is needed to properly describe corpus data, and in particular to generalise from one language sample to other similar samples and language in general; (2) R is one of the most powerful tools for quantitative analysis out there, and is freely available. The workshop will be divided in three sessions, dedicated in turn to basic considerations of corpus data and R, sample description and statistical inference. All sessions will use (meta)data from JANES.

Prerequisites: Experience in work with corpora will be assumed. No previous knowledge of statistics or R is required; an introductory handout will be provided about a week before the workshop to help participants brush up some basic math concepts and form expectations about R.

  • Session 1: “Obtaining data from corpora: How and why?”
  • Session 2: “Describing and visualising corpus data”
  • Session 3: “Generalising from corpus data”


13. 11. 2015
Kaja Dobrovoljc, Zavod Trojina
Označevanje korpusov z orodjem WebAnno

WebAnno je eno najuniverzalnejših spletnih orodij za jezikoslovno in drugo označevanje korpusov, ki ga ni potrebno nameščati na svoj računalnik, omogoča projektno in oddaljeno delo, prav tako pa tudi ne zahteva programerskih znanj, zaradi česar je idealno za jezikoslovce, humaniste, družboslovce in študente.

13. 11., predavalnica MPŠ
“Označevanje korpusov z WebAnno”, Kaja Dobrovoljc (zavod Trojina)
15:00 – 15:45 Predavanje
15:45 – 16:30 Demo
16:30 – 16:45 Kava
16:45 – 17:30 Vaje


20. 10. 2015
Žiga Pušnik, Fakulteta za računalništvo in informatiko Univerze v Ljubljani
Učenje globokih konvolucijskih nevronskih mrež na surovem besedilu

Globoke konvolucijske nevronske mreže so se izkazale na področju procesiranja vizualne informacije. V zadnjem času se pojavljajo poskusi njihove uporabe na različnih področjih. Področje procesiranja naravnega jezika ni izjema. V članku “Text Understanding from Scratch” so avtorji za klasifikacijo člankov uporabili globoke konvolucijske nevronske mreže na surovem besedilu. Rezultati poskusov so pokazali velik potencial, ki ga ima takšen pristop pri procesiranju naravnega jezika. Te raziskave so bile motivacija za diplomsko delo, v katerem smo se lotili evalvacije novega pristopa.
Najprej bomo opisali delovanje nevronskih mrež. Zanimalo nas bo, kako so mreže zgrajene in kako uporabiti gradientni spust za učenje. Predstavimo konvolucijo, njene lastnosti in kako jo lahko uporabimo v nevronski mreži. Prednost pri učenju besedila na konvolucijskih nevronskih mrežah je, da besedila ni treba predprocesirati, saj se model uči na nivoju znakov oziroma črk. Za uspešno in učinkovito učenja  pa potrebujemo nekaj trikov, kot je kvantizacija z vektorizacijo znakov. Razložili bomo, kakšne modele smo zgradili, in ovrednotili rezultate.

[slajdi] [diploma]

23. 9. 2015
Maarten Janssen, Univerza v Lizboni
TEITOK: tagging and digital humanities

When looking at historic corpora, there are typically two different types of corpora: on the one hand those corpora created by philologists, which are annotated with textual information concerning line breaks, letter types, deletions, additions, etc. and on the other hand those created by corpus linguists, which are annotated with linguistic information concerning part-of-speech, lemmatisation, and (shallow) syntactic information. Corpora that contain both are rare, although they do exists, such as the Sainte Graal of the university in Lyons, or the IMP corpus of the university of Ljubljana. One reason for the lack of such corpora is that people working with textual markup are often not interested in the linguistic markup, and vice-versa. However, another important reason is that there are no tools to help in the hard task of creating corpora combining both types of information. In this presentation, I will demonstrate the TEITOK system, which aims to provide a system that does exactly that: it allows adding and modifying layers of linguistic annotation on top of documents containing textual annotation. It is a system for both the creation and maintenance, and the distribution of such corpora combining two types of annotation. The system allows corpus queries using the rich CQP query language, and see the results in a way that closely resembles the original format. The presentation will give a general overview of the philosophy and architecture of the system, with a focus on those aspects that are most relevant for historic corpora: how the system allows to combine orthographic variants of each word to be displayed and searched, most notably the original orthography and the normalized orthography; how the system deals with tokenisation mismatches where for instance the grammatical word does not align with the orthographic word, or where the original orthography does not align with the normalized orthography; and how the system allows users to switch from a more text-oriented view to a more manuscript-oriented view.


SEZONA 2014/15

25. 5. 2015
Vid Podpečan, Odsek za tehnologije znanja IJS
Robot, ki sliši in govori

Humanoidni robot NAO je v zadnjih nekaj letih postal eden najpopularnejših robotskih sistemov, ki se uporablja v številnih raziskovalnih in izobraževalnih ustanovah po celem svetu. Odlična tehnična zasnova ter dodelana in privlačna zunanja podoba sta pripomogli, da je garažno podjetje Aldebaran Robotics iz Pariza v nekaj letih postalo eden najpomembnejših proizvajalcev humanoidnih robotov.
Predavanje bo osredotočeno na jezikovne spretnosti robotskega sistema NAO. Predstavljene bodo njegove govorne in slušne spretnosti ter gradniki, ki so na voljo pri ustvarjanju novih robotskih aplikacij, ki vključujejo glasovno komunikacijo. Omenili bomo tudi možnost komunikacije v slovenskem jeziku, težave, na katere pri tem naletimo ter predloge za razvoj na tem področju. Na koncu bodo predstavljene tudi nekatere aplikacije robota NAO, ki jih trenutno razvijamo oz. so fazi zasnove.
Vse omenjene tehnologije in zmogljivosti robota NAO bodo predstavljene tudi v živo v nekaj zabavnih robotskih aplikacijah.

[slajdi] [videolectures]

16. 4. 2015
Ana Zwitter Vitez, FHŠ UP in FF UL
Ugotavljanje avtorstva besedil: od strojnega učenja do jezikovne forenzike

Plagiatorstvo, psevdonimi in anonimne grožnje v zadnjih letih povzročajo močne družbene pretrese. Kako odkriti avtorjeve sledi v besedilu? Kakšne so omejitve tovrstnih raziskav? Kakšna naj bo vloga raziskovalca na tem področju?

31. 3. 2015
Darja Fišer, FF
Uvod v družino orodij za delo s korpusi za jezikoslovce: SkeLL, noSkE in SketchEngine

26. 2. 2015
Jasmina Smailović, IJS
Sentiment analysis of Twitter microblogging posts

In recent years, more and more people use social networking and microblogging Web sites to post messages about their observations, opinions, and emotions. Such messages are suitable for various analyses because of their large volume and near-real-time publishing. In our study, we are interested in the analysis of opinions expressed in Twitter microblogging posts. In order to detect expressed opinions we employ sentiment analysis – a research area concerned with detecting opinions, attitudes, and emotions in texts. The main focus of the talk will be on selecting the most suitable sentiment analysis algorithm and determining the best text preprocessing setting for Twitter messages. Moreover, the mechanism for employing a binary Support Vector Machine (SVM) classifier to classify tweets into three sentiment categories of positive, negative, and neutral (instead of positive and negative only) will be showed. Finally, real-world applications of the developed sentiment analysis methodology will be briefly presented.

[slajdi] [doktorska disertacija]

5. 1. 2015
Kaja Dobrovoljc, zavod Trojina
Procesiranje slovenskega jezika v razvojnem okolju NooJ

Z razvojem področja gradnje in označevanja obsežnih računalniških zbirk avtentičnih besedil se tudi v slovenskem jezikoslovju povečuje število raziskav, ki temeljijo na njihovem preučevanju. Kljub vse večjemu uveljavljanju korpusnih metod pa trenutno v slovenističnem prostoru obstaja precejšen razkorak med mnogimi raziskovalnimi priložnostmi, ki jih obstoječi korpusi ponujajo, ter njihovo dejansko izrabo. To stanje je v določeni meri tudi posledica pomanjkanja enostavnih orodij za kompleksnejšo obdelavo korpusnih besedil. Kot primer računalniško zmogljivega, a jezikoslovnemu uporabniku prijaznega orodja v prispevku predstavljamo NooJ, jezikoslovno razvojno okolje za izdelavo obsežnih formaliziranih opisov naravnih jezikov in njihovo uporabo v besedilnih korpusih. Na primeru izbranih jezikovnih virov in pravil iz pilotnega modula za slovenščino predstavimo najpomembnejše funkcionalnosti tega razvojnega okolja in prednosti njegovega povezovanja z že obstoječimi viri in orodji za slovenščino.


16. 12. 2014
dr. Žiga Golob, Alpineon d.o.o.
Zmanjševanje odvečnosti informacije slovarjev izgovarjav za sintezo govora

Slovarji izgovarjav pri pregibno bogatih jezikih, kot je npr. slovenščina, predstavljajo pomnilniško zelo potraten del sinteze govora. Da bi omogočili kvalitetno sintezo govora tudi v sistemih, ki so pomnilniško manj zmogljivi, smo razvili nekaj postopkov, ki temeljijo na predstavitvi s končnimi pretvorniki in omogočajo opustitev hranjenja nepotrebne informacije za grafemsko-alofonsko pretvorbo iz slovarjev izgovarjav. Tako smo lahko preostalo informacijo predstavili bolj zgoščeno. Izkaže se, da lahko s pomočjo predstavljenih metod grafemsko-alofonsko pretvorbo opravimo razmeroma točno tudi za določene besede, ki v slovarju izgovarjav niso vsebovane.

18. 11. 2014
Dr. Nikola Dobrić, Oddelek za angleške in ameriške študije Univerze Alpe-Adria, Celovec
Error annotation in learner corpora – Thoughts on agreement and norm

There is a difficulty of setting a clear norm in terms of error annotation of learner corpora as an acceptable and ‘correct’ phrase, clause or a sentence, relaying the same semantic content, can have theoretically an unlimited number of instantiations. Hence we have situation where one annotator analyzes a writing performance and produces a strikingly different error analysis from someone else equally competent and equally trained looking at the same text. Having in mind the vagaries of language, it is difficult to assume that this problem can ever be fully solved through any kind of a standardization or training effort. It can, however, be reduced to its possible minimum form the point of view of error annotation, by focusing on the choice and training in application of a particular error taxonomy together with an insistence on using unified and corpus-based descriptors of norm. The lecture intends to address the use of a newly developed error taxonomy (the SD Error Taxonomy) and the level of inter-rater agreement it produces as an indication of it primary usability in learner corpus annotation. The results indicate that a more transparent and coarsely grained taxonomy which leaves the feedback information out of its domain produces high levels of agreement even with minimal rater training.

28. 10. 2014
Mihael Arčan, Irski center za analizo podatkov Insigt, Galway
Statistical Machine Translation and Terminology

Professional translators deal on a daily basis with texts coming from different domains (information technology (IT), legal, agriculture, etc.), which require a specific lexical knowledge of the domain.
Nowadays, statistical machine translation (SMT) systems are suitable to translate very frequent expressions, but fail in translating domain-specific terms. This mostly depends on a lack of domain-specific parallel data from which the SMT systems can learn. Generic models such as Google Translate and Bing Translator, are the most common solutions, and are often used to translate manuals or very specific texts resulting in modest translations.
On the other hand, online terminological resources (e.g. the ‘Interactive Terminology for Europe’, IATE) are a valuable and fundamental support for translators, although their continuous use can be time demanding. For all these reasons, the integration of the terminological knowledge in the SMT system is a crucial step to increase translator productivity and limit their initial overload when working in different domains.
The talk will give an overview on how an SMT system generates a translation from source language to target language. The main focus of the presentation will centre on the embedding of terminology into open source phrase-based SMT (PB-SMT) systems such as Moses. Finally, the talk will conclude with a discussion about the future of SMT.


17. 9. 2014
dr. Antoni Oliver Gonzalez, Katalonska odprta univerza, Barcelona
WN-Toolkit: automatic creation of WordNets. Results for Croatian and Slovene (preliminary)

In this talk I will present some algorithms for automatic wordnet creation following the expand model. The methodologies we are using are based on bilingual dictionaries and parallel corpora. I will show complete evaluation results for Croatian, and preliminary for Slovene (only using dictionary-based techniques).



Vabila na jezikovnotehnološki abonma in gradiva z minulih predavanj najdete na tej povezavi.