UDK 81’322
Stručni rad

Rukopis primljen 30. IX. 2019.
Prihvaćen za tisak 29. XI. 2019.
doi.org/10.31724/rihjj.46.2.8

Radovan Garabík

Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences
Panská 26, SK-81101 Bratislava
radovan.garabik@kassiopeia.juls.savba.sk

Word Embedding Based on Large-Scale Web
Corpora as a Powerful Lexicographic Tool
The Aranea Project offers a set of comparable corpora for two dozens of (mostly European)
languages providing a convenient dataset for NLP applications that require training on large
amounts of data. The article presents word embedding models trained on the Aranea corpora
and an online interface to query the models and visualize the results. The implementation
is aimed towards lexicographic use but can be also useful in other fields of linguistic study
since the vector space is a plausible model of semantic space of word meanings. Three
different models are available – one for a combination of part of speech and lemma, one for
raw word forms, and one based on FastText algorithm uses subword vectors and is not limited
to whole or known words in finding their semantic relations. The article is describing the
interface and major modes of its functionality; it does not try to perform detailed linguistic
analysis of presented examples.

1. Introduction
The Aranea Project (Benko 2014) offers a set of comparable corpora for two
dozen of (mostly European) languages providing a convenient dataset for NLP
applications that require training on large amounts of data. The corpora are built
using the same methodology and compatible natural language processing tools
and are available via NoSketch Engine interface (Rychlý 2007) at the web page
of the project1.
1

http://aranea.juls.savba.sk

603

Garabik.indd 603

4.11.2020. 11:07:56

Rasprave 46/2 (2020.) str. 603–618

Word embeddings is a collective name for various methods of representing words
as vectors within a many-dimensional vector space. Although first mentioned as
a theoretical concept in (Harris 1954), it gained momentum with the publication
and open-sourcing of the word2vec software (Mikolov et al. 2013), and currently
is an indispensable part of many NLP related tasks and processes. It is supported
by several mature OpenSource frameworks, in particular, gensim (Řehůřek and
Sojka 2010) is rather popular with researchers preferring the Python programming
language – this framework is also used to generate our vector models. The vector
space and the relation of words represented by vectors within is connected with
more abstract semantic meanings of the words and their relations.
Our work presents an online accessible interface for vector models for main
languages in the Aranea corpora to be used in lexicographic work. The
implementation is somewhat Slovak-centric in the sense that some features are
either available only for Slovak, or their implementation for other languages is not
tuned for coverage or accuracy. This is understandable, since the implementation
is geared towards use in Slovak lexicography, and indirectly because of the state
of the art lemmatization (Garabík 2006) used in the Slovak corpora.
The models use automatic detection of bigrams, which aids to the lexicographic
description of multiword expressions (though there are better tools available
for collocation analysis, so this is useful in a supplemental role only). There is
a possibility to filter out-of-dictionary lemmas, which is useful in uncovering
non-obvious meanings of existing words. Otherwise, the lemmas obtained
by statistical and heuristic guesser can be erroneous, but their inclusion often
displays unexpected relations between words not covered by existing dictionaries
(this does not exclusively cover only neologisms).

2. Vector models
At the time of writing, there are usable vector models for 22 languages and three
more language models are in the test phase. Because of the aim to a provide
useful resource for lexicographic work, we provide a special vector model
(Slovak ll in table 1) using very low threshold for word frequency in the corpus
(10 occurrences in the corpus), while other corpora use variable threshold,
604

Garabik.indd 604

4.11.2020. 11:07:56

Radovan Garabík: Word Embedding Based on Large-Scale Web Corpora

depending on the size of the corpus (20 for the smallest corpora, 400 for the
biggest ones). Although such a low threshold brings a lot of “noise” (uncommon
typos, mislemmatized entries, errors in tokenization, foreign language citations
etc.) into the results, it also helps to uncover rare, but relevant semantic relations
or synonyms. The models use a context window 7 words wide and the skip-gram
training algorithm.
Table 1: Overview of language models and their source corpora
language

corpus

corpus size

Arabic

Araneum Arabicum

978 M

Bulgarian

Araneum Bulgaricum

1.2 G

Chinese (simplified)#

Hanku

1.2 G

Croatian

hrWaC v2.0

1.9 G

Czech

Araneum Bohemicum

5.2 G

Dutch

Araneum Nederlandicum

1.2 G

English

Araneum Anglicum

11.4 G

Estonian

Araneum Estonicum

430 M

Finnish

Araneum Finnicum

1.2 G

French

Araneum Francogallicum

8.7 G

German

Araneum Germanicum

9.1 G

Hungarian

Araneum Hungaricum

1.2 G

Italian

Araneum Italicum

1.2 G

Latin

Araneum Latinum

109 M

Latvian

Araneum Lettonicum

671 M

Polish

Araneum Polonicum

1.2 G

Portuguese

Araneum Portugallicum

1.2 G

Russian

29.7 G

Slovene

Omnia Russica
Araneum Slovacum + prim-8.0juls-all
Araneum Slovacum + prim-8.0juls-all
slWaC v2.1

895 M

Spanish

Araneum Hispanicum

1.2 G

Swedish

Araneum Suedicum

1.2 G

*

Slovak
Slovak ll&

4.6 G
4.6 G

605

Garabik.indd 605

4.11.2020. 11:07:57

Rasprave 46/2 (2020.) str. 603–618

* not lemmatized, wordform model serves as a fallback when the lemmata model is selected
#
lemmatization not applicable, the lemmata model differs from the wordform one only by the
background presence of part of speech information (Gajdoš, Garabík and Benická 2016)
&
differs from the baseline Slovak model by using substantially lower threshold for word
occurrence (frequency)

Test models
language
Georgian
Norwegian$
Romanian
$

corpus
Araneum Georgianum
Araneum Norvegicum
Araneum Dacoromanicum

corpus size
254 M
1.6 G
1.2 G

mixture of Nynorsk and Bokmål

Traditionally, to quantify semantic difference in word embeddings, cosine
similarity is used – words close in meanings have their vectors almost parallel
(angle θ between them close to zero and cos 0 = 1), unrelated words almost
perpendicular (right angle, and cos π/2 = 0). Our users, however, prefer to align
their spatial intuition with the semantic space model and think of semantically
“close” words as “near” in the spatial sense and “unrelated” words as “far” in the
spatial sense, therefore we define “semantic difference” as √(1–cos2 θ) = sin θ,
being close to zero for near-synonyms and close to one for unrelated words.
For most languages, there are three different models available: the model trained
on the combination of part of speech and lemmas, the model trained on word forms,
and a FastText model (Mikolov et al. 2018). Common linguistic preprocessing in
all the models includes text normalization, deduplication, boilerplate removal,
tokenization, and sentence level segmentation (Benko 2014).
The model trained on the combination of part of speech and lemmas (called
lemmata model in this article) is trained on the sequence of the combination
of part of speech (tagged by Araneum Universal Tagset Version 1.0) and
lemmatized tokens. We keep the uppercase lemmas in the capitalization that
is “natural” for the language in question. In particular, proper names in almost
all languages and German nouns are capitalized. This helps in keeping casual
users from the undue cognitive load, while allowing to distinguish homonymous
common and proper names (and, in the case of German, nouns from other parts
of speech). The model over lemmas loses information regarding inflected forms
(and by implication, perhaps interesting syntactical features), but users generally
expect to enter lemmas, and it is the lemmas that carry semantic information.
606

Garabik.indd 606

4.11.2020. 11:07:57

Radovan Garabík: Word Embedding Based on Large-Scale Web Corpora

The main disadvantage of this model is that it exposes errors in lemmatization,
particularly if the queried word was not known to the morphological database or
is lemmatized incorrectly.
The model trained on word forms is the closest to commonplace word
embedding usage. However, we normalize the capitalization of tokens according
to their predominant capitalization (more than 90% of occurrences in the corpus
at positions 1) not at the beginning of the sentence and 2) not immediately after
direct speech punctuation, such as quotes or dashes). The model is otherwise
independent of an existing linguistic annotation, therefore it is not tainted
by eventual systematic errors or shortcomings of existing tools (especially
systematic errors in lemmatization are known to skew the results significantly),
and can be used even if other NLP processing components are not available for
a given language.
The FastText model uses sub-word character n-gram vectors for certain values
of n, added to the main word vector, calculated over the space of case normalized
word forms. This model captures intra-word information and allows calculation
of vectors for the input of words not present in the training corpora. This model
is especially useful in searching for compound words or for languages with many
such compounds (such as German) or querying for substandard or erroneous
inputs. As a convenient side effect, the semantic closeness of the vectors extends
to morphemes within words, which is especially visible if there are otherwise no
close synonyms to the input.

3. User interface
Conceptually, the web interface (available at https://www.juls.savba.sk/semä.
html) consists of several modules, and we strive to provide the most streamlined
user experience possible. This is translated into the exact functionality being
determined by user input, without additional (or visible) options.
The input consists of two query fields, one for positive words (normalized sum
of the corresponding vectors), one for negative words (sum of these vectors will
be subtracted from the positive ones). The negative word field is hidden by default and is exposed when the user moves the mouse pointer over the “minus”
607

Garabik.indd 607

4.11.2020. 11:07:57

Rasprave 46/2 (2020.) str. 603–618

sign. The words should be separated by spaces or plus signs. It is also possible to
enter negative words into the positive input field if prefixed by the minus sign –
this way effectively constructing a simple arithmetic expression.
The following options are available:
−

the model, one of lemmata, wordform, fasttext

−

known lexicon filter (only for Slovak and only for the lemmata model)

−

visualization method

The result of the query consists of three fields:
−

informational message

−

lexical similarities table

−

visualization

The “informational message” field contains optional messages for the user,
mostly about the status of the server (e.g. the backend is not working, the server
is overloaded), or if an unknown word has been queried.
In case of a successful query (no error and the queried word is known to the
model), the lexical similarities table and the visualization is displayed.
The lexical similarities table consists of three columns, the first one shows the
similarity coefficient (rounded to three decimal places), the second column the
word (semantically close to the query) and the third column links to external
resources, with a letter-like symbol denoting the type of link.

608

Garabik.indd 608

4.11.2020. 11:07:57

Radovan Garabík: Word Embedding Based on Large-Scale Web Corpora

Figure 1: Example of the user interface
From top to bottom, left to right, there are these elements: language selector
(hr), model selection (wordform), visualization selection (Gnuplot), input field
(Zagreb), informational message (“cached”), lexical similarities table (with
the columns containing semantic difference, word, raw count in the corpus,
hyperlinks to external sources), and the visualization field.
The links to external resources are (together with the hyperlink symbols):
−

link to the Aranea corpus2, version Minus; a

−

link to the Aranea corpus, version Maximum (if available; otherwise
Maius; requires registration); A

−

link to the Google search3 for the word, restricted to the top level domain
typical for the language4; G

−

link to English language Wiktionary5 entry for the word; W

−

link to the Slovak National Corpus search interface6 (only for the Slovak
language; requires registration); P

http://aranea.juls.savba.sk
https://google.com
4
With the following exceptions: Arabic, English, Georgian and Latin are not restricted; German is
restricted to the .de, .at and .ch domains; Portuguese is restricted to the .pt, .br, .ao and .mz domains; Russian
is restricted to the .su, .ru, .by, .бел and .рф domains.
5
https://en.wiktionary.org
6
https://bonito.korpus.sk
2
3

609

Garabik.indd 609

4.11.2020. 11:07:57

Rasprave 46/2 (2020.) str. 603–618

−

link to the Slovak Dictionary Portal7 (only for the Slovak language); S

−

link to Yandex Search8 (only for the Russian language); Я

−

link to the search interface of the Dictionnaire de l’Académie Française9
(only for the French language); F

−

link to the hrWaC corpus search interface10 (only for the Croatian
language); C

−

link to the slWaC corpus search interface11 (only for the Slovene
language); C

−

link to the Slovene dictionary portal Fran12 (only for the Slovene
language); F

The individual words in the semantic similarities table are hypertext linked to a
query of that word using the same language model and options. The number of
rows (returned results) can be increased by clicking on the down arrow symbol
at the bottom of the table.

4. Visualization
The visualization is especially important in quickly providing information about
semantic clusters or the relation of semantically close words to the queried word.
We are using the ISOMAP method of dimensionality reduction (Tenenbaum,
De Silva and Langford 2000) to get a presentable scatter-like graph of the
semantic neighborhood of the query. We are using reduction to two dimensions
for the basic graph, reduction to three dimensions for a three-dimensional graph
(displayed in 2D projection), and since there are people able to conceptualize and
perceive four spatial dimensions (Francis and Brinkmann 2009), there is also
a possibility to display a four dimensional graph, as a colour-coded 3D graph
projected to a 2D screen.
7
8
9
10
11
12

https://slovnik.juls.savba.sk
https://yandex.ru
https://academie.atilf.fr
http://nl.ijs.si/noske/all.cgi/first_form?corpname=hrwac
http://nl.ijs.si/noske/all.cgi/first_form?corpname=slwac
https://fran.si

610

Garabik.indd 610

4.11.2020. 11:07:58

Radovan Garabík: Word Embedding Based on Large-Scale Web Corpora

The graphs in SVG format are produced by the Gnuplot13 software (Janert
2010), running on the server. Gunplot is a mature, well-established visualization
multiplatform software with a rich set of capabilities, and its batch-like mode
makes integrating into a web-centric interface rather straightforward and
effortless. The software is also very efficient and generating the graphs takes
a virtually negligible amount of resources. The raw dimensionality reduced
coordinates in Gnuplot format are given as a link (⚙), providing the Gnuplotenabled users with the ability to pan, zoom and rotate the graphs (though rotation
is possible only in planes perpendicular to the ana-kata axis). Users can also use
the raw data in the visualization or other statistical software of their choice.
There are also two additional visualization modules available, providing two
different word clouds – a static image14 and a dynamic rotating sphere15. These
are provided purely for demonstration or illustrative purposes.

5. Usage Examples
5.1. Near Synonyms
Perhaps the most basic application of the interface is as an extensive collection of
thesauri in various languages, each of them displaying not only synonyms of the
queried word, but also quantifying the semantic difference of the synonym (in
the lexical similarities table). In the visualization field, we can immediately spot
prominent clusters of semantically similar words, which can bring new insights
into semantic relations and behavior.
Table 2: Semantic similarities table for the query djevojka,
Croatian lemmata model
0.000
0.346
0.420

13
14
15

djevojka
mladić
djevojčica

130074
50694
57120

http://gnuplot.info
Based on wordcloud2.js software, https://wordcloud2-js.timdream.org.
Based on jsTagSphere software, http://jstagsphere.sf.net.

611

Garabik.indd 611

4.11.2020. 11:07:58

Rasprave 46/2 (2020.) str. 603–618

0.502
0.508
0.509
0.516
0.531
0.536
0.549
0.574

dečko
curiti
dječak
cura
muškarac
žena
momak
ljepotica

99050
24738
57554
35098
233772
670474
42770
12390

Figure 2: 2D visualization of the query djevojka. Several
semantic clusters are visible
Table 3: Semantic similarities table for the query kralj, Croatian lemmata model
0.000
0.511
0.541
0.564
0.625
0.635
0.641
0.644
0.665
0.690

kralj
vladar
car
knez
kraljica
prijestolje
Petr_i.
gospodar
princ
vitez

83678
24612
26474
10444
28280
8526
70
25802
12992
14742

612

Garabik.indd 612

4.11.2020. 11:07:58

Radovan Garabík: Word Embedding Based on Large-Scale Web Corpora

0.696

Pipin_Mali

112

Figure 3: 3D visualization of the query kralj. Several semantic clusters are
visible
5.2. Vector Arithmetic
Since we have the words represented as vectors in a multidimensional Euclidean16
space, we can perform simple vector arithmetic (addition and subtraction)
corresponding to addition and subtraction of meanings of the words in the
traditional sense.
The usual example used to demonstrate word embedding arithmetic is to subtract
the masculinity from the word king, add femininity and get the word queen,
using the equation:
king + woman – man = queen (1)
which we demonstrate on the Croatian model, i. e. the equation will be of the
form:
kralj + žena – muškarac = x (2)
First, the semantic space around the single query kralj is show in Table 3 and
Figure 4; we can recognize several semantic groups, e.g. that of important
European kings (Henrik II, Ludovik II, Pipin Mali, Fridrik II – bottom group),
16
Not necessarily Euclidean; using other metrics (e.g. Manhattan or general Ln-norm) could emphasise or
suppress various aspects of semantic differences. However, the interpretation of various norms is difficult
and the relation to inherent linguistic properties is opaque, other metrics are therefore not used and the
discussion of them is beyond the scope of this article.

613

Garabik.indd 613

4.11.2020. 11:07:58

Rasprave 46/2 (2020.) str. 603–618

important Croatian/Bosnian rulers (Stjepan Tvrtko, Stjepan Držislav – upper
group), other titles and rulers (gospodar, ban, vojvoda, car, princ, kraljica – in
the middle of the graph) and several semantically less connected, solitary words
(prijestolje, kraljevski).
Table 4: The most similar vectors to the query x = kralj + žena – muškarac
0.193
0.608
0.621
0.642
0.649
0.661
0.684
0.688
0.698
0.699
0.711
0.987

kralj
kraljica
knez
vladar
Petr_i.
car
prijestolje
dvor
Pipin_Mali
krunidba
vitez
...
žena

83678
28280
10444
24612
70
26474
8526
19908
112
658
14742
670474

Table 5: The most similar vectors to the query x = premijer + žena – muškarac
0.201
0.472
0.493
0.523
0.537
0.538
0.540
0.553
0.580
0.585
0.598
0.978

premijer
premijerka
vlada
Jadranka_Kosor
premijerka_Jadranka
Ivo_Sanader
Zoran_Milanović
Kosor
premijerka_Kosor
potpredsjednik_vlada
Iva_Sanader
...
žena

105350
18732
401842
29904
7882
14840
19544
32200
5390
12096
12796
670474

614

Garabik.indd 614

4.11.2020. 11:07:58

Radovan Garabík: Word Embedding Based on Large-Scale Web Corpora

The results of equation (2) nearest to the calculated vector x are in Table 4.
Unsurprisingly, the nearest word not equal to the input is kraljica, with other
vectors being close because of their closeness to kralj (in other words, there are
no other clearly feminine near-synonyms). Note that we calculate the semantic
difference relative to the vector x and always include all the input words with
positive arithmetic signs in the results, to give the user an idea of how really
close is the result to the words presented in the table. This alleviates some of the
concerns risen in (Nissim et al. 2019)17. The connecting lines in the visualization
graph originate at the position of the vector x as well; however, their length
is necessarily distorted and generally does not correspond to the semantic
closeness. Also, note that the word nearest to the x of equation (2) happens to be
kralj itself; kraljica is the second nearest.
In modern times, of course, a monarchy is not typical for Croatia, and we do not
expect so many texts about monarchies in the web corpus. We thus repeat the
query with a more modern example:
premijer + žena – muškarac = x

(3)

The results are in Table 5; as expected, the nearest word (apart from premijer
itself) is premijerka, but we also see several close proper names, most prominently
(feminine) Jadranka Kosor18. The table also incidentally illustrates automatic
bigram detection – all the bigrams (Jadranka Kosor, premijerka Jadranka, Ivo
Sanader, etc.) were automatically inferred from the corpus data; and errors in the
lemmatization – Iva Sanader is such an error19. It also demonstrates the closeness
(in the abstract semantic space, which our vector space is hopefully a reasonably
adequate model of) of proper nouns to common nouns.

The idea to originate the connecting lines in the visualization at the resulting vector (and not the nearest
word) and to include semantic closeness between this origin and original words in the tables has been
inspired by (Nissim et al. 2019), brought to our attention by the anonymous reviewer of this article. We are
also grateful for the comments provided by the reviewer.
18
Prime minister of Croatia from 2009 to 2011.
19
The correct lemma is Ivo Sanader.
17

615

Garabik.indd 615

4.11.2020. 11:07:58

Rasprave 46/2 (2020.) str. 603–618

Figure 4: 2D visualization of the query premijer + žena – muškarac;
connecting lines originate at the vector representing the arithmetic result
5.3. Quantification of Semantic Difference
Another mode of operation is activated automatically if exactly two terms are
entered into the positive query field, and it calculates and displays semantic
difference, defined as sin θ, where θ is the angle between the two vectors. To
illustrate the results, Table 6 contains several pairs of synonyms, not quite
synonyms and also word pairs with a rather low semantic difference that are
however not considered synonyms in the traditional sense at all.
Table 6: Quantification of semantic differences for several word pairs
first word
kralj
kralj
kralj
lingvistika
premijer
apoteka
tri
osam
euro
Beč
Slovačka

second word
premijer
knez
car
jezikoslovlje
predsjednik
ljekarna
dva
sedam
kuna
Budimpešta
Češka

sin θ
0.885
0.564
0.541
0.661
0.628
0.427
0.162
0.120
0.327
0.339
0.240

616

Garabik.indd 616

4.11.2020. 11:07:59

Radovan Garabík: Word Embedding Based on Large-Scale Web Corpora

6. Conclusion
Word embedding is an indispensable method in modern Natural Language
Processing. By presenting simple, yet powerful web-accessible interface to
various word vector models build upon the Aranea corpora family, we hope
to bridge the gap between contemporary NLP and traditional linguistic and
lexicographic research and allow lexicographers to consult the rich information
that word embeddings trained on huge corpora can provide.

References
Benko, Vladimír. 2014. Aranea: Yet Another Family of (Comparable) Web Corpora.
Text, Speech and Dialogue. 17th International Conference, TSD 2014. Eds. Sojka, Petra
et al. Springer International Publishing Switzerland. Brno. 257–264.
Erjavec, Tomaž; Ljubešić, Nikola; Logar, Nataša. 2015. The slWaC corpus of the
Slovene Web. Informatica: an international journal of computing and informatics 39/1.
35–42.
Francis, George K.; Brinkmann, Peter. 2009. Human four-dimensional spatial intuition
in virtual reality. Psychonomic Bulletin & Review 16/5. 818–823.
Gajdoš, Ľuboš; Garabík, R adovan; Benická, Jana. 2016. The New Chinese Webcorpus
Hanku – Origin, Parameters, Usage. Studia Orientalia Slovaca 15/1. 21–33.
Garabík, R adovan. 2008. Storing Morphology Information in a Wiki. Lexicographic
Tools and Techniques. IITP RAS. Moscow. 55–59.
Harris, Zellig S. 1954. Distributional structure. Word 10/2–3. 146–162.
Janert, Philipp K. 2010. Gnuplot in action: understanding data with graphs. Manning
Publications Co. New York.
Ljubešić, Nikola; K lubička, Filip. 2014. {bs, hr, sr}wac-web corpora of Bosnian,
Croatian and Serbian. Proceedings of the 9th Web as Corpus Workshop (WaC-9).
Eds. Bildhauer, Felix; Schäfer, Roland. Association for Computational Linguistics.
Gothenburg. 29–35.
Mikolov, Tomas; Chen, K ai; Corrado, Greg; Dean, Jeffrey. 2013. Efficient Estimation
of Word Representations in Vector Space. Proceedings of Workshop at ICLR. Université
de Montreal. Scottsdale.
Mikolov, Tomas; Grave, Edouard; Bojanowski, Piotr; Puhrsch, Christian; Joulin,
Armand. 2018. Advances in Pre-Training Distributed Word Representations.
Proceedings of the International Conference on Language Resources and Evaluation
(LREC 2018). European Language Resources Association. Miyazaki.
617

Garabik.indd 617

4.11.2020. 11:07:59

Rasprave 46/2 (2020.) str. 603–618

Nissim, Malvina; van Noord, R ik; van der Goot, Rob. 2019. Fair is Better than
Sensational: Man is to Doctor as Woman is to Doctor. Computational Linguistics 46/2.
487–497. doi.org/10.1162/coli_a_00379.
Ř ehůřek, R adim; Sojka, Petr. 2010. Software framework for topic modelling with
large corpora. Proceedings of the LREC 2010 Workshop on New Challenges for NLP
Framework. Eds. Witte, René et al. ELRA. Valletta. 46–50.
Rychlý, Pavel. 2007. Manatee/Bonito – A Modular Corpus Manager. 1st Workshop on
Recent Advances in Slavonic Natural Language Processing. Masaryk University. Brno.
65–70.
Slovenský národný korpus – prim-8.0-juls-all. 2018. Jazykovedný ústav Ľ. Štúra
Slovenskej akadémie vied. Bratislava. http://korpus.juls.savba.sk (accessed 28 December
2019).
Tenenbaum, Joshua B.; de Silva, Vin, Langford, John C. 2000. A global geometric
framework for nonlinear dimensionality reduction. Science 290/5500. 2319–2323.
doi.org/10.1126/science.290.5500.2319.

Vektorski prikaz riječi utemeljen na velikim mrežnim korpusima
kao moćan leksikografski alat
Sažetak
Projekt Aranea sadržava niz usporednih korpusa za 24 (uglavnom europskih) jezika. On
pruža prikladan skup podataka za aplikacije za obradu prirodnoga jezika (NLP) koje
zahtijevaju učenje na velikoj količini podataka. U radu se prikazuju modeli vektorskoga
prikaza riječi koji su uspostavljeni učenjem na korpusima Aranea te mrežno sučelje kako
bi se propitali modeli i vizualizirali rezultati. To može biti korisno za leksikografsku
praksu, ali i u drugim područjima leksikografskoga proučavanja jer je vektorski prostor
vjerodostojan model semantičkoga prostora značenja riječi. Postoje tri moguća modela:
prvi za kombinaciju vrste riječi i leme, drugi za sirove forme riječi i treći koji se temelji
na algoritmu FastText koji upotrebljava vektore na razini nižoj od riječi i nije ograničen
na cijele riječi ili poznate riječi pri pronalaženju semantičkih odnosa. U radu se opisuju
sučelje i osnovni modeli njegova funkcioniranja, ali se ne pokušava provesti iscrpna
jezična analiza prikazanih primjera.
Keywords: corpus, word embedding, vector similarity, semantic similarity, web corpora, visualization
Ključne riječi: korpus, vektorski prikaz riječi, vektorska sličnost, semantička sličnost, mrežni
korpusi, vizualizacija

618

Garabik.indd 618

4.11.2020. 11:07:59