More about the analyser

What do the morphological tags mean?

They give the morphological information about words, e.g. V-p is a verb in the present tense, Nv is a verbal noun, Ncpfn is a plural feminine common noun in the nominative. All tags are described in the file Scottish_Gaelic_Part-of-Speech_Annotatio.pdf provided with ARCOSG corpus.

I got the tag Ncsmn for eile, are you kidding me?

The tagger is trained on the basis of ARCOSG corpus using machine learning methods (with Conditional Random Fields). The output is a guessing of the system based on the form of the word and the neighboring words. We evaluated that 90.7% units (as an average value) in the text are correctly tagged. The POS tagging mistakes tend to involve categories that are difficult to predict on a statistical basis, such as case and gender. The lemmatisation is still experimental. If you want a perfect result for some purposes, you have to correct the output by hand.

What is the simplified version of the tagger?

While the default version of the tagger was trained upon the full version of ARCOSG, which uses a tagset of 246 tags, the simplified version was trained upon a version with just 41 tags. Although the simplified tagger is less informative, its higher accuracy (close to 95%) makes it more attractive for many tasks. The tagset is described in the Annotation Guidelines manual, available here: https://github.com/Gaelic-Algorithmic-Research-Group/ARCOSG-S.

What do the syntactic tags mean?

They give the syntactic information about words in sentence, e.g. nsubj is noun-like subject, amod is an adjective-like modifier. The tags belong to the Universal Dependencies syntactic tagset.

What do the numbers before the syntactic tags mean?

They give the number of the word in the sentence a given word depends. If the number is 0, the words depend on nothing, you've got the main word of the sentence (= the root of the dependency tree). The parser is based on dependency syntax, a kind of syntactic model where words of the sentence are linked through dependency relations.

Will you improve the analyser?

We do not plan to stop now, we see room to significantly improve the analyser, but we are working on our free time, so do not expect enhancements each week. We will try to keep the versioning page up to date and to document improvements.

Can I write Gaelic in the traditional way? Do I have to spell things a certain way?

Although the analyser can tolerate both grave and acute accents to an extent, we assume that texts have been written according to the Gaelic Orthographic Conventions (GOC) 2009. The analyser assumes that texts are spelled correctly. Various Gaelic spelling resources are freely available on the web. Here are two:

How do I send a file to the analyser for analysing?

First, install cURL.

When you have cURL running, type the following command in a command line (assuming that the text you want to annotate is called text.txt (100 kb maximum) and that you want the annotated output to be called text.ann.txt):

curl -X POST -H "Content-Type: text/plain" --data-binary "@text.txt" https://klc.vdu.lt/gaelic_tagger/tagger -o "text.ann.txt"

Be careful, the data file (text.txt or any other name) must be in the folder where you are calling cURL (or you will have to play with paths). The results (e.g. text.ann.txt) will also appear in the same folder.

And if you prefer the simplified results, just send the following cURL instruction:

curl -X POST -H "Content-Type: text/plain" --data-binary "@text.txt" https://klc.vdu.lt/gaelic_tagger/simple_tagger -o "text.ann.txt"

Is the web service running?

Try the following command:

curl https://klc.vdu.lt/gaelic_tagger/hello

If you got an answer, the server is running.

What are the different analyser options with cURL?

Tagger with light vertical output: curl -X POST -H "Content-Type: text/plain" --data-binary "@text.txt" https://klc.vdu.lt/gaelic_tagger/tagger -o "text-ann.txt"

Tagger with CoNLL-U format output: curl -X POST -H "Content-Type: text/plain" --data-binary "@text.txt" https://klc.vdu.lt/gaelic_tagger/tagger_conllu -o "text-ann.conllu"

Tagger with simplified tagset, light vertical output only: curl -X POST -H "Content-Type: text/plain" --data-binary "@text.txt" https://klc.vdu.lt/gaelic_tagger/simple_tagger -o "text-ann.txt"

Tagger and parser with light vertical output: curl -X POST -H "Content-Type: text/plain" --data-binary "@text.txt" https://klc.vdu.lt/gaelic_tagger/parser -o "text-ann.txt"

Tagger and parser with CoNLL-U format output: curl -X POST -H "Content-Type: text/plain" --data-binary "@text.txt" https://klc.vdu.lt/gaelic_tagger/parser -o "text-ann.conllu"

What is the CoNLL-U format?

CoNLL-U is a tabbed format used for linguistic annotation.