Text Annotation · NLP with R and UDPipe

udpipe models

Pre-trained models

Before you can start on performing the annotation, you need a model. Pre-trained models build on Universal Dependencies treebanks are made available for more than 65 languages based on 101 treebanks, namely:

afrikaans-afribooms, ancient_greek-perseus, ancient_greek-proiel, arabic-padt, armenian-armtdp, basque-bdt, belarusian-hse, bulgarian-btb, buryat-bdt, catalan-ancora, chinese-gsd, chinese-gsdsimp, classical_chinese-kyoto, coptic-scriptorium, croatian-set, czech-cac, czech-cltt, czech-fictree, czech-pdt, danish-ddt, dutch-alpino, dutch-lassysmall, english-ewt, english-gum, english-lines, english-partut, estonian-edt, estonian-ewt, finnish-ftb, finnish-tdt, french-gsd, french-partut, french-sequoia, french-spoken, galician-ctg, galician-treegal, german-gsd, german-hdt, gothic-proiel, greek-gdt, hebrew-htb, hindi-hdtb, hungarian-szeged, indonesian-gsd, irish-idt, italian-isdt, italian-partut, italian-postwita, italian-twittiro, italian-vit, japanese-gsd, kazakh-ktb, korean-gsd, korean-kaist, kurmanji-mg, latin-ittb, latin-perseus, latin-proiel, latvian-lvtb, lithuanian-alksnis, lithuanian-hse, maltese-mudt, marathi-ufal, north_sami-giella, norwegian-bokmaal, norwegian-nynorsk, norwegian-nynorsklia, old_church_slavonic-proiel, old_french-srcmf, old_russian-torot, persian-seraji, polish-lfg, polish-pdb, polish-sz, portuguese-bosque, portuguese-br, portuguese-gsd, romanian-nonstandard, romanian-rrt, russian-gsd, russian-syntagrus, russian-taiga, sanskrit-ufal, scottish_gaelic-arcosg, serbian-set, slovak-snk, slovenian-ssj, slovenian-sst, spanish-ancora, spanish-gsd, swedish-lines, swedish-talbanken, tamil-ttb, telugu-mtg, turkish-imst, ukrainian-iu, upper_sorbian-ufal, urdu-udtb, uyghur-udt, vietnamese-vtb, wolof-wtb.

For R users who want to use these open-sourced models provided by the UDPipe community and start on tagging, you can proceed as follows to download the model of the language of your choice.

library(udpipe)
dl <- udpipe_download_model(language = "dutch")
str(dl)

Build your own annotator models

The udipe R package also allows you to easily train your own models, based on data in CONLL-U format, so that you can use these for your own commercial or non-commercial purposes. This is described in the other vignette of this package which you can view by the command vignette("udpipe-train", package = "udpipe") `

Annotate text

Currently the package allows you to do tokenisation, tagging, lemmatization and dependency parsing with one convenient function called udpipe_annotate. This goes as follows.

Load the model

First load the model which you have downloaded or which you have stored somewhere on disk.

## Either give a file in the current working directory
udmodel_dutch <- udpipe_load_model(file = "dutch-alpino-ud-2.5-191206.udpipe")
## Or give the full path to the file 
udmodel_dutch <- udpipe_load_model(file = dl$file_model)

Annotate your text

Tokenisation, tagging and parsing

Once you have this model, you can start on annotating. Provide a vector of text and use udpipe_annotate. The resulting tagged output is in CONLL-U format as described at http://universaldependencies.org/format.html. You can put this in a data.frame format with as.data.frame.

txt <- c("Ik ben de weg kwijt, kunt u me zeggen waar de Lange Wapper ligt? Jazeker meneer", 
         "Het gaat vooruit, het gaat verbazend goed vooruit")
x <- udpipe_annotate(udmodel_dutch, x = txt)
x <- as.data.frame(x)
str(x)
table(x$upos)

Only part of the annotation

Mark that by default udpipe_annotate does Tokenization, Parts of Speech Tagging, Lemmatization and Dependency parsing. If you want to gain some time because you require only a part of the annotation, you can specify to leave parts of the annotation out. This is done as follows.

## Tokenization + finds sentences, does not execute POS tagging, nor lemmatization or dependency parsing
x <- udpipe_annotate(udmodel_dutch, x = txt, tagger = "none", parser = "none")
x <- as.data.frame(x)
table(x$upos)
table(x$dep_rel)

## Tokenization + finds sentences, does POS tagging and lemmatization but does not execute dependency parsing
x <- udpipe_annotate(udmodel_dutch, x = txt, tagger = "default", parser = "none")
x <- as.data.frame(x)
table(x$upos)
table(x$dep_rel)

## Tokenization + finds sentences and executes dependency parsing but does not do POS tagging nor lemmatization
x <- udpipe_annotate(udmodel_dutch, x = txt, tagger = "none", parser = "default")
x <- as.data.frame(x)
table(x$upos)
table(x$dep_rel)

My text data is already tokenised

If your data is already tokenised according to your needs using other tools like the tidytext / tokenizers / text2vec R packages or any other external software or just by manual work. You can still use udpipe to do parts of speech annotation and dependency parsing and skip the tokenisation. This is done as follows.

## Either put every token on a new line and use tokenizer: vertical
input <- list(doc1 = c("Ik", "ben", "de", "weg", "kwijt", ",", "kunt", "u", "me", "zeggen", 
                       "waar", "de", "Lange Wapper", "ligt", "?", "Jazeker", "meneer"),
              doc2 = c("Het", "gaat", "vooruit", ",", "het", "gaat", "verbazend", "goed", "vooruit"))
txt <- sapply(input, FUN=function(x) paste(x, collapse = "\n"))
x <- udpipe_annotate(udmodel_dutch, x = txt, tokenizer = "vertical")
x <- as.data.frame(x)

## Or put every token of each document in 1 string separated by a space and use tokenizer: horizontal
##   Mark that if a token contains a space, you need to replace the space 
##   with the 'NO-BREAK SPACE' (U+00A0) character to make sure it is still considered as one token
txt <- sapply(input, FUN=function(x){
  x <- gsub(" ", intToUtf8(160), x) ## replace space with no-break-space
  paste(x, collapse = " ")
})
x <- udpipe_annotate(udmodel_dutch, x = as.character(txt), tokenizer = "horizontal")
x <- as.data.frame(x)

Remarks

Some remarks:

If your model is not trained to be able to do parsing/tagging, you can not request it to do parsing/tagging
Use argument doc_id to udpipe_annotate so that you can link your document to the tagged terms later on
Your text has to be in UTF-8 Encoding when you pass it to udpipe_annotate, if you don't have that Encoding use standard R facilities like iconv to convert it to UTF-8. You get also results in UTF-8 encoding back.

dl <- udpipe_download_model(language = "sanskrit", udpipe_model_repo = "jwijffels/udpipe.models.ud.2.0")
udmodel_sanskrit <- udpipe_load_model(file = dl$file_model)
txt <- "ततः असौ प्राह क्षत्रियस्य तिस्रः भार्या धर्मम् भवन्ति तत् एषा कदाचिद् वैश्या सुता भविष्यति तत् अनुरागः ममास्याम् ततः रथकारः तस्य निश्चयम् विज्ञायावदत् वयस्य किम् अ धुना कर्तव्यम् कौलिकः आह किम् अहम् जानामि त्वयि मित्रे यत् अभिहितं मया ततः"
x <- udpipe_annotate(udmodel_sanskrit, x = txt)
Encoding(x$conllu)
x <- as.data.frame(x)

If you want to work with other tools which are capable of handling CONLL-U format, just export your annotation to a file as shown below

x <- udpipe_annotate(udmodel_sanskrit, x = txt)
cat(x$conllu, file = "myannotation.conllu")

Support in text mining

Need support in text mining. Contact BNOSAC: http://www.bnosac.be