• Home
  • Docs
  • Support
  • GitHub
  • Blog

›Introduction

Introduction

  • Try it out
  • Introduction

Annotation

  • Text Annotation
  • Model building
  • Parallel Annotation

Analytical use cases

  • Use Cases I
  • Use Cases II
  • Use Cases III

Universe

  • udpipe universe

MoreDocs

  • More docs

Try it out

Install the R package.

install.packages("udpipe")

Example

Get your language model and start annotating.

library(udpipe)
udmodel <- udpipe_download_model(language = "dutch")
udmodel <- udpipe_load_model(file = udmodel$file_model)
x <- udpipe_annotate(udmodel, x = "Ik ging op reis en ik nam mee: mijn laptop, mijn zonnebril en goed humeur.")
x <- as.data.frame(x, detailed = TRUE)
x

Or just do as follows.

library(udpipe)
x <- udpipe(x = "Ik ging op reis en ik nam mee: mijn laptop, mijn zonnebril en goed humeur.", 
            object = "dutch")

The annotation returns paragraphs, sentences, tokens, the location of the token in the original text, morphology elements like the lemma, the universal part of speech tag and the treebank-specific parts of speech tag, morphosyntactic features and returns as well the dependency relationship. More information at https://universaldependencies.org/guidelines.html

 doc_id paragraph_id sentence_id start end term_id token_id     token     lemma  upos                                        xpos                               feats head_token_id      dep_rel            misc
   doc1            1           1     1   2       1        1        Ik        ik  PRON                VNW|pers|pron|nomin|vol|1|ev      Case=Nom|Person=1|PronType=Prs             2        nsubj            <NA>
   doc1            1           1     4   7       2        2      ging      gaan  VERB                               WW|pv|verl|ev Number=Sing|Tense=Past|VerbForm=Fin             0         root            <NA>
   doc1            1           1     9  10       3        3        op        op   ADP                                     VZ|init                                <NA>             4         case            <NA>
   doc1            1           1    12  15       4        4      reis      reis  NOUN                  N|soort|ev|basis|zijd|stan              Gender=Com|Number=Sing             2          obl            <NA>
   doc1            1           1    17  18       5        5        en        en CCONJ                                    VG|neven                                <NA>             7           cc            <NA>
   doc1            1           1    20  21       6        6        ik        ik  PRON                VNW|pers|pron|nomin|vol|1|ev      Case=Nom|Person=1|PronType=Prs             7        nsubj            <NA>
   doc1            1           1    23  25       7        7       nam     nemen  VERB                               WW|pv|verl|ev Number=Sing|Tense=Past|VerbForm=Fin             2         conj            <NA>
   doc1            1           1    27  29       8        8       mee       mee   ADP                                      VZ|fin                                <NA>             7 compound:prt   SpaceAfter=No
   doc1            1           1    30  30       9        9         :         : PUNCT                                         LET                                <NA>             7        punct            <NA>
...

A small note on encodings

Mark that it is important that the x argument to udpipe_annotate is in UTF-8 encoding. You can check the encoding of your text with Encoding('your text'). You can convert your text to UTF-8, using standard R utilities: as in iconv('your text', from = 'latin1', to = 'UTF-8') where you replace the from part with whichever encoding you have your text in, possible your computers default as defined in localeToCharset(). So annotation would look something like this if your text is not already in UTF-8 encoding:

  • udpipe_annotate(udmodel, x = iconv('your text', to = 'UTF-8')) if your text is in the encoding of the current locale of your computer.
  • udpipe_annotate(udmodel, x = iconv('your text', from = 'latin1', to = 'UTF-8')) if your text is in latin1 encoding.
  • udpipe_annotate(udmodel, x = iconv('your text', from = 'CP949', to = 'UTF-8')) if your text is in CP949 encoding.
UDPipe Introduction →
NLP with R and UDPipe
Docs
Getting StartedIntroductionNatural Language Processing
Community
User ShowcaseReport issue
More
BlogGitHubStar
BNOSAC Open Source
Copyright © 2020 BNOSAC