R: dtm with ngram tokenizer plus dictionary broken in Ubuntu? -


i creating document term matrix, dictionary , ngram tokenization. works on windows 7 laptop, not on configured ubuntu 14.04.2 server. update: works on centos server.

library(tm) library(rweka) library((snowballc))  newbigramtokenizer = function(x) {   tokenizer1 = rweka::ngramtokenizer(x, rweka::weka_control(min = 1, max = 2))   if (length(tokenizer1) != 0l) { return(tokenizer1)   } else return(wordtokenizer(x)) }  textvect <- c("this story girl",                "this story boy",                "a boy , girl went store",               "a store place buy things",               "you can buy things boy or girl",               "the word store can verb meaning position later use")  textvect <- iconv(textvect, = "utf-8") textsource <- vectorsource(textvect) textcorp <- corpus(textsource)  textdict <- c("boy", "girl", "store", "story about") textdict <- iconv(textdict, = "utf-8")  # ok dtm <- documenttermmatrix(textcorp, control=list(dictionary=textdict))  # ok on windows laptop # freezes or generates error on ubuntu server dtm <- documenttermmatrix(textcorp, control=list(tokenize=newbigramtokenizer,                                              dictionary=textdict)) 

error ubuntu server (at last line in source example):

/usr/lib/jvm/java-7-openjdk-amd64/jre/lib/rt.jar: invalid loc header (bad signature) error in simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allterms),  :   'i, j' invalid in addition: warning messages: 1: in mclapply(unname(content(x)), termfreq, control) :   scheduled core 1 encountered error in user code, values of job affected 2: in simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allterms),  :   nas introduced coercion 

i have tried of suggestions in twitter data analysis - error in term document matrix , error in simple_triplet_matrix -- unable use rweka count phrases

i had thought problem attributed 1 of these, script running on centos server same locales , jvm problematic ubuntu server.

  • the locales
  • the minor difference in jvms
  • the parallel library? mclapply mentioned in error message, , parallel listed in session info (for systems, though.)

here 2 environments:

r version 3.1.2 (2014-10-31) platform: x86_64-w64-mingw32/x64 (64-bit)

ps c:\> java -version picked java_tool_options: -dfile.encoding=utf-8 java version "1.7.0_72" java(tm) se runtime environment (build 1.7.0_72-b14) java hotspot(tm) 64-bit server vm (build 24.72-b04, mixed mode)  locale:  [1] lc_collate=english_united states.1252  lc_ctype=english_united states.1252    [3] lc_monetary=english_united states.1252 lc_numeric=c                           [5] lc_time=english_united states.1252      attached base packages: [1] stats     graphics  grdevices utils     datasets  methods   base       other attached packages: [1] rweka_0.4-23 tm_0.6       nlp_0.1-5     loaded via namespace (and not attached): [1] grid_3.1.2         parallel_3.1.2     rjava_0.9-6        rwekajars_3.7.11-1 slam_0.1-32        [6] tools_3.1.2          

r version 3.1.2 (2014-10-31) platform: x86_64-pc-linux-gnu (64-bit)

$ java -version java version "1.7.0_79" openjdk runtime environment (icedtea 2.5.5) (7u79-2.5.5-0ubuntu0.14.04.2) openjdk 64-bit server vm (build 24.79-b02, mixed mode)  locale: [1] lc_ctype=en_us.utf-8          lc_numeric=c                  lc_time=en_us.utf-8           [4] lc_collate=en_us.utf-8        lc_monetary=en_us.utf-8       lc_messages=en_us.utf-8       [7] lc_paper=en_us.utf-8          lc_name=en_us.utf-8           lc_address=en_us.utf-8        [10] lc_telephone=en_us.utf-8      lc_measurement=en_us.utf-8    lc_identification=en_us.utf-8  attached base packages: [1] stats     graphics  grdevices utils     datasets  methods   base       other attached packages: [1] rweka_0.4-23 tm_0.6       nlp_0.1-5     loaded via namespace (and not attached): [1] grid_3.1.2         parallel_3.1.2     rjava_0.9-6        rwekajars_3.7.11-1 slam_0.1-32        [6] tools_3.1.2      

r version 3.2.0 (2015-04-16) platform: x86_64-redhat-linux-gnu (64-bit) running under: centos linux 7 (core)

$ java -version java version "1.7.0_79" openjdk runtime environment (rhel-2.5.5.1.el7_1-x86_64 u79-b14) openjdk 64-bit server vm (build 24.79-b02, mixed mode)   locale:  [1] lc_ctype=en_us.utf-8          lc_numeric=c  [3] lc_time=en_us.utf-8           lc_collate=en_us.utf-8  [5] lc_monetary=en_us.utf-8       lc_messages=en_us.utf-8  [7] lc_paper=en_us.utf-8          lc_name=en_us.utf-8  [9] lc_address=en_us.utf-8        lc_telephone=en_us.utf-8 [11] lc_measurement=en_us.utf-8    lc_identification=en_us.utf-8  attached base packages: [1] stats     graphics  grdevices utils     datasets  methods   base  other attached packages: [1] rweka_0.4-24 tm_0.6-2     nlp_0.1-8  loaded via namespace (and not attached): [1] parallel_3.2.0     tools_3.2.0        slam_0.1-32        grid_3.2.0 [5] rjava_0.9-6        rwekajars_3.7.12-1 

if prefer simpler no less flexible or powerful, how trying out quanteda package? can make quick work of dictionary , bigram task in 3 lines:

# or: devtools::install_github("kbenoit/quanteda") require(quanteda)  # use dictionary() construct dictionary named list textdict <- dictionary(list(mydict = c("boy", "girl", "store", "story about")))  # convert document-feature matrix, 1grams + 2grams, apply dictionary dfm(tolower(textvect), dictionary = textdict, ngrams = 1:2, concatenator = " ") ## document-feature matrix of: 6 documents, 1 feature. ## 6 x 1 sparse matrix of class "dfmsparse" ##        features ## docs    mydict ##   text1      2 ##   text2      2 ##   text3      3 ##   text4      1 ##   text5      2 ##   text6      1  # alternative consider dictionary thesaurus of synonyms,  # not exclusive in feature selection dictionary  dfm.all <- dfm(tolower(textvect), thesaurus = textdict,                ngrams = 1:2, concatenator = " ", verbose = false) topfeatures(dfm.all) ##       mydict   boy  girl              story   about  ##     11      11       3       3       3       3       3       2       2       2   sort(dfm.all)[1:6, 1:12] ## document-feature matrix of: 6 documents, 12 features. ## 6 x 12 sparse matrix of class "dfmsparse" ##        features ## docs    mydict boy girl is a story about buy ##   text1 2      2     0      1  1    1  0       1     1       1    0   0 ##   text2 2      2     1      0  1    1  0       1     1       1    0   0 ##   text3 2      3     1      1  0    0  1       0     0       0    0   0 ##   text4 2      1     0      0  1    1  1       0     0       0    0   1 ##   text5 2      2     1      1  0    0  0       0     0       0    1   1 ##   text6 1      1     0      0  0    0  1       0     0       0    1   0 

Comments

Popular posts from this blog

toolbar - How to add link to user registration inside toobar in admin joomla 3 custom component -

linux - disk space limitation when creating war file -