R: dtm with ngram tokenizer plus dictionary broken in Ubuntu? -
i creating document term matrix, dictionary , ngram tokenization. works on windows 7 laptop, not on configured ubuntu 14.04.2 server. update: works on centos server.
library(tm) library(rweka) library((snowballc)) newbigramtokenizer = function(x) { tokenizer1 = rweka::ngramtokenizer(x, rweka::weka_control(min = 1, max = 2)) if (length(tokenizer1) != 0l) { return(tokenizer1) } else return(wordtokenizer(x)) } textvect <- c("this story girl", "this story boy", "a boy , girl went store", "a store place buy things", "you can buy things boy or girl", "the word store can verb meaning position later use") textvect <- iconv(textvect, = "utf-8") textsource <- vectorsource(textvect) textcorp <- corpus(textsource) textdict <- c("boy", "girl", "store", "story about") textdict <- iconv(textdict, = "utf-8") # ok dtm <- documenttermmatrix(textcorp, control=list(dictionary=textdict)) # ok on windows laptop # freezes or generates error on ubuntu server dtm <- documenttermmatrix(textcorp, control=list(tokenize=newbigramtokenizer, dictionary=textdict))
error ubuntu server (at last line in source example):
/usr/lib/jvm/java-7-openjdk-amd64/jre/lib/rt.jar: invalid loc header (bad signature) error in simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allterms), : 'i, j' invalid in addition: warning messages: 1: in mclapply(unname(content(x)), termfreq, control) : scheduled core 1 encountered error in user code, values of job affected 2: in simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allterms), : nas introduced coercion
i have tried of suggestions in twitter data analysis - error in term document matrix , error in simple_triplet_matrix -- unable use rweka count phrases
i had thought problem attributed 1 of these, script running on centos server same locales , jvm problematic ubuntu server.
- the locales
- the minor difference in jvms
- the parallel library? mclapply mentioned in error message, , parallel listed in session info (for systems, though.)
here 2 environments:
r version 3.1.2 (2014-10-31) platform: x86_64-w64-mingw32/x64 (64-bit)
ps c:\> java -version picked java_tool_options: -dfile.encoding=utf-8 java version "1.7.0_72" java(tm) se runtime environment (build 1.7.0_72-b14) java hotspot(tm) 64-bit server vm (build 24.72-b04, mixed mode) locale: [1] lc_collate=english_united states.1252 lc_ctype=english_united states.1252 [3] lc_monetary=english_united states.1252 lc_numeric=c [5] lc_time=english_united states.1252 attached base packages: [1] stats graphics grdevices utils datasets methods base other attached packages: [1] rweka_0.4-23 tm_0.6 nlp_0.1-5 loaded via namespace (and not attached): [1] grid_3.1.2 parallel_3.1.2 rjava_0.9-6 rwekajars_3.7.11-1 slam_0.1-32 [6] tools_3.1.2
r version 3.1.2 (2014-10-31) platform: x86_64-pc-linux-gnu (64-bit)
$ java -version java version "1.7.0_79" openjdk runtime environment (icedtea 2.5.5) (7u79-2.5.5-0ubuntu0.14.04.2) openjdk 64-bit server vm (build 24.79-b02, mixed mode) locale: [1] lc_ctype=en_us.utf-8 lc_numeric=c lc_time=en_us.utf-8 [4] lc_collate=en_us.utf-8 lc_monetary=en_us.utf-8 lc_messages=en_us.utf-8 [7] lc_paper=en_us.utf-8 lc_name=en_us.utf-8 lc_address=en_us.utf-8 [10] lc_telephone=en_us.utf-8 lc_measurement=en_us.utf-8 lc_identification=en_us.utf-8 attached base packages: [1] stats graphics grdevices utils datasets methods base other attached packages: [1] rweka_0.4-23 tm_0.6 nlp_0.1-5 loaded via namespace (and not attached): [1] grid_3.1.2 parallel_3.1.2 rjava_0.9-6 rwekajars_3.7.11-1 slam_0.1-32 [6] tools_3.1.2
r version 3.2.0 (2015-04-16) platform: x86_64-redhat-linux-gnu (64-bit) running under: centos linux 7 (core)
$ java -version java version "1.7.0_79" openjdk runtime environment (rhel-2.5.5.1.el7_1-x86_64 u79-b14) openjdk 64-bit server vm (build 24.79-b02, mixed mode) locale: [1] lc_ctype=en_us.utf-8 lc_numeric=c [3] lc_time=en_us.utf-8 lc_collate=en_us.utf-8 [5] lc_monetary=en_us.utf-8 lc_messages=en_us.utf-8 [7] lc_paper=en_us.utf-8 lc_name=en_us.utf-8 [9] lc_address=en_us.utf-8 lc_telephone=en_us.utf-8 [11] lc_measurement=en_us.utf-8 lc_identification=en_us.utf-8 attached base packages: [1] stats graphics grdevices utils datasets methods base other attached packages: [1] rweka_0.4-24 tm_0.6-2 nlp_0.1-8 loaded via namespace (and not attached): [1] parallel_3.2.0 tools_3.2.0 slam_0.1-32 grid_3.2.0 [5] rjava_0.9-6 rwekajars_3.7.12-1
if prefer simpler no less flexible or powerful, how trying out quanteda package? can make quick work of dictionary , bigram task in 3 lines:
# or: devtools::install_github("kbenoit/quanteda") require(quanteda) # use dictionary() construct dictionary named list textdict <- dictionary(list(mydict = c("boy", "girl", "store", "story about"))) # convert document-feature matrix, 1grams + 2grams, apply dictionary dfm(tolower(textvect), dictionary = textdict, ngrams = 1:2, concatenator = " ") ## document-feature matrix of: 6 documents, 1 feature. ## 6 x 1 sparse matrix of class "dfmsparse" ## features ## docs mydict ## text1 2 ## text2 2 ## text3 3 ## text4 1 ## text5 2 ## text6 1 # alternative consider dictionary thesaurus of synonyms, # not exclusive in feature selection dictionary dfm.all <- dfm(tolower(textvect), thesaurus = textdict, ngrams = 1:2, concatenator = " ", verbose = false) topfeatures(dfm.all) ## mydict boy girl story about ## 11 11 3 3 3 3 3 2 2 2 sort(dfm.all)[1:6, 1:12] ## document-feature matrix of: 6 documents, 12 features. ## 6 x 12 sparse matrix of class "dfmsparse" ## features ## docs mydict boy girl is a story about buy ## text1 2 2 0 1 1 1 0 1 1 1 0 0 ## text2 2 2 1 0 1 1 0 1 1 1 0 0 ## text3 2 3 1 1 0 0 1 0 0 0 0 0 ## text4 2 1 0 0 1 1 1 0 0 0 0 1 ## text5 2 2 1 1 0 0 0 0 0 0 1 1 ## text6 1 1 0 0 0 0 1 0 0 0 1 0
Comments
Post a Comment