sequence - R - Compute Mismatch By Group -
i wondering how compute mismatching cases by group.
let imagine data :
sek = rbind(c(1, 'a', 'a', 'a'), c(1, 'a', 'a', 'a'), c(2, 'b', 'b', 'b'), c(2, 'c', 'b', 'b')) colnames(sek) <- c('group', paste('t', 1:3, sep = '')) the data
group t1 t2 t3 [1,] "1" "a" "a" "a" [2,] "1" "a" "a" "a" [3,] "2" "b" "b" "b" [4,] "2" "c" "b" "b" in order
group 1 : 0 group 2 : 1 it fancy use stringdist library compute this.
something
seqdistgroupstr = function(x) stringdistmatrix(x, method = 'hamming') sek %>% as.data.frame() %>% group_by(group) %>% seqdistgroupstr() but not working.
any ideas ?
quick update: how solve question of weights? example, how pass argument - value (1,2,3, ...) - when setting mistmatch between 2 characters. mismatch between b , c cost 2 while mismatch between a , c cost 1 , on.
the code below give number of mismatches group, mismatch defined 1 less number of unique values in each column t1, t2, etc. each level of group. think need bring in string distance measure if need more binary measure of mismatch, binary measure suffices example gave. also, if want number of distinct rows in each group, @alex's solution more concise.
library(dplyr) library(reshape2) sek %>% as.data.frame %>% melt(id.var="group") %>% group_by(group, variable) %>% summarise(mismatch = length(unique(value)) - 1) %>% group_by(group) %>% summarise(mismatch = sum(mismatch)) group mismatch 1 1 0 2 2 1 here's shorter dplyr method count individual mismatches. doesn't require reshaping, requires other data gymnastics:
sek %>% as.data.frame %>% group_by(group) %>% summarise_each(funs(length(unique(.)) - 1)) %>% mutate(mismatch = rowsums(.[-1])) %>% select(-matches("^t[1-3]$"))
Comments
Post a Comment