r - Remove duplicate rows -
already question answered here ,but not make work.
i have data frame here,interested remove duplicate rows based on symbol. checking column call remove duplicates.the priority p>a>m.if p,a,m keep p, if a,m, keep a, otherwise m.
symbol intensity call 1 ddr1 596.95050 p 2 rfc2 420.28708 p 3 hspa6 510.73254 p 4 ddr1 1717.99487 5 guca1a 121.53488 6 uba7 1810.49780 p 7 uba7 301.51944 m 8 guca1a 34.53987 9 ccl5 5966.24609 p 10 cyp2e1 95.15707 11 cyp2e1 164.95276 m 12 esrra 1024.88745 p 13 cyp2a6 502.48877 14 gas6 921.70923 p 15 mmp14 524.96863 16 gas6 3069.48462 p 17 fntb 266.77686 18 pld1 187.65569 19 pld1 1891.04541 p 20 pld1 258.79028 m i tried code found here
library(data.table) setdt(df)[, list(call=call[which.min(factor(call, levels=c('p', 'a', 'm')))]), .(symbol)] but removes second column intensity. help, please make sure code fastest also. thanks
expected output
symbol intensity call 1 ddr1 596.95050 p 2 rfc2 420.28708 p 3 hspa6 510.73254 p 5 guca1a 121.53488 6 uba7 1810.49780 p 9 ccl5 5966.24609 p 10 cyp2e1 95.15707 12 esrra 1024.88745 p 13 cyp2a6 502.48877 14 gas6 921.70923 p 15 mmp14 524.96863 17 fntb 266.77686 19 pld1 1891.04541 p
you can either use order (in ith position) order "call" column converting factor levels specified in correct order, , subset first observation (.sd[1l]), grouped 'symbol'
library(data.table) setdt(df)[order(factor(call, levels=c('p', 'a', 'm'))), .sd[1l], = symbol] or modifying code, instead of list(call=.., can use .sd subset rows.
setdt(df)[, .sd[which.min(factor(call, levels=c('p', 'a', 'm')))], .(symbol)] an option using dplyr is
library(dplyr) df %>% group_by(symbol) %>% arrange(factor(call, levels=c('p', 'a', 'm'))) %>% slice(1l) or use which.min within slice
df %>% group_by(symbol) %>% slice(which.min(factor(call, levels=c('p', 'a', 'm'))))
Comments
Post a Comment