r - spread data over an interval -
i have data table start date , end date want reshape repeats information each single date in interval between start , end date.
my data follows
tripstart tripend country 1: 2014-10-07 2014-10-10 2: 2013-06-12 2013-06-13 fr 3: 2013-02-07 2013-02-10 dk based on data result looking similar to
day country 2014-10-10 2014-10-09 2014-10-08 2014-10-07 2013-06-13 fr 2013-06-12 fr 2013-02-10 dk 2013-02-09 dk 2013-02-08 dk 2013-02-07 dk i tried following without success,
setkey(hotel_stays, tripstart, tripend) # first date used transaction date. max_date <- max(hotel_stays$tripend, hotel_stays$tripstart) min_date <- min(hotel_stays$tripend, hotel_stays$tripstart) hotel_stays_long <- data.table(day = seq.date(min_date, = max_date,, length.out = max_date - min_date)) setkey(hotel_stays_long, day) foverlaps(hotel_stays, hotel_stays_long) r code data:
hotel_stays <- data.table(tripstart = c(as.date("2014-10-07"), as.date("2013-06-12"), as.date("2013-02-07")), tripend = c(as.date("2014-10-10"), as.date("2013-06-13"), as.date("2013-02-10")), country = c("us", "fr", "dk"))
thanks frank have 2 solutions.
hotel_stays <- data.table(tripstart = c(as.date("2014-10-07"), as.date("2013-06-12"), as.date("2013-02-07")), tripend = c(as.date("2014-10-10"), as.date("2013-06-13"), as.date("2013-02-10")), country = c("us", "fr", "dk")) ### solution 1 setkey(hotel_stays, tripstart, tripend) # first date used transaction date. max_date <- max(hotel_stays$tripend, hotel_stays$tripstart) min_date <- min(hotel_stays$tripend, hotel_stays$tripstart) hotel_stays_long <- data.table(day = seq.date(min_date, = max_date,, length.out = max_date - min_date)) hotel_stays_long[, end := day] setkey(hotel_stays_long, day, end) hotel_stays_long <- foverlaps(hotel_stays, hotel_stays_long) hotel_stays_long[, c("end", "tripstart", "tripend") := null] ## solution 2 hotel_stays_long[, .(day = seq(tripstart, tripend, = "day"), country), = 1 : nrow(hotel_stays_long)] i ran both examples on private data set contains additional columns. info on data set is,
> dim(hotel_stays) [1] 4675 28 the first solution leads to
replications elapsed relative user.self sys.self user.child sys.child 1 100 1.898 1 1.889 0.005 0 0 the second solution leads to
replications elapsed relative user.self sys.self user.child sys.child 1 100 45.244 1 45.253 0 0 0 the test environment is
> sessioninfo() r version 3.2.0 (2015-04-16) platform: x86_64-unknown-linux-gnu (64-bit) running under: red hat enterprise linux server release 6.6 (santiago) locale: [1] lc_ctype=en_us.utf-8 lc_numeric=c lc_time=en_us.utf-8 lc_collate=en_us.utf-8 [5] lc_monetary=en_us.utf-8 lc_messages=en_us.utf-8 lc_paper=en_us.utf-8 lc_name=c [9] lc_address=c lc_telephone=c lc_measurement=en_us.utf-8 lc_identification=c attached base packages: [1] stats graphics grdevices utils datasets methods base other attached packages: [1] rbenchmark_1.0.0 data.table_1.9.5 rodbc_1.3-11 loaded via namespace (and not attached): [1] tools_3.2.0 chron_2.3-45 conclusion, first solution faster less elegant.
Comments
Post a Comment