python - Optimized Way to Splice Pandas Dataframe -

July 15, 2014

my problem have large time series (~5-10 million obs) have events marked flags. in case drop in stock prices triggers event has dummy variable 1 or 0 if event triggered or not. time series extract both events , subsequent 29 days of data. obviously, involves type of splicing of arrays.

i have simple code should trick (it merely marks flags , next 29 days 2 there simple filter dataframe) relies on pandas dataframe splicing not quick. here code:

def first_drop(df):     indexlen = len(df.dropflag[df.dropflag==1].index)      y in range(indexlen):                                                                                                                                        x = df.dropflag[df.dropflag==1].index[y]             df.dropflag[x:30]=2      return df.dropflag  dstk['dropflag2'] = dstk[["permno","dropflag"]].groupby('permno').apply(first_drop)

is there faster way else has found type of splicing next x number of entries? thinking maybe faster numpy arrays or maybe cythonized function can't quite see start.

here 1 possible way it. maybe not fast, takes 1 min process 10,000,000 row dataset. idea that, populating new columns data on subsequent days using .shift(-i), avoids looping on rows inside each groupby. advantage flexibility on reshaping resulting dataframe, example stack() stacked records.

import pandas pd import numpy np  # generate artificial data, 10,000,000 rows # ============================================================ np.random.seed(0) dates = pd.date_range('2001-01-01', periods=2500, freq='b') permno = np.arange(1000, 5000)  # 4000 symbols multi_index = pd.multiindex.from_product([permno, dates], names=['permno', 'dates']) data = np.random.randn(10000000) dropflag = np.random.choice([0,1], size=10000000)  df = pd.dataframe({'data': data, 'dropflag': dropflag}, index=multi_index).reset_index('permno')  out[273]:              permno    data  dropflag dates                                2001-01-01    1000  1.7641         1 2001-01-02    1000  0.4002         1 2001-01-03    1000  0.9787         0 2001-01-04    1000  2.2409         1 2001-01-05    1000  1.8676         0 ...            ...     ...       ... 2010-07-26    4999  0.5902         1 2010-07-27    4999  0.4676         1 2010-07-28    4999 -1.9447         1 2010-07-29    4999 -0.3440         1 2010-07-30    4999 -0.7402         0  [10000000 rows x 3 columns]  # processing # ============================================================ def func(group):     all_data = [group]     in np.arange(1, 30):         temp = group.data.shift(-i)         temp.name = 'data_subday{}'.format(i)         all_data.append(temp)     dataset = pd.concat(all_data, axis=1).iloc[:-30]     return dataset.loc[dataset.dropflag==1]  %time df.groupby('permno').apply(func)  cpu times: user 59.7 s, sys: 1.83 s, total: 1min 1s wall time: 1min 5s   out[277]:                     permno    data  dropflag  data_subday1  data_subday2      ...        data_subday25  data_subday26  data_subday27  data_subday28  data_subday29 permno dates                                                                 ...                                                                                  1000   2001-01-01    1000  1.7641         1        0.4002        0.9787      ...              -1.4544         0.0458        -0.1872         1.5328         1.4694        2001-01-02    1000  0.4002         1        0.9787        2.2409      ...               0.0458        -0.1872         1.5328         1.4694         0.1549        2001-01-04    1000  2.2409         1        1.8676       -0.9773      ...               1.5328         1.4694         0.1549         0.3782        -0.8878        2001-01-08    1000 -0.9773         1        0.9501       -0.1514      ...               0.1549         0.3782        -0.8878        -1.9808        -0.3479        2001-01-09    1000  0.9501         1       -0.1514       -0.1032      ...               0.3782        -0.8878        -1.9808        -0.3479         0.1563 ...                   ...     ...       ...           ...           ...      ...                  ...            ...            ...            ...            ... 4999   2010-06-09    4999  2.1195         1        1.5564        1.0739      ...               0.2677         1.2637        -0.3607        -1.4011         1.1292        2010-06-15    4999 -1.1747         1        0.2159        0.1221      ...               1.1292         1.1614         0.4842         1.3593         0.5902        2010-06-16    4999  0.2159         1        0.1221        0.0136      ...               1.1614         0.4842         1.3593         0.5902         0.4676        2010-06-17    4999  0.1221         1        0.0136        0.8378      ...               0.4842         1.3593         0.5902         0.4676        -1.9447        2010-06-18    4999  0.0136         1        0.8378        0.4887      ...               1.3593         0.5902         0.4676        -1.9447        -0.3440  [4941409 rows x 32 columns]

Search This Blog

JVParth

python - Optimized Way to Splice Pandas Dataframe -

Comments

Post a Comment

Popular posts from this blog

toolbar - How to add link to user registration inside toobar in admin joomla 3 custom component -

linux - disk space limitation when creating war file -

How to provide Authorization & Authentication using Asp.net, C#? -