python - Most effective way to parse CSV and take action based on content of row -


i have csv file splunk generates, similar in format following:

category,url,hash,id,"__mv_hash","_mkv_id" binary,somebadsite.com/file.exe,12345abcdef,123,,, callback,bad.com,,567,,, 

what need iterate through csv file, maintaining header order, , take different action if result binary or callback. example, if result binary i'll return arbitrary "clean" or "dirty" rating , if it's callback i'll print out details.

below code i'm planning use, i'm new python , feedback on code , if there better way accomplish this. i'm not clear on difference between how i'm handling if result binary: for k in (k k in r.fieldnames if (not k.startswith("""__mv_""") , not k.startswith("""_mkv_"""))) , how handle if it's not. both achieve same result, whats benefit of 1 on other?

import gzip import csv import json  csv_file = 'test_csv.csv.gz'  class gzipcsvreader:     def __init__(self, filename):         self.gzfile = gzip.open(filename)         self.reader = csv.dictreader(self.gzfile)         self.fieldnames = self.reader.fieldnames      def next(self):         return self.reader.next()      def close(self):         self.gzfile.close()      def __iter__(self):         return self.reader.__iter__()  def get_rating(hash):     if hash == "12345abcdef":         rating = "dirty"     else:         rating = "clean"     return hash, rating  def print_callback(result):     print json.dumps(result, sort_keys=true, indent=4, separators=(',',':'))  def process_results_content(r):     row in r:         values = {}         values_misc = {}          if row["category"] == "binary":             # iterate through key:value pairs , add dictionary             k in (k k in r.fieldnames if (not k.startswith("""__mv_""") , not k.startswith("""_mkv_"""))):                 v = row[k]                 values[k] = v             rating = get_rating(row["hash"])             if rating[1] == "dirty":                 print rating         else:             k in r.fieldnames:                 if not k.startswith("""__mv_""") , not k.startswith("""_mkv_"""):                     v = row[k]                     values_misc[k] = v             print_callback(values_misc)     r.close()  if __name__ == '__main__':     r = gzipcsvreader(csv_file)     process_results_content(r) 

finally, for...else loop better rather doing such if row["category"] == "binary"? example, such as:

def process_results_content(r):     row in r:         values = {}         values_misc = {}          k in (k k in r.fieldnames if (not row["category"] == "binary")):             v = row[k]             ...         else:             v = row[k]             ... 

seems same logic first clause capture not binary , second capture else, not seem produce correct result.

my take using pandas library.

code:

import pandas pd  csv_file = 'test_csv.csv' df = pd.read_csv(csv_file) df = df[["category","url","hash","id"]] # remove other columns.  get_rating = lambda x: "dirty" if x == "12345abcdef" else "clean" df["rating"] = df["hash"].apply(get_rating) # assign value each row based on hash value.  print df  j = df.to_json() # self-explanatory. :) print j 

result:

   category                       url         hash   id rating 0    binary  somebadsite.com/file.exe  12345abcdef  123  dirty 1  callback                   bad.com          nan  567  clean {"category":{"0":"binary","1":"callback"},"url":{"0":"somebadsite.com\/file.exe","1":"bad.com"},"hash":{"0":"12345abcdef","1":null},"id":{"0":123,"1":567},"rating":{"0":"dirty","1":"clean"}} 

if intended result, substitute above gzipreader, since did not emulate opening of gzip file.


Comments

Popular posts from this blog

toolbar - How to add link to user registration inside toobar in admin joomla 3 custom component -

linux - disk space limitation when creating war file -