python - Most effective way to parse CSV and take action based on content of row -
i have csv file splunk generates, similar in format following:
category,url,hash,id,"__mv_hash","_mkv_id" binary,somebadsite.com/file.exe,12345abcdef,123,,, callback,bad.com,,567,,,
what need iterate through csv file, maintaining header order, , take different action if result binary or callback. example, if result binary i'll return arbitrary "clean" or "dirty" rating , if it's callback i'll print out details.
below code i'm planning use, i'm new python , feedback on code , if there better way accomplish this. i'm not clear on difference between how i'm handling if result binary: for k in (k k in r.fieldnames if (not k.startswith("""__mv_""") , not k.startswith("""_mkv_""")))
, how handle if it's not. both achieve same result, whats benefit of 1 on other?
import gzip import csv import json csv_file = 'test_csv.csv.gz' class gzipcsvreader: def __init__(self, filename): self.gzfile = gzip.open(filename) self.reader = csv.dictreader(self.gzfile) self.fieldnames = self.reader.fieldnames def next(self): return self.reader.next() def close(self): self.gzfile.close() def __iter__(self): return self.reader.__iter__() def get_rating(hash): if hash == "12345abcdef": rating = "dirty" else: rating = "clean" return hash, rating def print_callback(result): print json.dumps(result, sort_keys=true, indent=4, separators=(',',':')) def process_results_content(r): row in r: values = {} values_misc = {} if row["category"] == "binary": # iterate through key:value pairs , add dictionary k in (k k in r.fieldnames if (not k.startswith("""__mv_""") , not k.startswith("""_mkv_"""))): v = row[k] values[k] = v rating = get_rating(row["hash"]) if rating[1] == "dirty": print rating else: k in r.fieldnames: if not k.startswith("""__mv_""") , not k.startswith("""_mkv_"""): v = row[k] values_misc[k] = v print_callback(values_misc) r.close() if __name__ == '__main__': r = gzipcsvreader(csv_file) process_results_content(r)
finally, for...else
loop better rather doing such if row["category"] == "binary"
? example, such as:
def process_results_content(r): row in r: values = {} values_misc = {} k in (k k in r.fieldnames if (not row["category"] == "binary")): v = row[k] ... else: v = row[k] ...
seems same logic first clause capture not binary , second capture else, not seem produce correct result.
my take using pandas
library.
code:
import pandas pd csv_file = 'test_csv.csv' df = pd.read_csv(csv_file) df = df[["category","url","hash","id"]] # remove other columns. get_rating = lambda x: "dirty" if x == "12345abcdef" else "clean" df["rating"] = df["hash"].apply(get_rating) # assign value each row based on hash value. print df j = df.to_json() # self-explanatory. :) print j
result:
category url hash id rating 0 binary somebadsite.com/file.exe 12345abcdef 123 dirty 1 callback bad.com nan 567 clean {"category":{"0":"binary","1":"callback"},"url":{"0":"somebadsite.com\/file.exe","1":"bad.com"},"hash":{"0":"12345abcdef","1":null},"id":{"0":123,"1":567},"rating":{"0":"dirty","1":"clean"}}
if intended result, substitute above gzipreader
, since did not emulate opening of gzip
file.
Comments
Post a Comment