How to filter out sequences based on a given data using Python? -
i filter out sequences don't want based on given file a.fasta. original file contain sequences , fasta file file starts sequence id followed nucleotides represented a, t, c, g. can me?
a.fasta
>chr12:15747942-15747949 tgacatca >chr2:130918058-130918065 tgacctca original.fasta
>chr3:99679938-99679945 tgacgtaa >chr9:135822160-135822167 tgacctca >chr12:15747942-15747949 tgacatca >chr2:130918058-130918065 tgacctca >chr2:38430457-38430464 tgacctca >chr1:112381724-112381731 tgacatca expected output c.fasta
>chr3:99679938-99679945 tgacgtaa >chr9:135822160-135822167 tgacctca >chr2:38430457-38430464 tgacctca >chr1:112381724-112381731 tgacatca code
import sys import warnings bio import seqio bio import biopythondeprecationwarning warnings.simplefilter('ignore',biopythondeprecationwarning) fasta_file = sys.argv[1] # input fasta file remove_file = sys.argv[2] # input wanted file, 1 gene name per line result_file = sys.argv[3] # output fasta file remove = set() open(remove_file) f: line in f: line = line.strip() if line != "": remove.add(line) fasta_sequences = seqio.parse(open(fasta_file),'fasta') open(result_file, "w") f: seq in fasta_sequences: nuc = seq.seq.tostring() if nuc not in remove , len(nuc) > 0: seqio.write([seq], f, "fasta") the code above filter out repeated sequences keep repeated sequences if appear in output
check out @ biopython. here solution using that:
from bio import seqio input_file = 'a.fasta' merge_file = 'original.fasta' output_file = 'results.fasta' exclude = set() fasta_sequences = seqio.parse(open(input_file),'fasta') fasta in fasta_sequences: exclude.add(fasta.id) fasta_sequences = seqio.parse(open(merge_file),'fasta') open(output_file, 'w') output_handle: fasta in fasta_sequences: if fasta.id not in exclude: seqio.write([fasta], output_handle, "fasta")
Comments
Post a Comment