Filter .bam files for a specific chromosome and start positions of reads

Working with sequence/alignment-data, you sometimes need to filter an existing .bam file e.g. for data reduction or read extraction. In this post we describe and implement one way to filter the content of .bam files by a given chromosome as well as by a given range for read start positions.

A straight forward and flexible way to change an existing .bam file is to convert it to .sam format, filter the .sam file using bash and finally convert it back into the .bam format. Therefore we make use the following samtools commands in our demo script which can be downloaded here:

samtools view -h "alignment.bam" > "alignment.sam"
samtools view -bS "alignment.sam" > "alignment.bam"
samtools sort "alignment.bam" "alignment_sorted"
samtools index "alignment_sorted.bam" "alignment_sorted"

After converting the .bam file into the .sam format, our script extracts a predefined amount of reads, writes them back into a sorted .bam file and finally creates a .bai index file from it. The extraction filters for reads on a given chromosome, with a start position in between a given range.

##############################
#set the following variables
##############################

#use the bamfile without .bam
myfile="alignment"
#the chromosome to filter for
mychrom="chr1"
# reads that start inbetween $from and $to get selected 
from="43500000"
to=  "43600000"
#how many reads should be in the resultfile at max
readnumber="50"

##############################
#script starts here
##############################

samfile="${myfile}.sam"
bamfile="${myfile}.bam"
samchromfile="${myfile}_${mychrom}".sam
samchromtargetfile="${myfile}_${mychrom}_from_${from}_to_${to}.sam"
bamchromtargetfile="${myfile}_${mychrom}_from_${from}_to_${to}.bam"
sortedbamchromtargetfile="${myfile}_${mychrom}_from_${from}_to_${to}_sorted.bam"
tmpfile="tmp.txt"

##convert initial bam to sam
samtools view -h "${bamfile}" > "${samfile}"
#filter by chromosome
cat ${samfile} | awk -v mychrom=${mychrom} 'BEGIN{FS="\t";} {if($3==mychrom) print $0}' > ${samchromfile}
#filter by given target (start and end)
cat ${samchromfile} | awk -v from=${from} -v to=${to} '{FS="\t"} $4>=from && $4<=to' from=$from to=$to > ${samchromtargetfile}
#take the first x lines
head -n ${readnumber} ${samchromtargetfile} > ${tmpfile} && mv ${tmpfile} ${samchromtargetfile}
#push the original header to the beginning of the new file
head -n 1000 ${samfile} | grep  "^@" | cat - ${samchromtargetfile} > ${tmpfile} && mv ${tmpfile} ${samchromtargetfile}

#convert sam to bam
samtools view -bS ${samchromtargetfile} > ${bamchromtargetfile}
tmpfilename=`echo "${sortedbamchromtargetfile}" | sed "s/.bam$//"`
# create sorted bam
samtools sort "${bamchromtargetfile}" "${tmpfilename}"
tmpfilename=`echo "${sortedbamchromtargetfile}" | sed "s/.bam$//"`
#create bai
samtools index "${sortedbamchromtargetfile}" "${tmpfilename}.bai"

Using this small script you can easily filter for chromosome and/or the start position. You can adapt the two lines of code doing the filtering to your needs using bash, awk or any other script languages/file-processing tools. For simple examples you can also rely purely on the filter options of samtools.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: