DIY Bioinformatics: A Whole New Galaxy

This blog post was first published by my company on June 16, 2011

Inspired by Will’s recent

book review



and the

analysis crowd sourcing

of the European

E. Coli

outbreak, I thought I would take another look at DIY (Do It Yourself) biology in this week’s post. Unlike some, I have no interest in trying to run molecular biology experiments out of my kitchen. As anyone who has had the misfortune of trying my cooking would tell you, if there is a way to make a PCR reaction dry and tasteless I'm sure I'd find it. DIY bioinformatics I find more intriguing. I'm not a practitioner as I'm too busy with PSTDIFY (Pay Somebody To Do It For You) bioinformatics, but I like the vision of the lone, amateur scientist, sitting amongst a pile of empty pizza boxes and Red Bull cans finding unknown biological treasure with just their laptop, curiosity and some serendipity. This vision is not so unlikely. Large biological data sets are readily available including thousands of

microarray experiments



and even

full genomes

. Someone modestly adept at programming or a package like R can interrogate, correlate and mine this data - and indeed this is happening all the time. What about the true amateur, however, who even lacks programming skills? Can the Excel Warrior or my web savvy grandma participate in their own DIY bioinformatics adventure? That’s what I set out to discover this week. As a test, I went back to a favorite paper of mine by Majewski and Ott (

Genome Research, 2004

). What I like about the paper is the number of insights made simply through careful mining of genomic databases. For example, even with inherently noisy data sets like


and the annotated human genome, the authors were able to clearly see the extent and importance (for splice regulation) of sites near exon-intron boundaries simply by looking at the overall frequency of SNPs discovered in these positions compared to other sites. This figure (F2) from the paper shows the low SNP frequency in the immediate 5' and 3' positions of the intron where it meets the neighboring exons.

My test was to see if I could reproduce at least a part of this analysis by simply using free public tools and without programming. I settled on the web-based analysis tool


as it seemed to have a lot of the functionality I would need and I wasn’t very familiar with it - making me a better stand in for the Red Bull-intoxicated amateur scientist. After some time poking around, I settled on these steps in Galaxy:

  1. Get introns from chromosome 12 via UCSC’s Table Browser (I just did chromosome 12 to keep my data sets manageable for this example).
  2. Get all SNPs from chromosome 12
  3. Join the introns and SNPs producing a table of only those SNPs that fall within an intron
  4. Calculate the position of the SNP relative to the 5’ end of the intron
  5. Count up number of SNPs found at each 5’ position
  6. Sort results by position (probably not necessary)
  7. Limit results to just positions within 50 bp of the exon-intron boundary
  8. Plot the SNP frequency vs SNP location

For more detail, you can see my workflow


. And this was my final result:

This corresponds pretty well to the left portion of Majewski and Ott's own intron plot and was finished before cracking my second can of Red Bull. Score one for DIY! Before I quit my day (and night) job to make room for the waves of empowered amateurs, it's worth pointing out a few minor details. First, this rudimentary analysis glosses over many important details (such as normalizing for intron lengths) and any publication ready workflow would be much more complex. Second, like all tools of this type, Galaxy walks a fine line to balance functionality and usability. It took me quite a bit of exploring to find the right functions and many of these functions probably only made sense to me because I knew a lot about programming, databases, genomics, etc. Third, it's near impossible to match the power, speed and flexibility a programmer has to analyze data with a web based tool like Galaxy. And finally, although I am empowered by Galaxy to do the steps, the know-how of what questions to ask and the science to understand the observations I make comes from many years of experience - unfortunately there's no short cut around that. With that said, Galaxy has some very nice features and is a powerful addition to the DIY's tool box. Stock up on your Red Bull now.