Excel Hacks: Reverse Complement a Nucleotide Sequence

Reverse complementing a nucleotide sequence in Excel turns out to be really ugly - in fact, this may be one case where it's better to do some cut and pasting into a web tool or finally go pick up that Perl book that's been sitting on your shelf. However, if you’re desperate and absolutely have to do it in Excel, here's how you can reverse complement short (~30 base) nucleotide sequences.

Complementing

Let's break reverse complementing into its two steps - reversing the DNA sequence and then taking its complement. We'll start with complementing:
= SUBSTITUTE( SUBSTITUTE( SUBSTITUTE( SUBSTITUTE( SUBSTITUTE( SUBSTITUTE( SUBSTITUTE( SUBSTITUTE( A1, "A", 1), "C", 2), "G", 3),"T", 4), 1, "T"), 2, "G"), 3, "C"), 4, "A")

As in previous posts, we assume the sequence to be complemented is in cell A1.
This rather lengthy formula is performing a two-step substitution. First, nucleotides are converted to numbers, e.g. “A” to “1”, “C” to “2”, etc. Next, the numbers are then replaced with the complementary nucleotide, so “1” goes to “T”, “2” to “G” and so on. Doing the substitution in two steps allows us to distinguish which bases have been complemented from those that have not. The formula works correctly for nucleotides “A”, “C”, “G”, “T” and “N”. Other degenerate IUPAC codes could be handled, but we’ll leave this as an exercise for the reader.

Reversing

Surprisingly, reversing the DNA sequence turns out to be even trickier. Most programming languages have a function to reverse text, but not Excel. Breaking out a macro in Excel might be worthwhile since writing a little Visual Basic to reverse text isn't too bad. Unfortunately, this is not possible for folks with Macs using Office 2008 since macro functionality was removed.

For a general method to reverse our sequence, our options are limited and ugly. Perhaps one of the less awful methods is repeated use of the MID function. Assuming that our complemented nucleotide sequence is in cell B1, we would do something like this:

= MID(B1,30,1) & MID(B1,29,1) & MID(B1,28,1) & MID(B1,27,1) & MID(B1,26,1) & MID(B1,25,1) & MID(B1,24,1) & MID(B1,23,1) & MID(B1,22,1) & MID(B1,21,1) & MID(B1,20,1) & MID(B1,19,1) & MID(B1,18,1) & MID(B1,17,1) & MID(B1,16,1) & MID(B1,15,1) & MID(B1,14,1) & MID(B1,13,1) & MID(B1,12,1) & MID(B1,11,1) & MID(B1,10,1) & MID(B1,9,1) & MID(B1,8,1) & MID(B1,7,1) & MID(B1,6,1) & MID(B1,5,1) & MID(B1,4,1) & MID(B1,3,1) & MID(B1,2,1) & MID(B1,1,1)

The formula constructs the reverse sequence base-by-base up to 30 bases. Make certain that your sequence to reverse is 30 or fewer nucleotides, otherwise your sequence will be truncated without warning. You could extend the function to handle larger sequences, but clearly this method is only practical for short sequences.

Excel Hacks: Calculating GC%

As a quick hack, you can calculate the GC% of a DNA sequence in Excel without breaking out the macros. Here’s how you do it assuming your sequence is in Excel cell A1.

= (1 - LEN(SUBSTITUTE(SUBSTITUTE(A1,"G",""),"C",""))/LEN(A1)) * 100

Although the formula is a bit daunting, the principle is straightforward. First, use the SUBSTITUTE command to get a version of your sequence with all G’s and C’s removed. Determine the length of this new sequence using the LEN command and compare this to the length of the original sequence. This gives you the fraction of sequence that is NOT a G or a C. To get the GC%, subtract this fraction from 1 and multiply by 100% to go from a fraction to a percent.

This will probably satisfy most folks, but it’s not perfect. For an accurate calculation, we need to take into account ambiguous IUPAC base codes. For example, a "N" should be treated as 0.5 of a G-C base. A "D" should count as 0.33. Although it’s possible to extend the hack above to handle these cases, it might be a good time to break out Perl or Python.