How Does Work OligoArray?


        OligoArray is a program that computes gene specific oligonucleotides that are free of secondary structure for genome-scale oligonucleotide microarray construction. Selection is based on three major criteria: oligonucleotide melting temperature, specificity to a single target, or at least to the shortest list of possible targets and the inability to fold into a stable secondary structure at the hybridization temperature. Here, I describe how Tm, specificity and secondary structure are computed.
Tm computation
        Oligonucleotide melting temperature is computed using the Nearest-Neighbor model using DNA parameters published by Dr. J. SantaLucia (Proc Natl Acad Sci U S A (1998) 95: 1460-5). Published data are for a 1M sodium concentration in the buffer. To compute the Tm, I use the formula Tm = (DH°/(DS° + R ln(DNA /4)) -273.15 where R is the gas constant (1.9872 cal/K.mol) and DNA is the DNA concentration.

        For computation, I have fixed DNA concentration to 10E-6 M and I assume that the concentration of both strands  is  equal. Since we do not know the exact concentration of each probe and each target during hybridization, I have fixed this parameter. Changing the DNA concentration from 10E-6 to 10E-9 M will only decrease the Tm by few degrees.

Specificity Computation
         The Blast program (Altschul et al. 1997 NAR 25(17):3389-402) is used to search for sequence similarities between the oligonucleotide sequence and all other sequences contained in the Local Blast Database (see here how to build this database). For gene expression studies, this database should contain all transcribed sequence from the organism you are working on.

        OligoArray  automatically performs a Blast for every oligo. Blast options are setup as follow -W 7 -F F -S 1 -e 100 (the last option is correct for small genomes, however, in case of large genomes, you may need to increase it until the blast output  has alignments that are at least 13 nucleotides). The -S 1 option will restrict the search to the plus strand, which is useful only for oligos designed for gene expression. In the paper, the W option is not mentioned (default is W = 11).  Now, we use W = 7 to initiate the blast search with shorter words.

        OligoArray reads two values in the Blast output: length and percentage of identity. The length and percentage identity are then compared to threshold parameters (see chart below) designed to be more restrictive than recommended by Kane et al. (Nucleic Acids Res, 2000, 28: 4552-7) and Hughes et al (Nat Biotechnol, 2001, 19: 342-7). An oligonucleotide will be accepted if the identity detected with Blast is below these thresholds. For example, if the length of identity of the sequence alignment is 40 nucleotides, than the oligo will be rejected if the % of identity exceeds 60%.  Therefore, these thresholds are function of the length of the sequence alignment:

Length of identity
 Threshold (% of identity)
> 50 nucleotides
50 %
36 to 50 nucleotides
60 %
15 to 35 nucleotides
70 %
less than 15 nucleotides *
100 %

 * In some cases (see algorithm), the program can tolerate perfect short alignments of up to 15 consecutive nucleotides.

Family Members and Cross-Hybridization
        When OligoArray does not find any oligo for an input sequence due to specificity failure, I consider that such a sequence belongs to a set of conserved sequences. These sequences will be processed differently (see algorithm). OligoArray will first consider all sequences with an identity above 90 % along the oligo length minus 5 nucleotides. These sequences will be called family members and reported as main targets of the oligo in the output file. All sequences with identities between the thresholds described above and this 90 % threshold are considered as possible cross-hybridization and are also reported in the output file.
Secondary Structure Prediction
         One may probably want to reject an oligonucleotide having a stable secondary structure at hybridization temperature. To predict such structures, OligoArray calls the Mfold software developed by Prof. M. Zucker. Predictions are done using  parameters for DNA at 1M sodium concentration. An oligonucleotide will be rejected if it has secondary structure with a Tm above the user defined threshold (see option setting).

        The published version of OligoArray access remotely to the Mfold WWW server. This means that you need to run OligoArray on a computer connected to the web. If you get the Mfold license from WUSTL, I can provide you with a piece of code to link OligoArray to your local version of Mfold instead of the web server.

        The algorithm is presented in the chart below. For each sequence from the input file, OligoArray reads, at the 3' end of the input sequence, the last possible window of sequence of length equal to the length of the oligo. First, the oligo Tm is tested and compared to the Tm range chosen by user. If tag sequences were entered, they are pasted to the oligo sequence and then, the full sequence is tested for the absence of prohibited sequences (if defined by user). If there is no prohibited sequences, the oligo specificity is tested using a threshold of 14. That means that only perfect alignment no longer than 14 nucleotides is accepted. If the oligo is specific, it is then tested for the absence of secondary structure with a Tm above the threshold chosen by user. In absence of structure, the oligo is saved in the output file and the next sequence from the input file is processed.

        If one of the four tests fail, the reading window is moved 10 nucleotides upstream. The process is repeated until an oligo is found or until the maximal distance accepted between the 5' of the oligo and the 3' of the input sequence is reached. If no oligo is found in these condition, OligoArray run a new cycle using a specificity threshold of 15. This means that we can tolerate few perfect alignments up to 15 nucleotides.

        If at the end of this more permissive cycle no oligo is found, the input sequence is considered to belong to a set of closely related sequences and it will be processed differently (right side of the chart). Every oligonucleotide sequences will be tested for its Tm and the absence of secondary structures. If the oligo fulfills these criteria, it will be blasted against the database. All alignments longer than the length of the oligo minus 5 nt and with an identity greater than 90% will define a family. Family (list of sequence identifier) are saved in memory until all oligonucleotides are tested. These lists are sorted from the shorter to the longer one. The oligo corresponding to the shortest list is tested for the absence of secondary structure, and this until a valid oligonucleotide is found.