ClustalW+

Table of Contents

FUNCTION

DESCRIPTION

EXAMPLE

OUTPUT

DENDROGRAM

INPUT FILES

RELATED PROGRAMS

RESTRICTIONS

ALGORITHM

COMMAND-LINE SUMMARY

LOCAL DATA FILES

PARAMETER REFERENCE


FUNCTION

[ Top | Next ]

ClustalW+ creates a multiple sequence alignment from a group of related sequences using progressive, pairwise alignments. It can also create a dendrogram (.dnd) showing the clustering relationships used to create the alignment.

DESCRIPTION

[ Previous | Top | Next ]

The simultaneous alignment of many nucleotide or amino acid sequences is now an essential tool in molecular biology. Multiple alignments are used to find diagnostic patterns to characterize protein families; to detect or demonstrate homology between new sequences and existing families of sequences; to help predict the secondary and tertiary structures of new sequences; to suggest oligonucleotide primers for PCR; as an essential prelude to molecular evolutionary analysis. The rate of appearance of new sequence data is steadily increasing and the development of efficient and accurate automatic methods for multiple alignments are, therefore, of major importance.

The multiple alignment procedure begins with the pair wise alignment of the two most similar sequences, producing a cluster of two aligned sequences. This cluster can then be aligned to the next most related sequence or cluster of aligned sequences. Two clusters of sequences can be aligned by a simple extension of the pairwise alignment of two individual sequences. The final alignment is achieved by a series of progressive, pairwise alignments that include increasingly dissimilar sequences and clusters, until all sequences have been included in the final pair wise alignment.

Before alignment, the sequences are first clustered by similarity to produce a dendrogram, or tree representation of clustering relationships. It is this dendrogram that directs the order of the subsequent pairwise alignments. 

EXAMPLE

[ Previous | Top | Next ]

18:31~43> clustalw+

ClustalW+ is a general purpose multiple sequence alignment program for DNA or Proteins.  It produces biologically meaningful multiple sequence alignments of divergent sequences.  It calculates the best match for the selected sequences, and lines them up so that the identities, similarities and differences can be seen.

clustalw+ of what sequence(s) (*  *) ? gb_pl.msf

What kind of action (alignfast, alignslow, tree) :  (* alignslow *) ? alignslow

What should I call the output file.

Default name for alignment file: clustalw.msf.

Default name for tree file : clustalw.ph

Output file:  (*  *) ?

Creating clustalw.msf as the output file.

CLUSTAL W (1.83) Multiple Sequence Alignments

Sequence format is Clustalw+/MSF

Sequence 1: AB016060.gb_pl      1824 bp

Sequence 2: AB016060            1824 bp

Sequence 3: AB016062.gb_pl      1824 bp

Sequence 4: AB016063.gb_pl      1824 bp

Sequence 5: AB016064.gb_pl      1824 bp

Sequence 6: AB016065.gb_pl      1824 bp

Sequence 7: AB016066.gb_pl      1824 bp

Sequence 8: YSCGCN4.gb_pl       1824 bp

Sequence 9: YSCGCN4_1.gb_pl     1824 bp

Sequence 10: yscgcn4.gb_pl       1824 bp


Start of Pair wise alignments

Aligning...

Sequences (1:2) Aligned. Score:  100

Sequences (1:3) Aligned. Score:  33

Sequences (1:4) Aligned. Score:  1

Sequences (1:5) Aligned. Score:  1

Sequences (1:6) Aligned. Score:  0

Sequences (1:7) Aligned. Score:  4

Sequences (1:8) Aligned. Score:  2

Sequences (1:9) Aligned. Score:  2

Sequences (1:10) Aligned. Score:  2

Sequences (2:3) Aligned. Score:  33

Sequences (2:4) Aligned. Score:  1

Sequences (2:5) Aligned. Score:  1

Sequences (2:6) Aligned. Score:  0

Sequences (2:7) Aligned. Score:  4

Sequences (2:8) Aligned. Score:  2

Sequences (2:9) Aligned. Score:  2

Sequences (2:10) Aligned. Score:  2

Sequences (3:4) Aligned. Score:  1

Sequences (3:5) Aligned. Score:  1

Sequences (3:6) Aligned. Score:  3

Sequences (3:7) Aligned. Score:  1

Sequences (3:8) Aligned. Score:  3

Sequences (3:9) Aligned. Score:  3

Sequences (3:10) Aligned. Score:  3

Sequences (4:5) Aligned. Score:  57

Sequences (4:6) Aligned. Score:  53

Sequences (4:7) Aligned. Score:  68

Sequences (4:8) Aligned. Score:  1

Sequences (4:9) Aligned. Score:  1

Sequences (4:10) Aligned. Score:  1

Sequences (5:6) Aligned. Score:  81

Sequences (5:7) Aligned. Score:  65

Sequences (5:8) Aligned. Score:  1

Sequences (5:9) Aligned. Score:  1

Sequences (5:10) Aligned. Score:  1

Sequences (6:7) Aligned. Score:  62

Sequences (6:8) Aligned. Score:  1

Sequences (6:9) Aligned. Score:  1

Sequences (6:10) Aligned. Score:  1

Sequences (7:8) Aligned. Score:  6

Sequences (7:9) Aligned. Score:  6

Sequences (7:10) Aligned. Score:  6

Sequences (8:9) Aligned. Score:  100

Sequences (8:10) Aligned. Score:  100

Sequences (9:10) Aligned. Score:  100

Guide tree        file created:   [/var/tmp/bslskAAAomayNz.dnd]


Start of Multiple Alignment

There are 9 groups

Aligning...

Group 1: Sequences:   2      Score:15353

Group 2: Sequences:   2      Score:23055

Group 3: Sequences:   4      Score:17100

Group 4:                     Delayed

Group 5: Sequences:   3      Score:34656

Group 6: Sequences:   7      Score:12169

Group 7: Sequences:   2      Score:28956

Group 8: Sequences:   3      Score:16945

Group 9: Sequences:  10      Score:10629

Alignment Score 142840

GCG-Alignment file created      [/var/tmp/bslskBAApmayNz.msf]

Moved tree file from /var/tmp/bslskAAAomayNz.dnd to clustalw.dnd

 

Moved alignment file from /var/tmp/bslskBAApmayNz.msf to clustalw.msf

OUTPUT

[ Previous | Top | Next ]

Here is some portion of the output: clustalw.msf

 

!!NA_MULTIPLE_ALIGNMENT 1.0

MSF: 2420  Type: N  December 07, 2004 18:34  Check: 4225 ..

 Name: AB016060.gb_pl  Len: 2420  Check: 6882  Weight: 1.0

 Name: AB016060  Len: 2420  Check: 6882  Weight: 1.0

 Name: AB016062.gb_pl  Len: 2420  Check: 6225  Weight: 1.0

 Name: AB016063.gb_pl  Len: 2420  Check: 6517  Weight: 1.0

 Name: AB016064.gb_pl  Len: 2420  Check: 3692  Weight: 1.0

 Name: AB016065.gb_pl  Len: 2420  Check: 668  Weight: 1.0

 Name: AB016066.gb_pl  Len: 2420  Check: 7136  Weight: 1.0

 Name: YSCGCN4.gb_pl  Len: 2420  Check: 8741  Weight: 1.0

 Name: YSCGCN4_1.gb_pl  Len: 2420  Check: 8741  Weight: 1.0

 Name: yscgcn4.gb_pl  Len: 2420  Check: 8741  Weight: 1.0

 

//

                1                                                   50

AB016060.gb_pl  .......... .......... .......... .......... ..........

AB016060        .......... .......... .......... .......... ..........

AB016062.gb_pl  .......... .......... .......... .......... ..........

AB016063.gb_pl  .......... .......... .......... .......... ..........

AB016064.gb_pl  .......... .......... .......... .......... ..........

AB016065.gb_pl  .......... .......... .......... .......... ..........

AB016066.gb_pl  .......... .......... .......... .......... ..........

YSCGCN4.gb_pl   ATCTTCGGGG ATATAAAGTG CATGAGCATA CATCTTGAAA AAAAAAGATG

YSCGCN4_1.gb_pl ATCTTCGGGG ATATAAAGTG CATGAGCATA CATCTTGAAA AAAAAAGATG

yscgcn4.gb_pl   ATCTTCGGGG ATATAAAGTG CATGAGCATA CATCTTGAAA AAAAAAGATG

 

                51                                                 100

AB016060.gb_pl  .......... .......... .......... .......... ..........

AB016060        .......... .......... .......... .......... ..........

AB016062.gb_pl  .......... .......... .......... .......... ..........

AB016063.gb_pl  .......... .......... .......... .......... ..........

AB016064.gb_pl  .......... .......... .......... .......... ..........

AB016065.gb_pl  .......... .......... .......... .......... ..........

AB016066.gb_pl  .......... .......... .......... .......... ..........

YSCGCN4.gb_pl   AAAAATTTCC GACTTTAAAT ACGGAAGATA AATACTCCAA CCTTTTTTTC

YSCGCN4_1.gb_pl AAAAATTTCC GACTTTAAAT ACGGAAGATA AATACTCCAA CCTTTTTTTC

yscgcn4.gb_pl   AAAAATTTCC GACTTTAAAT ACGGAAGATA AATACTCCAA CCTTTTTTTC

 

                101                                                150

AB016060.gb_pl  .......... .......... ...ATGGTGT TGTCTGAGTC CAACTTCCTG

AB016060        .......... .......... ...ATGGTGT TGTCTGAGTC CAACTTCCTG

AB016062.gb_pl  .......... .......... .......... ..ATGGATTT CTACACACT.

AB016063.gb_pl  ..AAGCAAAC GCAGCATTGG GAGATAGAAA GAGAGAGAGA AAGAGAGAGA

AB016064.gb_pl  .......... .......... .......... .......... ..........

AB016065.gb_pl  .......... ......TTCA CCCTCCGCCG CCTCGTCAAT TCCACGCGAA

AB016066.gb_pl  .......... .......... .......... .......... ..........

YSCGCN4.gb_pl   CAATTCCGAA ATTTTAGTCT TCTTTAAAGA AGTTTCGGCT CGCTGTCTTA

YSCGCN4_1.gb_pl CAATTCCGAA ATTTTAGTCT TCTTTAAAGA AGTTTCGGCT CGCTGTCTTA

yscgcn4.gb_pl   CAATTCCGAA ATTTTAGTCT TCTTTAAAGA AGTTTCGGCT CGCTGTCTTA

 

                151                                                200

AB016060.gb_pl  TTATGTCTTA TTTCCATTTC AATAGCTTCT GTTTTCTTCT TTCTCTTGAA

AB016060        TTATGTCTTA TTTCCATTTC AATAGCTTCT GTTTTCTTCT TTCTCTTGAA

AB016062.gb_pl  .TGCGT.TTG GATCAATTTT ..TGCCTGCG GTTTGCTTTA TATTCTAGCA

AB016063.gb_pl  GAGAAAGACC CTTACCCTTC TCTATCGCTC GCTTTCCTTT GACGCTTCTG

AB016064.gb_pl  .......... .......... .......... .......... ..TGCAAAAA

AB016065.gb_pl  CGCGAGAGCT CTCGGAAAGC ACCACCACCA GCACAGAGCC AGCGCGAGAG

AB016066.gb_pl  .......... .......... .......... .......... ..........

YSCGCN4.gb_pl   CCTTTTAAAA TCTTCTACTT CTTGACAGTA CTTATCTTCT TATATAATAG

YSCGCN4_1.gb_pl CCTTTTAAAA TCTTCTACTT CTTGACAGTA CTTATCTTCT TATATAATAG

yscgcn4.gb_pl   CCTTTTAAAA TCTTCTACTT CTTGACAGTA CTTATCTTCT TATATAATAG

The gaps at the ends of each sequence are written as dots (.)which may represent differences in input sequence lengths rather than missing characters or significant differences in the alignment. Internal gaps in each sequence are written as periods (.). See Appendix III of GCG 11.0 manual for more information about the two different gap characters.

DENDROGRAM

Clustalw+ creates a dendrogram file called clustalw.dnd (default). It has the following information.

(

(

(

(

(

AB016060.gb_pl:0.00000,

AB016060:0.00000)

:0.33486,

AB016062.gb_pl:0.33312)

:0.15633,

(

(

YSCGCN4.gb_pl:0.00000,

YSCGCN4_1.gb_pl:0.00000)

:0.00000,

yscgcn4.gb_pl:0.00000)

:0.48288)

:0.28754,

(

AB016064.gb_pl:0.08285,

AB016065.gb_pl:0.10552)

:0.11331)

:0.03569,

AB016063.gb_pl:0.18615,

AB016066.gb_pl:0.12824);

 

Any dendrogram tree viewer can interpret the distances mentioned in the dendrogram file to draw an appropriate dendrogram for given set of input sequence alignments.

INPUT FILES

[ Previous | Top | Next ]

Clustalw+ accepts multiple (two or more) nucleotide sequences or multiple (two or more) protein sequences as input. You can specify multiple sequences in a number of ways: by using a list file, for example @project.list; by using an MSF or RSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example Genbank:*. The function of Clustalw+ depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above the sequence. If your sequence(s) are not the correct type, turn to Appendix VI of GCG 11.0 manual for information on how to change or set the type of a sequence.

If the input sequences are named in a list file, you can specify the reverse complement strand of any particular nucleotide sequence in the list as input by using the strand:- sequence attribute. You can restrict the range of interest for any particular sequence with appropriate sequence attributes like Begin:43 and End:682. (See "Using List Files" in Chapter 2, Using Sequence Files and Databases in the User's Guide for more information about sequence attributes in list files.) For example:

 

This is part of a list file suitable for input to CLUSTALW+.

 

                   October 6, 1998  ..

 

PIR:A32493

PIR:S05776        Begin:43 End:682

PIR:B36590

 

///////////////////////////////////////

You can limit the range of interest for all of the sequences in the alignment by including expressions like -BEGin=20 and -END=70 on the command line. The command-line range limiters take precedence over the range limiters for sequences in a list file when both are used. If no range limitation is specified, the entire length of each sequence is aligned.

You can force the program to align the forward strand of all nucleotide sequences by including -NOREVerse on the command line. Conversely, you can force the program to align the reverse complement strand for all nucleotide sequences by including -REVerse on the command line. The command-line strand specification takes precedence over the strand specifications for sequences in a list file when both are used. If no strands are specified, the forward strands of all nucleotide sequences are aligned

RELATED PROGRAMS

[ Previous | Top | Next ]

Clustalw+ creates a multiple sequence alignment from a group of related sequences

RESTRICTIONS

[ Previous | Top | Next ]

Please make sure that your sequences have different names as the first 30 characters of the name are significant. If Clustalw+ finds two or more sequences with the same name it will fail!

Some word processors may yield unpredictable results as hidden/control characters may be present in the files. It is best to save files with the Unix format option to avoid hidden windows characters while preparing the input files

 

ALGORITHM

[ Previous | Top | Next ]

The basic alignment method

The basic multiple alignment algorithm consists of three main stages: 1) all pairs of sequences are aligned separately in order to calculate a distance matrix giving the divergence of each pair of sequences; 2) a guide tree is calculated from the distance matrix; 3) the sequences are progressively aligned according to the branching order in the guide tree. An example using 7 globin sequences of known tertiary structure (25) is given in figure 1.

1) The distance matrix/pairwise alignments

In the original CLUSTAL programs, the pairwise distances were calculated using a fast approximate method (22). This allows very large numbers of sequences to be aligned, even on a microcomputer. The scores are calculated as the number of k-tuple matches (runs of identical residues, typically 1 or 2 long for proteins or 2 to 4 long for nucleotide sequences) in the best alignment between two sequences minus a fixed penalty for every gap. We now offer a choice between this method and the slower but more accurate scores from full dynamic programming alignments using two gap penalties (for opening or extending gaps) and a full amino acid weight matrix. These scores are calculated as the number of identities in the best alignment divided by the number of residues compared (gap positions are excluded). Both of these scores are initially calculated as percent identity scores and are converted to distances by dividing by 100 and subtracting from 1.0 to give number of differences per site. We do not correct for multiple substitutions in these initial distances. In figure 1 we give the 7x7 distance matrix between the 7 globin sequences calculated using the full dynamic programming method.

2) The guide tree

The trees used to guide the final multiple alignment process are calculated from the distance matrix of step 1 using the Neighbour-Joining method (21). This produces unrooted trees with branch lengths proportional to estimated divergence along each branch. The root is placed by a "mid-point" method (15) at a position where the means of the branch lengths on either side of the root are equal. These trees are also used to derive a weight for each sequence (15). The weights are dependent upon the distance from the root of the tree but sequences which have a common branch with other sequences share the weight derived from the shared branch. In the example in figure 1, the leghaemoglobin (Lgb2_Luplu) gets a weight of 0.442 which is equal to the length of the branch from the root to it. The Human beta globin (Hbb_Human) gets a weight consisting of the length of the branch leading to it that is not shared with any other sequences (0.081) plus half the length of the branch shared with the horse beta globin (0.226/2) plus one quarter the length of the branch shared by all four haemoglobins (0.061/4) plus one fifth the branch shared between the haemoglobins and the myoglobin (0.015/5) plus one sixth the branch leading to all the vertebrate globins (0.062). This sums to a total of 0.221. By contrast, in the normal progressive alignment algorithm, all sequences would be equally weighted. The rooted tree with branch lengths and sequence weights for the 7 globins is given in figure 1.

3) Progressive alignment

The basic procedure at this stage is to use a series of pairwise alignments to align larger and larger groups of sequences, following the branching order in the guide tree. You proceed from the tips of the rooted tree towards the root. In the globin example in figure 1 you align the sequences in the following order: human vs. horse beta globin; human vs. horse alpha globin; the 2 alpha globins vs. the 2 beta globins; the myoglobin vs. the haemoglobins; the cyanohaemoglobin vs the haemoglobins plus myoglobin; the leghaemoglobin vs. all the rest. At each stage a full dynamic programming (26,27) algorithm is used with a residue weight matrix and penalties for opening and extending gaps. Each step consists of aligning two existing alignments or sequences. Gaps that are present in older alignments remain fixed. In the basic algorithm, new gaps that are introduced at each stage get full gap opening and extension penalties, even if they are introduced inside old gap positions (see the section on gap penalties below for modifications to this rule). In order to calculate the score between a position from one sequence or alignment and one from another, the average of all the pairwise weight matrix scores from the amino acids in the two sets of sequences is used i.e. if you align 2 alignments with 2 and 4 sequences respectively, the score at each position is the average of 8 (2x4) comparisons. This is illustrated in figure 2. If either set of sequences contains one or more gaps in one of the positions being considered, each gap versus a residue is scored as zero. The default amino acid weight matrices we use are rescored to have only positive values. Therefore, this treatment of gaps treats the score of a residue versus a gap as having the worst possible score. When sequences are weighted (see improvements to progressive alignment, below), each weight matrix value is multiplied by the weights from the 2 sequences, as illustrated in figure 2.

Improvements to progressive alignment

All of the remaining modifications apply only to the final progressive alignment stage. Sequence weighting is relatively straightforward and is already widely used in profile searches (15,16). The treatment of gap penalties is more complicated. Initial gap penalties are calculated depending on the weight matrix, the similarity of the sequences, and the length of the sequences. Then, an attempt is made to derive sensible local gap opening penalties at every position in each pre-aligned group of sequences that will vary as new sequences are added. The use of different weight matrices as the alignment progresses is novel and largely by-passes the problem of initial choice of weight matrix. The final modification allows us to delay the addition of very divergent sequences until the end of the alignment process when all of the more closely related sequences have already been aligned.

Sequence weighting

Sequence weights are calculated directly from the guide tree. The weights are normalised such that the biggest one is set to 1.0 and the rest are all less than one. Groups of closely related sequences receive lowered weights because they contain much duplicated information. Highly divergent sequences without any close relatives receive high weights. These weights are used as simple multiplication factors for scoring positions from different sequences or prealigned groups of sequences. The method is illustrated in figure 2. In the globin example in figure 1, the two alpha globins get downweighted because they are almost duplicate sequences (as do the two beta globins); they receive a combined weight of only slightly more than if a single alpha globin was used.

Initial gap penalties

Initially, two gap penalties are used: a gap opening penalty (GOP) which gives the cost of opening a new gap of any length and a gap extension penalty (GEP) which gives the cost of every item in a gap. Initial values can be set by the user from a menu. The software then automatically attempts to choose appropriate gap penalties for each sequence alignment, depending on the following factors.

1) Dependence on the weight matrix

It has been shown (16,28) that varying the gap penalties used with different weight matrices can improve the accuracy of sequence alignments. Here, we use the average score for two mismatched residues (ie. off-diagonal values in the matrix) as a scaling factor for the GOP.

2) Dependence on the similarity of the sequences

The percent identity of the two (groups of) sequences to be aligned are used to increase the GOP for closely related sequences and decrease it for more divergent sequences on a linear scale.

3) Dependence on the lengths of the sequences

The scores for both true and false sequence alignments grow with the length of the sequences. We use the logarithm of the length of the shorter sequence to increase the GOP with sequence length.

Using these three modifications, the initial GOP calculated by the program is:

GOP->(GOP+log(MIN(N,M))) * (average residue mismatch score) * (percent identity scaling factor) where N, M are the lengths of the two sequences.

4) Dependence on the difference in the lengths of the sequences

The GEP is modified depending on the difference between the lengths of the two sequences to be aligned. If one sequence is much shorter than the other, the GEP is increased to inhibit too many long gaps in the shorter sequence. The initial GEP calculated by the program is:

GEP -> GEP*(1.0+|log(N/M)|) where N, M are the lengths of the two sequences.

Position-specific gap penalties

In most dynamic programming applications, the initial gap opening and extension penalties are applied equally at every position in the sequence, regardless of the location of a gap, except for terminal gaps which are usually allowed at no cost. In ClustalW+, before any pair of sequences or pre-aligned groups of sequences are aligned, we generate a table of gap opening penalties for every position in the two (sets of) sequences. An example is shown in figure 3. We manipulate the initial gap opening penalty in a position specific manner, in order to make gaps more or less likely at different positions.

The local gap penalty modification rules are applied in a hierarchical manner. The exact details of each rule are given below. Firstly, if there is a gap at a position, the gap opening and gap extension penalties are lowered; the other rules do not apply. This makes gaps more likely at positions where there are already gaps. If there is no gap at a position, then the gap opening penalty is increased if the position is within 8 residues of an existing gap. This discourages gaps that are too close together. Finally, at any position within a run of hydrophilic residues, the penalty is decreased. These runs usually indicate loop regions in protein structures. If there is no run of hydrophilic residues, the penalty is modified using a table of residue specific gap propensities (12). These propensities were derived by counting the frequency of each residue at either end of gaps in alignments of proteins of known structure. An illustration of the application of these rules from one part of the globin example, in figure 1, is given in figure 3.

1) Lowered gap penalties at existing gaps

If there are already gaps at a position, then the GOP is reduced in proportion to the number of sequences with a gap at this position and the GEP is lowered by a half. The new gap opening penalty is calculated as:

GOP -> GOP*0.3*(no. of sequences without a gap/no. of sequences).

2) Increased gap penalties near existing gaps

If a position does not have any gaps but is within 8 residues of an existing gap, the GOP is increased by:

GOP -> GOP*(2+((8-distance from gap)*2)/8)

3) Reduced gap penalties in hydrophilic stretches

Any run of 5 hydrophilic residues is considered to be a hydrophilic stretch. The residues that are to be considered hydrophilic may be set by the user but are conservatively set to D, E, G, K, N, Q, P, R or S by default. If, at any position, there are no gaps and any of the sequences has such a stretch, the GOP is reduced by one third.

4) Residue specific penalties

If there is no hydrophilic stretch and the position does not contain any gaps, then the GOP is multiplied by one of the 20 numbers in table 1, depending on the residue. If there is a mixture of residues at a position, the multiplication factor is the average of all the contributions from each sequence.

Weight matrices

Two main series of weight matrices are offered to the user: the Dayhoff PAM series (3) and the BLOSUM series (4). The default is the BLOSUM series. In each case, there is a choice of matrix ranging from strict ones, useful for comparing very closely related sequences to very "soft" ones that are useful for comparing very distantly related sequences. Depending on the distance between the two sequences or groups of sequences to be compared, we switch between 4 different matrices. The distances are measured directly from the guide tree. The ranges of distances and tables used with the PAM series of matrices is: 80-100%:PAM20, 60-80%:PAM60, 40-60%:PAM120, 0-40%:PAM350. The range used with the BLOSUM series is:80-100%:BLOSUM80, 60-80%:BLOSUM62, 30-60%:BLOSUM45, 0-30%:BLOSUM30.

Divergent sequences

The most divergent sequences (most different, on average from all of the other sequences) are usually the most difficult to align correctly. It is sometimes better to delay the incorporation of these sequences until all of the more easily aligned sequences are merged first. This may give a better chance of correctly placing the gaps and matching weakly conserved positions against the rest of the sequences. A choice is offered to set a cut off (default is 40% identity or less with any other sequence) that will delay the alignment of the divergent sequences until all of the rest have been aligned.

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

 

ClustalW+ performs multiple alignments on a set of sequences or a set of previously aligned sequences.

 

 

Minimal Syntax: % clustalw+ [-infile=]value -Default

 

 

Minimal Parameters (case-insensitive):

 

-infile         [Type: InFile / Default: EMPTY / Aliases: infile1 in]

                Input file specification

 

Prompted Parameters (case-insensitive):

 

-action         [Type: String / Default: 'alignslow' / Aliases: act]

                The following actions are supported:

          alignfast: multiple alignment using fast pair wise

          alignment

          alignslow: multiple alignment using slow pair wise

          alignment

                tree: generate a phylogenetic tree given the alignment

               

-outfile        [Type: OutFile / Default: EMPTY / Aliases: out]

                Output file produced by ClustalW. This will be an

                alignment file if ACTION is alignfast or alignslow. It

                is a tree file if ACTION is tree. Default output file

                is clustalw.msf

 

Optional Parameters (case-insensitive):

 

-check          [Type: Boolean / Default: 'false' / Aliases: che help]

                Prints out this usage message

-default        [Type: Boolean / Default: 'false' / Aliases: d def]

                Specifies that sensible default values be used for all

                parameters where possible.

-documentation  [Type: Boolean / Default: 'true' / Aliases: doc]

                Prints banner at program startup

-quiet          [Type: Boolean / Default: 'false' / Aliases: qui]

                Tells application to print only a minimal amount of

                information

-outorder       [Type: String / Default: 'input' / Aliases: ord]

                Whether the sequences should be output in input order

                or the order in which they were aligned. Valid values

                are : Input,   Aligned

-range          [Type: List / Default: EMPTY / Aliases: rng]

                The sequence range to write as a comma-seaparated

                value, e.g. m,n will write from m to m+n

-pwmatrix       [Type: String / Default: EMPTY]

                The scoring matrix to use for pair wise alignment when

                performing slow pair wise alignments. Depending upon

                sequence type, it may either refer to DNA scoring

                matrix or a protein scoring matrix. This option is

                relevant only when ACTION=alignslow. Valid matrices are

                : blosum, pam, gonnet, id or filename

-pwgapopen      [Type: Double / Default: EMPTY]

                The gap opening penalty during slow pair wise

                alignments. This option is relevant only when  

                ACTION=alignslow.

-pwgapext       [Type: Double / Default: EMPTY]

                The gap extension penalty during slow pair wise

                alignments. This option is relevant only when

                ACTION=alignslow.

-ktuple         [Type: Integer / Default: EMPTY]

                Window size while doing fast pair wise alignment. This

                option is relevant only when ACTION=alignfast.

-topdiags       [Type: Integer / Default: EMPTY]

                Number of windows around best diagonals while doing

                fast pair wise alignment. This option is relevant only

                when     ACTION=alignfast.

-window         [Type: Integer / Default: EMPTY]

Number of windows around each of the top diagnoals. This option is relevant only when ACTION=alignfast.

-pairgap        [Type: Integer / Default: EMPTY]

                The number of matching residues required to open a gap

                while doing fast pair wise alignments. This option is

                only relevant   when ACTION=alignfast.

-score          [Type: String / Default: 'absolute']

                Whether pair wise alignment scores should be reported

                as (raw)  absolute scores or percentages

                absolute|percent}. This     option is relevant only

                when ACTION=alignfast.

-matrix         [Type: String / Default: EMPTY]

                The scoring matrix to be used for multiple sequence

                alignments.

-gapopen        [Type: Double / Default: EMPTY]

                The gap opening penalty when doing multiple sequence

                alignments

-gapext         [Type: Double / Default: EMPTY]

                The gap extension penalty when doing multiple sequence

                alignments.

-endgaps        [Type: Boolean / Default: 'false']

                Turn on/off end gap separation penalty

-pgap           [Type: Boolean / Default: 'true']

                Turn on/off residue-specific gap separation penalties

-hgap           [Type: Boolean / Default: 'true']

                Turn on/off hydrophilic residue gaps

-hgapresidues   [Type: List / Default: EMPTY]

                List of hydrophilic residues

-maxdiv         [Type: Double / Default: EMPTY]

                Minimum percentage identity required for delay

-outputtree     [Type: String / Default: 'nj' / Aliases: outtreeformat

                treefmt] The output format of the guide tree produces. Valid values: NJ, PHYLIP, DIST, NEXUS

-negative       [Type: Boolean / Default: 'false' / Aliases: neg]

                Sets whether protein matrix contains negative values.

-monitor        [Type: Boolean / Default: 'false' / Aliases: mon]

                Turn on/off result monitoring

LOCAL DATA FILES

[ Previous | Top | Next ]

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory [$WPROOT/share/matrix/] unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Chapter 4, Using Data Files in the User's Guide.

Local Scoring Matrices

This program reads one or more scoring matrices for the comparison of sequence characters. The program automatically reads the program's default scoring matrix in a public data directory unless you either 1) have a data file with exactly the same name as the program default scoring matrix in your current working directory; or 2) have a data file with exactly the same name as the program default scoring matrix in the directory with the logical name MyData; or 3) name a file on the command line with an expression like -MATRix=mymatrix.cmp. If you don't include a directory specification when you name a file with -MATRix, the program searches for the file first in your local directory, then in the directory with the logical name MyData, then in the public data directory with the logical name GenMoreData, and finally in the public data directory with the logical name GenRunData. For more information see "Using a Special Kind of Data File: A Scoring Matrix" in Chapter 4, Using Data Files in the User's Guide.

Clustalw+ reads a scoring matrix from your local directory or the public database with the values for every possible symbol comparison. The file Clustalw+dna.cmp has a 10 at every place where the set of bases implied by the alphabetic IUB ambiguity codes (see Appendix III) overlap. All of the other locations have zeros. The file blosum62.cmp is based on substitutions between amino acid pairs in ungapped blocks of aligned protein segments as measured by Henikoff and Henikoff. The scores in this matrix for pair wise amino acid comparisons range from -4 to +11. You can use the Fetch+ program to copy these files and then modify them to suit you own needs. (See the CONSIDERATIONS topic for more information about scoring matrices.)

 

 

PARAMETER REFERENCE

[ Previous | Top ]

-CHEck

 

prints out this usage message.

 

 

-DEFault

 

 specifies that sensible default values be used for all

 parameters where possible.

 

 

-DOCumentation

 

prints banner at program startup.

 

-QUIet

 

tells application to print only a minimal amount of information.

 

-outorder

 

whether the sequences should be output in input order or the               order in which they were aligned. Valid values are : Input,

Aligned.

 

-range

 

the sequence range to write as a comma-separated value, e.g.

m,n will write from m to m+n

 

-pwmatrix

 

the scoring matrix to use for pair wise alignment when performing slow pair wise alignments. Depending upon sequence type, it may either refer to DNA scoring matrix or a protein scoring matrix. This option is relevant only when ACTION=alignslow. Valid matrices are: blosum, pam, gonnet, id or filename

 

-pwgapopen

 

the gap opening penalty during slow pair wise alignments. This            option is relevant only when ACTION=alignslow.

 

-pwgapext

 

the gap extension penalty during slow pair wise alignments. This option is relevant only when ACTION=alignslow.

 

-ktuple

 

window size while doing fast pair wise alignment. This option is relevant only when ACTION=alignfast.

 

-topdiags

 

number of windows around best diagonals while doing fast

pair wise alignment. This option is relevant only when               ACTION=alignfast.

 

-window

 

number of windows around each of the top diagonals. This              option is relevant only when ACTION=alignfast.

 

-pairgap

 

the number of matching residues required to open a gap while          doing fast pair wise alignments. This option is only relevant               when ACTION=alignfast.

 

-score

 

whether pair wise alignment scores should be reported as (raw)

absolute scores or percentages {absolute|percent}. This

option is relevant only when ACTION=alignfast.

 

-matrix

 

the scoring matrix to be used for multiple sequence alignments.

 

-gapopen

 

the gap opening penalty when doing multiple sequence alignments

 

-gapext

 

the gap extension penalty when doing multiple sequence alignments.

-gapdist

 

Gap Separation distance is used to decrease the chances of gaps being too close to each other. Gaps that are less than this distance apart are penalised more than other gaps.

 

-endgaps

 

turn on/off end gap separation penalty.

 

-pgap

 

turn on/off residue-specific gap penalty.

 

-hgap

 

turn on/off hydrophilic residue gaps.

 

-hgapresidues

 

list of hydrophilic residues.

 

-maxdiv

 

minimum percentage identity required for delay.

 

-outputtree

 

the output format of the guide tree produces. Valid values: NJ, PHYLIP, DIST, NEXUS.

 

-NEGative

 

sets whether protein matrix contains negative values.

 

-MONitor

 

turn on/off result monitoring.

 

Printed: January 27, 2005  15:04