DENDROGRAM
ClustalW+ creates a multiple sequence alignment from a group of related sequences using progressive, pairwise alignments. It can also create a dendrogram (.dnd) showing the clustering relationships used to create the alignment.
The simultaneous alignment of many nucleotide or amino acid sequences is now an essential tool in molecular biology. Multiple alignments are used to find diagnostic patterns to characterize protein families; to detect or demonstrate homology between new sequences and existing families of sequences; to help predict the secondary and tertiary structures of new sequences; to suggest oligonucleotide primers for PCR; as an essential prelude to molecular evolutionary analysis. The rate of appearance of new sequence data is steadily increasing and the development of efficient and accurate automatic methods for multiple alignments are, therefore, of major importance.
The multiple alignment procedure begins with the pair wise alignment of the two most similar sequences, producing a cluster of two aligned sequences. This cluster can then be aligned to the next most related sequence or cluster of aligned sequences. Two clusters of sequences can be aligned by a simple extension of the pairwise alignment of two individual sequences. The final alignment is achieved by a series of progressive, pairwise alignments that include increasingly dissimilar sequences and clusters, until all sequences have been included in the final pair wise alignment.
Before alignment, the sequences are first clustered by similarity to produce a dendrogram, or tree representation of clustering relationships. It is this dendrogram that directs the order of the subsequent pairwise alignments.
18:31~43> clustalw+
ClustalW+ is a general purpose multiple sequence alignment program for DNA or Proteins. It produces biologically meaningful multiple sequence alignments of divergent sequences. It calculates the best match for the selected sequences, and lines them up so that the identities, similarities and differences can be seen.
clustalw+ of what sequence(s) (* *) ? gb_pl.msf
What kind of action (alignfast, alignslow, tree) : (* alignslow *) ? alignslow
What should I call the output file.
Default name for alignment file: clustalw.msf.
Default name for tree file : clustalw.ph
Output file: (* *) ?
Creating clustalw.msf as the output file.
CLUSTAL W (1.83) Multiple Sequence Alignments
Sequence format is Clustalw+/MSF
Sequence 1: AB016060.gb_pl 1824 bp
Sequence 2: AB016060 1824 bp
Sequence 3: AB016062.gb_pl 1824 bp
Sequence 4: AB016063.gb_pl 1824 bp
Sequence 5: AB016064.gb_pl 1824 bp
Sequence 6: AB016065.gb_pl 1824 bp
Sequence 7: AB016066.gb_pl 1824 bp
Sequence 8: YSCGCN4.gb_pl 1824 bp
Sequence 9: YSCGCN4_1.gb_pl 1824 bp
Sequence 10: yscgcn4.gb_pl 1824 bp
Start of Pair wise alignments
Aligning...
Sequences (1:2) Aligned. Score: 100
Sequences (1:3) Aligned. Score: 33
Sequences (1:4) Aligned. Score: 1
Sequences (1:5) Aligned. Score: 1
Sequences (1:6) Aligned. Score: 0
Sequences (1:7) Aligned. Score: 4
Sequences (1:8) Aligned. Score: 2
Sequences (1:9) Aligned. Score: 2
Sequences (1:10) Aligned. Score: 2
Sequences (2:3) Aligned. Score: 33
Sequences (2:4) Aligned. Score: 1
Sequences (2:5) Aligned. Score: 1
Sequences (2:6) Aligned. Score: 0
Sequences (2:7) Aligned. Score: 4
Sequences (2:8) Aligned. Score: 2
Sequences (2:9) Aligned. Score: 2
Sequences (2:10) Aligned. Score: 2
Sequences (3:4) Aligned. Score: 1
Sequences (3:5) Aligned. Score: 1
Sequences (3:6) Aligned. Score: 3
Sequences (3:7) Aligned. Score: 1
Sequences (3:8) Aligned. Score: 3
Sequences (3:9) Aligned. Score: 3
Sequences (3:10) Aligned. Score: 3
Sequences (4:5) Aligned. Score: 57
Sequences (4:6) Aligned. Score: 53
Sequences (4:7) Aligned. Score: 68
Sequences (4:8) Aligned. Score: 1
Sequences (4:9) Aligned. Score: 1
Sequences (4:10) Aligned. Score: 1
Sequences (5:6) Aligned. Score: 81
Sequences (5:7) Aligned. Score: 65
Sequences (5:8) Aligned. Score: 1
Sequences (5:9) Aligned. Score: 1
Sequences (5:10) Aligned. Score: 1
Sequences (6:7) Aligned. Score: 62
Sequences (6:8) Aligned. Score: 1
Sequences (6:9) Aligned. Score: 1
Sequences (6:10) Aligned. Score: 1
Sequences (7:8) Aligned. Score: 6
Sequences (7:9) Aligned. Score: 6
Sequences (7:10) Aligned. Score: 6
Sequences (8:9) Aligned. Score: 100
Sequences (8:10) Aligned. Score: 100
Sequences (9:10) Aligned. Score: 100
Guide tree file created: [/var/tmp/bslskAAAomayNz.dnd]
Start of Multiple Alignment
There are 9 groups
Aligning...
Group 1: Sequences: 2 Score:15353
Group 2: Sequences: 2 Score:23055
Group 3: Sequences: 4 Score:17100
Group 4: Delayed
Group 5: Sequences: 3 Score:34656
Group 6: Sequences: 7 Score:12169
Group 7: Sequences: 2 Score:28956
Group 8: Sequences: 3 Score:16945
Group 9: Sequences: 10 Score:10629
Alignment Score 142840
GCG-Alignment file created [/var/tmp/bslskBAApmayNz.msf]
Moved tree file from /var/tmp/bslskAAAomayNz.dnd to clustalw.dnd
Moved alignment file from /var/tmp/bslskBAApmayNz.msf to clustalw.msf
Here is some portion of the output: clustalw.msf
!!NA_MULTIPLE_ALIGNMENT 1.0
MSF: 2420 Type: N December 07, 2004 18:34 Check: 4225 ..
Name: AB016060.gb_pl Len: 2420 Check: 6882 Weight: 1.0
Name: AB016060 Len: 2420 Check: 6882 Weight: 1.0
Name: AB016062.gb_pl Len: 2420 Check: 6225 Weight: 1.0
Name: AB016063.gb_pl Len: 2420 Check: 6517 Weight: 1.0
Name: AB016064.gb_pl Len: 2420 Check: 3692 Weight: 1.0
Name: AB016065.gb_pl Len: 2420 Check: 668 Weight: 1.0
Name: AB016066.gb_pl Len: 2420 Check: 7136 Weight: 1.0
Name: YSCGCN4.gb_pl Len: 2420 Check: 8741 Weight: 1.0
Name: YSCGCN4_1.gb_pl Len: 2420 Check: 8741 Weight: 1.0
Name: yscgcn4.gb_pl Len: 2420 Check: 8741 Weight: 1.0
//
1 50
AB016060.gb_pl .......... .......... .......... .......... ..........
AB016060 .......... .......... .......... .......... ..........
AB016062.gb_pl .......... .......... .......... .......... ..........
AB016063.gb_pl .......... .......... .......... .......... ..........
AB016064.gb_pl .......... .......... .......... .......... ..........
AB016065.gb_pl .......... .......... .......... .......... ..........
AB016066.gb_pl .......... .......... .......... .......... ..........
YSCGCN4.gb_pl ATCTTCGGGG ATATAAAGTG CATGAGCATA CATCTTGAAA AAAAAAGATG
YSCGCN4_1.gb_pl ATCTTCGGGG ATATAAAGTG CATGAGCATA CATCTTGAAA AAAAAAGATG
yscgcn4.gb_pl ATCTTCGGGG ATATAAAGTG CATGAGCATA CATCTTGAAA AAAAAAGATG
51 100
AB016060.gb_pl .......... .......... .......... .......... ..........
AB016060 .......... .......... .......... .......... ..........
AB016062.gb_pl .......... .......... .......... .......... ..........
AB016063.gb_pl .......... .......... .......... .......... ..........
AB016064.gb_pl .......... .......... .......... .......... ..........
AB016065.gb_pl .......... .......... .......... .......... ..........
AB016066.gb_pl .......... .......... .......... .......... ..........
YSCGCN4.gb_pl AAAAATTTCC GACTTTAAAT ACGGAAGATA AATACTCCAA CCTTTTTTTC
YSCGCN4_1.gb_pl AAAAATTTCC GACTTTAAAT ACGGAAGATA AATACTCCAA CCTTTTTTTC
yscgcn4.gb_pl AAAAATTTCC GACTTTAAAT ACGGAAGATA AATACTCCAA CCTTTTTTTC
101 150
AB016060.gb_pl .......... .......... ...ATGGTGT TGTCTGAGTC CAACTTCCTG
AB016060 .......... .......... ...ATGGTGT TGTCTGAGTC CAACTTCCTG
AB016062.gb_pl .......... .......... .......... ..ATGGATTT CTACACACT.
AB016063.gb_pl ..AAGCAAAC GCAGCATTGG GAGATAGAAA GAGAGAGAGA AAGAGAGAGA
AB016064.gb_pl .......... .......... .......... .......... ..........
AB016065.gb_pl .......... ......TTCA CCCTCCGCCG CCTCGTCAAT TCCACGCGAA
AB016066.gb_pl .......... .......... .......... .......... ..........
YSCGCN4.gb_pl CAATTCCGAA ATTTTAGTCT TCTTTAAAGA AGTTTCGGCT CGCTGTCTTA
YSCGCN4_1.gb_pl CAATTCCGAA ATTTTAGTCT TCTTTAAAGA AGTTTCGGCT CGCTGTCTTA
yscgcn4.gb_pl CAATTCCGAA ATTTTAGTCT TCTTTAAAGA AGTTTCGGCT CGCTGTCTTA
151 200
AB016060.gb_pl TTATGTCTTA TTTCCATTTC AATAGCTTCT GTTTTCTTCT TTCTCTTGAA
AB016060 TTATGTCTTA TTTCCATTTC AATAGCTTCT GTTTTCTTCT TTCTCTTGAA
AB016062.gb_pl .TGCGT.TTG GATCAATTTT ..TGCCTGCG GTTTGCTTTA TATTCTAGCA
AB016063.gb_pl GAGAAAGACC CTTACCCTTC TCTATCGCTC GCTTTCCTTT GACGCTTCTG
AB016064.gb_pl .......... .......... .......... .......... ..TGCAAAAA
AB016065.gb_pl CGCGAGAGCT CTCGGAAAGC ACCACCACCA GCACAGAGCC AGCGCGAGAG
AB016066.gb_pl .......... .......... .......... .......... ..........
YSCGCN4.gb_pl CCTTTTAAAA TCTTCTACTT CTTGACAGTA CTTATCTTCT TATATAATAG
YSCGCN4_1.gb_pl CCTTTTAAAA TCTTCTACTT CTTGACAGTA CTTATCTTCT TATATAATAG
yscgcn4.gb_pl CCTTTTAAAA TCTTCTACTT CTTGACAGTA CTTATCTTCT TATATAATAG
The gaps at the ends of each sequence are written as dots (.)which may represent differences in input sequence lengths rather than missing characters or significant differences in the alignment. Internal gaps in each sequence are written as periods (.). See Appendix III of GCG 11.0 manual for more information about the two different gap characters.
Clustalw+ creates a dendrogram file called clustalw.dnd (default). It has the following information.
(
(
(
(
(
AB016060.gb_pl:0.00000,
AB016060:0.00000)
:0.33486,
AB016062.gb_pl:0.33312)
:0.15633,
(
(
YSCGCN4.gb_pl:0.00000,
YSCGCN4_1.gb_pl:0.00000)
:0.00000,
yscgcn4.gb_pl:0.00000)
:0.48288)
:0.28754,
(
AB016064.gb_pl:0.08285,
AB016065.gb_pl:0.10552)
:0.11331)
:0.03569,
AB016063.gb_pl:0.18615,
AB016066.gb_pl:0.12824);
Any dendrogram tree viewer can interpret the distances mentioned in the dendrogram file to draw an appropriate dendrogram for given set of input sequence alignments.
Clustalw+ accepts multiple (two or more) nucleotide sequences or multiple (two or more) protein sequences as input. You can specify multiple sequences in a number of ways: by using a list file, for example @project.list; by using an MSF or RSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example Genbank:*. The function of Clustalw+ depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above the sequence. If your sequence(s) are not the correct type, turn to Appendix VI of GCG 11.0 manual for information on how to change or set the type of a sequence.
If the input sequences are named in a list file, you can specify the reverse complement strand of any particular nucleotide sequence in the list as input by using the strand:- sequence attribute. You can restrict the range of interest for any particular sequence with appropriate sequence attributes like Begin:43 and End:682. (See "Using List Files" in Chapter 2, Using Sequence Files and Databases in the User's Guide for more information about sequence attributes in list files.) For example:
This is part of a list file suitable for input to CLUSTALW+.
October 6, 1998 ..
PIR:A32493
PIR:S05776 Begin:43 End:682
PIR:B36590
///////////////////////////////////////
You can limit the range of interest for all of the sequences in the alignment by including expressions like -BEGin=20 and -END=70 on the command line. The command-line range limiters take precedence over the range limiters for sequences in a list file when both are used. If no range limitation is specified, the entire length of each sequence is aligned.
You can force the program to align the forward strand of all nucleotide sequences by including -NOREVerse on the command line. Conversely, you can force the program to align the reverse complement strand for all nucleotide sequences by including -REVerse on the command line. The command-line strand specification takes precedence over the strand specifications for sequences in a list file when both are used. If no strands are specified, the forward strands of all nucleotide sequences are aligned
Clustalw+ creates a multiple sequence alignment from a group of related sequences
Please make sure that your sequences have different names as the first 30 characters of the name are significant. If Clustalw+ finds two or more sequences with the same name it will fail!
Some word processors may yield unpredictable results as hidden/control characters may be present in the files. It is best to save files with the Unix format option to avoid hidden windows characters while preparing the input files
The basic multiple alignment algorithm consists of three main stages: 1) all pairs of sequences are aligned separately in order to calculate a distance matrix giving the divergence of each pair of sequences; 2) a guide tree is calculated from the distance matrix; 3) the sequences are progressively aligned according to the branching order in the guide tree. An example using 7 globin sequences of known tertiary structure (25) is given in figure 1.
1) The distance matrix/pairwise alignments
In the original CLUSTAL programs, the pairwise distances were calculated using a fast approximate method (22). This allows very large numbers of sequences to be aligned, even on a microcomputer. The scores are calculated as the number of k-tuple matches (runs of identical residues, typically 1 or 2 long for proteins or 2 to 4 long for nucleotide sequences) in the best alignment between two sequences minus a fixed penalty for every gap. We now offer a choice between this method and the slower but more accurate scores from full dynamic programming alignments using two gap penalties (for opening or extending gaps) and a full amino acid weight matrix. These scores are calculated as the number of identities in the best alignment divided by the number of residues compared (gap positions are excluded). Both of these scores are initially calculated as percent identity scores and are converted to distances by dividing by 100 and subtracting from 1.0 to give number of differences per site. We do not correct for multiple substitutions in these initial distances. In figure 1 we give the 7x7 distance matrix between the 7 globin sequences calculated using the full dynamic programming method.
2) The guide tree
The trees used to guide the final multiple alignment process are calculated from the distance matrix of step 1 using the Neighbour-Joining method (21). This produces unrooted trees with branch lengths proportional to estimated divergence along each branch. The root is placed by a "mid-point" method (15) at a position where the means of the branch lengths on either side of the root are equal. These trees are also used to derive a weight for each sequence (15). The weights are dependent upon the distance from the root of the tree but sequences which have a common branch with other sequences share the weight derived from the shared branch. In the example in figure 1, the leghaemoglobin (Lgb2_Luplu) gets a weight of 0.442 which is equal to the length of the branch from the root to it. The Human beta globin (Hbb_Human) gets a weight consisting of the length of the branch leading to it that is not shared with any other sequences (0.081) plus half the length of the branch shared with the horse beta globin (0.226/2) plus one quarter the length of the branch shared by all four haemoglobins (0.061/4) plus one fifth the branch shared between the haemoglobins and the myoglobin (0.015/5) plus one sixth the branch leading to all the vertebrate globins (0.062). This sums to a total of 0.221. By contrast, in the normal progressive alignment algorithm, all sequences would be equally weighted. The rooted tree with branch lengths and sequence weights for the 7 globins is given in figure 1.
3) Progressive alignment
The basic procedure at this stage is to use a series of pairwise alignments to align larger and larger groups of sequences, following the branching order in the guide tree. You proceed from the tips of the rooted tree towards the root. In the globin example in figure 1 you align the sequences in the following order: human vs. horse beta globin; human vs. horse alpha globin; the 2 alpha globins vs. the 2 beta globins; the myoglobin vs. the haemoglobins; the cyanohaemoglobin vs the haemoglobins plus myoglobin; the leghaemoglobin vs. all the rest. At each stage a full dynamic programming (26,27) algorithm is used with a residue weight matrix and penalties for opening and extending gaps. Each step consists of aligning two existing alignments or sequences. Gaps that are present in older alignments remain fixed. In the basic algorithm, new gaps that are introduced at each stage get full gap opening and extension penalties, even if they are introduced inside old gap positions (see the section on gap penalties below for modifications to this rule). In order to calculate the score between a position from one sequence or alignment and one from another, the average of all the pairwise weight matrix scores from the amino acids in the two sets of sequences is used i.e. if you align 2 alignments with 2 and 4 sequences respectively, the score at each position is the average of 8 (2x4) comparisons. This is illustrated in figure 2. If either set of sequences contains one or more gaps in one of the positions being considered, each gap versus a residue is scored as zero. The default amino acid weight matrices we use are rescored to have only positive values. Therefore, this treatment of gaps treats the score of a residue versus a gap as having the worst possible score. When sequences are weighted (see improvements to progressive alignment, below), each weight matrix value is multiplied by the weights from the 2 sequences, as illustrated in figure 2.
Improvements to progressive alignment
All of the remaining modifications apply only to the final progressive alignment stage. Sequence weighting is relatively straightforward and is already widely used in profile searches (15,16). The treatment of gap penalties is more complicated. Initial gap penalties are calculated depending on the weight matrix, the similarity of the sequences, and the length of the sequences. Then, an attempt is made to derive sensible local gap opening penalties at every position in each pre-aligned group of sequences that will vary as new sequences are added. The use of different weight matrices as the alignment progresses is novel and largely by-passes the problem of initial choice of weight matrix. The final modification allows us to delay the addition of very divergent sequences until the end of the alignment process when all of the more closely related sequences have already been aligned.
Sequence weighting
Sequence weights are calculated directly from the guide tree. The weights are normalised such that the biggest one is set to 1.0 and the rest are all less than one. Groups of closely related sequences receive lowered weights because they contain much duplicated information. Highly divergent sequences without any close relatives receive high weights. These weights are used as simple multiplication factors for scoring positions from different sequences or prealigned groups of sequences. The method is illustrated in figure 2. In the globin example in figure 1, the two alpha globins get downweighted because they are almost duplicate sequences (as do the two beta globins); they receive a combined weight of only slightly more than if a single alpha globin was used.
Initial gap penalties
Initially, two gap penalties are used: a gap opening penalty (GOP) which gives the cost of opening a new gap of any length and a gap extension penalty (GEP) which gives the cost of every item in a gap. Initial values can be set by the user from a menu. The software then automatically attempts to choose appropriate gap penalties for each sequence alignment, depending on the following factors.
1) Dependence on the weight matrix
It has been shown (16,28) that varying the gap penalties used with different weight matrices can improve the accuracy of sequence alignments. Here, we use the average score for two mismatched residues (ie. off-diagonal values in the matrix) as a scaling factor for the GOP.
2) Dependence on the similarity of the sequences
The percent identity of the two (groups of) sequences to be aligned are used to increase the GOP for closely related sequences and decrease it for more divergent sequences on a linear scale.
3) Dependence on the lengths of the sequences
The scores for both true and false sequence alignments grow with the length of the sequences. We use the logarithm of the length of the shorter sequence to increase the GOP with sequence length.
Using these three modifications, the initial GOP calculated by the program is:
GOP->(GOP+log(MIN(N,M))) * (average residue mismatch score) * (percent identity scaling factor) where N, M are the lengths of the two sequences.
4) Dependence on the difference in the lengths of the sequences
The GEP is modified depending on the difference between the lengths of the two sequences to be aligned. If one sequence is much shorter than the other, the GEP is increased to inhibit too many long gaps in the shorter sequence. The initial GEP calculated by the program is:
GEP -> GEP*(1.0+|log(N/M)|) where N, M are the lengths of the two sequences.
Position-specific gap penalties
In most dynamic programming applications, the initial gap opening and extension penalties are applied equally at every position in the sequence, regardless of the location of a gap, except for terminal gaps which are usually allowed at no cost. In ClustalW+, before any pair of sequences or pre-aligned groups of sequences are aligned, we generate a table of gap opening penalties for every position in the two (sets of) sequences. An example is shown in figure 3. We manipulate the initial gap opening penalty in a position specific manner, in order to make gaps more or less likely at different positions.
The local gap penalty modification rules are applied in a hierarchical manner. The exact details of each rule are given below. Firstly, if there is a gap at a position, the gap opening and gap extension penalties are lowered; the other rules do not apply. This makes gaps more likely at positions where there are already gaps. If there is no gap at a position, then the gap opening penalty is increased if the position is within 8 residues of an existing gap. This discourages gaps that are too close together. Finally, at any position within a run of hydrophilic residues, the penalty is decreased. These runs usually indicate loop regions in protein structures. If there is no run of hydrophilic residues, the penalty is modified using a table of residue specific gap propensities (12). These propensities were derived by counting the frequency of each residue at either end of gaps in alignments of proteins of known structure. An illustration of the application of these rules from one part of the globin example, in figure 1, is given in figure 3.
1) Lowered gap penalties at existing gaps
If there are already gaps at a position, then the GOP is reduced in proportion to the number of sequences with a gap at this position and the GEP is lowered by a half. The new gap opening penalty is calculated as:
GOP -> GOP*0.3*(no. of sequences without a gap/no. of sequences).
2) Increased gap penalties near existing gaps
If a position does not have any gaps but is within 8 residues of an existing gap, the GOP is increased by:
GOP -> GOP*(2+((8-distance from gap)*2)/8)
3) Reduced gap penalties in hydrophilic stretches
Any run of 5 hydrophilic residues is considered to be a hydrophilic stretch. The residues that are to be considered hydrophilic may be set by the user but are conservatively set to D, E, G, K, N, Q, P, R or S by default. If, at any position, there are no gaps and any of the sequences has such a stretch, the GOP is reduced by one third.
4) Residue specific penalties
If there is no hydrophilic stretch and the position does not contain any gaps, then the GOP is multiplied by one of the 20 numbers in table 1, depending on the residue. If there is a mixture of residues at a position, the multiplication factor is the average of all the contributions from each sequence.
Weight matrices
Two main series of weight matrices are offered to the user: the Dayhoff PAM series (3) and the BLOSUM series (4). The default is the BLOSUM series. In each case, there is a choice of matrix ranging from strict ones, useful for comparing very closely related sequences to very "soft" ones that are useful for comparing very distantly related sequences. Depending on the distance between the two sequences or groups of sequences to be compared, we switch between 4 different matrices. The distances are measured directly from the guide tree. The ranges of distances and tables used with the PAM series of matrices is: 80-100%:PAM20, 60-80%:PAM60, 40-60%:PAM120, 0-40%:PAM350. The range used with the BLOSUM series is:80-100%:BLOSUM80, 60-80%:BLOSUM62, 30-60%:BLOSUM45, 0-30%:BLOSUM30.
Divergent sequences
The most divergent sequences (most different, on average from all of the other sequences) are usually the most difficult to align correctly. It is sometimes better to delay the incorporation of these sequences until all of the more easily aligned sequences are merged first. This may give a better chance of correctly placing the gaps and matching weakly conserved positions against the rest of the sequences. A choice is offered to set a cut off (default is 40% identity or less with any other sequence) that will delay the alignment of the divergent sequences until all of the rest have been aligned.
ClustalW+ performs multiple alignments on a set of sequences or a set of previously aligned sequences.
Minimal Syntax: % clustalw+ [-infile=]value -Default
Minimal Parameters (case-insensitive):
-infile [Type: InFile / Default: EMPTY / Aliases: infile1 in]
Input file specification
Prompted Parameters (case-insensitive):
-action [Type: String / Default: 'alignslow' / Aliases: act]
The following actions are supported:
alignfast: multiple alignment using fast pair wise
alignment
alignslow: multiple alignment using slow pair wise
alignment
tree: generate a phylogenetic tree given the alignment
-outfile [Type: OutFile / Default: EMPTY / Aliases: out]
Output file produced by ClustalW. This will be an
alignment file if ACTION is alignfast or alignslow. It
is a tree file if ACTION is tree. Default output file
is clustalw.msf
Optional Parameters (case-insensitive):
-check [Type: Boolean / Default: 'false' / Aliases: che help]
Prints out this usage message
-default [Type: Boolean / Default: 'false' / Aliases: d def]
Specifies that sensible default values be used for all
parameters where possible.
-documentation [Type: Boolean / Default: 'true' / Aliases: doc]
Prints banner at program startup
-quiet [Type: Boolean / Default: 'false' / Aliases: qui]
Tells application to print only a minimal amount of
information
-outorder [Type: String / Default: 'input' / Aliases: ord]
Whether the sequences should be output in input order
or the order in which they were aligned. Valid values
are : Input, Aligned
-range [Type: List / Default: EMPTY / Aliases: rng]
The sequence range to write as a comma-seaparated
value, e.g. m,n will write from m to m+n
-pwmatrix [Type: String / Default: EMPTY]
The scoring matrix to use for pair wise alignment when
performing slow pair wise alignments. Depending upon
sequence type, it may either refer to DNA scoring
matrix or a protein scoring matrix. This option is
relevant only when ACTION=alignslow. Valid matrices are
: blosum, pam, gonnet, id or filename
-pwgapopen [Type: Double / Default: EMPTY]
The gap opening penalty during slow pair wise
alignments. This option is relevant only when
ACTION=alignslow.
-pwgapext [Type: Double / Default: EMPTY]
The gap extension penalty during slow pair wise
alignments. This option is relevant only when
ACTION=alignslow.
-ktuple [Type: Integer / Default: EMPTY]
Window size while doing fast pair wise alignment. This
option is relevant only when ACTION=alignfast.
-topdiags [Type: Integer / Default: EMPTY]
Number of windows around best diagonals while doing
fast pair wise alignment. This option is relevant only
when ACTION=alignfast.
-window [Type: Integer / Default: EMPTY]
Number of windows around each of the top diagnoals. This option is relevant only when ACTION=alignfast.
-pairgap [Type: Integer / Default: EMPTY]
The number of matching residues required to open a gap
while doing fast pair wise alignments. This option is
only relevant when ACTION=alignfast.
-score [Type: String / Default: 'absolute']
Whether pair wise alignment scores should be reported
as (raw) absolute scores or percentages
absolute|percent}. This option is relevant only
when ACTION=alignfast.
-matrix [Type: String / Default: EMPTY]
The scoring matrix to be used for multiple sequence
alignments.
-gapopen [Type: Double / Default: EMPTY]
The gap opening penalty when doing multiple sequence
alignments
-gapext [Type: Double / Default: EMPTY]
The gap extension penalty when doing multiple sequence
alignments.
-endgaps [Type: Boolean / Default: 'false']
Turn on/off end gap separation penalty
-pgap [Type: Boolean / Default: 'true']
Turn on/off residue-specific gap separation penalties
-hgap [Type: Boolean / Default: 'true']
Turn on/off hydrophilic residue gaps
-hgapresidues [Type: List / Default: EMPTY]
List of hydrophilic residues
-maxdiv [Type: Double / Default: EMPTY]
Minimum percentage identity required for delay
-outputtree [Type: String / Default: 'nj' / Aliases: outtreeformat
treefmt] The output format of the guide tree produces. Valid values: NJ, PHYLIP, DIST, NEXUS
-negative [Type: Boolean / Default: 'false' / Aliases: neg]
Sets whether protein matrix contains negative values.
-monitor [Type: Boolean / Default: 'false' / Aliases: mon]
Turn on/off result monitoring
The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory [$WPROOT/share/matrix/] unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Chapter 4, Using Data Files in the User's Guide.
Local Scoring Matrices
This program reads one or more scoring matrices for the comparison of sequence characters. The program automatically reads the program's default scoring matrix in a public data directory unless you either 1) have a data file with exactly the same name as the program default scoring matrix in your current working directory; or 2) have a data file with exactly the same name as the program default scoring matrix in the directory with the logical name MyData; or 3) name a file on the command line with an expression like -MATRix=mymatrix.cmp. If you don't include a directory specification when you name a file with -MATRix, the program searches for the file first in your local directory, then in the directory with the logical name MyData, then in the public data directory with the logical name GenMoreData, and finally in the public data directory with the logical name GenRunData. For more information see "Using a Special Kind of Data File: A Scoring Matrix" in Chapter 4, Using Data Files in the User's Guide.
Clustalw+ reads a scoring matrix from your local directory or the public database with the values for every possible symbol comparison. The file Clustalw+dna.cmp has a 10 at every place where the set of bases implied by the alphabetic IUB ambiguity codes (see Appendix III) overlap. All of the other locations have zeros. The file blosum62.cmp is based on substitutions between amino acid pairs in ungapped blocks of aligned protein segments as measured by Henikoff and Henikoff. The scores in this matrix for pair wise amino acid comparisons range from -4 to +11. You can use the Fetch+ program to copy these files and then modify them to suit you own needs. (See the CONSIDERATIONS topic for more information about scoring matrices.)
-CHEck
prints out this usage message.
-DEFault
specifies that sensible default values be used for all
parameters where possible.
-DOCumentation
prints banner at program startup.
-QUIet
tells application to print only a minimal amount of information.
-outorder
whether the sequences should be output in input order or the order in which they were aligned. Valid values are : Input,
Aligned.
-range
the sequence range to write as a comma-separated value, e.g.
m,n will write from m to m+n
-pwmatrix
the scoring matrix to use for pair wise alignment when performing slow pair wise alignments. Depending upon sequence type, it may either refer to DNA scoring matrix or a protein scoring matrix. This option is relevant only when ACTION=alignslow. Valid matrices are: blosum, pam, gonnet, id or filename
-pwgapopen
the gap opening penalty during slow pair wise alignments. This option is relevant only when ACTION=alignslow.
-pwgapext
the gap extension penalty during slow pair wise alignments. This option is relevant only when ACTION=alignslow.
-ktuple
window size while doing fast pair wise alignment. This option is relevant only when ACTION=alignfast.
-topdiags
number of windows around best diagonals while doing fast
pair wise alignment. This option is relevant only when ACTION=alignfast.
-window
number of windows around each of the top diagonals. This option is relevant only when ACTION=alignfast.
-pairgap
the number of matching residues required to open a gap while doing fast pair wise alignments. This option is only relevant when ACTION=alignfast.
-score
whether pair wise alignment scores should be reported as (raw)
absolute scores or percentages {absolute|percent}. This
option is relevant only when ACTION=alignfast.
-matrix
the scoring matrix to be used for multiple sequence alignments.
-gapopen
the gap opening penalty when doing multiple sequence alignments
-gapext
the gap extension penalty when doing multiple sequence alignments.
-gapdist
Gap Separation distance is used to decrease the chances of gaps being too close to each other. Gaps that are less than this distance apart are penalised more than other gaps.
-endgaps
turn on/off end gap separation penalty.
-pgap
turn on/off residue-specific gap penalty.
-hgap
turn on/off hydrophilic residue gaps.
-hgapresidues
list of hydrophilic residues.
-maxdiv
minimum percentage identity required for delay.
-outputtree
the output format of the guide tree produces. Valid values: NJ, PHYLIP, DIST, NEXUS.
-NEGative
sets whether protein matrix contains negative values.
-MONitor
turn on/off result monitoring.
Printed: January 27, 2005 15:04