TreeArrange and Treeps: User guide ---------------------------------- The current version of this software can be downloaded from http://monod.uwaterloo.ca/downloads/treearrange/ 1. Introduction --------------- Treeps is a tool for displaying expression array data and optional associated hierarchical clustering in a form of encapsulated postscript files. TreeArrange is a program that reorders leaves of hierarchical clustering tree to place similar leaves together. Details of reordering algorithms see [1]. Both programs are command-line programs that can be controlled by a number of switches. This makes them especially suitable for processing multiple inputs. Encapsulated postscript files created by Treeps can be included to other documents. Output of Treeps program can be also viewed in a suitable postscript viewer (free postscript viewers include gv and Ghostview). Such viewers often allow user to zoom in different parts of the picture. This is a sufficient substitute for interactive features provided by TreeView program by Michael Eisen. However, files produced by TreeArrange can also be viewed by TreeView. TreeArrange and Treeps input files can be produced for example by the following programs: - Cluster by Michael Eisen (for Windows) http://rana.lbl.gov/EisenSoftware.htm - XCluster by Gavin Sherlock (for Windows, Unix, Macintosh) http://genome-www.stanford.edu/~sherlock/cluster.html 2. Installation --------------- a) Installation for Linux and UNIX platforms -------------------------------------------- The TreeArrange and Treep is distributed as a .tgz file containing source code. The source code is written in GNU C++, therefore it should be easy to compile on major UNIX platforms (so far it was tested on Linux only). In future we may prepare binary distributions as well. To install the software on UNIX platform, follow these steps: 1. Unpack the source distribution. gunzip treearrange.tgz tar xf treearrange.tar This will create a new directory treearrange. 2. Compile the source code. Be sure to use gnu make (you may need to substitute something else for make on some platforms). cd trearrange make 3. If everything goes well, two files should be produced by the compilation: Treeps and TreeArrange in trearrange directory. Copy these two files to your favourite location for binary executable files (e.g. /usr/local/bin, ~/bin, etc.). You can now try sample run on files provided in the source distribution (sample files were created by program XCluster): 1. Change to the directory treearrange/sample 2. Run (this assume that TreeArrange and Treeps are in your executable search path; otherwise add path): TreeArrange sample reordered Treeps reordered out.eps 3. Now you can view file out.eps by your favourite postscript viewer (e.g. ghostview or gv). b) Installation for Windows platforms ------------------------------------- Executable files for Windows are provided in a zipped format. 3. Input and output files ------------------------- Both programs accept input files in the format produced by Cluster program written by Michael Eisen. There are (at most) three files in each data set: .cdt contains expression data themselves .gtr contains hierarchical clustering on trees .atr contains hierarchical clustering on experiments atr file is optional. It is ignored by Treeps and is simply copied by TreeArrange. gtr file is optional for Treeps but it is required by most of the methods in TreeArrange. It is assumed that the user first uses Cluster program to preprocess data and to produce cdt, gtr and atr files. Other programs using the same output format can be used as well. Also, cdt file can be easily created in a spreadsheet (see format description below). TreeArrange produces the output in the same format, creating three files with a different name. cdt file is original cdt files with reordered lines. gtr and atr files stay the same. Output of TreeArrange is in the same format as input, only the lines are rearranged in different order. Treeps outputs encapsulated postscript (.eps file) 3.1 Format of cdt file ---------------------- Cdt file is in text format with entries separated by tabs. Each line contains data for one gene, each column corresponds to one experiment. Unknown values are left empty. The first several rows and columns have special meaning. You can create such file by loading data to spreadsheet, adding required special rows and columns and then saving the table as text separated by tabs. Note that programs are case sensitive - use keywords in capital letters as shown below. There is one, two, or three rows with special meaning. Other rows contain data for genes. The first row always contains column names. First several columns have special meanings and have unique name that has to be used (see below). Columns with experimental data can be named arbitrarily to describe meaning or identifier of the experiment. The first row can be optionally followed by a row containing weights of individual experiments. This row starts with word EWEIGHT. Also optional is a row containing experiment identifiers. This row starts with word AID. Experiment identifiers are created by Cluster program automatically and are not used by our programs. There are at most 4 columns with special meaning. The name of the first column should be GID and this column should contain a gene identifier. This is a special identifier added by Cluster program when doing hierarchical clustering to identify leaves of the hierarchical tree. If you do not use hierarchical clustering this column can be omitted. The second column (or first if no GID) should be named UNIQID and it should contain a unique identifier for each gene. You can use any kind of identifier as far as no two rows have the same. The following column should be named NAME and it should contain any description of the gene you want to display. It does not need to be unique. This column can also be safely omitted. The last special column is called GWEIGHT and it contains weights of columns. It is not used in the programs and can be safely omitted. Empty lines and lines starting with word REMARK are ignored. Some examples of tables: GID UNIQID NAME GWEIGHT EXPERIMENT1 EXPERIMENT2 EWEIGHT 1.0 2.5 GENE2X SID112179 EST T91987 1.5 0.5 0.4 GENE1X SID112153 EST N5632 2.0 -0.1 0.2 GENE0X SID314213 EST FF3456 0.5 0.1 2.6 GID UNIQID EXPERIMENT1 EXPERIMENT2 GENE2X SID112179 0.5 0.4 GENE1X SID112153 -0.1 0.2 GENE0X SID314213 0.1 2.6 These two are only examples of spreadsheet tables, actual cdt files need to be tab separated. 3.2 Format of gtr file ---------------------- Each line of gtr file describes one internal node of the hierarchical clustering tree. It contains four fields separated with whitespace. The first field is a unique identifier of this node, the other two fields are unique identifiers of its two children. If a child is a leaf (i.e. gene,identifier is a GID from GID column in cdt file). For internal nodes new identifiers are introduced. The last field contains length of an edge as an average correlation between the two clusters (i.e. number between -1 and 1). Example: NODE1X GENE2X GENE1X 0.99 NODE2X GENE0X NODE1X 0.98 4. TreeArrange command line options ----------------------------------- TreeArrange accepts the following command line parameters: TreeArrange [options] and are filenames of input and output data sets. They do not contain .cdt suffixes, however they contain path, if they are not in the current directory. It is assumed that .cdt exists and optionally .gtr and .atr exist. is similarly name for output files. We recommend to use a different names in order to preserve the original files. Program will generate .cdt and optionally also .gtr and .atr if input gtr and atr files were supplied. Options: (each option is followed by space and value) -m (possible values are O, I, W, R, T) Method for reordering to use (default: I) O: optimal reordering consistent with the tree I: as O, but with improvements saving running time W: order by average expression level (see [2]) consistently with the tree R: random ordering consistent with the tree T: order heuristically ignoring the tree constraint ("Travelling Salesman" 2-OPT heuristic) All methods except T require hierarchical tree in gtr file -d (possible values P, U, E) Distance measure to measure similarity of genes (default: P) P: Pearson correlation (centred) U: Pearson correlation (uncentred) E: Euclidean distance -i (possible values are positive integers) Number of iterations for T heuristic (default: 30) (more iterations mean longer running time but a slight change of getting better result) For a more detailed explanation of methods and distance measures see [1]. 5. Treeps command line options ------------------------------ Treeps accepts the following command line parameters: Treeps [options] has the same meaning as in TreeArrange. is the name of the encapsulated postscript file (including .eps or .ps suffix). Options: (each option is followed by a space and a value. In this list we show option followed by format of value and short meaning. More details see below.) -t 0|1 Display tree on/off (default: 1) -l 0|1 Display gene labels on/off (default: 1) -c 0|1 Display gene groups on/off (default: 1) -d 0|1 Display node labels on/off (default: 0) -s , Size of the one cell should be xpx -S , Size of the map thumbnail should be xpx (1in=72px) -b Start with th gene -e End with th gene -f Display only subtree rooted in given node -p ,, Colour for positive values (default: 255,0,0) -n ,, Colour for negative values (default: 0,255,0) -z ,, Colour for zero values (default: 0,0,0) -m ,, Colour for missing values (default: 100,100,100) -a Contrast (positive number; default: 3) -P Filename with palette for type colours Options -t,-l,-c,-d switch on and off displaying certain features. Value 0 means switch off, 1 means switch on. These features are: - hierarchical clustering tree for genes (tree for experiments are never displayed), - gene labels (i.e. names of genes or other material included in NAME column). If NAME column is missing, column UNIQID is used. Parts of the name column between dollar signs are ignored (see gene groups) - gene groups - users can specify several groups of genes they want to highlight by a colour bar. Genes in one group may constitute a cluster or they can be dispersed all around (e.g. genes with one function etc.) Group of a gene is given in NAME column between dollar signs. Groups are numbered from 0. E.g. if name column contains string "My favourite $3$gene", this gene will belong to group number 3 and string $3$ will be removed, i.e. displayed name will be "My favourite gene". As we can see, group number can be inside the gene name, or at the beginning as "$0$My gene" or at the end "My gene $6$". Genes that do not contain group number have white colour instead of colour bar. - node labels are unique identifiers of internal nodes of hierarchical clustering tree. These identifiers are not nicely aligned and are not meant to appear in the final output. However you need to know the identifier of a certain node if you want to use -f option. Therefore we recommend to use -d option, find out the label and then use -f option with this label. Options -s and -S determine the size of the picture. Postscript can be rendered in arbitrary density, therefore can be enlarged without loss of quality (provided your tools handle postscript well). However it is convenient to produce the postscript of the right size directly. Both options get a value consisting of two numbers (width and height) separated by comma (no spaces). Both lengths are given in points, where one inch is 72 points. -s option gives size of one cell of the colour map of gene expressions, -S gives size of entire map. Use only one of these options. Hierarchical tree has a fixed width. Size of font for gene labels is set according to height of the cell. Options -b, -e and -f determine which genes to display (if they are not specified, all genes are displayed). -f displays subtree rooted in node with given node identifier. To find out appropriate node identifier use -d option to view identifiers. -b and -e allow you to choose the first and last gene to be displayed. Nodes are numbered from 0 in the order in which they appear in cdt file. If the cdt file was ordered by TreeArrange you need to specify order in the new cdt file. Genes in specified range do not need to be in one subtree. Options -p, -n, -z, -m specify colours of colour map. Each colour is given as three numbers separated by comma (no spaces). Each number is between 0 and 255. These three values describe red, green and blue portion of the colour. For example 255,0,0 is red. 0,0,255 is blue, 255,255,255 is white, 0,0,0 is black etc. Option -p specifies colour for positive values, -n for negative values, -z for zero values, -m for missing values. For example if we specify -p as red and -z as black, than zero will be black and positive numbers will range from black to red, with more intensive red for higher values. Option -a specifies contrast. It is a positive real number. Higher is this number, the more will zero colour prevail. If the values in your data are close to zero and your diagram is too black, use lower value of contrast. If your diagram is too red/green, use higher value of contrast to bring more black places. Finally, option -P allows you to specify colours to use with group labels. Put these colours to one file. This file should contain one colour on each line. Colour in the first line is colour for group 0, colour in the second line is colour for group 1, etc. Each colour is given by three numbers (similarly as in -p, -n, -z etc. arguments). The only difference is that these numbers should be separated by space, not comma. The three numbers can be followed by a comment, given on the same line. For example: 255 0 0 //group 0 will be red 0 255 0 //group 1 will be green 0 0 255 //group 2 will be blue 255 244 0 //group 3 will be yellow References ---------- 1. Therese Biedl, Brona Brejova, Erik D. Demaine, Angele M. Hamel, Tomas Vinar. Optimal Arrangement of Leaves in the Tree Representing Hierarchical Clustering of Gene Expression Data. Technical Report CS-2001-14, Dept. of Computer Science, University of Waterloo, April 2001. http://monod.uwaterloo.ca/papers/expanded.php3?paper=2001004 2. M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein. Cluster Analysis and Display of Genome-Wide Expression Patterns. Proceedings of the National Academy of Sciences of the U.S.A., 95(25):14863-14868, 1998.