00README

   1 * 0. Introduction
   2 This directory contains a computer program for predicting one-dimensional
   3 protein structures (secondary structures [SS], contact numbers [CN], and
   4 residue-wise contact orders [RWCO]) by the method of critical random networks
   5 described in:
   6
   7 Ref. 1 (Description of the software)
   8   "CRNPRED: Highly accurate prediction of one-dimensional protein structures
   9     by large-scale critical random networks."
  10   Kinjo AR, Nishikawa K.
  11   submitted (2006)
  12
  13 and
  14
  15 Ref. 2 (Method of critical random networks)
  16   "Predicting secondary structures, contact numbers, and residue-wise contact
  17     orders of native protein structure from amino acid sequence using critical
  18     random networks."
  19   Kinjo AR, Nishikawa K.
  20   BIOPHYSICS, 1:67-74 (2005) (DOI: 10.2142/biophysics.1.67).
  21
  22 This software is in public domain. You can use, modify and/or destroy it freely,
  23 but we do not take any responsibility for the consequences of your use.
  24
  25 * 1. INSTALLING CRNPRED.
  26  To install the CRNPRED program, you need the following:
  27  (0) UNIX-like operating system (Linux, MacOS X, *BSD, etc.)
  28  (1) bash (or zsh)
  29  (2) make
  30  (3) gcc
  31  (4) PSI-BLAST and related databases (amino acid sequences and BLOSUM
  32      scoring matrices).
  33
  34   First, set the environment variable CRNPRED_DIR to this directory (that is,
  35   the directory containing this file "00README").
  36   If you are using sh, ksh, bash, or zsh, write
  37
  38     export CRNPRED_DIR=/path/to/this/directory
  39
  40   in your ~/.profile and do
  41         % . ~/.profile
  42
  43   If you are using csh or tcsh, write
  44
  45     setenv CRNPRED_DIR /path/to/this/directory
  46
  47   in your ~/.cshrc and do
  48         % source ~/.cshrc
  49
  50   To compile the program, do
  51
  52         % ./make.sh
  53
  54   Then the program named "xpredm" is installed under the directory
  55   ${CRNPRED_DIR}/bin.
  56
  57    After xpredm has been installed, test it by running
  58
  59         % ${CRNPRED_DIR}/bin/xpredm sample/d3nul__.prof > hoge.out
  60
  61    Compare hoge.out with sample/d3nul__.out. There are a few sample
  62    inputs and outputs in the directory named "sample".
  63
  64 * 2. RUNNING CRNPRED.
  65   Make sure you have set the environment variable CRNPRED_DIR appropriately.
  66   A utility shell script "run_crn.sh" is available for your convenience.
  67   If you have FASTA format amino acid sequence file (say, "test.seq"), do
  68
  69   ${CRNPRED_DIR}/bin/run_crn.sh -d uniref100 test.seq
  70
  71   where "uniref100" is the sequence database used by PSI-BLAST.
  72   Then, after some time,  you have a file named "test.seq.d.out" which contains
  73   the result of the prediction. If it does not work, check the content of
  74   "run_crn.sh" and modify the environment variables such as BLASTDB, BLASTMAT,
  75   and CRNPRED_DIR, or you may have to change the first line "#!/bin/sh" to
  76   something like "#!/usr/bin/env bash" or "#!/usr/bin/env/ zsh".
  77   Run
  78     ${CRNPRED_DIR}/bin/run_crn.sh -h
  79   to see other options.
  80
  81   Alternatively, you can directly run the program. You first need to run
  82   PSI-BLAST to make a position-specific scoring matrix:
  83
  84   blastpgp -d nr -h 0.0005 -j 3 -i test.seq -Q test.prof > /dev/null
  85
  86   Then do
  87
  88   ${CRNPRED_DIR}/bin/xpredm test.prof > test.out
  89
  90   The result is saved in "test.out".
  91
  92 * 3. INTERPRETING THE RESULTS.
  93   Below is an example of prediction.
  94
  95     * Lines starting with "AA" show the amino acid sequence you fed.
  96     * Lines starting with "SS" show the predicted secondary structures
  97       where "H", "E", and "C" mean "alpha-helix", "beta-strand", and "coils",
  98       respectively.
  99     * Lines starting with "CN" show the predicted contact numbers in 2-state
 100       description where "B" and "E" mean "buried" and "exposed", respectively.
 101       The threshold values are the average contact number for each residue
 102       type (see Appendix below for the list of the average contact numbers).
 103     * Lines after "># AA : SS P_H P_E P_C : CN : RWCO" are the details of the
 104       prediction:
 105           o The column corresponding to "AA" indicates the residue numbers
 106             and the amino acid residues.
 107           o The column corresponding to "SS" indicates the predicted secondary
 108             structure followed by the ad hoc probability for each secondary
 109             structure class (i.e., "P_H" for the probability for the residue to
 110             be in the alpha-helix class, etc.).
 111           o The column corresponding to "CN" indicates the predicted contact
 112             numbers in 2-state description ("B" or "E") followed by the real
 113             predicted contact numbers.
 114           o The column corresponding to "RWCO" indicates the predicted
 115             residue-wise contact orders (real numbers).
 116
 117 ---------sample output starts here--------
 118 >prediction for: test.prof
 119
 120
 121 #                  *         *         *         *         *         *
 122 AA:       SWQSYVDDHLMCDVEGNHLTAAAILGQDGSVWAQSAKFPQLKPQEIDGIKKDFEEPGFLA
 123 SS:       CCHHHHHHHHHCCCCCCCCHEEEEECCCCCEEEECCCCCCCCHHHHHHHHHCCCCCCCCC
 124 CN:       BBBBBBEBBEBBBBBBBBEEEEEEEEEEEEEEEEEBBBBBEEBBEBBBEEBEBBBBBBBB
 125 #                  *         *         *         *         *         *
 126 AA:       PTGLFLGGEKYMVIQGEQGAVIRGKKGPGGVTIKKTNQALVFGFYDEPMTGGQCNLVVER
 127 SS:       CCEEEECCCEEEEEECCCCEEEEECCCCCEEEEEECCCEEEEEEECCCCCCHHHHHHHHH
 128 CN:       EBEEBEBBEEEEEEEBBBBBBEEEEEBBBEEEEEEEEEEEEEEEBBBBBBBBEEEBEEEB
 129 #                  *
 130 AA:       LGDYLIESEL
 131 SS:       HHHHHHHCCC
 132 CN:       EEEBEBBBBB
 133 //
 134
 135 >#   AA : SS P_H P_E P_C : CN     : RWCO
 136     1 S : C   11   7  82 : B   14 :  840
 137     2 W : C   23  10  67 : B   22 : 1221
 138     3 Q : H   59  11  30 : B   18 :  864
 139     4 S : H   79   8  12 : B   18 :  860
 140     5 Y : H   86   6   7 : B   25 : 1199
 141     6 V : H   89   5   6 : B   27 : 1276
 142     7 D : H   90   5   6 : E   21 :  855
 143     8 D : H   90   4   6 : B   17 :  728
 144     9 H : H   89   5   6 : B   22 :  954
 145    10 L : H   85   6   8 : E   30 : 1188
 146    11 M : H   72   9  18 : B   24 :  850
 147    12 C : C   44  11  46 : B   22 :  826
 148    13 D : C   18   8  73 : B   18 :  669
 149    14 V : C   10   7  83 : B   22 :  751
 150    15 E : C    8   6  86 : B   17 :  593
 151    16 G : C    8   7  85 : B   18 :  640
 152    17 N : C   10   8  82 : B   17 :  696
 153    18 H : C   16  11  73 : B   24 :  808
 154    19 L : C   30  16  54 : E   32 : 1103
 155    20 T : H   45  22  32 : E   26 :  962
 156    21 A : E   37  43  20 : E   27 : 1017
 157    22 A : E   15  75  10 : E   37 : 1265
 158    23 A : E    7  88   5 : E   38 : 1286
 159    24 I : E    6  89   5 : E   41 : 1341
 160    .
 161    .
 162    .
 163    .
 164 ---------sample output ends here--------
 165
 166 * 4. CONTACT INFORMATION
 167
 168 Akira Kinjo
 169 Center for Information Biology and DNA Data Bank of Japan,
 170 National Institute of Genetics,
 171 Mishima, 411-8540, JAPAN
 172 email: akinjo @ genes . nig . ac . jp
 173
 174 * Appendix A.
 175  The average contact number of each residue type is listed below:
 176 -------------------
 177   25.430 , /* A */
 178   21.038 , /* R */
 179   20.093 , /* N */
 180   18.594 , /* D */
 181   29.647 , /* C */
 182   20.206 , /* Q */
 183   18.008 , /* E */
 184   22.505 , /* G */
 185   23.572 , /* H */
 186   29.469 , /* I */
 187   28.173 , /* L */
 188   18.452 , /* K */
 189   26.466 , /* M */
 190   28.057 , /* F */
 191   20.350 , /* P */
 192   21.420 , /* S */
 193   22.747 , /* T */
 194   26.913 , /* W */
 195   26.627 , /* Y */
 196   28.656 , /* V */
 197 -------------------
 198
 199 * Appendix B.
 200 ** Faster but less accurate predictions.
 201 The default implementation of CRNPRED uses 5000 dimensional state vectors for
 202 critical random networks. This makes the prediction process quite slow when
 203 you use the program on old computers or when you predict large proteins.
 204
 205 If you want predictions quickly, there are two options:
 206 (1) linear predictor or (2) 2000 dimensional state vectors.
 207
 208 *** Using linear predictor
 209 The linear predictor as described in Ref. 2 is implemented as a separate
 210 program named "lpredm" which is installed along with xpredm (CRNPRED).
 211 Use it as follows:
 212   ${CRNPRED_DIR}/bin/lpredm test.prof > test.out
 213
 214 *** Using 2000 dimensional CRNPRED
 215 To use CRNPRED with 2000 dimensional state vectors, you need to recompile the
 216 program. Do it as follows:
 217
 218   cd ${CRNPRED_DIR}/src
 219   make realclean
 220   make NDIM=2000 install
 221   cd ..
 222   cp w2000/WMATS .
 223   cp w2000/WMAT_ENS .
 224
 225 This produces the executable file "xpredm" just like before, but it now
 226 uses 2000-dimensional state vectors.
 227
 228 *** Comparison of predictors
 229 Here is a brief summary of speed and accuracy of the linear predictor (lpredm),
 230 xpredm(2000),  and xpredm (5000). The CPU times were measured for
 231 the sample file "sample/d8abp__.prof" (305 AA) on Mac OS X (PPC G5, 2.5GHz).
 232 The CPU time is (almost) linearly proportional to the protein length.
 233
 234
 235 program        speed            accuracy        note
 236 --------------------------------------------------------------
 237 xpredm          very slow       SS:Q3=81        default
 238 (5000)          5min52s         CN:Cor=0.75
 239                                 RWCO:Cor=0.61
 240
 241 xpredm          slow            SS:Q3=79
 242 (2000)          1min12s         CN:Cor=0.74
 243                                 RWCO:Cor=0.61
 244
 245 lpredm          fast            SS:Q3=76
 246                 0.558s          CN:Cor=0.72
 247                                 RWCO:Cor=0.59
 248 --------------------------------------------------------------
 249
 250 Note that the accuracies are the average values based on a benchmark.
 251 The difference between Q3=81 and Q3=79 may seem insignificant on average, but
 252 there can be a big difference for individual predictions [e.g., an incorrectly
 253 predicted alpha helix with xpred(2000) may be correctly predicted as a beta
 254 strand with xpred(5000)].
 255
 256 Updated: 2009-06-10
 257
 258 # Local variables:
 259 # mode: outline
 260 # End: