2 This directory contains a computer program for predicting one-dimensional
3 protein structures (secondary structures [SS], contact numbers [CN], and
4 residue-wise contact orders [RWCO]) by the method of critical random networks
7 Ref. 1 (Description of the software)
8 "CRNPRED: Highly accurate prediction of one-dimensional protein structures
9 by large-scale critical random networks."
10 Kinjo AR, Nishikawa K.
15 Ref. 2 (Method of critical random networks)
16 "Predicting secondary structures, contact numbers, and residue-wise contact
17 orders of native protein structure from amino acid sequence using critical
19 Kinjo AR, Nishikawa K.
20 BIOPHYSICS, 1:67-74 (2005) (DOI: 10.2142/biophysics.1.67).
22 This software is in public domain. You can use, modify and/or destroy it freely,
23 but we do not take any responsibility for the consequences of your use.
25 * 1. INSTALLING CRNPRED.
26 To install the CRNPRED program, you need the following:
27 (0) UNIX-like operating system (Linux, MacOS X, *BSD, etc.)
31 (4) PSI-BLAST and related databases (amino acid sequences and BLOSUM
34 First, set the environment variable CRNPRED_DIR to this directory (that is,
35 the directory containing this file "00README").
36 If you are using sh, ksh, bash, or zsh, write
38 export CRNPRED_DIR=/path/to/this/directory
40 in your ~/.profile and do
43 If you are using csh or tcsh, write
45 setenv CRNPRED_DIR /path/to/this/directory
47 in your ~/.cshrc and do
50 To compile the program, do
54 Then the program named "xpredm" is installed under the directory
57 After xpredm has been installed, test it by running
59 % ${CRNPRED_DIR}/bin/xpredm sample/d3nul__.prof > hoge.out
61 Compare hoge.out with sample/d3nul__.out. There are a few sample
62 inputs and outputs in the directory named "sample".
65 Make sure you have set the environment variable CRNPRED_DIR appropriately.
66 A utility shell script "run_crn.sh" is available for your convenience.
67 If you have FASTA format amino acid sequence file (say, "test.seq"), do
69 ${CRNPRED_DIR}/bin/run_crn.sh -d uniref100 test.seq
71 where "uniref100" is the sequence database used by PSI-BLAST.
72 Then, after some time, you have a file named "test.seq.d.out" which contains
73 the result of the prediction. If it does not work, check the content of
74 "run_crn.sh" and modify the environment variables such as BLASTDB, BLASTMAT,
75 and CRNPRED_DIR, or you may have to change the first line "#!/bin/sh" to
76 something like "#!/usr/bin/env bash" or "#!/usr/bin/env/ zsh".
78 ${CRNPRED_DIR}/bin/run_crn.sh -h
81 Alternatively, you can directly run the program. You first need to run
82 PSI-BLAST to make a position-specific scoring matrix:
84 blastpgp -d nr -h 0.0005 -j 3 -i test.seq -Q test.prof > /dev/null
88 ${CRNPRED_DIR}/bin/xpredm test.prof > test.out
90 The result is saved in "test.out".
92 * 3. INTERPRETING THE RESULTS.
93 Below is an example of prediction.
95 * Lines starting with "AA" show the amino acid sequence you fed.
96 * Lines starting with "SS" show the predicted secondary structures
97 where "H", "E", and "C" mean "alpha-helix", "beta-strand", and "coils",
99 * Lines starting with "CN" show the predicted contact numbers in 2-state
100 description where "B" and "E" mean "buried" and "exposed", respectively.
101 The threshold values are the average contact number for each residue
102 type (see Appendix below for the list of the average contact numbers).
103 * Lines after "># AA : SS P_H P_E P_C : CN : RWCO" are the details of the
105 o The column corresponding to "AA" indicates the residue numbers
106 and the amino acid residues.
107 o The column corresponding to "SS" indicates the predicted secondary
108 structure followed by the ad hoc probability for each secondary
109 structure class (i.e., "P_H" for the probability for the residue to
110 be in the alpha-helix class, etc.).
111 o The column corresponding to "CN" indicates the predicted contact
112 numbers in 2-state description ("B" or "E") followed by the real
113 predicted contact numbers.
114 o The column corresponding to "RWCO" indicates the predicted
115 residue-wise contact orders (real numbers).
117 ---------sample output starts here--------
118 >prediction for: test.prof
122 AA: SWQSYVDDHLMCDVEGNHLTAAAILGQDGSVWAQSAKFPQLKPQEIDGIKKDFEEPGFLA
123 SS: CCHHHHHHHHHCCCCCCCCHEEEEECCCCCEEEECCCCCCCCHHHHHHHHHCCCCCCCCC
124 CN: BBBBBBEBBEBBBBBBBBEEEEEEEEEEEEEEEEEBBBBBEEBBEBBBEEBEBBBBBBBB
126 AA: PTGLFLGGEKYMVIQGEQGAVIRGKKGPGGVTIKKTNQALVFGFYDEPMTGGQCNLVVER
127 SS: CCEEEECCCEEEEEECCCCEEEEECCCCCEEEEEECCCEEEEEEECCCCCCHHHHHHHHH
128 CN: EBEEBEBBEEEEEEEBBBBBBEEEEEBBBEEEEEEEEEEEEEEEBBBBBBBBEEEBEEEB
135 ># AA : SS P_H P_E P_C : CN : RWCO
136 1 S : C 11 7 82 : B 14 : 840
137 2 W : C 23 10 67 : B 22 : 1221
138 3 Q : H 59 11 30 : B 18 : 864
139 4 S : H 79 8 12 : B 18 : 860
140 5 Y : H 86 6 7 : B 25 : 1199
141 6 V : H 89 5 6 : B 27 : 1276
142 7 D : H 90 5 6 : E 21 : 855
143 8 D : H 90 4 6 : B 17 : 728
144 9 H : H 89 5 6 : B 22 : 954
145 10 L : H 85 6 8 : E 30 : 1188
146 11 M : H 72 9 18 : B 24 : 850
147 12 C : C 44 11 46 : B 22 : 826
148 13 D : C 18 8 73 : B 18 : 669
149 14 V : C 10 7 83 : B 22 : 751
150 15 E : C 8 6 86 : B 17 : 593
151 16 G : C 8 7 85 : B 18 : 640
152 17 N : C 10 8 82 : B 17 : 696
153 18 H : C 16 11 73 : B 24 : 808
154 19 L : C 30 16 54 : E 32 : 1103
155 20 T : H 45 22 32 : E 26 : 962
156 21 A : E 37 43 20 : E 27 : 1017
157 22 A : E 15 75 10 : E 37 : 1265
158 23 A : E 7 88 5 : E 38 : 1286
159 24 I : E 6 89 5 : E 41 : 1341
164 ---------sample output ends here--------
166 * 4. CONTACT INFORMATION
169 Center for Information Biology and DNA Data Bank of Japan,
170 National Institute of Genetics,
171 Mishima, 411-8540, JAPAN
172 email: akinjo @ genes . nig . ac . jp
175 The average contact number of each residue type is listed below:
200 ** Faster but less accurate predictions.
201 The default implementation of CRNPRED uses 5000 dimensional state vectors for
202 critical random networks. This makes the prediction process quite slow when
203 you use the program on old computers or when you predict large proteins.
205 If you want predictions quickly, there are two options:
206 (1) linear predictor or (2) 2000 dimensional state vectors.
208 *** Using linear predictor
209 The linear predictor as described in Ref. 2 is implemented as a separate
210 program named "lpredm" which is installed along with xpredm (CRNPRED).
212 ${CRNPRED_DIR}/bin/lpredm test.prof > test.out
214 *** Using 2000 dimensional CRNPRED
215 To use CRNPRED with 2000 dimensional state vectors, you need to recompile the
216 program. Do it as follows:
218 cd ${CRNPRED_DIR}/src
220 make NDIM=2000 install
225 This produces the executable file "xpredm" just like before, but it now
226 uses 2000-dimensional state vectors.
228 *** Comparison of predictors
229 Here is a brief summary of speed and accuracy of the linear predictor (lpredm),
230 xpredm(2000), and xpredm (5000). The CPU times were measured for
231 the sample file "sample/d8abp__.prof" (305 AA) on Mac OS X (PPC G5, 2.5GHz).
232 The CPU time is (almost) linearly proportional to the protein length.
235 program speed accuracy note
236 --------------------------------------------------------------
237 xpredm very slow SS:Q3=81 default
238 (5000) 5min52s CN:Cor=0.75
242 (2000) 1min12s CN:Cor=0.74
248 --------------------------------------------------------------
250 Note that the accuracies are the average values based on a benchmark.
251 The difference between Q3=81 and Q3=79 may seem insignificant on average, but
252 there can be a big difference for individual predictions [e.g., an incorrectly
253 predicted alpha helix with xpred(2000) may be correctly predicted as a beta
254 strand with xpred(5000)].