Markov Chain Clustering (MCL) is fast scalable supervised clustering algorithm based on information flow in graphs. The algorithm finds cluster in graphs by random walks. It uses two important operators one is the inflation and other the expansion. "Expansion takes the power of a stochastic matrix using the normal matrix product. Inflation takes the Hadamard power of a matrix (taking powers entrywise), followed by a scaling step, such that the resulting matrix is stochastic again, i.e. the matrix elements (on each column) correspond to probability values." More information can be found here and here . Below the R code describes how to perform it step by step. Also a nice explanation is presented here .
Open Chemical Information and Research
This blog is for cheminformaticians and Chemogenomics enthusiast.
Friday, October 24, 2014
Wednesday, October 1, 2014
Link Prediction using Bipartite Networks .
Missing link prediction of networks is of practical significance in modern science like in Social Networks , Biological Networks and Food networks and lots others. AdamicAdar index refines the simple counting of common neighbors by assigning the lower connected neighbors more weights which is given by the equation below. More on the other indexes are ,
∑w∈Γ(u)∩Γ(v)1logΓ(w)
The code takes a bipartite graph as input (stored as a text file in an adjacency list) and computes the Adamic/Adar similarity of each nonneighboring node pair. The similarity is computed using the degree of the intermediate nodes. The output file is written as a text file containing three fields per row score , Proteins and Drugs. However this can be applied to other bipartite networks also.
After calculation the predicted links are stored in an output file and the highest predicted links can be obtained by sorting the first column. The Bipartite data (inputdata.txt) and code is avialable at Git.
Sunday, September 14, 2014
Some R codes and Examples to perform Random Walks on a network.
My research is mainly focused on Drug Target Prediction and Drug to Target to Adverse events prediction. I use Random walk on different heterogeneous networks for this. There are several papers which came up like .
1) Kohler 2008 (Walking the Interactome for Prioritization of Candidate Disease Genes)
2) The power of protein interaction networks for associating genes with diseases
3) Chen Etal ( Drug Target Prediction with Random walk with restart)
4) Genomewide inferring gene–phenotype relationship by walking on the heterogeneous network
Here i will post some codes in R on how to perform a Normal Random walk on a small network using two method one proposed by Kohler (2008) and another proposed by Vanunu (2010).
1) Kohler 2008 (Walking the Interactome for Prioritization of Candidate Disease Genes)
2) The power of protein interaction networks for associating genes with diseases
3) Chen Etal ( Drug Target Prediction with Random walk with restart)
4) Genomewide inferring gene–phenotype relationship by walking on the heterogeneous network
Here i will post some codes in R on how to perform a Normal Random walk on a small network using two method one proposed by Kohler (2008) and another proposed by Vanunu (2010).
The method iteratively simulates random transitions of a walker from a node to a randomly selected neighbour node and at any time step the walk can be restarted depending on a predefined probability. Random walk with restarts is slightly different than PageRank with priors in the way that it normalizes the link weights. The convergence is decided when a probability difference is less than 10e10 between two consecutive time steps which is calculated by L1 / Frobenius norm. The loop breaks when the difference is less than value. The function assumes two types of method based
1 Kohler > which is just a simple random walk
2 Vanunu  > Which is a modification of random walk with restarts such that the link weight is normalized not only by number of outgoing edges but also by number of incoming edges. They mentioned "We chose to normalize the weight of an edge by the degrees of its endpoints, since the latter relate to the probability of observing an edge between the same endpoints in a random network with the same node degrees".
There are no such parameters applied here and also similarity matrices . This is the heart of the code of Random walk and one can modify on top of this as much as they can.
Hope this code makes life simple for some people when they try to understand it :)
Thursday, September 11, 2014
Some function for VS: AUC , BEDROC and RIE in R
Below are some R functions to compute Area under the curve , Robust Initial Enhancement Metric and BoltzmannEnhanced Discrimination of ROC which is implemented in
Truchon et al. Evaluating Virtual Screening Methods: Good and Bad Metrics for the "Early Recognition" J. Chem. Inf. Model. (2007) 47, 488508.
These metrics use for the early recognition problem in virtual screening.
AUC
RIE
BEDROC
Truchon et al. Evaluating Virtual Screening Methods: Good and Bad Metrics for the "Early Recognition" J. Chem. Inf. Model. (2007) 47, 488508.
These metrics use for the early recognition problem in virtual screening.
AUC
RIE
BEDROC
Monday, July 21, 2014
Converting InChi to Mol using PL/PYTHON and RDKit
I am at EBI this summer and working in the Unichem database virtualization. One of the part of the project is to perform a search of over 50 million compounds and generate the images of those compounds. It can be done on the fly but people here suggested me to generate all the mol files for those compounds. The data has only InChi's available so you need to convert it into Mol object and write it into mol file and then use database to dump the files. Another very fast and efficient method is to use PL/PYTHON which is very fast and you can integrate all the python code on postgres and generate the database. Quite fancy postgres and python. Certainly I choose that option for conversion. I used rdkit for reading the molecules and conversion to mol files and also the erroneous molecules are written as an error log file. I have given the PL/PYTHON code below just paste it and enter and then you create function at screen. Before using the script above you need to set the plpython as a language in your database which is done by
mydb# CREATE PROCEDURAL LANGUAGE plpython2u;
Once you're done with the script then executing the following sql statement below will generate the mol files for you in the ctab column.
select uci,stdinchi,inchi_mol(stdinchi) as ctab into ndb_mol from db_mol ;
Thats it . It takes almost 4448 hours to generate all the mol files for 65 million compounds. I used a loop in a python script to extract 1 million set of compounds and compute the mol files.
mydb# CREATE PROCEDURAL LANGUAGE plpython2u;
Once you're done with the script then executing the following sql statement below will generate the mol files for you in the ctab column.
select uci,stdinchi,inchi_mol(stdinchi) as ctab into ndb_mol from db_mol ;
Thats it . It takes almost 4448 hours to generate all the mol files for 65 million compounds. I used a loop in a python script to extract 1 million set of compounds and compute the mol files.
Reactions: 
Thursday, July 17, 2014
RDKit , R and PostgreSQL : Predictive Modeling / QSAR with ChEMBL data
This post is based on doing Predictive Modeling with R and RDKit postgres cartridge . If you are a rdkit user then i think you do all use the rdkit postgres cartridge if not then start using it today it Free and very useful. Here is a nice documentation of installation of rdkit .
There are many ipython notebooks upon using python and rdkit for predictive modeling and qsar. You can also search it on google and some tutorials are given here.
But this post is for those who are R lovers and like to use R for their regular modeling purposes.
The tools I have used :
RDkit postgres cartridge
R
R libraries : RPostgreSQL, BMS, ggplot2 , elasticnet.
If you already have a working version of the ChEMBL postgres database then should be great otherwise please download and load it. The cartridge can be used to generate several types of fingerprints however to save the space it gets generated in hexadecimal format. The default setting is 512 for morgan fingerprint with path radius of 2. You can change the setting in the postgres cartridge for generation of fingerprints using the options below which Greg Landrum has suggested to me.
There are many ipython notebooks upon using python and rdkit for predictive modeling and qsar. You can also search it on google and some tutorials are given here.
But this post is for those who are R lovers and like to use R for their regular modeling purposes.
The tools I have used :
RDkit postgres cartridge
R
R libraries : RPostgreSQL, BMS, ggplot2 , elasticnet.
If you already have a working version of the ChEMBL postgres database then should be great otherwise please download and load it. The cartridge can be used to generate several types of fingerprints however to save the space it gets generated in hexadecimal format. The default setting is 512 for morgan fingerprint with path radius of 2. You can change the setting in the postgres cartridge for generation of fingerprints using the options below which Greg Landrum has suggested to me.
The options available are:
rdkit.dice_threshold rdkit.layered_fp_size
rdkit.do_chiral_sss rdkit.morgan_fp_size
rdkit.featmorgan_fp_size rdkit.rdkit_fp_size
rdkit.hashed_atompair_fp_size rdkit.ss_fp_size
rdkit.hashed_torsion_fp_size rdkit.tanimoto_threshold
Note that a change to a configuration variable as done here only affects the current session. If you want to make it the default for the database as a whole you need to change the database configuration:
contrib_regression=# alter database contrib_regression set rdkit.morgan_fp_size=1024;
ALTER DATABASE
Then disconnect (close psql) and reconnect to pick up the new setting.
I used the default 512 bits. The sql query below makes a subset table based on chembl compound id, all human chembl targets, its standard and published activity values which are less than 50 uM. The sql code below shows it how to perform it. It also generates fingerprints into rdkfps_1 table .
Once you have done this then your database is ready for modeling. I am using postgresapp and R version 3.0.3 (20140306) . The following code shows you how to connect to postgres data and the query to run and how to convert hexadecimal to binary fingerprint (hex2bin()) and run a ridge regression model on the Serotonin 2a (5HT2a) receptor dataset. I have written two function r2se() and plotsar() . r2se computes the r^2 and root mean square error and plotsar() plots the data .
Let me know if you have any questions about the code and modeling and generating this plot. I am writing some more R codes to perform analytics with the ChEMBL data stay tuned .
Reactions: 
Friday, February 14, 2014
Converting Similarity Matrix into a into a pairwise pvalue
We can calculate the probability that a given random pairwise similarity score X is bigger than a value x as p(X > x) using the fitted Gaussian function, we can transform a Tanimoto similarity matrix into a pvalue p(X > x) as follows:
were t(xi,xj) is the tanimoto similarity matrix and h is the smoothing factor which you need to estimate.
Hope now you all can very easily understand how you can calculate your pvalue from a large distribution.
Tuesday, October 22, 2013
Fast Tanimoto Similarity Calculation using rcdk
Well many of you ,who are using r for cheminformatics must be knowing rcdk . Regarding the tanimoto calculation i have seen it seem it takes a long time to calculate the code in rcdk code looks neat but still the similarity calculation can be performed much faster using the inner products.Below given a simple code to do that and also the time taken is like 10 times faster than the rcdk code. Quite an impressive performance boost . I have made a pull request to Rajarshi's code, it should be available soon in the main package.
Time taken for the new method
#Normal method in rcdk
Time taken for the new method
user system elapsed
2.962 0.012 2.971
#Normal method in rcdk
user system elapsed
43.644 0.064 43.707
Reactions: 
Dynamic plots with R Studio.
Few days back i came to know that R Studio provides dynamic plots like you can plot histograms and move sliders, tick check boxes and also you can select from the drop down list the items you want to display from your dataset.I will provide some examples of this below. Quite cool enough from Rstudio group.
//Sliders
library(mosaic)
if(require(manipulate)) {
manipulate(
histogram( ~ eruptions, data=faithful, n=n),
n = slider(5,40)
)
}
//CheckBoxes
library(mosaic)
if(require(manipulate)) {
manipulate(
histogram( ~ age, data=HELP, n=n, density=density),
n = slider(5,40),
density = checkbox()
)
}
Check box with density plot
//Dropdowns
library(mosaic)
if(require(manipulate)) {
manipulate(
histogram( ~ age, data=HELP, n=n, fit=distribution, dlwd=4),
n = slider(5,40),
distribution =
picker('normal', 'gamma', 'exponential', 'lognormal',
label="distribution")
)
}
//dropdown and density
library(mosaic)
manipulate(
histogram(as.matrix(mtcars[,factor]),
beside = TRUE, main = factor,density=density),
factor = picker("mpg", "disp", "hp","drat","wt"),density = checkbox())
This way you can make your ggplots dynamic with the manipulate package.
Reactions: 
Monday, July 22, 2013
In January, I started doing an interesting project using random walks to predict drug target. I found many papers got recently published in this domain one is by Chen and another one from Xing Chen . Looks interesting work and there are several papers related to this topic you can just type it in google .
Now the point I am trying to indicate is that the molecular descriptors which they used is kind of ok or not. I made descriptors based study too in which some pharmacophore descriptors gave me very good results.
The validation is still an important question. Ok, if you have got some model you need to test
your data. Looks like in these papers they didn't show cross target class validation much which made me to do research on the method. Well i am still onto it.But today i will be posting some of the interesting results i got while working with the random walk with restart algorithm or you may call personalized page rank /shortest paths etc.
A random walk is a ﬁnite Markov chain that is timereversible In fact, there is not much diﬀerence between the theory of random walks on graphs and the theory of ﬁnite Markov chains; every Markov chain can be viewed as random walk on a directed graph, if we allow weighted edges. Similarly, timereversible Markov chains can be viewed as random walks on undirected graphs, and symmetric Markov chains, as random walks on regular symmetric graphs.
A random walk on Graph starts at a node x and iteratively moves to a neighbor of x chosen uniformly at random from the set (x). The hitting time H(x,y) from x to y is the expected number of steps required for a random walk starting at x to reach y. Because the hitting time is not in general symmetric, it is also natural to consider the commute time C(x,y) := H(x,y) + H(y,x). Both of these measures serve as natural proximity measures and hence (negated) can be used as score(x, y).
Now, some results
The method also shows some very good results listed in the table below. Overall combination of sequence and descriptor similarity performs good. But still is it good enough to predict cross target class prediction. Well we have to see and validate more .
Now the point I am trying to indicate is that the molecular descriptors which they used is kind of ok or not. I made descriptors based study too in which some pharmacophore descriptors gave me very good results.
The validation is still an important question. Ok, if you have got some model you need to test
your data. Looks like in these papers they didn't show cross target class validation much which made me to do research on the method. Well i am still onto it.But today i will be posting some of the interesting results i got while working with the random walk with restart algorithm or you may call personalized page rank /shortest paths etc.
A random walk is a ﬁnite Markov chain that is timereversible In fact, there is not much diﬀerence between the theory of random walks on graphs and the theory of ﬁnite Markov chains; every Markov chain can be viewed as random walk on a directed graph, if we allow weighted edges. Similarly, timereversible Markov chains can be viewed as random walks on undirected graphs, and symmetric Markov chains, as random walks on regular symmetric graphs.
A random walk on Graph starts at a node x and iteratively moves to a neighbor of x chosen uniformly at random from the set (x). The hitting time H(x,y) from x to y is the expected number of steps required for a random walk starting at x to reach y. Because the hitting time is not in general symmetric, it is also natural to consider the commute time C(x,y) := H(x,y) + H(y,x). Both of these measures serve as natural proximity measures and hence (negated) can be used as score(x, y).
Now, some results
Figure shows the
statins network and the top scoring 10 predicted genes are listed for the
compounds along with the true links. The true links are coloured in blue solid
lines and red dashed lines are the predicted network.It has reported that both
lovastatin and simvastatin having the side effect of alopecia and hair loss
along with post marketing side effects shows variety of skin problem related to
these drugs such as nodules, discoloration, dryness of skin/mucous membranes.
We have
found an association of simvastatin and lovastatin with KRA53 and KRA52 genes.
KRA53(Keratin associated protein 5 type 3) is an essential gene for the formation of a rigid and resistant hair
shaft through their extensive disulfide bond crosslinking with abundant
cysteine residues of hair keratins. The matrix proteins include the highsulfur
and highglycinetyrosine keratins. The majority of keratinizing disorders
affect the epidermis and/or its adnexal structures such as hair and nail, or
sweat and sebaceous glands, although a number of these diseases affect other
epithelia such as mucosal or corneal epithelia. We hypothesize here the side effect
of hairloss of lovastatin and simvastatin might be associated with KRA53 or
KRA52.
The method also shows some very good results listed in the table below. Overall combination of sequence and descriptor similarity performs good. But still is it good enough to predict cross target class prediction. Well we have to see and validate more .
Drugs

Targets

zolmitriptan

Dopamine D1 Receptor, Dopamine D2 Receptor

Lamotigrine

Sodiumchannel protein
type III alpha subunit,

sildenafil

adenosine A_{2A}
receptor, adenosine A_{2B} receptor

Timolol

adrenoceptor alpha 1B

Acetaminophen

PhospholipaseA2,PPARG, phosphoglycerate kinase 1

Glipizide

Alkaline
Phosphatase,PPARD,Multi drug resistance protein

Pyridostigmine

Liver carboxylesterase
1, Acetylcholine receptor
subunit alpha

fluocinonide

Synaptosomalassociated
protein 25,

Metronidiazole

Beta2 adrenergic
receptor

Doxazosin

5HT2A

Subscribe to:
Posts (Atom)