Monday, July 21, 2014

Converting InChi to Mol using PL/PYTHON and RDKit

I am at EBI this summer and working in the Unichem database virtualization. One of the part of the project is to perform a search of over 50 million compounds and generate the images of those compounds. It can be done on the fly but people here suggested me to generate all the mol files for those compounds. The data has only InChi's available so you need to convert it into Mol object and write it into mol file and then use database to dump the files. Another very fast and efficient method is to use PL/PYTHON which is very fast and you can integrate all the python code on postgres and generate the database. Quite fancy postgres and python. Certainly I choose that option for conversion. I used rdkit for reading the molecules and conversion to mol files and also the erroneous molecules are written as an error log file. I have given the PL/PYTHON code below just paste it and enter and then you create function at screen. Before using the script above you need to set the plpython as a language in your database which is done by


Once you're done with the script then  executing the following sql statement below will generate the mol files for you in the ctab column.

select uci,stdinchi,inchi_mol(stdinchi) as ctab into ndb_mol from db_mol ;

Thats it . It takes almost 44-48 hours to generate all the mol files for 65 million compounds. I used a loop in a python script to extract 1 million set of compounds and compute the mol files.

Thursday, July 17, 2014

RDKit , R and PostgreSQL : Predictive Modeling / QSAR with ChEMBL data

This post is based on doing Predictive Modeling with R and RDKit postgres cartridge . If you are a rdkit user then i think you do all use the rdkit postgres cartridge if not then start using it today it Free and very useful. Here is a nice documentation of installation of rdkit .

There are many ipython notebooks upon using python and rdkit for predictive modeling and qsar. You can also search it on google and some tutorials are given here.

But this post is for those who are R lovers and like to use R for their regular modeling purposes.
The tools I have used :
RDkit postgres cartridge
R libraries : RPostgreSQL, BMS, ggplot2 , elasticnet.

If you already have a working version of the ChEMBL postgres database then should be great otherwise please download and load it. The cartridge can be used to generate several types of fingerprints however to save the space it gets generated in hexadecimal format. The default setting is 512 for morgan fingerprint with path radius of 2. You can change the setting in the postgres cartridge for generation of fingerprints using the options below which Greg Landrum has suggested to me.

The options available are:
rdkit.dice_threshold           rdkit.layered_fp_size
rdkit.do_chiral_sss            rdkit.morgan_fp_size
rdkit.featmorgan_fp_size       rdkit.rdkit_fp_size
rdkit.hashed_atompair_fp_size  rdkit.ss_fp_size
rdkit.hashed_torsion_fp_size   rdkit.tanimoto_threshold

Note that a change to a configuration variable as done here only affects the current session. If you want to make it the default for the database as a whole you need to change the database configuration:

contrib_regression=# alter database contrib_regression set rdkit.morgan_fp_size=1024;

Then disconnect (close psql) and reconnect to pick up the new setting.

I used the default 512 bits. The sql query below makes a subset table based on chembl compound id, all human chembl targets, its standard and published activity values which are less than 50 uM. The sql code below  shows it how to perform it. It also generates fingerprints into rdkfps_1 table .

Once you have done this then your database is ready for modeling. I am using postgresapp and R version 3.0.3 (2014-03-06) . The following code shows you how to connect to postgres data and the query to run and how to convert hexadecimal to binary fingerprint (hex2bin()) and run a ridge regression model on the Serotonin 2a (5-HT2a) receptor dataset. I have written two function r2se() and plotsar() . r2se computes the r^2 and root mean square error and plotsar() plots the data . 


[1] 0.3224558

[1] 0.9095131

Let me know if you have any questions about the code and modeling and generating this plot. I am writing some more R codes to perform analytics with the ChEMBL data stay tuned .

Friday, February 14, 2014

Converting Similarity Matrix into a into a pairwise p-value

We can calculate the probability that a given random pairwise similarity score X is bigger than a value x as p(X > x) using the fitted Gaussian function, we can transform a Tanimoto similarity matrix into a p-value p(X > x) as follows:

were t(xi,xj) is the tanimoto similarity matrix and h is the smoothing factor which you need to estimate. 

Hope now you all can very easily understand how you can calculate your pvalue from a large distribution.

Tuesday, October 22, 2013

Fast Tanimoto Similarity Calculation using rcdk

Well many of you ,who are using r for cheminformatics must be knowing rcdk . Regarding the tanimoto calculation i have seen it seem it takes a long time to calculate the code in rcdk code looks neat but still the similarity calculation can be performed much faster using the inner products.Below given a simple code to do that and also the time taken is like 10 times faster than the rcdk code. Quite an impressive performance boost . I have made a pull request to Rajarshi's code, it should be available soon in the main package.

##Consider m is the binary matrix of 0 and 1 which you calculated using fp.sim.matrix###

Time taken for the new method
user  system elapsed 
  2.962   0.012   2.971 

#Normal method in rcdk

user  system elapsed 
 43.644   0.064  43.707 

Dynamic plots with R Studio.

Few days back i came to know that R Studio provides dynamic plots like you can plot histograms and move sliders, tick check boxes and also you can select from the drop down list the items you want to display from your dataset.I will provide some examples of this below. Quite cool enough from Rstudio group.

if(require(manipulate)) {
histogram( ~ eruptions, data=faithful, n=n),
n = slider(5,40)


if(require(manipulate)) {
histogram( ~ age, data=HELP, n=n, density=density),
n = slider(5,40),
density = checkbox()

Check box with density plot


if(require(manipulate)) {
histogram( ~ age, data=HELP, n=n, fit=distribution, dlwd=4),
n = slider(5,40),
distribution =
picker('normal', 'gamma', 'exponential', 'lognormal',

//dropdown and density

          beside = TRUE, main = factor,density=density),
  factor = picker("mpg", "disp", "hp","drat","wt"),density = checkbox())

This way you can make your ggplots dynamic with the manipulate package.

Monday, July 22, 2013

In January, I started  doing an interesting project using random walks to predict drug -target. I found many papers got recently published in this domain one is by Chen and another one from Xing Chen . Looks interesting work and there are several papers related to this topic you can just type it in google .

Now the point I am trying to indicate is that the molecular descriptors which they used is kind of ok or not. I made  descriptors based study too in which some pharmacophore descriptors gave me very good results.
The validation is still an important question. Ok, if you have got some model you need to test
your data. Looks like in these papers they didn't show  cross target class validation much which made me to do research on the method. Well i am still onto it.But today i will be posting some of the interesting results i got while working with the random walk with restart algorithm or you may call personalized page rank /shortest paths etc.

A random walk is a finite Markov chain that is time-reversible  In fact, there is not much difference between the theory of random walks on graphs and the theory of finite Markov chains; every Markov chain can be viewed as random walk on a directed graph, if we allow weighted edges. Similarly, time-reversible Markov chains can be viewed as random walks on undirected graphs, and symmetric Markov chains, as random walks on regular symmetric graphs.

A random walk on Graph starts at a node x and iteratively moves to a neighbor of x chosen uniformly at random from the set (x). The hitting time H(x,y) from x to y is the expected number of steps required for a random walk starting at x to reach y. Because the hitting time is not in general symmetric, it is also natural to consider the commute time C(x,y) := H(x,y) + H(y,x). Both of these measures serve as natural proximity measures and hence (negated) can be used as score(x, y).

Now,  some results

Figure  shows the statins network and the top scoring 10 predicted genes are listed for the compounds along with the true links. The true links are coloured in blue solid lines and red dashed lines are the predicted network.It has reported that both lovastatin and simvastatin having the side effect of alopecia and hair loss along with post marketing side effects shows variety of skin problem related to these drugs such as nodules, discoloration, dryness of skin/mucous membranes. 

We have found an association of simvastatin and lovastatin with KRA53 and KRA52 genes. KRA53(Keratin associated protein 5 type 3) is an essential gene for the formation of a rigid and resistant hair shaft through their extensive disulfide bond cross-linking with abundant cysteine residues of hair keratins. The matrix proteins include the high-sulfur and high-glycine-tyrosine keratins.  The majority of keratinizing disorders affect the epidermis and/or its adnexal structures such as hair and nail, or sweat and sebaceous glands, although a number of these diseases affect other epithelia such as mucosal or corneal epithelia. We hypothesize here the side effect of hairloss of lovastatin and simvastatin might be associated with KRA53 or KRA52.

The method also shows some very good results listed in the table below. Overall combination of sequence and descriptor similarity performs good. But still is it good enough to predict cross target class prediction. Well we have to see and validate more .

Dopamine D1 Receptor, Dopamine D2 Receptor
Sodiumchannel protein type III alpha subunit,
adenosine A2A receptor, adenosine A2B receptor
adrenoceptor alpha 1B
PhospholipaseA2,PPARG, phosphoglycerate kinase 1
Alkaline Phosphatase,PPARD,Multi drug resistance protein
Liver carboxylesterase 1, Acetylcholine receptor subunit alpha
Synaptosomal-associated protein 25,
Beta-2 adrenergic receptor

Thursday, December 13, 2012

Drug Repurposing Explorer

Following from the previous work i have made an tool suitable for drug repurposing using the side effect, Protein sharing, Fingerprints(Pubchem, ECFP6) and also ROCS (3D) .

I collected tThe side effect information of 727 drugs from SIDER .The target information, ATC Codes
from the Drugbank. We also manually added some of the drugs ATC codes from KEGG database. The
disease information of 727 drugs was obtained from Comparative Toxicogenomics Database (CTD)
[8].CTD is a curated information resource for chemical- gene, chemical-diseases, gene-diseases, chemical
–ontology interaction which is curated by hand by the CTD bio-curators. We extracted the Chemical to
disease association between the compounds which are related to the therapeutic area only which will help
to identify possible diseases to treat the drug and also if new diseases could be identified. We then
converted side effect and target data into binary matrices of 0 and 1 and created drug-side effect, drug
target matrices.

A user searching for a drug in Drug Repurposing explorer should be aware of drug  and also for what purpose it is used.


If any user using this tool found something interesting please mail your results to or It will help to make a good manual for the user while searching the database.

Sunday, November 25, 2012

fingerprints,fragments,side effect of drugs...

After a long time i am writing a blog . Though the material looks quite interesting to me for study the results indicate some essential fingerprints are related to side effect profile as well the compounds therapeutic area and also in target similarity between two compounds.

So lets see what the study tells....

I mapped around 746 compounds from drugbank and SIDER  by CID and then by name of the compounds and  manually checked and made a final dataset for my study. The recent SIDER provides 4500 side effects profiles.The side effect profiles are used to create a Compound - side effect binary matrix .

The side effects are used to calculate the compound similarity using matrix multiplication and normalization the algorithm i used is discussed in paper Metapath . It provides a fast and efficient way to calculate dice coefficient which i used to calculate the similarity more at Rajarshi's Guha Blog .

For the Final data of 746  compounds i removed metals and compounds molecular weight greater than 1000  and ultimately came down to 728 compounds.

I also created a compound- target matrix and compound - atc code matrix ( data is collected from drugbank) from which I calculated the similarity of compounds by side effect, ATC codes and protein  using using the metapath algorithm respectively.

Earlier research from Yamanishi's work on predicting side effects from substructure profile and drug target prediction using learning models  were very good papers . One of the important paper was from Campillos of drug target identification using side effect similarity which focused on similar side effect profile which is kind of similar to the off target profile. Another paper (SLAP) from our lab at Indiana University using semantics to identify target was a good improvement of integrating multiple heterogeneous network and predicting targets.

For my work i used pubchem fingerprints, maccs keys 166 , ECFP 4, ECFP 6 FCFP 4, FCFP 6 to find how much is the relation between the these fingerprints with the side effects and ATC codes and proteins.

I used Tanimoto similarity for the fingerprint based similarity of compounds and the pathsim similarity with side effect,ATC and protein of compounds. After performing a simple correlation study I found pubchem fingerprints are highly correlated with side effect though the correlation is about 0.16 but it was statically significant p-value < 2.2e-16 so as for  maccs keys and extended connectivity fingerprints. Below i provide correlation matrix of the different fingerprints,side effects,ATC Code, protein similarity.

Pubchem Fingerprint similarity Distribution

side effect distribution
Below is the matrix shown

sideffect pubchem maccs ECFP4 ECFP6 FCFP4 FCFP6 atc protein
sideffect 1 0.160535 0.156551 0.142548 0.140111 0.125838 0.12807 0.055374 0.097784
pubchem 0.160535 1 0.585017 0.604037 0.566609 0.617526 0.576539 0.018368 0.132186
maccs 0.156551 0.585017 1 0.523484 0.491311 0.477387 0.434868 0.014732 0.169165
ECFP4 0.142548 0.604037 0.523484 1 0.990811 0.757584 0.777071 0.096187 0.289458
ECFP6 0.140111 0.566609 0.491311 0.990811 1 0.747782 0.787673 0.102716 0.295364
FCFP4 0.125838 0.617526 0.477387 0.757584 0.747782 1 0.981399 0.094613 0.263978
FCFP6 0.12807 0.576539 0.434868 0.777071 0.787673 0.981399 1 0.106107 0.283874
atc 0.055374 0.018368 0.014732 0.096187 0.102716 0.094613 0.106107 1 0.166476
prot 0.097784 0.132186 0.169165 0.289458 0.295364 0.263978 0.283874 0.166476 1

It is found that ECFP and FCFP  shows very good correlation with Protein and atc code similairty fingerprint . But on the other hand pubchem fingeprint shows a best correlation with the side effect .
A question is arised here if the correlation is such low how much does Yamanishi's paper is able to predict the true relations? Can different substrctures methods able to correlate with the side effects ?

Well i am still thinking what's behind the side effects and the impact of right substructures on side effects.

Sunday, August 12, 2012

Kinase compounds polypharmacology can it be studied using Mol Wt & Lipophilicity

I have came across a post in blog FBDD and also from the linkedIn groups about the effect of liphophilicity and Molecular weight in promiscuity. Peter have referred one very good paper from Michael Hahn and Andrew Leach of Molecular complexity and fragment-based drug discovery: ten years on .This review suggest the importance of liphophilicity, Molecular weight,positive charge , Heavy atoms count in promiscuity.According to the paper Molecular complexity is playing a crucial role in polypharmacology. 
My Question is here what do you mean by molecular complexity? 

From the paper they have mentioned the use of mol weight and heavy atom count to in molecular complexity.Several studies were made for example 75000 compounds from pfizer were tested against 220 assays ,they showed that promiscuity decrease with mol weight, to counter it Novartis did 160 HT assays were they found a postive effect of mol wt in promiscuity.They also found that compounds containing a carboxylic acid showed significant higher selectivity.Other than that sprinthorpe seminal work proposed that increased liphophilicity and presence of basic moiety is playing a important role in promiscuity. From the paper mentioned"People at Roche also mentioned presence of positively charged groups and increase promiscuity."

Based on leach paper i did some study of 72 known Kinase inhibitors which was assayed against 442 different kinases studied by Davis etal . The 72 different inhibitors are available at .

I used AlogP for liphophilicty, mol weight, polar surface area and Heavy atoms count against the selectivity score of the compound ar 3 uM. The kinase selectivity score at 3uM is defined as the number of kinases the compound is bound at 3uM to the total number of kinase domain queried.The selectivity scores are being distributed between 0.2 nad 0.7  the The data is available in as additional file in . They have observed "The lowest selectivity scores, and therefore the greatest selectivity, were observed for the MEK inhibitors AZD-6244/ARRY-886 and CI-1040, the MET inhibitor SGX-523, the CSF1R inhibitor GW-2580 and the ERBB2/EGFR inhibitor lapatinib (Tykerb)."

The figure below explains an interesting result. For liphophilicity there exist a negative correlation but for all other it was positve though the correlation was maximum for mol weight around 0.31 but PSA and heavy atoms count it was below 0.1. Both the figure displaying same type of distribution of compounds at 300nM and 3 uM in which lower selectivity observed for different classes of kinase inhibitors.


Most of the compounds studied was having a selectivity below 0.2 which indicates the selectivity of class I and class II. For class I and class II inhibitors they are not much a difference Class I was made for big gate keepers like phenyl alanine and class II made for small and medium gatekeepers.Figure below gives the distribution from selectivity scores of different classes of inhibitors from Davis paper.

I think more compounds needs to get explored for the study of kinase inhibitor selectivity and prosmiscuity. Does these signify that the kinases are selective for compounds? 
However i still have question what other property comes into molecular complexity when dealing with promiscuity?

Saturday, August 11, 2012

PCA Plot

Continuing from the last post i have done some PCA anaylsis using 12 different physicochemical property descriptors including molecular refractivity, atom polarizabilities, bond polarizabilities, hydrogen bond donors and acceptors, petitjean number, topological polar surface area, number of rotatable bonds,liphophilicity XLogP, molecular weight, topological shape and geometrical shape. 

I used CDK for this developed by Rajarshi Guha and plotted the plot using ggplot2

Quite impressive that the compounds which have been screened was in the same area of the PknB inhibitors. Some outliers like Mitoxantrone,straurosporine are their which are scattered in plot . But majority of the predicted ones are in close association with the PknB inhibitors.