Virtual screening

New Approaches to Virtual Screening

Ivan Solt, Anna Tomin, Krisztian Niesz, ChemAxon Ltd.

Virtual screening (VS) aims to reduce the enormous virtual space of chemical compounds (a practical virtual library might comprise ~1015 molecules) to a more manageable number for further synthesis and screening against biological targets, which could lead to potential drug candidates. Although the origin of the computational methods goes back to the 1970s, the term “virtual screening” did not appear until 1997.1 Since then several successful case studies and approved drugs have demonstrated the relevance of computer-aided drug design (CADD). Additionally, the approach has proved useful in identifying relevant candidates for several drug repositioning applications.2 Despite recent developments, the potential of virtual screening in terms of helping medicinal chemists to develop new drugs in a time and cost-effective manner is still criticized by many.

There are two generally accepted approaches for virtual screening: ligand-based and structure-based (docking) methods. While ligand-based virtual screening (LBVS) uses 2D or 3D similarity searches between large compound databases and known actives, structure-based virtual screening (SBVS) applies different modeling techniques to mimic the binding interaction of a ligand to a biomolecular target. Hence, the biggest difference between LBVS and SBVS is that the latter requires structural information for the target, usually obtained from X-ray crystallography or nuclear magnetic resonance (NMR). If that information does not exist—which is often the case with membrane receptors such as GPCRs—one can also mimic this information with their homology models.3 Although a recent literature study has shown that docking is arguably the most widely used approach in early phase drug discovery, the same research also points out that LBVS methods in general yield a higher fraction of potent hits.4

Structure-based approaches

Traditionally SBVS is built up from several steps including the target and the compound library preparation, running the actual docking algorithms, post-processing and ranking the results for bioassays by a pre-defined scoring function.5 All these steps include a great deal of assumption, which may lead to artifacts. But the key element that has been missing so far in overcoming the shortcomings of docking methods is the innovative thinking needed to replace the current static models with dynamic models that are applicable to the whole system.6

Figure 2: An example workflow for 3D shape-based flexible alignment: 1) 2D input structures; 2) 3D conformer is generated and the shape is colored by atomic types; 3) the volume intersection, which maximized during the alignment, is shown along with the resulting pose.

Figure 2: An example workflow for 3D shape-based flexible alignment: 1) 2D input structures; 2) 3D conformer is generated and the shape is colored by atomic types; 3) the volume intersection, which maximized during the alignment, is shown along with the resulting pose.

Undoubtedly, one of the greatest challenges of docking software is to consider protein flexibility.7 These macromolecules are obviously not static objects and conformational changes are often key elements in ligand binding. Using multiple high-quality static receptor conformations as snapshots in docking runs and selecting the highest scoring conformation for further investigation is one way to tackle the problem. Other ways, including ensemble docking allowed by molecular dynamics simulations, or “soft docking,” in which the interaction of the protein and the ligand is allowed to change continuously, and 4D docking which allows ligands to be fitted against multiple target conformations in a single run are also described in detail in the literature.9-11

Equally important and another nontrivial issue is the handling of water molecules. Water can signifi cantly affect ligand binding through the formation of hydrogen bonds and can contribute to both the enthalpy and entropy of the binding. In general, the thermodynamics of ligandreceptor interactions are still treated similarly to how molecular reactions work, and often times this is not the optimal way to approach the problem. As far as accounting for water, several approaches have been developed recently, which are complementary to the experimental information from X-ray and NMR spectroscopy.12-15

Although providing considerable enhancements, the general belief is that these methods are very target specific and they do not work effectively with all protein types. How we account for the water molecules is directly connected to the problem of solving the ligand-flexibility issue, also known as conformational sampling of ligand molecules within the binding site of the protein pocket during docking. In the attempt to solve this, four main methods are currently in broad usage: 1) building up the entire ligand from fragments within binding pocket;16 2) generating a rotamer library, a collection of low energy conformers, and docking them individually;17 3) Monte Carlo simulations;18 4) applying evolutionary methods, such as the traditional or Lamarckian genetic algorithms to find the local energy minimum of the system.19

While today’s methods are often capable of finding the correct binding modes, they are quite far from giving an accurate prediction in terms of binding affinity or potency. Current scoring methods include traditional methods, such as force field-based scoring functions, which use classical force fields to calculate noncovalent interactions between the ligand and the target; empirical-based methods, which take into account several energetic terms, polar and apolar interactions in a weighted fashion; and knowledge-based methods, which use the sum of distance-dependent statistical potentials.20,21 A refinement of the above scoring functions is consensus scoring, which is a hybrid method using multiple functions.22

However, since the different scoring functions are co-linear, some scientists question whether this method can significantly improve the accuracy of the process.23 To overcome the accuracy issue, machine learning techniques are attracting a great deal of attention currently. Neural networks, support vector machines and the random forest technique are able to describe the nonlinear dependence of the ligand-target interactions during binding without taking solvation and entropic effects into account.24 Furthermore, the structural interaction fingerprint (SIFt) method, which uses the 3D structure of the protein-ligand complex to generate a 1D binary fingerprint, is of interest.25 This fingerprint then is used to characterize ligand poses derived from the docking procedure and compare to the native substrate’s interaction map.

Ligand-based approaches

In contrast to docking methods, ligand-based approaches do not take the target structure directly into account. These techniques are based on the assumption that compounds with a similar topology have similar biological activity. There are many ways to define molecular similarity; ligand-based screening in drug discovery typically uses topology-based descriptors involving the pharmacophoric sites of the molecules. The descriptors of the known active molecules and the potential hit molecules are compared using pre-defined mathematical expressions (metrics) to quantify molecular similarity. These approaches essentially neglect any information about the target biomolecule as well as the 3D structure of the ligand compounds. Nevertheless, they are very efficient and are often applied in combination with structure-based approaches to identify potential bioactive hits that can then be fed into docking experiments.

Besides the structure and traditional ligand-based methods there is a third possible approach to predict the bioactivity of molecules in a virtual chemical space. Methods that belong to this third branch can be thought of as extensions of the ligand-based approach with the major difference that instead of considering only the molecular topology, they create or consider 3D coordinates of both the active and the potential lead molecules for the similarity comparison, and then estimate the 3D shape similarity of these molecules. These algorithms are called shape methods, although just like in the traditional ligand-based algorithms, a number of different ways to generate 3D similarity measures exist.

The first question that occurs is how the flexibility of the molecules, which is not less important in this case than with docking, is accounted for during the alignment process. Some approaches generate an exhaustive conformation ensemble and use rigid alignment of the generated conformations against a rigid active molecule. These approaches have the advantage of a very efficient alignment step but can be rather sensitive to the conformational space that is generated. Also, generating the conformers can be a time consuming but critical procedure that is critical to the quality of the alignment. Nevertheless, successful approaches utilizing a rigid alignment model for 3D ligand screening exist.26

As an alternative solution it is possible to use flexible instead of rigid alignment. In these cases, no conformational ensemble is pregenerated for the screening, but the conformations are created on-the-fly during the alignment procedure. Flexible alignment methods, such as the 3D Screen from ChemAxon Ltd. (Budapest), have the advantage of not being sensitive to the quality of the conformational space that can be generated.27 An interesting amendment found in flexible alignment methods is the option to keep the known active ligand rigid and the hit structure flexible. This hybrid approach allows grabbing a native substrate in its bound conformation and trying to mimic this conformation with the candidate molecules giving comparable results to docking.

Besides the flexibility of the active and the candidate compounds, the second question is how to bias the alignment procedure or, in other words, what is the goal function for the alignment. Some approaches aim to maximize the van der Waals overlap of the molecules and calculate the shape similarity score based on this.28 Although these methods can provide an estimate of the actual shape similarity (shape methods), they give little insight into how the binding characteristics of the candidate molecule relates to that of the actives.

This drawback can be overcome by taking specific atom-type information, such as pharmacophore sites, into account during the alignment procedure (match methods). Algorithms considering this information would be capable of generating alignments where patterns with similar binding character are oriented in a similar fashion in the active and the candidate, providing a more realistic picture of the potential bioactive similarity of the molecules.

Combined use of methods

The combination of the above mentioned structure and ligand-based strategies is also becoming a desired and common path for researchers and different methods, including sequential and parallel approaches.29 Although the fi rst step in most combined sequential strategies is to prepare a manageable compound set from large databases via molecular similarity searches, reverse approaches, in which docking protocols are carried out first to select the ligands for further investigations, have also been successfully applied.30 Hybrid approaches, in which ligand- and structure-based applications are truly molded together (protein-ligand pharmacophores) and integrated into one standalone technique to enhance accuracy and performance, have also been described.29


Cheminformatics helps accelerate drug discovery, and virtual screening plays a crucial role in the early phase of that process via reducing the size of the haystack. In this paper, we tried to summarize the basic approaches including different structure-based and ligand-based methods as well as their integration into combined approaches. In addition to pointing out the recent developments, some major pitfalls and challenges still need to be solved in order to handle the exponentially increased volume of data—compound and biological activity information—expected soon to be unleashed as a consequence of entering the new genomic era.31

Date: May 11, 2015


Contact Us

    Add a Comment

    Your email address will not be published. Required fields are marked *