Publication: Open PHACTS computational protocols for in silico target validation of cellular phenotypic screens: knowing the knowns

Phenotypic screening is in a renaissance phase and is expected by many academic and industry leaders to accelerate the discovery of new drugs for new biology. Given that phenotypic screening is per definition target agnostic, the emphasis of in silico and in vitro follow-up work is on the exploration of possible molecular mechanisms and efficacy targets underlying the biological processes interrogated by the phenotypic screening experiments. Herein, we present six exemplar computational protocols for the interpretation of cellular phenotypic screens based on the integration of compound, target, pathway, and disease data established by the IMI Open PHACTS project. The protocols annotate phenotypic hit lists and allow follow-up experiments and mechanistic conclusions. The annotations included are from ChEMBL, ChEBI, GO, WikiPathways and DisGeNET. Also provided are protocols which select from the IUPHAR/BPS Guide to PHARMACOLOGY interaction file selective compounds to probe potential targets and a correlation robot which systematically aims to identify an overlap of active compounds in both the phenotypic as well as any kinase assay. The protocols are applied to a phenotypic pre-lamin A/C splicing assay selected from the ChEMBL database to illustrate the process. The computational protocols make use of the Open PHACTS API and data and are built within the Pipeline Pilot and KNIME workflow tools.

D. Digles,   B. Zdrazil,   J.-M. Neefs,   H. Van Vlijmen,  C. Herhaus,   A. Caracoti,   J. Brea,   B. Roibás,  M. I. Loza,   N. Queralt-Rosinach,   L. I. Furlong,  A. Gaulton,   L. Bartek,   S. Senger,   C. Chichester,  O. Engkvist,   C. T. Evelo,   N. I. Franklin,   D. Marren,  G. F. Ecker and   E. Jacoby  

Full publication: Med. Chem. Commun., 2016, 7, 1237-1244

Publication: Using the Semantic Web for Rapid Integration of WikiPathways with Other Biological Online Data Resources

The diversity of online resources storing biological data in different formats provides a challenge for bioinformaticians to integrate and analyse their biological data. The semantic web provides a standard to facilitate knowledge integration using statements built as triples describing a relation between two objects. WikiPathways, an online collaborative pathway resource, is now available in the semantic web through a SPARQL endpoint at Having biological pathways in the semantic web allows rapid integration with data from other resources that contain information about elements present in pathways using SPARQL queries. In order to convert WikiPathways content into meaningful triples we developed two new vocabularies that capture the graphical representation and the pathway logic, respectively. Each gene, protein, and metabolite in a given pathway is defined with a standard set of identifiers to support linking to several other biological resources in the semantic web. WikiPathways triples were loaded into the Open PHACTS discovery platform and are available through its Web API ( to be used in various tools for drug development. We combined various semantic web resources with the newly converted WikiPathways content using a variety of SPARQL query types and third-party resources, such as the Open PHACTS API. The ability to use pathway information to form new links across diverse biological data highlights the utility of integrating WikiPathways in the semantic web.

Andra Waagmeester, Martina Kutmon, Anders Riutta, Ryan Miller, Egon L. Willighagen, Chris T. Evelo , Alexander R. Pico

Full publication: PLOS Computational Biology, June 2016

Publication: The FAIR Guiding Principles for scientific data management and stewardship

There is an urgent need to improve the infrastructure supporting the reuse of scholarly data. A diverse set of stakeholders—representing academia, industry, funding agencies, and scholarly publishers—have come together to design and jointly endorse a concise and measureable set of principles that we refer to as the FAIR Data Principles. The intent is that these may act as a guideline for those wishing to enhance the reusability of their data holdings. Distinct from peer initiatives that focus on the human scholar, the FAIR Principles put specific emphasis on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals. This Comment is the first formal publication of the FAIR Principles, and includes the rationale behind them, and some exemplar implementations in the community.

Mark D. Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz Bonino da Silva Santos, Philip E. Bourne, Jildau Bouwman, Anthony J. Brookes, Tim Clark, Mercè Crosas, Ingrid Dillo, Olivier Dumon, Scott Edmunds, Chris T. Evelo, Richard Finkers, Alejandra Gonzalez-Beltran, Alasdair J.G. Gray, Paul Groth, Carole Goble, Jeffrey S. Grethe, Jaap Heringa, Peter A.C ’t Hoen, Rob Hooft, Tobias Kuhn, Ruben Kok, Joost Kok, Scott J. Lusher, Maryann E. Martone, Albert Mons, Abel L. Packer, Bengt Persson, Philippe Rocca-Serra, Marco Roos, Rene van Schaik, Susanna-Assunta Sansone, Erik Schultes, Thierry Sengstag, Ted Slater, George Strawn, Morris A. Swertz, Mark Thompson, Johan van der Lei, Erik van Mulligen, Jan Velterop, Andra Waagmeester, Peter Wittenburg, Katherine Wolstencroft, Jun Zhao, Barend Mons

Full publication: Scientific Data, Volume 3, Article 160018, March 2016

Publication: Selectivity profiling of BCRP versus P-gp inhibition: from automated collection of polypharmacology data to multi-label learning

The human ATP binding cassette transporters Breast Cancer Resistance Protein (BCRP) and Multidrug Resistance Protein 1 (P-gp) are co-expressed in many tissues and barriers, especially at the blood–brain barrier and at the hepatocyte canalicular membrane. Understanding their interplay in affecting the pharmacokinetics of drugs is of prime interest. In silico tools to predict inhibition and substrate profiles towards BCRP and P-gp might serve as early filters in the drug discovery and development process. However, to build such models, pharmacological data must be collected for both targets, which is a tedious task, often involving manual and poorly reproducible steps.

Floriane Montanari, Barbara Zdrazil, Daniela Digles, Gerhard F. Ecker

Full publication: Journal of Cheminformatics, Volume 8:7, February 2016

Publication: WikiPathways: capturing the full diversity of pathway knowledge

WikiPathways ( is an open, collaborative platform for capturing and disseminating models of biological pathways for data visualization and analysis. Since our last NAR update, 4 years ago, WikiPathways has experienced massive growth in content, which continues to be contributed by hundreds of individuals each year. New aspects of the diversity and depth of the collected pathways are described from the perspective of researchers interested in using pathway information in their studies. We introduce the Quick Edit feature for pathway authors and curators, in addition to new means of publishing pathways and maintaining custom pathway collections to serve specific research topics and communities. In addition to the latest milestones in our pathway collection and curation effort, we also highlight the latest means to access the content as publishable figures, as standard data files, and as linked data, including bulk and programmatic access.

Martina Kutmon, Anders Riutta, Nuno Nunes, Kristina Hanspers, Egon L. Willighagen, Anwesha Bohler, Jonathan Mélius, Andra Waagmeester, Sravanthi R. Sinha, Ryan Miller, Susan L. Coort, Elisa Cirillo, Bart Smeets, Chris T. Evelo, Alexander R. Pico

Full publication: Nucleic Acids Research, 2015

Publication: Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents

First public disclosure of new chemical entities often takes place in patents, which makes them an important source of information. However, with an ever increasing number of patent applications, manual processing and curation on such a large scale becomes even more challenging. An alternative approach better suited for this large corpus of documents is the automated extraction of chemical structures. A number of patent chemistry databases generated by using the latter approach are now available but little is known that can help to manage expectations when using them. This study aims to address this by comparing two such freely available sources, SureChEMBL and IBM SIIP (IBM Strategic Intellectual Property Insight Platform), with manually curated commercial databases.

Stefan Senger, Luca Bartek, George Papadatos, Anna Gaulton

Full publication: Journal of Cheminformatics, Volume 7:49, October 2015

Publication: Medicinal chemistry in the era of big data

In the era of big data medicinal chemists are exposed to an enormous amount of bioactivity data. Numerous public data sources allow for querying across medium to large data sets mostly compiled from literature. However, the data available are still quite incomplete and of mixed quality. This mini review will focus on how medicinal chemists might use such resources and how valuable the current data sources are for guiding drug discovery.

Lars Richter, Gerhard F. Ecker

Full publication: Drug Discovery Today, Volume 14, July 2015, Pages 37–41

Publication: The Chemical Validation and Standardization Platform (CVSP): large-scale automated validation of chemical structure datasets

There are presently hundreds of online databases hosting millions of chemical compounds and associated data. As a result of the number of cheminformatics software tools that can be used to produce the data, subtle differences between the various cheminformatics platforms, as well as the naivety of the software users, there are a myriad of issues that can exist with chemical structure representations online. In order to help facilitate validation and standardization of chemical structure datasets from various sources we have delivered a freely available internet-based platform to the community for the processing of chemical compound datasets.

Karen Karapetyan, Colin Batchelor, David Sharpe, Valery Tkachenko, Antony J Williams

Full publication: Journal of Cheminformatics, Volume 7:30, June 2015

Publication: DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes

DisGeNET is a comprehensive discovery platform designed to address a variety of questions concerning the genetic underpinning of human diseases. DisGeNET contains over 380 000 associations between >16 000 genes and 13 000 diseases, which makes it one of the largest repositories currently available of its kind. DisGeNET integrates expert-curated databases with text-mined data, covers information on Mendelian and complex diseases, and includes data from animal disease models. It features a score based on the supporting evidence to prioritize gene-disease associations. It is an open access resource available through a web interface, a Cytoscape plugin and as a Semantic Web resource. The web interface supports user-friendly data exploration and navigation. DisGeNET data can also be analysed via the DisGeNET Cytoscape plugin, and enriched with the annotations of other plugins of this popular network analysis software suite. Finally, the information contained in DisGeNET can be expanded and complemented using Semantic Web technologies and linked to a variety of resources already present in the Linked Data cloud. Hence, DisGeNET offers one of the most comprehensive collections of human gene-disease associations and a valuable set of tools for investigating the molecular mechanisms underlying diseases of genetic origin, designed to fulfill the needs of different user profiles, including bioinformaticians, biologists and health-care practitioners. Database URL:

Janet Piñero, Núria Queralt-Rosinach, Àlex Bravo, Jordi Deu-Pons, Anna Bauer-Mehren, Martin Baron, Ferran Sanz, Laura I Furlong

Full publication: The Journal of Biological Databases and Curation, 2015

Publication: Publishing DisGeNET as Nanopublications

The increasing and unprecedented publication rate in the biomedical field is a major bottleneck for discovery in Life Sciences. Although the scientific community is limited an inability to manually curate facts from published papers, recent approaches enable the automatic, scalable and reliable extraction of assertions from the scientific literature. While the publication of assertions on the Semantic Web is gaining traction, it also creates new challenges to ensure proper provenance, such as versioning for dataset change-sensitive link generation. Here, we address these issues and describe our efforts to represent the DisGeNET database of human gene-disease associations as permanent, immutable, and provenance rich digital objects called nanopublications. This is the first Linked Dataset that ensure stable interlinking to the assertion and its metadata by trusty URIs. As DisGeNET integrate expert-curated and text-mined data of different origin, the semantic description of the evidence for each assertion is provided to confer trust and allow evidence-based hypothesis generation. We describe our steps to ensure high quality and demonstrate the utility of linking our dataset to others on the emerging Semantic Web.

Núria Queralt-Rosinach, Tobias Kuhn, Christine Chichester, Michel Dumontier, Ferran Sanz, Laura I Furlong

Full publication: Semantic Web, 2015

Publication: Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research

Background Current biomedical research needs to leverage and exploit the large amount of information reported in scientific publications. Automated text mining approaches, in particular those aimed at finding relationships between entities, are key for identification of actionable knowledge from free text repositories. We present the BeFree system aimed at identifying relationships between biomedical entities with a special focus on genes and their associated diseases. Results By exploiting morpho-syntactic information of the text, BeFree is able to identify gene-disease, drug-disease and drug-target associations with state-of-the-art performance. The application of BeFree to real-case scenarios shows its effectiveness in extracting information relevant for translational research. We show the value of the gene-disease associations extracted by BeFree through a number of analyses and integration with other data sources.

Laura I Furlong, Alex Bravo, Janet Piñero, Núria Queralt-Rosinach, Michael Rautschka

Full publication: BMC Bioinformatics 2015, 16:55

Publication: The Application of the Open Pharmacological Concepts Triple Store (Open PHACTS) to Support Drug Discovery Research

Integration of open access, curated, high-quality information from multiple disciplines in the Life and Biomedical Sciences provides a holistic understanding of the domain. Additionally, the effective linking of diverse data sources can unearth hidden relationships and guide potential research strategies. However, given the lack of consistency between descriptors and identifiers used in different resources and the absence of a simple mechanism to link them, gathering and combining relevant, comprehensive information from diverse databases remains a challenge. The Open Pharmacological Concepts Triple Store (Open PHACTS) is an Innovative Medicines Initiative project that uses semantic web technology approaches to enable scientists to easily access and process data from multiple sources to solve real-world drug discovery problems. The project draws together sources of publicly-available pharmacological, physicochemical and biomolecular data, represents it in a stable infrastructure and provides well-defined information exploration and retrieval methods. Here, we highlight the utility of this platform in conjunction with workflow tools to solve pharmacological research questions that require interoperability between target, compound, and pathway data.

Joseline Ratnam, Barbara Zdrazil, Daniela Digles, Emiliano Cuadrado-Rodriguez, Jean-Marc Neefs, Hannah Tipney, Ronald Siebes, Andra Waagmeester, Glyn Bradley, Chau Han Chau, Lars Richter, Jose Brea, Chris T. Evelo, Edgar Jacoby, Stefan Senger, Maria Isabel Loza, Gerhard F. Ecker, Christine Chichester

Full publication: PLOS ONE, December 2014

Publication: Using the BioAssy Ontology for Analyzing High-Throughput Screening Data

High-throughput screening (HTS) is the main starting point for hit identification in drug discovery programs. This has led to a rapid increase of available screening data both within pharmaceutical companies and the public domain. We have used the BioAssay Ontology (BAO) 2.0 for assay annotation within AstraZeneca to enable comparison with external HTS methods. The annotated assays have been analyzed to identify technology gaps, evaluate new methods, verify active hits, and compare compound activity between in-house and PubChem assays. As an example, the binding of a fluorescent ligand to formyl peptide receptor 1 (FPR1, involved in inflammation, for example) in an in-house HTS was measured by fluorescence intensity. In total, 155 active compounds were also tested in an external ligand binding flow cytometry assay, a method not used for in-house HTS detection. Twelve percent of the 155 compounds were found active in both assays. By the annotation of assay protocols using BAO terms, internal and external assays can easily be identified and method comparison facilitated. They can be used to evaluate the effectiveness of different assay methods, design appropriate confirmatory and counterassays, and analyze the activity of compounds for identification of technology artifacts.

Linda Zander Balderud, David Murray, Niklas Larsson, Uma Vempati, Stephan C. Schürer, Marcus Bjäreland, Ola Engkvist

Full publication: Journal of Biomolecular Screening, March 2015 Volume 20, No. 3, Pages 402-415

Publication: Drug Discovery FAQs: Workflows for answering cross concept drug discovery questions

Modern data-driven drug discovery requires integrated resources to support decision-making and enable new discoveries. The Open PHACTS Discovery Platform ( was built to address this requirement by focusing on drug discovery questions that are of high priority to the pharmaceutical industry. Although complex, most of these frequently asked questions (FAQs) revolve around the combination of data concerning compounds, targets, pathways and diseases. Computational drug discovery using workflow tools and the integrated resources of Open PHACTS can deliver answers to most of these questions. Here, we report on a selection of workflows used for solving these use cases and discuss some of the research challenges. The workflows are accessible online from myExperiment ( and are available for reuse by the scientific community.

Christine Chichester, Daniela Digles, Paul Groth, Ronald Siebes, Lee Harland, Antonis Loizou

Full publication: Drug Discovery Today, Volume 20, Issue 4, April 2015, Pages 399–405

Publication: On the formulation of performant SPARQL queries

The combination of the flexibility of RDF and the expressiveness of SPARQL provides a powerful mechanism to model, integrate and query data. However, these properties also mean that it is nontrivial to write performant SPARQL queries. Indeed, it is quite easy to create queries that tax even the most optimised triple stores. Currently, application developers have little concrete guidance on how to write “good” queries. The goal of this paper is to begin to bridge this gap. It describes 5 heuristics that can be applied to create optimised queries. The heuristics are informed by formal results in the literature on the semantics and complexity of evaluating SPARQL queries, which ensures that queries following these rules can be optimised effectively by an underlying RDF store. Moreover, we empirically verify the efficacy of the heuristics using a set of openly available datasets and corresponding SPARQL queries developed by a large pharmacology data integration project. The experimental results show improvements in performance across six state-of-the-art RDF stores.

Antonis Louizou, Renzo Angles, Paul Groth

Full publication: Web Semantics: Science, Services and Agents on the World Wide Web, Volume 31, March 2015, Pages 1–26

Publication: Scientific Lenses to Support Multiple Views over Linked Chemistry Data

When are two entries about a small molecule in different datasets the same? If they have the same drug name, chemical structure, or some other criteria? The choice depends upon the application to which the data will be put. However, existing Linked Data approaches provide a single global view over the data with no way of varying the notion of equivalence to be applied.

In this paper, we present an approach to enable applications to choose the equivalence criteria to apply between datasets. Thus, supporting multiple dynamic views over the Linked Data. For chemical data, we show that multiple sets of links can be automatically generated according to different equivalence criteria and published with semantic descriptions capturing their context and interpretation. This approach has been applied within a large scale public-private data integration platform for drug discovery. To cater for different use cases, the platform allows the application of different lenses which vary the equivalence rules to be applied based on the context and interpretation of the links.

Colin Batchelor, Christian Y.A. Brenninkmeijer, Christine Chichester, Mark Davies, Daniela Digles, Ian Dunlop, Chris T. Evelo, Anna Gaulton, Carole Goble, Alasdair J. Gray, Paul Groth, Lee Harland, Karen Karapetyan, Antonis Loizou, John P. Overington, Steve Pettifer, Jon Steele, Robert Stevens, Valery Tkachenko, Andra Waagmeester, Antony Williams, Egon L. Willighagen

Full publication: The Semantic Web – ISWC 2014, Lecture Notes in Computer Science, Volume 8796, 2014, Pages 98-113

See also: Presentation: Scientific Lenses to Support Multiple Views over Linked Chemistry Data

Publication: Querying neXtProt nanopublications and their value for insights on sequence variants and tissue expression

Understanding how genetic differences between individuals impact the regulation, expression, and ultimately function of proteins is an important step toward realizing the promise of personal medicine. There are several technical barriers hindering the transition of biological knowledge into the applications relevant to precision medicine. One important challenge for data integration is that new biological sequences (proteins, DNA) have multiple issues related to interoperability potentially creating a quagmire in the published data, especially when different data sources do not appear to be in agreement. Thus, there is an urgent need for systems and methodologies to facilitate the integration of information in a uniform manner to allow seamless querying of multiple data types which can illuminate, for example, the relationships between protein modifications and causative genomic variants. Our work demonstrates for the first time how semantic technologies can be used to address these challenges using the nanopublication model applied to the neXtProt data set, a curated knowledgebase of information about human proteins. We have applied the nanopublication model to demonstrate querying over several named graphs, including the provenance information associated with the curated scientific assertions from neXtProt. We show by way of use cases using sequence variations, post-translational modifications (PTMs) and tissue expression, that querying the neXtProt nanopublication implementation is a credible approach for expanding biological insight.

Christine Chichester, Pascale Gaudet, Oliver Karch, Paul Groth, Lydie Lane, Amos Bairoch, Barend Mons, Antonis Loizou

Full publication: Web Semantics: Science, Services and Agents on the World Wide Web, Volume 29, December 2014, Pages 3–11 | Open access pre-print

Publication: A Knowledge-Driven Approach to Extract Disease-Related Biomarkers from the Literature

The biomedical literature represents a rich source of biomarker information. However, both the size of literature databases and their lack of standardization hamper the automatic exploitation of the information contained in these resources. Text mining approaches have proven to be useful for the exploitation of information contained in the scientific publications. Here, we show that a knowledge-driven text mining approach can exploit a large literature database to extract a dataset of biomarkers related to diseases covering all therapeutic areas. Our methodology takes advantage of the annotation of MEDLINE publications pertaining to biomarkers with MeSH terms, narrowing the search to specific publications and, therefore, minimizing the false positive ratio. It is based on a dictionary-based named entity recognition system and a relation extraction module. The application of this methodology resulted in the identification of 131,012 disease-biomarker associations between 2,803 genes and 2,751 diseases, and represents a valuable knowledge base for those interested in disease-related biomarkers. Additionally, we present a bibliometric analysis of the journals reporting biomarker related information during the last 40 years.

Alex Bravo, Montserrat Cases, Núria Queralt-Rosinach, Ferran Sanz, Laura Inés Furlong

Full publication: BioMed Research International, Volume 2014, Article ID 253128, 11 Pages

Publication: Transporter taxonomy – a comparison of different transport protein classification schemes

Currently, there are more than 800 well characterized human membrane transport proteins (including channels and transporters) and there are estimates that about 10% (approx. 2000) of all human genes are related to transport. Membrane transport proteins are of interest as potential drug targets, for drug delivery, and as a cause of side effects and drug–drug interactions. In light of the development of Open PHACTS, which provides an open pharmacological space, we analyzed selected membrane transport protein classification schemes (Transporter Classification Database, ChEMBL, IUPHAR/BPS Guide to Pharmacology, and Gene Ontology) for their ability to serve as a basis for pharmacology driven protein classification. A comparison of these membrane transport protein classification schemes by using a set of clinically relevant transporters as use-case reveals the strengths and weaknesses of the different taxonomy approaches.

Michael Viereck, Anna Gaulton, Daniela Digles, Gerhard F. Ecker

Full publication: Drug Discovery Today: Technologies, Volume 12, June 2014, Pages e37–e46

Publication: Transporter assays and assay ontologies: useful tools for drug discovery

Transport proteins represent an eminent class of drug targets and ADMET (absorption, distribution, metabolism, excretion, toxicity) associated genes. There exists a large number of distinct activity assays for transport proteins, depending on not only the measurement needed (e.g. transport activity, strength of ligand–protein interaction), but also due to heterogeneous assay setups used by different research groups. Efforts to systematically organize this (divergent) bioassay data have large potential impact in Public-Private partnership and conventional commercial drug discovery. In this short review, we highlight some of the frequently used high-throughput assays for transport proteins, and we discuss emerging assay ontologies and their application to this field. Focusing on human P-glycoprotein (Multidrug resistance protein 1; gene name: ABCB1, MDR1), we exemplify how annotation of bioassay data per target class could improve and add to existing ontologies, and we propose to include an additional layer of metadata supporting data fusion across different bioassays.

Barbara Zdrazil, Christine Chichester, Linda Zander Balderud, Ola Engkvist, Anna Gaulton, John P. Overington

Full publication: Drug Discovery Today: Technologies, Volume 12, June 2014, Pages e47–e54

Publication: Exploiting open data: a new era in pharmacoinformatics

Within the last decade open data concepts has been gaining increasing interest in the area of drug discovery. With the launch of ChEMBL and PubChem, an enormous amount of bioactivity data was made easily accessible to the public domain. In addition, platforms that semantically integrate those data, such as the Open PHACTS Discovery Platform, permit querying across different domains of open life science data beyond the concept of ligand-target-pharmacology. However, most public databases are compiled from literature sources and are thus heterogeneous in their coverage. In addition, assay descriptions are not uniform and most often lack relevant information in the primary literature and, consequently, in databases. This raises the question how useful large public data sources are for deriving computational models. In this perspective, we highlight selected open-source initiatives and outline the possibilities and also the limitations when exploiting this huge amount of bioactivity data.

Daria Goldmann, Floriane Montanari, Lars Richter, Barbara Zdrazil, Gerhard F. Ecker

Full publication: Future Medicinal Chemistry, Volume 6, No. 5 , Pages 503–514

Publication: Toxins in transit

The Pharmacoinformatics Research Group seeks to further understanding of transporter proteins and their interactions with drugs, with a particular focus on multidrug resistance in cancer. The development of the eTOX and Open PHACTS databases should encourage greater integration of pharmacoinformatics datasets so that more efficient in silico models can be created to aid the development of new drugs.

Gerhard F. Ecker

Full publication: International Innovation, Issue 127: Mapping Medicine, Pages 40–42

Publication: Applying Linked Data Approaches to Pharmacology: Architectural Decisions and Implementation

The discovery of new medicines requires pharmacologists to interact with a number of information sources ranging from tabular data to scientific papers, and other specialized formats. In this application report, we describe a linked data platform for integrating multiple pharmacology datasets that form the basis for several drug discovery applications. The functionality offered by the platform has been drawn from a collection of prioritised drug discovery business questions created as part of the Open PHACTS project, a collaboration of research institutions and major pharmaceutical companies. We describe the architecture of the platform focusing on seven design decisions that drove its development with the aim of informing others developing similar software in this or other domains. The utility of the platform is demonstrated by the variety of drug discovery applications being built to access the integrated data.

Alasdair J. G. Gray, Paul Groth, Antonis Loizou, Sune Askjaer, Christian Brenninkmeijer, Kees Burger, Christine Chichester, Chris T. Evelo, Carole Goble, Lee Harland, Steve Pettifer, Mark Thompson, Andra Waagmeester, Antony J. Williams

Full publication: Semantic Web, Volume 5, No. 2, 2014

Publication: Nanopublication Guidelines

This document describes the structure of nanopublications and offers guidelines in their composition, implementation and use. It was produced by members of the Concept Web Alliance (CWA), an open collaborative community that is actively addressing the challenges associated with the production, management, interoperability and analysis of unprecedented volumes of data.

Paul Groth, Erik Schultes, Mark Thompson, Zuotian Tatum, Michel Dumontier, Alasdair J G Gray, Christine Chichester, Kees Burger, Spyros Kotoulas, Antonis Loizou, Valery Tkachenko, Andra Waagmeester, Sune Askjaer, Steve Pettifer, Lee Harland, Carina Haupt, Colin Batchelor, Miguel Vazquez, José María Fernández, Jahn Saito, Andrew Gibson, Louis Wich, Tobias Kuhn, Jesse van Dam

Full publication: Concept Web Alliance Working Draft 15 December 2013

Publication: Computing Identity Co-Reference Across Drug Discovery Datasets

This paper presents the rules used within the Open PHACTS ( Identity Management Service to compute co-reference chains across multiple datasets. The web of (linked) data has encouraged a proliferation of identifiers for the concepts captured in datasets; with each dataset using their own identifier. A key data integration challenge is linking the co-referent identifiers, i.e. identifying and linking the equivalent concept in every dataset. Exacerbating this challenge, the datasets model the data differently, so when is one representation truly the same as another? Finally, different users have their own task and domain specific notions of equivalence that are driven by their operational knowledge. Consumers of the data need to be able to choose the notion of operational equivalence to be applied for the context of their application. We highlight the challenges of automatically computing co-reference and the need for capturing the context of the equivalence. This context is then used to control the co-reference computation. Ultimately, the context will enable data consumers to decide which co-references to include in their applications.

Christian Y.A. Brenninkmeijer, Ian Dunlop, Carole Goble, Alasdair J.G. Gray, Steve Pettifer, Robert Stevens

Full publication: Paper for SWAT4LS 2013, Semantic Web Applications and Tools for Life Sciences, Edinburgh, UK, December 9–12 2013

Publication: Nanopublications for exposing experimental data in the life-sciences: a Huntingtion’s Disease case study

Data from high throughput experiments often produce far more results than can ever appear in the main text or tables of a single research article. In these cases, the majority of new associations is often archived either as supplemental information in an arbitrary format or in publisher-independent databases that can be difficult to find. These data are not only lost from scientific discourse, but are also elusive to automated search, retrieval and processing. Here, we use the nanopublication model to make scientific assertions that were concluded from a workflow analysis of Huntington’s Disease data machine-readable, interoperable, and citable. We followed the nanopublication guidelines to semantically model our assertions as well as their provenance metadata and authorship. We demonstrate interoperability by linking nanopublication provenance to the Research Object model. These results indicate that nanopublications can provide an incentive for researchers to expose mass data that is interoperable and machine-readable.

Eleni Mina, Mark Thompson, Rajaram Kaliyaperumal, Jun Zhao, Krisina Hettne, Erik Schultes, Marco Roos

Full publication: Paper for SWAT4LS 2013, Semantic Web Applications and Tools for Life Sciences, Edinburgh, UK, December 9–12 2013

Publication: Open PHACTS Explorer: Bringing the web to the semantic web

The Open PHACTS Explorer is a web application that supports drug discovery via the Open PHACTS API without requiring knowledge of SPARQL or the RDF data being searched. It provides a UI layer on top of the Open PHACTS linked data cache and also provides a javascript library to facilitate easy access to the Open PHACTS API.

Ian Dunlop, Rishi Ramgolam, Stephen Pettifer, Alasdair J G Gray, James Eales, Carole Goble, Jan Velterop

Full publication: Paper for SWAT4LS 2013, Semantic Web Applications and Tools for Life Sciences, Edinburgh, UK, December 9–12 2013

Publication: Scientific requirements for the next generation semantic web-based chemogenomics and systems chemical biology molecular information system OPS

This book focuses on applications of compound library design and virtual screening to expand the bioactive chemical space, to target hopping of chemotypes to identify synergies within related drug discovery projects or to repurpose known drugs, to propose mechanism of action of compounds, or to identify off-target effects by cross-reactivity analysis. Both ligand-based and structure-based in silico approaches, as reviewed in this book, play important roles for all these applications. Computational chemogenomics is expected to increase the quality and productivity of drug discovery and lead to the discovery of new medicines.

Jacoby, E., Azzaoui, K., Senger S., Cuadrado Rodríguez, E., Loza, M., Zdrazil, B., Pinto, M., Williams, A.J., de la Torre, V., Mestres, J., Taboureau, O., Rarey, M., Chichester, C., Blomberg, N., Harland, L., Ecker, G.F.

Full publication: Computational Chemogenomics, Pan Stanford Publishing, Singapore, December 2013, Pages 213-242

Publication: Pav ontology: provenance, authoring and versioning

We present the Provenance, Authoring and Versioning ontology (PAV): a lightweight ontology for capturing “just enough” descriptions essential for tracking the provenance, authoring and versioning of web resources. We argue that such descriptions are essential for digital scientific content. PAV distinguishes between contributors, authors and curators of content and creators of representations in addition to the provenance of originating resources that have been accessed, transformed and consumed. We explore five projects (and communities) that have adopted PAV illustrating their usage through concrete examples. Moreover, we present mappings that show how PAV extends the W3C PROV-O ontology to support broader interoperability.

Paolo Ciccarese, Stian Soiland-Reyes, Khalid Belhajjame, Alasdair J G Gray, Carole Goble, Tim Clark

Full publication: Journal of Biomedical Semantics 2013, 4:37