SureChem’s 15 Million Chemical Structures from Patents to be Made Freely Available through EBI
In a significant move for the scientific community, a collection of over 15 million chemical structures from patents, known as SureChem, is set to be made freely available through the European Bioinformatics Institute (EBI). The collection was donated by Digital Science, a division of Macmillan Science & Education, to EMBL-EBI.
Nicko Goncharoff of Digital Science described the collection as a “largely organic chemistry database from patents with a strong bias toward small molecule chemistry used in drug discovery.” He noted that the primary beneficiaries of this move would be researchers working on curing human diseases.
SureChem is a unique tool that extracts chemical structure data from the full text and images of patents. Until now, this data was held within commercial systems and was inaccessible to most researchers. This marks the first time a complete patent chemistry data source will be freely available. The data will be integrated with other life-science informatics resources at EMBL-EBI, which already offers molecular data.
Goncharoff explained the potential applications of this data, stating, “If you find some novel chemistry, you can go into the patents and download the chemistry of the patents and any related chemicals. You can go back then and search those against EMBL and download any related data.”
SureChem was initially established to meet the demand for bulk quantities of patent chemistry from patent documents produced by pharmaceutical firms. Macmillan acquired the business in 2009. While Macmillan still has products focused on the corporate market, SureChem was more nascent. Goncharoff noted that the decision to donate the database to EBI was made because it met all their criteria and was the best home for the SureChem database.
Antony Williams, head of cheminformatics at the Royal Society of Chemistry, commented on the move, stating, “I think this is an interesting shift and certainly they are moving the data and platform to a group of people who really understand cheminformatics and the value of integrating data.” He added that the move could be “potentially extremely disruptive” to commercial businesses dealing with patents and chemical structures.
Williams also suggested that if the data is made available quickly, pharmaceutical companies will likely pull the data in-house. However, he also noted that it would likely benefit projects like the Open PHACTS project and the PharmaSea project.
In 2011, IBM donated their database of more than 2.4 million chemical structures extracted from the patent literature and biomedical journals to PubChem. In comparison, SureChEMBL will hold 15 million.
EBI’s main focus is serving the life science community. However, Williams noted that chemistry encompasses a lot more than organic molecules of interest to this community. He posed the question of who would step into the space to make use of these data and support direct chemistry, potentially going up against big players like CAS. He also wondered if there was a way to extend this effort to extract even more chemistry-related data for the general community.
This move marks a significant shift in the accessibility of chemical structure data, potentially revolutionizing the way researchers and pharmaceutical companies operate. It also opens up new possibilities for the extraction and use of chemistry-related data for the broader scientific community.