SynBio Hub, full application

The SynBio Hub platform will allow the scientific community to monitor, review and discover the latest developments in synthetic biology Intellectual Property (IP). The open source platform will initially track all IP being published via the US patent office (USPTO) for relevance to synthetic biology.

Who we are

Ben Pellegrini, ex CTO of CambridgeIP, responsible for online patent search platform with over 100 million scientific documents.

Dr John Liddicoat, Philomathia Post Doctoral Fellow in Intellectual Property Law and Genomics, Centre for Law, Medicine and Life Sciences, Faculty of Law

Dr Ben Tregenna, PhD in Quantum Computing at Imperial, early member of Autonomy with specialist skills in Information Retrieval (IR), data search and indexing systems.

Dr Peter Murray-Rust, Emeritus Reader at the Department of Chemistry and Shuttleworth Foundation Fellow, Founder of ContentMine a project providing open source software and training for text and data mining.

The idea

The SynBio Hub platform will allow the scientific community to monitor, review and discover the latest developments in synthetic biology Intellectual Property (IP). The open source platform will initially track all IP being published via the Us patent office (USPTO), currently the USPTO makes available all patent applications and granted patents as a weekly dump of around 14000 documents. Although the data is freely accessible it is hard to manage and view unless you have access to premium services and tools that can present this information more easily. Being able to monitor the stream of patent data will give a valuable resource to the synthetic biology community, allowing access to both detailed scientific information relating to inventions and detailed information on key players in the space, both inventors and companies, that can be useful in identifying collaboration opportunities. The platform will also allow for users to tune and edit their areas of interest allowing more relevant data ‘streams’ to be created and shared eg new genome editing techniques, applications featuring particular DNA ‘parts’.

It is hoped that should the project be successful we could then develop the project further to include other sources of patent eg EPO (European Patent Office), WIPO ( World Intellectual Property Organisation) and non patent literature sources eg CrossRef, PubMed etc using ContentMine tools. Historic data could also be loaded into the system, giving a much more holistic view of the state of IP within domain of synthetic biology, however this would significantly increase the technical challenges for the system as the volume of data increases. We are proactively seeking funding and having a prototype would be beneficial.

Implementation

The USPTO publishes around 14000 patent applications and grants per week, meaning during the 6 month project we will likely have to search and look at 336,000 documents, it is therefore vital the system is designed to scale with the all the implications of big data analysis. A cloud-based infrastructure will be utilised, on AWS (Amazon Web Services), to allow for quick deployment of servers and allow for potentially rapid growth. Data ingestion and migration routines will be written to transform the structured data dumps (XML) to a NOSQL store to allow for better management of the big data and more flexibility in managing the data going forward. Once the data is accessible through the NOSQL data store we will create indexes, to make the documents easily accessible and searchable. At this point we will be able to start filtering the data for synthetic biology related documents using the search algorithm we will develop, it is assumed this algorithm will be initially broad enough to allow the capture of all documents. The algorithm will primarily consist of using broad patent classification filters and keyword filters, use of standard ontology’s, identifiers and custom search criteria, possibly in the form of regular expressions generated in collaboration with synthetic biologists. Tools will then be added to front end to allow users to fine tune the data they see. Once we have reduced the document data down to relevant documents we will need to present this data to end users through an easy to use interface. The ability to specifically filter the data for patents in a narrower field of interest such as microbial, plants or biomedical patents would also be possible using the same technologies and user interface.

The front end will be delivered in Ruby on Rails, making use of freely available gems and open source plugins to deliver maximum functionality in time available (eg account management, analytics, search, filter, export etc).

Red Katipo will develop the system and meetings will be arranged at regular intervals with John Liddicoat, Peter Murray-Rust, members of the Synthetic Biology SRI and potentially wider stakeholders to ensure the system design is useful to potential users.

Benefits and Outcomes

This monitoring service, and eventual corpus of synthetic biology related literature which would be freely available and accessible in a centralised place would allow people to comment and share content and information would act as a step towards developing routes for determining freedom to operate in synthetic biology. It would complement efforts promoting IP-free sharing, development of standardised Open MTAs (Material Transfer Agreements) and exploration of innovative licensing arrangements.

We see an opportunity to embed such a platform in the technology and knowledge transfer workflow of research institutions, providing a storefront for IP.

We propose to use this grant to seed fund the early development of SynBio Hub. We will seek additional funding and have had positive responses to the idea from SynbiCITE. Business models to ensure its long-term sustainability are being explored and this would most likely be through a freemium model offering value added services but the basic platform developed during this phase will be open source and provide open data for all.

Budget

John Liddicoat, 2 days consultancy and guidance from genomic IP legal scholar’s perspective £800.

IP Search Strategy consultant, £200

Red Katipo, 14 days software development £2800

£200 per day (66% discount on standard rates), 3 main components

software development to retrieve, extract and index USPTO data - 8 days
provide a front end tool to publish data and allow users to access, filter, group and share data - 4 days
development of algorithm to identify synthetic biology based documents - 2 days

Cloud based server funded for 6 months on Amazon web services, based on 1 medium server with 400g storage. £200

Total: £4000

The additional £1000 for outreach and follow-on would be used to improve the user interface of the system, keep the server running and fund stakeholder meetings to get constructive feedback which will influence future plans and funding applications.

Contact us

Website Policies

Study at Cambridge

About the University

Research at Cambridge