Michel Dumontier is a Distinguished Professor of Data Science at Maastricht University. His research focuses on the development of computational methods for scalable and responsible discovery science. Previously a faculty member at Carleton University in Ottawa and Stanford University in Palo Alto, Dr Dumontier now leads the Interfaculty Institute of Data Science at Maastricht University to develop socio-technological systems for accelerating scientific discovery, improving human health and well-being, and empowering communities with ethical data-driven decision making. He is a principal investigator in the Dutch National Research Agenda, the European Open Science Cloud, the NCATS Biomedical Data Translator, and the NIH Data Commons Pilots. He is the editor-in-chief for the journal Data Science and an associate editor for the journal Semantic Web. He is internationally recognized for his contributions in bioinformatics, biomedical informatics, and semantic technologies including ontologies and linked data. In this interview Michel provides an overview over the core principles of FAIR in the biomedical research domain and reveals a preview of what to expect at his SEMANTiCS 2019 Keynote.
In your video portrait for Maastricht University you talk about the overlapping areas of different scientific domains that impact your focal research area: biomedicine. Can you characterize this domain for us?
Biomedical research explores the nature and extent of fundamental biological processes, their components and how they are regulated in both healthy and abnormal states. This work involves human and non-human systems, and the experimental design, as well as the collection and analysis of such data, which raises a variety of social, ethical and legal issues. For instance, there are ethical review processes and legal protections afforded to the collection and use of human data, and the European mandate to reduce the use of animals in pre-clinical studies. As such, it is crucial that we keep informed on relevant legislation (GDPR, EU Copyright Directive, Directive 2010/63/EU, etc), but also to ensure that we are compliant in the work that we do, and that we respect the privilege of mining personal health data for the purpose of understanding human health and disease.
Where in this area do semantic technologies come into play?
Semantic technologies have proven crucial to not only how we can represent and reason about complex biomedical knowledge, but also in the manner in which we can make this knowledge available to others. Semantic technologies such as ontologies enable us to describe our knowledge in biomedicine and to enable machines to deduce the answers to questions about that knowledge. Moreover, semantic technologies such as RDF and SPARQL provide a powerful paradigm to structure data and to query that information across different knowledge providers. We have used these technologies to structure biomedical knowledge, and in doing so, making it easier to mine this knowledge in the pursuit of new discoveries.
“Accelerating biomedical discovery with an Internet of FAIR data and services”, is the title of your Keynote at SEMANTiCS 2019. Give us a preview of this idea! What is “internet of FAIR data and services”? Please elaborate.
In 2016, we published a paper in Nature Scientific Data describing the FAIR - Findable, Accessible, Interoperable, Reusable - principles with the goal of improving how we discover and reuse high-value research data. Since then, the FAIR principles have been endorsed by the G20, the European Commission, Horizon 2020, NWO, NIH, and many other funding agencies, journals, and communities. Being able to access and reuse research data is crucial for tasks involving reproducibility, hypothesis generation, and validation. But achieving those functional objectives requires us to radically rethink how we make research data available - and importantly, that the content we create has to be accessible to machines. Why? Not only because machines can help us make sense of the immense treasure trove of content we collectively create, but also because we need to examine the evidence that underlie the published facts. This collective of FAIR content - an Internet of FAIR data and services - is globally produced, decentralized content that adheres to the 15 FAIR principles and in doing so makes those resources easier to find and reuse.
Please name some typical use cases for FAIR data and services and their main benefits over prior approaches?
discovery science: first and foremost, establishing an Internet of FAIR data and services will enable and accelerate discovery science. Discovery science makes use of collected data to uncover novel and plausible patterns that lie underneath complex phenomena. The more data made available, the more opportunities for data mining and machine learning become possible.
reproducibility: we are increasingly discovering that not only are most research findings false, but our ability to reproduce original studies are seriously compromised leading to high rates of non-reproducibility. FAIR plays a role in this by promoting an agenda of making research results available for others to verify. This comes from, among other aspects, the use of persistent identifiers, use of institutionally supported repositories, clear instructions for accessing protected content, and detailed provenance for how these results were generated. It’s important that other people are able to not only attempt to replicate a finding (where possible), but that the research findings can be reproduced in complementary systems. We need to know the boundaries of findings generated from a particular experimental system, and an Internet of FAIR data and services should help establish what we know and what we don’t know, and help researchers focus on building the strength of evidence and to uncover fertile, unexplored ground.
validation: discovery science produces novel associations that need further examination. Typically, this may require new experiments to be performed. However, the results of such experiments may in fact already be available, thereby bypassing the need for a new experiment to be performed. Hence, discovery scientists may be able to validate their research results with external data, provided those data are FAIR.
What are the key accelerators in this domain? Which improvements would you like to see in future contributions in this domain?
There’s substantial hype, optimism, and pessimism around FAIR - while people generally agree that FAIR is largely sensible, there’s an enormous gap in how to make FAIR a reality for the everyday person. I think that key challenges lie in how do we make our semantic technologies integrated into the everyday tasks so that researchers don’t need to think twice about making their data FAIR. How do we incorporate new technologies to automatically capture the provenance of data collection, annotation, cleaning, analysis, publication, and discussion? How do we make it so that sensitive data can not only be discovered but be accessed by authorized users following standardized and computer-facilitated protocols? How can we promote a culture of data sharing with fine-grained attribution indicators and immediate returns on investments? It’s important for me to emphasize that the solution to this long-standing problem is social, legal, as well as technological.
What are the main issues that you see researchers and practitioners in the biomedicinal domain confronted with in their digital transformation journeys at the moment and what are the five most important steps for researchers and practitioners to make an internet of FAIR data and services work?
Learn what FAIR is and what it could mean for you. Starting with the original paper and reading up on the latest contributions, get a sense for what FAIR stands for, and what it entails.
Critically examine every part of your research workflow - are the inputs to your work FAIR? How can you make your research products (datasets, software, publications, web services, repositories, etc) FAIR? Is there a way to automatically gather more high-quality metadata instead of manually curating it after the fact?
Develop a plan to make your content FAIR. Inquire whether your institution has support for FAIR data management. Contact members of your community and evaluate what tools and resources (e.g. ontologies, data formats, etc) exist. Reach out to GO-FAIR and find others that are thinking about their approaches.
Execute your plan in a new project and share your experiences with others. Everybody is looking for guidance and insight into making their data and services FAIR. Sharing how you did it will help others understand what needs to be done, and where the pitfalls currently lie. Reach out to software development groups and companies to see if they can help improve the workflow to be more efficient.
Make use and contribute to the emerging Internet of FAIR data and services. More and more researchers are learning the tricks of the trade in using published data and services to explore research questions. I challenge you to publish your data first, then develop a method of analysis for that FAIR data, and data like it. Many initially sceptical groups have now recognized that their data and software may act as an incredibly fertile ground to accelerate discovery science worldwide. Create and use a meta-analytical framework to increase the confidence of your results before you publish, and enable others to reproduce your work.
About SEMANTiCS
The annual SEMANTiCS conference is the meeting place for professionals who make semantic computing work, and understand its benefits and know its limitations. Every year, SEMANTiCS attracts information managers, IT-architects, software engineers, and researchers, from organisations ranging from NPOs, universities, public administrations to the largest companies in the world. http://www.semantics.cc