1. Can you briefly describe what your research is about?
The secure analysis of health data on institutional infrastructure. We have made use of distributed machine learning and privacy-preserving techniques to analyze partitioned health data. The FAIR Health project seeks to develop (or has developed) a solution to make this data available for processing and analyzing with open metadata. Thus, these data did not have to leave the hospital.
In the FAIRHealth project, we established a scalable technical and governance framework, which can combine access-restricted data from the Maastricht Study and CBS in a privacy-preserving manner. We first made the data FAIR at the data source and coupled FAIR data to a federated learning framework based on the “Personal Health Train” architecture. The project also invested in developing a governance framework, including the legal and ethical basis for processing personal data from individuals.
2. How did you do your research?
By using privacy-preserving distributed machine learning.
We applied privacy-preserving distributed data mining methods and the Personal Health Train architecture to establish a secure infrastructure for analyzing distributed data. To be brief, this infrastructure enables researchers to send data analysis models to the data sources rather than transferring the original data to researchers.
3. What tools did you use to make your research FAIR?
- Docker container, Conda and Jupyter notebooks for making project-reproducible (FAIR Software)
- Zenodo for findability (PID and file storage)
- API for accessibility
We used Docker, Conda (Python), Gitlab, GraphDB, Bioportal, data standards and ontologies (e.g., SNOMED Clinical Terms, LONIC).
4. What UM-services did you use?
We used Disqover to obtain the data variables’ metadata and a Linux machine in the UM network.
5. To what extent were you able to make your research FAIR?
We tried to make all data, analysis models, and development process FAIR.
Before we did experiments with real personal data, we generated simulation data published with a valid license. The real data (variables) we applied in this project were also documented at both UM and CBS sides. To preserve individuals’ privacy, information about data instances (whose data has been used) are not available.
The analysis models and tools used to develop the infrastructure are documented publically in our repository with detailed information, including parameters and specific versions.
6. Is your data machine-readable?
Yes. As described above, the FAIRHealth project enables data to be processed and analyzed at their sources. In this case, data was made interoperable and machine-readable so that the analysis model can be executed on the data without human interaction in between. We converted data from a CSV to an RDF (Resource Description Framework) format and stored them in a graph database.
7. What lessons have you learnt from the experience?
- Good FAIR data should be defined with well-recognized terminology, where it is stored and whom they can ask for it.
- It is essential to keep your work reproducible for yourself and more importantly, for others.
- Interoperable remains a challenge.
- FAIR data is not the same as Open data
For our scientific work’s reproducibility, FAIR is a concept that we should always keep in mind when we conduct research. We should make data and publications, models, tools, and developing steps in our research FAIR. Based on your domain and specific topic, you can emphasize one or multiple parts of FAIR.
8. How do you think we can benefit from FAIR research?
Make FAIR data and reproducible scientific workflow are the drivers for the sake of science.
9. Are your metadata shared in a repository?
About Chang Sun
Chang Sun is a PhD student who has started at the Institute of Data Science at Maastricht University in October 2017. Her research interests cover privacy-preserving data mining, federated/distributed machine learning, personal health data sharing and analysis.