The role of standards for provenance and metadata in federated TRE architectures#

Chair: TBA

Prompts#

What metadata do you provide to researchers about data released to them?
Do you conform to any existing standards of metadata when providing to researchers? Which standards?
Do you utilise any metadata standards when publishing data externally? Which standards?
Do you share/publish code with researchers? Why or why not?
What are barriers to tracking and sharing provenance and metadata between organisations? What are ways to improve this?
Outputs produced by technical solutions for provenance tracking might be inaccessible to non-technical decision makers. What can we do to address this?
To what extent are rights/licence/access-conditions currently described formally in metadata?

The importance of data provenance was discussed, and how to ensure data provenance and additional metadata can be managed properly.

There is still little common understanding of requesting metadata, and of data provenance requirements.

Next steps are to map out these requirements for the community.

Key provenance questions: Where did the data come from? What data cleaning steps were applied?
Big Data Approach -> collect as much as you can as you don’t know what you might need in the future. Provenance is important to preserve the knowledge that would otherwise disappear when people leave, etc. (e.g., example of longitudinal studies spanning number of decades )
Lots of provenance information kept in a human-readable form (screenshots, pdfs, text files) - it is also a preferred way of consuming such data (e.g., researcher wants to see a pdf of the questionaire that was used). Processing provenance at scale is difficult as it is resource intensive.
Closer Discovery platform (https://discovery.closer.ac.uk/) was mentioned as an example of a metadata management platform for longitudinal populational studies.
There is still no common way of requesting metadata – everybody is asking for data using different templates, which is very challenging for data processors.
It has been highlighted that provenance could be problematic if it exposes too much detail (e.g., potential privacy implications)
Provenance could be useful in determining if the data was indeed used in line with the permissions that was given. Currently this is found out only via manual process (e.g., publication for certain study type mentions the data but the permission was only given for a study of another condition)