Sight unseen: how far can we go with keeping data hidden from users?#



This is the model of OpenSAFELY. Questions explored were how to ensure that the provided metadata is sufficient, how to extend the approach to more complex data (highly relational/linked databases) and the implied need of code review before running on actual data.

In summary this can be done but there are limitations.

Raw notes#

  • What are the advantages and disadvantages of hiding data from users?

  • How do we minimise barriers and frustration when working with unseen data?

  • Pros and cons of hiding data. Is it even worth it?

  • Challenge with interpretting the question - is this about restricting just identifiable information?

  • In what scenario would it be beneficial to keep data hidden?

  • Federated analytics - OpenSAFELY model. Allows you to see data that is structured the same as the original but filled with random (synthesised?) data.

    • Can we provide sufficient metadata to allow for unclean or missing data?

    • Additional challenge with more complex data (highly relational/linked databases)

    • There is a need for code review before running on the original data

  • Who’s resposibility is it to create the metadata and do the cleaning? The data provider? The TRE (probably not)?

  • On the question of how far we can take this:

    • It can be possible, but there are limitations. Including reducing the chance of the results.

  • Pros of hiding data:

    • increase trust in research

    • potential for higher quality research (no p-hacking, more hypothesis testing, less data mining, etc)

  • There are some doubts about the value/need for this. Aren’t TREs with anonymised data enough?

Roadmap plan#


  • What would a solution to this problem look like?

  • What resources would be needed (people, time, funds, infrastructure etc.)?

  • How can this community support you in getting them?

  • What working groups/orgs are already working on this, if any? How can we collaborate with them effectively?


  • Something along the lines of the OpenSAFELY model could work

  • Requires trust in the data providers and researchers

  • Limitations of types of data and types of analyses

  • Resources required: people to do the code review step