Workshop: Data processing tools#

Leads: James Friel (University of Dundee), Aida Sanchez (UCL)

Discussion around data processing, de-identification, and cohort building.

Required preparation#

A general understanding of data anonymisation. The ICO anonymisation guidance & the ADF (anonymisation decision making framework) may be of interest as a grounding in this.

Target audience#

People who work in data de-identification and data providers for TREs

Prompts#

  • Risk appetite to deposit data in a TRE - What level of de-identification is comfortable for use within a TRE? e.g truncation, pseudo-anonymization

  • What do current data processing pipelines look like? And are their pain points in the process?

  • What de-identification tools are being used? What has worked? What hasn’t?

Notes#

CPRD Clinical Practice Research Datalink

Canon: have non-opensource tools (DICOM, FHIR, CSV, Free Text, ‘omics, Pathology)

NetCDF, ArcGIS Enterprise, 100+TB data, SPARK to process data

  • Provide data to federated TREs

Plans for using OpenShift. Possible batch schedulers:

Summary#

General discussion of approaches and tools used