Current state of the art re data linkage/federation/AI&ML&LLM across infrastructures: federation, governance, safe output methods#
Overview#
Summary#
Issues about federation of datasets were discussed, including identifying different datasets across multiple systems, how to collect identifiable information robustly, and how we can link up different approaches across the 4 nations effectively.
There was further discussion on how to effectively check ML models within TREs.
In the case of governance, it was suggested that a project working across multiple TREs should have one singular governance process.
Next steps#
Create a ‘panel’ focused on specific type of data/research (e.g. health, crime, financial) who can oversee specific research projects within these fields
Raw notes#
Data Linkage#
How do you go about the NHS Number?#
Uses NHS Standard NF5, after 3 they went to manual to track through the system.
Issues with health and non-health data
Names such as Dave / David can cause problems.#
Linksmart is a solution for this.
Collecting Crime Data
Scotland’s Approach#
a national ID number
Federation between datasets#
Identifying with confidence across TREs is important
Problem: Linking health with something else is problematic to match up and link it with addresses and names
Separation functions
Person has all the identifying information, but they do not have the data
TREs communications between each other need specific criteria, Scotland has 5 TREs
Having more than two, and introducing a central one is a possibility
Issues with identifying A-B data sets across multiple systems
Seeding Death Data – David and Debra Smith: D. Smith & D. Smith causes gender incompatibility issues
National Drug Treatment Data – At source they only collected initials ‘D.S.’, Gender and MM/YYYY of DOB. Deidentifying can cause linking problems. Education to non-education where they don’t have their common ‘number’ – how confident can we be that Participant A is the same participant in another TRE? If you’re not sharing names & addresses
Bringing in NHS data and also pseudo anonymise it – how can you work with it without a key?
Once you got a data linkage – bringing the different data types into a data set (TRE). E.g. Linking mental health data and shopping data, if you anonymise that and have their own key – they can do it anonymously for external sources
Education data between England, Scotland and Wales might use different notations
Residential Data can be used as a key
‘E-child’ trying to link the NHS with the Department of Education
AI & ML#
People misunderstand the terms AI & ML with ‘Statistical Modeling’
Based on risk factors you can determine 70% precision pre-diabetic chance
Accessing ‘clinical like data’ with similar terminology to mimic clinic systems
AI – Offline AI: you can have an offline machine learning model – yes
Would multiple AIs learn the same thing on same data sets? – no
You can make it work with a shared API though (Stroke Predicition)
APRs – 8-9 expensive centre
Different type of interpretation of ML, ML data on health ‘takes your job’, ML data on other scenarios might be socially acceptable
Pattern finding models are popular and precise, this is lacking in statistical modeling
At the end of the day, medical data ML is not understood why it gives that result
Checking models are problematic and difficult, unsure results and unsure contents of the model begs the question of the model’s authenticity
Governance#
Process is repeated a lot, no committee talks to each other and are a separate entity
Cannot start work unless approved
Doing a project between TREs, each TRE will have an approval process, ideally a multi TRE Project requires a single approval process, this decision should be approved across the other one
What would a solution to this problem look like?#
Current state of the art is the overarching question – needs a TRE panel to decide what is state of the art
Single ‘panel’ on a specialty (e.g. health, crime) who deal with specific projects, additionally members of the national TRE supervision