Smart Data Foundry has a simple but ambitious vision – to become the trusted UK organisation for partnering with the private sector to safely and securely share deidentified data to power impactful research into societies big challenges.
Creating access to data that is usually unavailable for researchers does not come without challenge. But the value to society that can be unlocked from this data makes it well worth overcoming the challenges of sharing data at scale.
At Smart Data Foundry, we have pioneered a new approach to data partnerships that makes this possible. As an independent, not-for-profit organisation, we are making significant progress to making this vision a reality, made possible by:
- Bringing together the best of academia – research, rigour, independence – with a special relationship with the University of Edinburgh
- Within a professional infrastructure, including the safe haven at the University of Edinburgh’s world-class supercomputing and data facilities centre (EPCC), an Information Governance team, a Data Science team and a close working relationship with the ICO
- Leveraging data partners and government to produce pragmatic insights with real-world impact
- That results in deidentified, granular research-ready data, updated monthly, from trusted data partners.
The focus of this blog is to share the work we’ve been doing with the ICO to explore the boundaries of what is possible through data sharing for good.
What is the ICO Sandbox project?
The Sandbox is a service developed by the ICO to support organisations that are creating products and services which utilise personal data in innovative and safe ways. The project is based on co-developing resources and making the most of guidance put forward by the ICO, and providing clarifications/development to the guidance in the form of a final report.
As a result of this project, we were able to work together on several important documents:
- a full DPIA assessing our research proposition and storage of deidentified financial data,
- an “anonymisation assessment” to test the identifiability of the data we store,
- and crucially testing and scrutinising our Five Safes Framework approach in terms of safeguards.
Our project set out to tackle the data protection implications of:
- Creating a repository of financial data to be processed for research
- Exploring linking multiple datasets provided by multiple data controllers to support UK researchers across different research domains and use cases
- Creating synthetic datasets from data held in the above repository to help support social and economic innovation
During the process of the Sandbox, we decided that the creation of synthetic datasets (or ‘doubles’) from real data was not the synthetic data approach for us – instead, we focused on the agent-based modelling method that you can read further about in our white paper on the topic – but we were really pleased with the help to establish a supportive legal basis if we were to pursue synthetic doubles.
What did the project find?
We worked in consultation with the ICO to tease out important details on how to achieve our goal of creating lasting social impact and driving economic growth through data-driven research and innovation. Some key discussions included:
Purpose: to ensure we comply with UK GDPR ‘purpose limitation’ principles – which protect data from being used for a purpose different to what it was originally collected for – we used ICO’s research guidance and together co-developed a ‘legitimate interest assessment’ to cover ‘research in the public interest’ to establish our own separate lawful basis to process the data. A version of this assessment will be published on our website soon, which assesses:
- The purpose of the activity, and what the benefits would be for the organisation and the data subjects
- The necessity and proportionality of the processing – is it reasonable to achieve the aims, is it the most effective method, could it be done less intrusively (for example with less granular data)
- The sensitivity of data, the expectation of the individual’s data that this processing may take place, any communication with the subject, and any potential impacts on an individual’s rights
After weighing the benefits against the impacts, the ICO and ourselves concluded that the processing carries sufficient public benefit and the safeguards mitigate the impact to the degree that legitimate interest applies to the processing, provided we assess all new or changed projects against this balance.
This still requires data providers working with us to ensure they conduct a compatibility test as part of their own data protection assessments when sharing deidentified personal data – however as the purpose limitation principles in the Data Protection act state that reusing personal data for research-related purposes as long as there are appropriate safeguards in place, the assessment conducted on this data processing activity is likely compatible.
Identifiability of data: One of the key “bars” to reach when deidentifying research data – a practice we do by default – is to anonymise the data to the point that it can be deemed “effectively anonymised” – and hence no longer required to be treated as personal data by GDPR. This requires a combination of factors – effective pseudonymisation, “banding” or aggregating information, not having access to fields where data can be “singled out” or potentially linked, and also only accessing that data in environments where there isn’t access to further data that could “de-anonymise” or rather identify people.
To test all our datasets against this bar and try to achieve this, we co-developed an anonymisation assessment which helps state for all our data:
- How identifiable is the data if we look at the ability to single individuals out, whether there’s any data in there that is linkable, and any inferences or predictions that can be made from specific data fields
- How the data environment applies controls to the data and reduces the identifiability highlighted within the data
- The reasonable means available to a motivated intruder in conjunction with those controls.
- What additional controls and anonymisation can be applied in addition to the above to ensure the dataset is effectively anonymised
We tested this process on the data that is shared with us by NatWest Group and validated that, with the controls in place and the semi-aggregation techniques utilised, this data meets the bar of “effective anonymised” as long as it is only accessed within our EPCC Safe Haven environment in conjunction with our Fives Safes Framework.
What about if special category data was involved?
There are additional safeguards if our research ever entails special category data, including requiring that a separate legitimate interest assessment is conducted to demonstrate the necessity for processing that specific special category field, as well as proving the public interest is congruent with wider public societal interest and proving that the research is scientific in nature. Our work so far does not utilise these data fields – if we were to do so, we would publish our methodology around proving exactly why this was required, and what our safeguards entailed.
What does this mean for our data processing?
This project was important validation of our data protection processes, but also helped us refine them further and record information in more detail. We have successfully integrated outputs from this project – notably the DPIA and the anonymisation assessment – into our governance framework. We will be publishing more and more information over the coming weeks on exactly how we protect the data, the pseudonymisation controls we apply, our disclosure checks, and more detail on how we operate our Five Safes Framework. Here are some of those controls in a bit more detail:
Where can I find these outputs?
We will be publishing simplified and redacted versions of both legitimate interest assessments, the anonymisation assessment and the overall DPIA for our research proposition on our website very soon. Look out for our release of it.
What about the project report?
How has this affected our approach going forward?
We are an organisation that focuses on deidentifying the data and managing our Trusted Research Environment controls to the point where the data can be considered effectively anonymised – which we do by a combination of techniques, including hashed/salted pseudonymisation, segregation of datasets, a high enough sample, semi-aggregating transactions, ages and partial postcodes where possible.
We rely on our Five Safe Framework – and the Trusted Research Environment controls that are a part of it – to ensure the data we are entrusted with is kept safe and has as low a risk as possible of a potential breach or security incident.
We utilise our DPIA and legitimate interest processes effectively to judge if the projects we undertake are compliant with our “research in the public interest” legal basis, and are congruent with our missions. We will be developing further work on our ethics assessments and other public engagement to complement these processes and ensure they stand up to scrutiny.
We hope reading the report helps strengthen wider public trust in how data can be used to drive research and insights that uncover data stories to address our missions, and we thank the ICO for a rewarding experience.