Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

December report #35

Merged
merged 20 commits into from
Jan 31, 2024
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,26 @@
- What is the split (if any) between TRE/SDEs being hosted locally vs cloud?
- What are the merits of either option?
- [assuming there are participants who have hosted locally] Are there any challenges or tips for building and hosting locally?

## Summary

Conversation centred around cost implications of local vs cloud TREs.

Possible approaches included thinking about pan-cloud providers like Snowflake, as well as hybrid approaches to on-prem and cloud components, were discussed.

Next steps included using UK TRE community to standardise investments in the space to reduce costs for TRE provision.

## Raw notes

- We all have the same cost management problems. All having to use limited budgets for balancing storage and compute and operations costs. But why does this infrastructure not exist nationally with national investment to maintain a single architecture for use? We should lobby up as a community to influence policy and makers.
- Separating SDE and TREs allows storage costs to be managed separately from the dynamics of compute costs. Having the flexibility to allow TREs for projects to pay for the compute is advantageous.
- Suggestions to look at using Snowflake or Data fabric to facilitate this
- Bringing compute to data (by allowing the project TREs to see the data in the SDE) allows the balance of forecasting to sit with the funded party (the research project via its funding)
- Forecasting costs as well as Operating costs are key. All plant will need refreshing every few years and this is a HUGE investment case. Scaling exponentially. Better handled by a national infrastrcuture provider.
- Decide on costs before choice of provider. Lock in to a single cloud is inevitable. Pan-Cloud providers (like Snowflake) offer a solution.
- Hybrid working with onprem and cloud may be a good balance, and is influenced by the exact use cases of the organisation (and its clients)
- Additional factor worth considering is flexibility to keep up with developments across the industry (i.e. what is the new shiny thing and being able to faciliate/provide access)

### Next steps

- Advocate for UK-TRE (or equivalent) should be lobbying 'up' to get standardised investment with the aim of reducing costs across the sector (not splitting funding across numerous initatives)
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,56 @@ This breakout will discuss whether and how a common governance model can be achi
- What percentage of the problem of data silos will be solved by a common governance model?
- What can be incorporated in a common governance model, and what can't?
- How do governance models need to be updated to work for a multi-TRE network, rather than just individual TREs?

## Summary

The [Swansea paper](https://ijpds.org/article/view/2164) was referenced, highlighting how it found that the problem of multi-TRE analysis can be addressed.

The question was then asked - how do you get there, in a trustworthy and safe way?

It was discussed whether an entirely new framework was needed, or whether existing arrangements can be adapted, and whether solutions should exist within TREs, or new TREs shuld be built.

Many next steps were discussed, with a special focus on lobbying as a community to government, data providers and policy makers to move towards a more common, aligned approach to data provision and research.

## Raw notes

- Variation in data governance arrangements result in silos, barrier to federated analysis across data held across multiple TREs
- Review of TREs involved in federated collaborations (?) and their governance arrangement models: https://ijpds.org/article/view/2164
- Conclusions of the paper: problem _can_ be addressed considerably through adoption of a common governance model
- How do you get there?
- Changing existing models and cultures is hard
- Participating with another legal entity the responsible party will still stay the responsible party
- In a joint partnership the new model has to distribute the risk across the partnership group
- MRC example by Paul Colville-Nash: accessing GPs in 4 nations, relationship between parties
- Infrastructure requirements can be quite different, standardisation would need to encompass these
- Old TREs have very similar model but different naming
- How do we get there? and how it all works?
- How do we think the world works? do we bring data to researchers or researchers to data?
- What feels more trustworthy from data owners view
- How safe are the existing TREs?
- How a structure can enable multi-TRE collaboration
- Are there examples where this has been done successfully?
- Use-case with rare disease where large sample size needed (but needed complex sharing agreements)
- COVID: federation was not solved
- Funder looking at global population level
- To what extent do we need a completely new framework, if changing existing arrangements is too hard? Ripping up existng governance arrangements carries risk of undermining trust.
- should we moving with faster speed in this space? getting any type of data takes time and we may need to let time solve things
- Funders would challenge establishment of a new TRE
- the wave is changing to establishment of TREs inside TREs with shared governance models
- the existing TRE is a free market, should we and do we want to introduce new 'TRE products' _within_ existing trusted TREs, that are designed around ability to federate/carry out multi-TRE analyses. That way data providers benefit from the experience of the established host TRE. This can enable a gradual migration from one governance model to another without needing to rip up entire TRE governance model (analogous to a software migration). Can start with small number of such TRE2.0 products and expand more organically (land and expand).
- traumatic brain injury hub is a secure dataset which is living in a separate hub within DPUK
- is this a tailored access model per dataset? We need to separate out the data set from the access and analysis of them - SDE vs TRE. And then allow the existing IG to see how the new approach is a small acceptable growth.
- TRE operators have to convince data providers not just to provide the data, but also to curate it and document it, repeated n times, all required before you can provide the data to a project. Onus on data owner not consistent, this is often done by the TRE. Very costly to undertake.
- 'Ground truth' of datasets/versioning if multiple versions are held across different TREs?

### Next steps

- Bringing the governance models close enough to each other to be harmonised enough to give FAIR data as the way forward
- to build a communication line between the legal entities called 'Data Owners' giving their data to 'Data Providers' to facilitate a balanced risk and outcome delivery **Primary objective in this model is to reach a balance and lower cost**
- Talk lines between TRE-community and government for implementation of regulations
- For specific use cases we have achieved it, where to go next is how to expand the solutions to a more general population-wide use case in multi-TRE work
- Streamline a repeated process for provisioning data via different paths
- In some sense we have the opportunity it feels. ie Cloud based models offer a way of rebuilding new
- Get rid of different instances of similar entities in a cloud environment
- Adoptation of the existing models (or creation of a new one???)
- We as a community need to be lobbying for a change of focus for data owners / controllers to make sure that they recognise that they have a purpose and responsbility to make the harder research problems possible, the time to research faster and the costs to do the research lower (time and £). If not this community then who? We need to act as a lobbying function into government policy makers.
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,32 @@
- What, if any, will be the accreditation standards SDEs hold them to?
- What are the current options for unmet technical capability?
- Do you foresee the new data access policy changing your workflows at all? If so, how?

## Summary

The NHS network, and direction, was discussed.

The main principle is that health data will exist within SDEs, and that sub-optimal solutions will likely exist in the short term before full maturity of the SDE network is reached.

The NHS SDE team hopes to be more directly involved with the community going forwards.

## Raw notes

- High level context of definitions
- 11 Regional SDEs across England + National NHSE SDE
- Different types of SDEs
- some curate only
- some curate + have technical environment for analysis
- NHS Research SDE Network - HDRI Gateway
- Challenges in federation, working across multiple TRE
- Direction of travel is that health data will exist in SDEs - policy will default to SDEs
- Do you go for a fully federated set up?
- Requires common set of standards
- There's a long road ahead in terms of maturity, will likely have to temporarily settle for sub-optimal solutions as the situation evolves
- Link to NHS England National SDE information - https://digital.nhs.uk/services/secure-data-environment-service
- Is there a way to get other TREs accredited for the network?
- Probs not for now, they won't be planning to accredit others outside the network

### Next steps

- Have an NHSE SDE policy team person attend a future session as well (March 2024)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an action here for next event. uk-tre/community-management#55

Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,28 @@ There will be a brief presentation on public expectations from using personal he
- How many community members require PPIE support? Enough for a working group?
- How to overcome health / tech literacy limitations during public engagement activities?
- Public involvement - web landing page for public? Curated newsletter?

## Summary

This session was a broad exploration of how the public can better engage with both TRE teams and the UK TRE Community.

Next steps focused on encouraging those working within PPIE spaces to join the UK TRE community.

## Raw notes

- Who is doing PPIE work in TREs - is it dedicated roles or additional work to other jobs?
- What support would be useful for people within TREs and researchers with regard to PPIE
- Is a survey needed to find this information or does this info already exist in previous landscape surveys e.g. PEDRI or DARE UK?
- Routes for the public to engage - a landing page or signpost to on TRE Community?
- Are there any general Public-requirements for TREs?
build upon section 4.8 in https://satre-specification.readthedocs.io/en/v1.0.0/pillars/supporting.html
- How TREs track and communicate data used - beyond just project info? Can patients find out what data is used?
- Discussion around opt out and impact on data of who it is
- How many PPIE professionals or people with PPIE as a priority joining the TRE Community sessions and mailing list?
-

### Next steps

- Discuss TRE Community at next PEDRI Delivery Group meeting and make sure updates and opportunities are communicated to TRE Comm network. PEDRI currently planning their focus for next year
- Explore PPIE support opportunities in bringing agency to the citizen
- Encourage PPIE roles from TREs to join TRE Community
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,25 @@
- What are barriers to tracking and sharing provenance and metadata between organisations? What are ways to improve this?
- Outputs produced by technical solutions for provenance tracking might be inaccessible to non-technical decision makers. What can we do to address this?
- To what extent are rights/licence/access-conditions currently described _formally_ in metadata?

## Summary

The importance of data provenance was discussed, and how to ensure data provenance and additional metadata can be managed properly.

There is still little common understanding of requesting metadata, and of data provenance requirements.

Next steps are to map out these requirements for the community.

## Raw notes

- Key provenance questions: Where did the data come from? What data cleaning steps were applied?
- Big Data Approach -> collect as much as you can as you don’t know what you might need in the future. Provenance is important to preserve the knowledge that would otherwise disappear when people leave, etc. (e.g., example of longitudinal studies spanning number of decades )
- Lots of provenance information kept in a human-readable form (screenshots, pdfs, text files) - it is also a preferred way of consuming such data (e.g., researcher wants to see a pdf of the questionaire that was used). Processing provenance at scale is difficult as it is resource intensive.
- Closer Discovery platform (https://discovery.closer.ac.uk/) was mentioned as an example of a metadata management platform for longitudinal populational studies.
- There is still no common way of requesting metadata – everybody is asking for data using different templates, which is very challenging for data processors.
- It has been highlighted that provenance could be problematic if it exposes too much detail (e.g., potential privacy implications)
- Provenance could be useful in determining if the data was indeed used in line with the permissions that was given. Currently this is found out only via manual process (e.g., publication for certain study type mentions the data but the permission was only given for a study of another condition)

### Next steps

- Map out provenance requirenments for the TRE community?
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,82 @@ But what next? Forming a community working group to develop the architecture and
- How best would we organise and collaborate? Docs on GH webpages, with issues, boards etc? Something else?

[^1]: _All models are wrong, but some are useful_

## Summary

Conversation started by asking what work is already being done in this space.
This included exploring pre-existing standards, whether there is already a shared definition of federation, and where the use cases for federation already exist.

The need to map the ecosystem was highlighted, to show where the influence lies and how key decisions are made - especially in spaces where the data is not allowed to move out of internal systems.

The need to factor in data and information governance, in tandem with any technical solution, was highlighted.

Next steps include setting up a high-level federation working group, and use this to explore critical topics within the idea of federation.

## Raw notes

- Important to note that there are organisations that are already working in this space - their meetings will be set.
- Question: what other working groups exist and why do we need a new one? What aren't they covering?
- Question: Are there any existing standards?
- Small scale projects have succeeded - some funded by DARE UK - but not broad demonstration... yet?
- Teleport:
- https://dareuk.org.uk/driver-project-teleport
- https://dareuk.org.uk/preserving-public-trust-in-the-evolution-of-trusted-research-environments-teleport-federated-data-access
- TRE-FX:
- https://dareuk.org.uk/driver-project-tre-fx/
- Final report from the TRE-FX DARE UK project: https://zenodo.org/records/10055354
- Question: Do we have a shared definition of federation?
- How is that different to data pooling?
- Should we be aiming for a meta-learning? distributed learning?
- Potential definition: federation is a group of TREs that trust each other
- Doesn't have to involve moving data... but can...
- Examples of federation in the genomics space
- Question: what does it mean that the federation is specific to genomics?
- No limit on the fedearation architecture but front end standardisation would need to assess the data for egress.
- Question: Do we have a list of use-cases that describe the need for federation to exist.
- Yes for small ones but not coherently collected in one place.
- One idea: run a workshop to build up "grand challenge" style definitions
- From the zoom chat:
- The 'trustworthy' and 'governance' bits depend entirely on what information flows over this inter-SDE network.
- https://dash.ohdsi.org/research
- https://www.datashield.org/about/about-datashield-collated
- the above are examples of existing use-cases
- Note that the individual TREs publish many more research outputs than OHDSI - interested to know why that is - what are the needs that OHDSI aren't meeting?
- Improved analyses / insights / inference with very large data sets
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6451771/
- Note though that without a common data model it is very hard to combine data for such large scale analyses.
- Common data standards are very expensive to implement.
- Not total agreement in the room.
- But note that the mapping to a data standard - eg OMOP - is very variable.
- Note that many goverment departments / organisations are focusing on "data not moving" - so where is the space for federation?
- "lift and shift" stops being able to scale very well....
- Need to map the ecosystem of working groups
- Need to understand who has the power and influence...
- "A federated protocol for the working groups!"
- Note that the design of a TRE is conceptualised to bring researchers in and to NOT let data out...
- So do we need a concept beyond a TRE? Do we need a new phrase / name?
- TREs that are designed from the outset to federate?
- DataSHIELD has been around for 10 years - not complete but huge open source community of researchers.
- And there is this -> https://www.hpe.com/us/en/hpe-swarm-learning.html
- Data always move - even Genomics which are years ahead, the raw data does not move but the findings from the genomics do - that's data
- NHS SDE - adopting OMOP I believe - seeking to share maps across SDEs rather than each doing own
- I think the big challenge is around people understanding this space and removing the IG blockers.
- The NW SDE is 'federated' by design.
- How does anyone validate federated analyses if data doesn't move! You can't see the outputs...
- Addressing the techie part of federation without factoring in the data & governance pieces is only a fraction of answer.
- These things do run at different speeds

### Next steps

- Propose a high-level federation WG (IG?) as an umbrella/place to start, using the UK TRE/DARE UK/RDA model
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An action here or not? given the time it takes us to process report there should be a more immediate action to follow up on these

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume Rob will be the one to propose this, or at least will ne heavily involved, so I think we can discuss this separately

- Prob host on GitHub somewhere
- Use DARE UK Federated Architecture Blueprint v2.0 to seed (coming soon!)
- Within that, tease out the best approaches to key topics:
- Federation terminology (cf. Pete/Madalyn's idea, and the [DARE UK Driver common vocab](https://docs.google.com/document/d/1SJ6CJG8yHzsvtU7MyzdNOF_S0fZVJb_i/edit) )
- What is a TRE? Is it an individual "cluster of boxes", or can it be a formal federation of a number of "clusters of boxes"? (a super-cluster?)
- What does TRE accreditation for a piece of public cloud mean, for instance?
- Is the NHS SDE Network a TRE in itself?
- What is the sound of one hand clapping?
- Governance around this broader idea of a TRE as a "federation of smaller sub-TREs"
- Is "thinking federal" from the get-go useful? Possible?
- Data & queries: how can we ever harmonise data enough for a federated query to return comparable answers?
Loading
Loading