In Calligo’s latest Beyond Data podcast, Tessa Jones (Chief Data Scientist) is joined by Dr Ellie Graeden, Research Professor (Center for Global Health Science and Security) at Georgetown University. Here we explore some of the episode’s highlights:

  • The inherent conflict of private data and the public good
  • Protecting individual rights within federated learning
  • The importance of effective communication and a common language
  • Designing systems and policies that work together
  • Focusing regulation on outcomes, not creating data siloes

At societal level, poor communication costs lives

Transitioning data across and between departments and data systems has historically been fraught with problems – who owns it? Who pays for it? Is it understandable and translatable into meaningful and actionable insights for the end user? 

Having worked extensively in disaster response, Dr Graeden has seen first-hand the potentially life-threatening issues that can arise when government departments’ data platforms produce incompatible outputs:

  • If 20,000 people need water, how many pallets need to be shipped?
  • If 10,000 electricity meters have been knocked out by a hurricane, how many people need feeding?

In such scenarios, identifying individuals amongst population-level data is crucial if the help provided is to be sufficient.

“We have to be able to really effectively move and communicate and share data that are relevant, in ways that they can get used by people all across the system”

Of course, any data system design should ensure privacy and protection for personal data. ‘Big data’ is still relatively new, and as such more powerful and widespread regulatory controls are now being introduced, although the US still does not have consistent requirements for how data should be handled. Fundamentally, meeting a population’s needs today, and planning for them tomorrow, requires the data of individual people to be analysed. Personal data must be shared quickly, effectively and all the while protecting individual rights. Data system design must therefore:

  • Include all players
  • Consider cultural constraints
  • Keep out bias
  • Ensure the right words and phrases are used
  • Focus on the ‘so what’, why does it matter?

“Every single thing we experience can be captured as data”

Even the most mundane moments in our daily lives leave a digital footprint, we shed data everywhere. But when does ‘my’ data become public, or the property of the software developer or the service provider? VR headsets collect ephemeral data that is analysed and applied for that one end user, but if that data is assumed to fall under GDPR the potential to use it for positive outcomes is severely limited. For example, should authorities be notified if content viewed and generated is illegal or harmful? And what if that chip can detect if the user is having a stroke, is that data classified as ‘health’ data? Can it be used to alert the individual to their medical emergency without contravening legislation? What if your mouse clicks can detect the early stages of Parkinson’s? Should you, could you, be told?

“If you’re treating this data as health data, then they have a very different set of regulatory constraints. HIPAA isn’t going to regulate those because it’s not a health care provider or a health insurer”

Piercing the veil

The conflict between personal protection and public good is everywhere, and Dr Graeden believes that some new data laws will create problems for federated learning. Legislation has clear boundaries (speed limits, blood alcohol levels) whereas science deals in spectrums, probabilities and unknowns.

Deleting an individual’s personal data from the model breaks the system, contradicting what regulators are trying to achieve. The solution is to prioritize outcomes, not processes – it doesn’t matter whether you write the rules with a pen and paper, or with AI, as long as you write the rules. Expanding the framework by setting gradients of data availability affords protection for individuals, whilst making data available that informs better decision making for public bodies.

“Data is nothing more, nothing less, than an abstract description of our world. A useful and powerful language that can tell us things that other languages don’t”

Data can no longer exist in siloes if it’s to be useful to society

There is now a healthy global appetite for the discussion around data, thanks in the main to two recent developments:

  • Covid gave us huge amounts of data about mortality levels, vaccination rates, hospitalisation trends – all of which were in the public consciousness every day
  • AI and ChatGPT – articles and debates about the pros and cons are everywhere, discussion is not just in the scientific community

The key challenges now for data scientists are expectation management and communication – we need to be clear about aims and specific about context, as well as knowing what to leave out to avoid overwhelm and misunderstanding. Unfortunately, scientists are not always great communicators (using complex terminology and detail, rather than common parlance and generalization) as Covid demonstrated:

  • Did having a vaccine mean you wouldn’t get sick? Or just less sick?
  • ‘Everyone should wear a mask’ became ‘wear a mask if you can’. This was due to limited supply, but it appeared that the science was not clear

“The scientific approach means you never have an answer… we are trained as scientists to focus on the fact that we don’t know”

In fact, the only answer is that the right data, used consistently and communicated clearly, will always allow us to be prepared, not reactive. To make decisions for the public good that protect every individual.

You can find out more about the common language of privacy in our Rosetta Stone eBook.

You can also watch Tessa’s fascinating podcast with Dr Graeden below.