A double-edged sword
Since word got out that ‘data is the new oil’, companies have been enthusiastically collecting as much of it as possible. Data helps businesses make predictions, assess risk, understand their customers, and evolve their products and services. It’s precious and powerful but collecting it by the barrel comes with drawbacks – not just for firms, but for us all.
When handling data provided by humans, firms must be aware not only of the data security and the risk of breaches, but of the harms that can result from how the data is used.
AI – raising the stakes
In the era of ChatGPT, GPT-4, Google’s Bard and other AI systems, how we treat sensitive information takes on a new significance. AI-enabled risk management has already led to claims of discrimination, where algorithms trained on biased material negatively skew outcomes for minority groups and women.
AI technologies can build complex psychological profiles using a few scraps of personal data. Even social media ‘likes’ can reliably predict personality traits and political persuasions. The algorithms developed by AI tools can also compromise privacy due to what is overlooked – the omitted variable bias. An AI system works according to what it’s trained on, and if that material is missing a vital element, mistakes are easily made, as in the famous case of Target’s pregnancy predictor algorithm - which neglected to exclude minors. It resulted in the sending of marketing materials that revealed a teenager’s pregnancy to her parents.
For businesses, responsible data handling is a priority not just for ethical reasons, but to limit the impact of fines, brand damage and future legislative complications.
AIs are designed to benefit society, and they do so in all kinds of ways. But whatever their intended purpose, these technologies use human-created data as their fuel, and we need to manage which kinds of information they run on. We can do this in part by making sure sensitive data and other personal information is collected, processed, and stored responsibly.
The five ‘Ps’ of Ethical Data Handling
We developed the five Ps model to help guide companies collecting human data or making use of existing databases.
• Provenance
Firms that collect human data must pay special attention to where the data comes from, who provided and collected it, and whether it was obtained with consent, free of coercion or subterfuge.
This applies not only to new data collection, but retroactively – many firms have stores of ‘dark data’ collected from customers in the past. It’s typically unstructured data such as visitor logs, social media comments and uploaded media, that was unused at the time but is now being exploited.
• Purpose
It goes without saying that the reasons for collecting data must be ethically sound. It’s also important that the people who consented to data collection know how, and for what, their data is used. Having collected personal data, businesses often reuse it for purposes different to those originally intended. It is important to consider whether the person who provided their data would agree to its use for another project.
In the case of customer data, repurposing of personal data has become its own industry. Many companies routinely sell their first-party data as a product, some of them using it as a primary revenue source. However, this practice is becoming less acceptable, and has led to fines and sanctions for some organizations.
If the purpose of data collection changes, or the company finds a new use for existing data, they must consider whether consent should be obtained again.
• Protection
Personal information can be exposed by data breaches; even the largest and most sophisticated organizations experience hacks and leaks. In the USA, organizations reported some 2,000 data breaches in 2021; breaches are not uncommon even in more strictly regulated EU markets.
For many businesses, data security is outsourced to specialist firms. But this is no watertight guarantee, particularly if the provider is linked to commercial or political entities that may internally gain access to the data.
It’s important for firms to determine how they will protect the personal data they collect, who can access it, how it will be anonymized, and when it will be destroyed.
• Privacy
Companies must strike a balance between holding data securely and ensuring it’s still usable for their purposes. For example, anonymization helps protect data providers, but if over-applied, it can make the data useless for marketing.
Data can be aggregated, approximated, subtly altered or pseudonymized. When cross-referenced with other information though, individuals can still be identified from just a few details, as various high-profile examples have shown. Netflix published an anonymized data set with 100 million records of customers’ movie ratings, challenging data scientists to use it to create a new recommendation algorithm. Researchers were able to identify 84% of the individuals in Netflix’s ‘anonymized’ dataset by cross-referencing it with another one from movie ranking site IMDb.
Another privacy pitfall is geolocation, used to provide location-based recommendations and map services, among other applications. Geolocation can tie an individual’s IP address to their physical address, making their homes easy to find. It can also mistakenly link them to a nearby building or organization that has nothing to do with them. This could have unintended consequences for the individual, who has little recourse for correcting the errors.
• Preparation
Data is often imperfect, inconsistent, and incomplete. It might appear in multiple languages, contain typos, or vary in format. Data cleaning is a necessary step for making it useful to analysts, but when the data comes in large volumes, they need to rely on software rather than humans to do the job. This opens the door to programming errors that can introduce wildly inaccurate figures, skewing the results unless researchers catch and correct the problem.