Data Ethics, simplified perspectives for dummies

“Power over business, democracy and education will likely continue to lie with data and data-dependent tools, such as machine learning and artificial intelligence” (Arvanitakis, Frances & Obst, 2018). Data has become focal in our ecosystem. Let’s take look around us. Data driven practices are presented in all walks of life, whether it being private or public: Smart phones, banking apps, Fit-bit, Uber, Netflix, Air B&B, Google Map, Smart Cities, Khan Academy, digital blood glucose meters, etc.

Data has surpassed oil in value. Since the story published by the Economist in 2017, “Data is the new oil” has become a widely discussed theme. Bhagespur (2019) sang out data driven innovations such as aircrafts with advanced imaging for more accurate agricultural forecasting, wild fire avoidance, and emergency evacuation. None of these could be done without enormous data input.

Everything has two sides. So does data. Data use issues have been reported as data leaks, breaches, discrimination, data exhaust or the misuse of data trail such as web browsers and the Internet of things. This article first discusses the nature of data being never raw or neutral. Data rather has context. It then examines common ethical theories to understand the linkage between data and ethics. Finally it proposes a framework for codes of ethics in nowadays data ecosystem.

Brainstorming on data & ethics relationship

Data is never raw or neutral

Photo by Markus Winkler on Unsplash

Raw data is unprocessed computer data that needs to be refined before it is useful to us. It is opposed to processed data, which is manipulated from the raw data. “At a certain level the collection and management of data may be said to presuppose interpretation.” (Lisa Gitelman and Virginia Jackson, “Raw Data” Is an Oxymoron, 2013). As data collection itself is already a form of processing, data is never simply raw. The questions of why (it was collected), how (it was collected and analyzed) are always relevant. A single data point is meaningless. It must link to its surrounding connection points as metadata to make sense and to make information.

So why is the nature of data being never raw or being contextual important? Once collected, “raw” data, in most cases, still cannot be used without going through clean-up process. Data, small or big, contains bad spots or outliners. As it is the rule of thumb for machine learning, “garbage in, garbage out”. Selecting data input affects the output. One of the emerging problems in data analysis is selection bias. It is not always apparent that individuals are included or excluded from a data base. As a consequence, even big data will not produce big pictures without “thick” data or context. A decade ago marked the downfall of Nokia, the mobile phone market dominant during 1998–2008. Tricia Wang, in TED talk, shared her experience of collecting qualitative data for Nokia in China low income market. Her research revealed the human behaviors and new emerged components that had not been quantified and included in Nokia’s big data. That Nokia data came from their own market share and surveys narrowed their “big picture”. Ignoring the context that poor people were willing to spend their whole income to own a smart phone to earn a higher perceived status led to Nokia’s failed strategy. Nokia decided to invest in short term new device investment instead of new operating systems, leaving the competitive advantage to Apple.

Big data is often quantified in a contain system, aiming to form correlations. Correlations that happen by pure chance are spurious. Tyler Vigen, a legal scholar and data analyst, showed us some weird connections between different data source. For example, the US spending on science, space and technology correlates with the suicide rate by hanging, strangulation and suffocation. More data does not mean better decisions. Global spending on big data analytics was $180 billion in 2019. In Australia, organizations spent $1.3 billion on big data in 2020, nearly 14% up from 2019. Research also shows 73% of big data projects are not profitable. It is vital for data use to be put in context and for users to dig deep into the “why, what, how, when & who” aspects of the “raw” data base.

Big data is contextual because of its formation, a process called datafication. In the video below, Kenneth Cukier, co-author of Big Data, A Revolution That Will Transform How We Live, Work, And Think, explained the concept of datafication. Qualitative things have been datafied like never before, such as emotions, sentiment, relationships, interactions, speed, movements, and culture. Netflix is famous for its recommendation algorithms to create a new media viewing experience. Statistical analysis of users’ behaviors brought Netflix optimal intersection of genre, actors and directors. Not only just streaming, Netflix has also been producing successfully targeted content.

Big data powers today’s analytics tools like AI, machine learning algorithms, and data twins. Algorithms improve our life quality, economic growth, security. They increase the efficiency and accuracy of services, shed lights on areas that previously depends on human judgment such as marriage match making and criminal sentencing. However, rewards come with challenges. Sense making is a challenge of algorithms. Facebook FaceApp was accused of racism when referring the beauty standards to European facial features. Last April, Google Translate launched a new AI tool of rewriter to address the gender bias issue in its work. Google Translate had been bias when translating a gender neutral language like Turkish to a gender specific like English. (https://medium.com/analytics-vidhya/gender-bias-in-google-translate-e4014055fefd). In a more serious application consequence, data selection bias could reach the boundary between life and death.

With such vast and complex relationship between data and its power upon us, one must ask what the boundaries are, what defines right to wrong, what this mean for ethical practice? The articles continues to explore two ethical theories and their applications to data analytics.

Virtue Ethics

A good person is one that consistently act “in ways that aligned to a long list of virtues which included honesty, fairness and respect. The values in virtue theory are universal and apply to all regardless of religion, culture or class.” The theory provides us with a living guide but not specific rules to resolve ethical dilemma. Nowadays, data analysts or data scientists can easily found themselves in dilemma place between their responsibility to commit and perform at work and their social responsibility, between two things that may be considered “right”. At the end of the day, it is often a case of balancing benefits and harms.

Google’s motto is “Don’t be evil”. As a giant tech company, Google is extremely aware and protective of their reputation. They promote ethical practice, diversity and fairness. Unarguably Google and their Alphabet’s AI has advanced benefits for society, from language to medical science. Yet ethical challenges unavoidably arises. In the news today is Google engineers resigned over the firing of Timnit Gebru, who is a Google’s prominent AI researcher and also one of few high profile black women in the field. (https://www.reuters.com/article/us-alphabet-resignations/two-google-engineers-resign-over-firing-of-ai-ethics-researcher-timnit-gebru-idUSKBN2A4090). Gebru ceased research projects at Google that potentially disadvantaged minority groups. Gebru co-published a research discussing the bias use of mass English text data on web in Google’s large language model. The AI flaw would would pick up societal racism, sexism and other biases reflected in the writing. Gebru also urged for environmental impacts of running huge data centers for algorithms.

https://www.bloomberg.com/news/newsletters/2020-12-08/how-timnit-gebru-s-academic-paper-set-off-a-firestorm-at-google

Virtues ethics, in this Google vs. Gebru case, is not so useful in providing framework or guidance to differentiate between “right” and “wrong”. Gebru was honest when publishing her research on the dark sides of her company’s famous technology. Gebru was fair when asking to be treated fairly as a minority in her mostly-white-male tech community. Gebru was respectful to follow her company’s internal review policy before publishing to ensure no company’s secrets were out. Gebru was a hero in virtues ethics picture. Yet Gebru was fired for her “bad behaviors” in reality. Was it for tarnishing Google’s image? Was it for her pulling off work orders without top executive’s approval? Was it for her calling for congressional actions against Google? For these controversial stories, perhaps a better theory to examine is Social Contract.

Social Contract Theory

Constitution is an explicit example of social contract. People living in a country agree to governed by the moral and political rules and obligations written in the constitutional social contract. General will is the key to social contract while personal preferences are put aside. Social contract evolves as our society rolls into technology era. Don Tapscott presented a social contract for the digital age that address the big four emerging issues, data privacy, unemployment, social discourse and trust reduction in democracy. Self sovereign identity should be regulated and enforced.

Personal data and individual privacy is one of the pivotal themes in the new social contract. We are all giving personal data away every day. Photos we upload on social medias are used to train machines to read. Check-in or app locations tell companies the prime shops to attract customers. Google search history reveals personal traits and characteristics. Privacy rights are legal right. Thus we are always required to accept the contract of services to agree to terms and conditions of web service providers, prior to use. Ironically surveys have shown most of the time, we accept without even glancing through such long and wordy documents. A common rationale behind the prompt approval we often hear is “my life is not that exciting and adventurous. There is nothing you would find by spying into my (search/ post/profile…)”. Since the general will is to click accept for privacy terms and get the products or services that we want or need as soon as possibly, the social contract theory suggests we do the right thing. Happy life and good outcomes should follow. Unfortunately, that was not the case for Facebook and their partner, Cambridge Analytica in 2018, when their misuse of data scandal went public. Hardly anyone would ever imagine their posts about pets or children images ended up changing a country election or creating propaganda or facilitating coups.

As the social contract is being re-negotiated and re-written for the digital age, it is suggested professional data ethics be part of the theme. It is important to view data analytics as the center of our ecosystem to have broader and more comprehensive scope of its impacts.

Towards professional data ethics

Data supply chain involves three phases, collecting, processing and practicing. Each phase has seen ethical challenges that require morally good solutions.

Personal privacy and group privacy are concerning in data acquisition. In 2016, when the Commonwealth government released 10% of its de-identified patient data on the government open data source for public access. The data revealed details of location services, implants used on patients, health care providers and recipient numbers. Researchers at the University of Melbourne ran a test against the likelihood of reidentification of such sensitive data. As a result, they could decrypt and identified all service providers. The set of data was then removed from the Open data source.

Paula Helm did a literacy review on “Group privacy in times of big data” in 2016, stating two involved issues of group formation through automated algorithms and the lack of civil awareness of this formation consequences. An example of group profiling is trending topics on social platform. Autonomous individuals with common comments or hash tags are connected to form a passive group. The virtual interaction becomes physical interactions as we have observed in #BlackLivesMatter campaign globally.

Data analysis and processing require frameworks, platforms, tools and techniques that are responsible and accountable. Whereas data driven practices are to maintain privacy and fairness, protection, and permission to reuse data.

Time is of essence to create competitive advantages in technology. The revolution of Big data, machine learning, AI and other data analytics tools are not going to slow down. As a result, increasing reliance on automation, reduction in human involvement and increasing oversight will be prevalent. To balance the benefits and harms of data science, corporates are moving towards professional ethical practices. Corporates can be moral agencies at their business expenses to fulfill the their moral duties: autonomy (obtaining and giving consents), beneficence (doing no harm) and justice (being fair in distribution). Corporates can take accountability for their moral actions. However as corporates are often hierarchical and complex, how and who would be accountable for making moral actions is not a straight forward answer.

Emerging data ethics principles

A Hippocratic Oath for data practitioners were suggested as a bold statement to commit their intent of data collection, application and sharing. The oath would morally acknowledge the rights and responsibilities of data users to build a trustful and respectful individual, community and later-on, a society. It should be noted that a data scientist or a data custodian and steward is not a doctor who directly cares for his patients and take transparent responsibilities. Would taking a moral pledge help data practitioners avoid bias in their algorithms designs? Perhaps, to some extend. Lewis Mitchell and Joshua Ross at the University of Adelaide argued that data and statistical literacy would be a more pressing concern to eliminate the unmoral algorithms design in future.

Professions that requires qualifications such as medicine, law, accounting and finance, have codes of conducts or ethics guidelines. For example, American Institute of CPAs Code of Professional Conduct, Society of Actuaries Code of Professional Conduct or Ethical Guidelines for Statistical Practice. It is time for data scientists to become a profession with professional code of conduct and ethics. Such code of conducts have been proposed to cover the following emerging data ethics principles.

“1. Enhanced perspectives on what data is to sustain and empower communities, as will as seek ways to liberation from exploitation and oppressive view of datafication.

2. Focus on those directed impact by the outcome of data driven process.

3. Prioritize impact on community over those who benefits from private.

4. View change in consent as emergent from collaborative and ongoing process.

5. Role of data custodian and steward as facilitators, but not experts.

6. Everyone is expert in their field and a valuable input resource. Design of regulatory framework should be comprehensive and collective decisions.

7. Knowledge and tools for data analysis should be shared. Techniques be transparent.

8. Outcomes of data driven practices need to be sustainable, community led and controlled.”

Enforcing the principles of data ethics enables trust and transparency between individuals and organizations. Business uses customers’ data for three purposes, making their products and services better, targeted marketing and increasing revenues through resale. 2019 companies ranking of digital rights reported Microsoft and Apple scored first and second for their handling of privacy disclosures. Facebook rank fifth in privacy category and have been making effort in privacy policy improvements. (https://rankingdigitalrights.org/index2019/report/privacy/).

To an end note, big data and data analytics have been and will remain essential to our society. The ongoing Covid-19 pandemic has proved the significance of big data in accelerating the fight against the novelty virus. A massive amount of data was produced from collecting patient identification, movement, symptoms, locations, virus life cycle, and contact tracing. The big data enables analysis of the disease transmission, movement, and health monitoring and prevention system. It is critical that governments and corporates act with highest integrity and conduct data ethics in these uncertain times.

UTS Student of Master of Business Analytics

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store