What makes data real?

The beautiful images of galaxies, nebulas, and other astronomical objects produced by radio telescopes have been processed several times and colorized before we see them, but we still consider these images to be real and not synthetic.

So, what makes data real? Real data are data that have been generated by a process that is appropriately connected to real phenomena, where the terms “appropriately connected” and “real” are defined by the relevant research community. For example, we can say that an MRI image of the brain is real because it has been produced by a process that is appropriately connected to a real brain. However, sometimes MRI machines produce images that radiologists classify as (unreal) artifacts because they have been produced, for example, by the scanner itself or by the patient’s movements.

Referring to data as “real” does not necessarily entail a commitment to a physicalist notion of reality. Data could be about physical, chemical, biological, social, or psychological phenomena. For example, we would consider data concerning biodiversity, stock prices, suicidal ideation, or cultural taboos to be real data, even though the phenomena they refer to cannot be equated with specific physical objects. The data could be about things we cannot directly observe, such as electrons, quarks, entropy, or dark matter. What matters most is that the relevant scientific community considers the data to be about real phenomena.

Read more at the PNAS (Proceedings of the National Academy of Sciences of the United States of America)

A healthy balance between model building and data gathering

Too much theory without data, and speculations run amok. We get lost in a fog of models and idealizations that seldom have much to say about the world we live in. The maps invent all sorts of worlds and tell us very little about the world we live in, leaving us to get lost in fantasy. With too much data and no theory, though, we drown in confusion. We don’t know how to tell the story we are supposed to tell. We hear all sorts of tales about what is out there in the wilderness, but we don’t know how to chart the best path to reach our destination. The better the balance between speculative thinking and data gathering, the healthier the science that comes out.  

Marcelo Gleiser writing in BigThink

The Algorithms of Nostalgia

Nostalgia has become a template for the serial production of more content, a new income stream for copyright holders, a new data stream for platforms, and a new way to express identity for users. And there’s so much pop culture in the past to draw from, platform capitalism will seemingly never run out. We’re told our data is collected in an attempt to predict what we want, but this isn’t quite true. In attempting to predict our tastes, streaming services work to produce them in its image. Since algorithms are trained on the past, they aren’t merely transmitting nostalgia through neutral channels; they’re cultivating nostalgic biases, seeking to predispose users to crave retro. 

Even as Silicon Valley positions itself as progressive, its algorithms are stuck in the past.

Grafton Tanner, writing in Real Life Magazine

The algorithmic feedback loop

Users keep encountering similar content because the algorithms keep recommending it to us. As this feedback loop continues, no new information is added; the algorithm is designed to recommend content that affirms what it construes as your taste.

Reduced to component parts, culture can now be recombined and optimized to drive user engagement. This threatens to starve culture of the resources to generate new ideas, new possibilities. 

If you want to freeze culture, the first step is to reduce it to data. And if you want to maintain the frozen status quo, algorithms trained on people’s past behaviors and tastes would be the best tools.

The goal of a recommendation algorithm isn’t to surprise or shock but to affirm. The process looks a lot like prediction, but it’s merely repetition. The result is more of the same: a present that looks like the past and a future that isn’t one. 

Grafton Tanner, writing in Real Life Magazine

Goodhart’s law

Once a useful number becomes a measure of success, it ceases to be a useful number. This is known as Goodhart’s law, and it reminds us that the human world can move once you start to measure it. Deborah Stone writes about Soviet factories and farms that were given production quotas, on which jobs and livelihoods depended.  

Numbers can be at their most dangerous when they are used to control things rather than to  understand them. Yet Goodhart’s law is really just hinting at a much more basic limitation of a data- driven view of the world … there’s a critical gap between even the best proxies and the real thing— between what we’re able to measure and what we actually care about.

Hannah Fry writing in The New Yorker

Bullet-riddled Fighter Planes

During World War II, researchers from the non-profit research group the Center for Naval Analyses were tasked with a problem. They needed to reinforce the military’s fighter planes at their weakest spots. To accomplish this, they turned to data. They examined every plane that came back from a combat mission and made note of where bullets had hit the aircraft. Based on that information, they recommended that the planes be reinforced at those precise spots.

Do you see any problems with this approach?

The problem, of course, was that they only looked at the planes that returned and not at the planes that didn’t. Of course, data from the planes that had been shot down would almost certainly have been much more useful in determining where fatal damage to a plane was likely to have occurred, as those were the ones that suffered catastrophic damage.

The research team suffered from survivorship bias: they just looked at the data that was available to them without analyzing the larger situation. This is a form of selection bias in which we implicitly filter data based on some arbitrary criteria and then try to make sense out of it without realizing or acknowledging that we’re working with incomplete data.

Rahul Agarwal writing in Built in