Funny, I’ve always assumed that at least some of the reason for calling it #BigData was by analogy to this:

Supermajor: Big Oil

Petroleum and gas supermajors are sometimes collectively referred to as “Big Oil”, a term that emphasizes their economic power and perceived influence on politics, particularly in the United States. Big Oil is often associated with the Energy Lobby.

Usually used to refer to the industry as a whole in a pejorative or derogatory manner, “Big Oil” has come to encompass the enormous impact crude oil exerts over first-world industrial society.

…i.e. an term encompassing the industry and its corporate interests, too.

TechCrunch disagrees:

Why have I grown to hate the words “big data”? Because I think the term itself is outdated, and consists of an overly general set of words that don’t reflect what is actually happening now with data. It’s no longer about big data, it’s about what you can do with the data. It’s about the apps that layer on top of data stored, and insights these apps can provide. And I’m not the only one who has tired of the buzzword. I’ve talked to a number of investors, data experts and entrepreneurs who feel the same way.

According to Vincent McBurney, ”Big Data” originates from Francis Diebold of the University of Pennsylvania, who in July 2000 wrote about the term in relation to financial modeling. That was over 10 years ago. In the meantime, so much has happened since then with respect to how and what people can do with these enormous data sets.

And big data is not just about the enterprise. The fact is that every company, from consumer giants like Facebook and Twitter to the fast-growing enterprise companies like Cloudera, Box, Okta and Good Data are all big data companies by definition of the word. Every technology company with a set of engaged regular users is collecting large amounts of data, a.k.a. “big data.” In a world where data is the key to most product innovation, being a “big data” startup isn’t that unique, and honestly doesn’t say much about the company at all.

Oh well, it’ll be back…

3 Replies to “Funny, I’ve always assumed that at least some of the reason for calling it #BigData was by analogy to this:”

  1. Note that nearly everything written about “big data” is concerned with what companies do. Not individuals.

    Even Leena Rao’s cited piece in TechCrunch starts with individuals and then veers back into business-land, and the need for “a different way to describe startups.”

    Indeed, it is like “big oil.” What individual runs their own oil refinery?

  2. I thought the TechCrunch article was a good, one, too; and when it comes to a number of aspects of “Big Data”, I admit I’ve been sceptical from the outset, and I haven’t seen anything which turns my view around, yet. Here’s a quick sketch of the area as I see it; it’s intended as a provocative target, so please shoot at it.

    “Big Data” is about taking volumes of data which would be unwieldy for a person to work with analytically, and applying automated techniques to make it somehow wieldy. Such data, which is often initially unstructured from the perspective of the system doing the analysis (it may actually be structured, with different records structured in a whole number of different ways, but the analysing system doesn’t necessarily know this), is given to a machine to work on; the machine infers structure from, and apply structure to, those records (see, eg Splunk). It can then edit those records down to in a way which destroys the minimum amount of useful information deducible or inferrable from that data, to produce some kind of human-consumable summary (see eg MapReduce) or graphical representation of that data; and / or it can scan those records to look for correlations in changes of the data “when X moves, Y moves in the same direction” or trends in the data over some variety of axes (time, location, etc).

    Thinking about it, the correlation generation is what we used to call “data mining” – this isn’t a term I’ve seen used for a while. So, either it has been subsumed into “Big Data”, or the scope of “Big Data” is limited purely to inference and application of record formatting…

    So, we start piling automated inferences on top of eachother and automatically editing the data based upon those inferences, before a representation of the data gets to a pair of human eyes. The good thing that can be said about doing this, is that the automated manipulation can be reasonably expected to be free from conscious or intentional bias (especially if the inference engine designer is unaware of the context of the data), but as was discovered from neural network research in the early ’80s, inferences and correlations made by a machine are often not the ones a human would logically make.

    However, given a chaotic system and an infinite amount of data from measuring its states, a correlation engine won’t fail to find any correlations; rather, it will find *every* possible correlation, with equal weight. Distinguishing between truly chaotic systems and merely extremely complex systems can be a very fraught exercise, where the system has a large number of inputs. Large-but-finite data sets from chaotic systems have resulting random biases, and tools to make them tractable, let the human brain draw erroneous conclusions far more easily.

    The human brain appears to have evolved a bunch of hard-wiring when it comes to pattern-matching, along with cause, effect and inferring things about them, without necessarily recognising the differing risks of different directions of reasoning in complex systems. While predicting a set of effects from a set of causes is usually a reasonable thing to do, ascribing a set of interesting-looking effects to interactions between a set of causes frequently isn’t – and it’s this kind of reasoning that Big Data seems to enable more of.

    Ultimately, therefore, I see Big Data as a means of enabling us to draw inferences about “what’s going on behind” huge data sets more quickly than we have previously. These inferences will always be a mix of right (or at least, “good and reproducible enough to be repeatedly applicable”), wrong, useful and useless, and some of these inferences will result from random bias in finite data sets – and that’s in circumstances where the automated data processing elements both operate perfectly and are appropriate for the data sets being fed to them. So, I don’t see a particular positive case for going down the Big Data path – its success or failure can only be measured by doing it and seeing what emerges. Throw some dice…

Leave a Reply

Your email address will not be published. Required fields are marked *