Uncharted Territory


RSS     Archives

Big Dataの起源を探る

Big Dataという言葉は日本でも定着しつつある気がしますが、Big Dataという言葉を誰が使い始めたのか、その起源を探ろうとする記事がニューヨークタイムズにありました。言語を学んでいる者として興味をそそられる記事でした。

FEBRUARY 1, 2013, 9:10 AM
The Origins of ‘Big Data’: An Etymological Detective Story

Words and phrases are fundamental building blocks of language and culture, much as genes and cells are to the biology of life. And words are how we express ideas, so tracing their origin, development and spread is not merely an academic pursuit but a window into a society’s intellectual evolution.


Digital technology is changing both how words and ideas are created and proliferate, and how they are studied. Just last month, for example, the Library of Congress said its archive of public Twitter messages has reached 170 billion tweets and rising, by about 500 million tweets a day.

デジタルデータが発達したので、Big Dataの起源を探るのは簡単にいくと思ったらそうではなかったようです。softwareという言葉が1958年の論文に遡れるのとは対照的です。

The unruly digital data of the Web is a big ingredient in what is now being called “Big Data.” And as it turns out, the term Big Data seems to be most accurately traced not to references in news or journal archives, but to digital artifacts now posted on technical Web sites, appropriately enough.


But Mr. Shapiro couldn’t find anything as crisp and definitive as he had done for me years earlier when I asked him to try to find the first reference to the word “software” as a computing term. It was in 1958, in an article in “The American Mathematical Monthly,” written by John Tukey, a Princeton mathematician.

Big Dataという言葉の起源を探る取組みの難しさは、一般的な語の組み合わせなので、早期の用例を見つければいいというわけではなく、現在の使われ方と同等の意味で使われているという点を考慮しなくてはいけないというようです。意味・用例も考慮して検索するのは、コンピューターによるデータ分析の苦手とする部分でもありますね。

The term Big Data is so generic that the hunt for its origin was not just an effort to find an early reference to those two words being used together. Instead, the goal was the early use of the term that suggests its present connotation — that is, not just a lot of data, but different types of data handled in new ways.

The credit, it seemed to me, should go to someone who was aware of the computing context. That is why, in my view, a very intriguing reference, discovered by the Yale researcher Mr. Shapiro, does not qualify.

ちょっと違うなという例としてあげられていたのが以下の2つの例でした。一方は経済学者の論文で、もう一つはベストセラー作家のエッセイです。経済学者のDieboldさんは俺がBig Dataの使い始めと得意げになっているようで、ちょっと痛い感じです(苦笑)

Francis X. Diebold, an economist at the University of Pennsylvania, got in touch and even wrote a paper, with the mildly tongue-in-cheek title, “I Coined the Term ‘Big Data’ ” I had not thought of economics as the breeding ground for the term, but it is not unreasonable. Some of the statistical and algorithmic methods now in the Big Data tool kit trace their heritage to economic modeling and Wall Street.


In 1989, Erik Larson, later the author of bestsellers including “The Devil in the White City” and “In The Garden of Beasts,” wrote a piece for Harper’s Magazine, which was reprinted in The Washington Post. The article begins with the author wondering how all that junk mail arrives in his mailbox and moves on to the direct-marketing industry. The article includes these two sentences: “The keepers of big data say they do it for the consumer’s benefit. But data have a way of being used for purposes other than originally intended.”

Prescient indeed. But not, I don’t think, a use of the term that suggests an inkling of the technology we call Big Data today.

Big DataはIT系の文脈でないと今日的な使われ方ではないですよね。そういった意味で、シリコングラフィクス社のMashey氏がふさわしいのではと述べています。ただ、学術的な文書では支持されていないようです。というのも1990年代後半にMashey氏は小規模な会合でこの言葉を使っていたからだそうです。

Since I first looked at how he used the term, I liked Mr. Mashey as the originator of Big Data. In the 1990s, Silicon Graphics was the giant of computer graphics, used for special-effects in Hollywood and for video surveillance by spy agencies. It was a hot company in the Valley that dealt with new kinds of data, and lots of it.

There are no academic papers to support the attribution to Mr. Mashey. Instead, he gave hundreds of talks to small groups in the middle and late 1990s to explain the concept and, of course, pitch Silicon Graphics products. The case for Mr. Mashey is on the Web sites of technical and professional organizations, like Usenix. There, some of his presentation slides from those talks are posted, including “Big Data and the Next Wave of Infrastress” in 1998.


When I called Mr. Mashey recently, he said that Big Data is such a simple term, it’s not much a claim to fame. His role, if any, he said, was to popularize the term within a portion of the high-tech community in the 1990s. “I was using one label for a range of issues, and I wanted the simplest, shortest phrase to convey that the boundaries of computing keep advancing,” said Mr. Mashey, a consultant to tech companies and a trustee of the Computer History Museum in Mountain View, Calif.

ということで、今日的な意味でのBig Dataは1990年代後半にシリコングラフィクス社のMashey氏が使い始めたという感じのようです。まあでも、その方の独創性だけではなく、受容する側もシンプルながらもその意義を理解しやすい表現だったからこそ受け入れていったのではないかと想像します。

記事の締めは、Big Dataのお陰で今後の語源調査のやり方も変わっていくのではと締めていました。これまでのデータベースと言えばlegal documents, news articles and other documents, in computerized archivesだったのですが、ツイッターのようなものも調べられるようになってきたからです。Shapiro さんは別のところで“It’s almost like oral language instead of edited text”と語っていました。

Tracing the origins of Big Data points to the evolution in the field of etymology, according to Mr. Shapiro. The Yale researcher began his word-hunting nearly 35 years ago, as a student at the Harvard Law School, poring through the library stacks. He was an early user of databases of legal documents, news articles and other documents, in computerized archives.

The Web, Mr. Shapiro said, opens up new linguistic terrain. “What you’re seeing is a marriage of structured databases and novel, less structured materials,” he said. “It can be a powerful tool to see far more.”


January 2013
Update on the Twitter Archive At the Library of Congress

In April, 2010, the Library of Congress and Twitter signed an agreement providing the Library the public tweets from the company’s inception through the date of the agreement, an archive of tweets from 2006 through April, 2010. Additionally, the Library and Twitter agreed that Twitter would provide all public tweets on an ongoing basis under the same terms. The Library’s first objectives were to acquire and preserve the 2006-10 archive; to establish a secure, sustainable process for receiving and preserving a daily, ongoing stream of tweets through the present day; and to create a structure for organizing the entire archive by date. This month, all those objectives will be completed. To date, the Library has an archive of approximately 170 billion tweets.

The Library’s focus now is on confronting and working around the technology challenges to making the archive accessible to researchers and policymakers in a comprehensive, useful way. It is clear that technology to allow for scholarship access to large data sets is lagging behind technology for creating and distributing such data. Even the private sector has not yet implemented cost-effective commercial solutions because of the complexity and resource requirements of such a task. The Library is now pursuing partnerships with the private sector to allow some limited access capability in our reading rooms. These efforts are ongoing and a priority for the Library.