Columns » Technicalities

Google Ngram Viewer brings big data to small users

by

comment
God and Data Ngram - THOMAS RUSSELL
  • Thomas Russell
  • God and Data Ngram
By 1973, the word “data” had eclipsed the word “God” in relative frequency of use in our society. That kind of statistic would have required extensive research to prove before 2010, but with the introduction of Google Ngram Viewer, quantifying data is a few clicks away.

Google Ngram Viewer allows users to explore the frequencies of words as they appeared in books, and see how they changed over a period of time. Launched in December 2010, Ngram Viewer is made up of 11 publicly-searchable databases containing text from some 5 million books published between 1500 and 2008.

Ngram users type words or phrases (separated with a comma) in the search bar and a graph is generated plotting the frequency of the word or phrase over a specified period of time.

Pretty cool, but what is the practical use? Ngram is a tool for studying culture by analyzing the words that were published in the last four-hundred years. Researchers can track changing food preferences, identify periods of censorship, the rise of environmental issue awareness, significant social and political events, and much more.

In the book,“Uncharted: Big Data as a Lens on Human Culture," authors Erez Aiden and Jean-Baptiste Michel highlight how enabling anyone to research things like the spread and rate of technology through history, or why we talk less about God and more about data nowadays with Ngram is opening a new perspective of the world. One of my favorite examples from Aiden and Michel's work is that "United States" was originally used in the plural form, as written in the Constitution (adopted in 1887) but by mid 20th century, the term was definitively considered singular. According to James McPherson’s Pulitzer Prize-winning book titled Battle Cry of Freedom, the change took place after the American Civil War, and the singular usage was adopted in the 1880s. At least, that was the general belief, and it made perfect logical sense.

But research using Ngram shows something slightly different. When charting the terms "The United States is" and "The United States are," we find that the transition from plural to singular actually started earlier in the 1800s, almost half a century before the Civil War started. And the plural term was already outpaced by the singular in 1830, 30 years before the Civil War started.

Point is, with more data-driven information available to us than ever before, historical facts come under much more scrutiny. In the case of "The United States is" or "are," we know it should be researched further, and this history itself possibly rewritten.

In my classroom, I introduce Ngram to students as a supplement to other research tools they already have. Ngram may not be the best route to a definitive answer to historical questions, and it may not always be the best way to prove an argument, but having so much data so easily accessible helps open students' imaginations to the different ways and methods to find and interpret data. Some students wander into uncharted territory like which profanities are most popular — surprise! — the transitions between using the words "negro," to "colored," to "African American" to "black," when we first used the words "flying" and "saucer" together, or "robot," and "rocket."

On the flip-side, Ngram can be used in a not-so-constructive way by students and others. Kevin Drum of Mother Jones called Ngram, “the greatest timewaster in the history of the Internet.”

Ngram is certainly not perfect. Some words are more difficult to track accurately for a litany of reasons — Ngram can't differentiate a musical "keyboard" from a computer "keyboard," for example, nor the modern definition of "computer" to that of the 17th century "computer" (a human who computes/calculates something). It may take a little more creativity to get the true meaning or context of the search results.

Even with some drawbacks, "big data" sets like Ngram gives the average person the chance to play Galileo, Einstein or Darwin, debunking history and discovering new information by filtering through tons of data all at once. Not everyone will be known amongst humanities greatest minds, but the possibility of finding some novel idea through the process data mining and manipulation is enough to ensure data tools like Ngram will only continue to grow in popularity and ease of use.

Thomas Russell is a high school information technology teacher and retired Army Signal Corps soldier. He is the founder of SEMtech (Student Engagement and Mentoring in Technology) and an Advisory Board Member of Educating Children of Color. His hobbies include writing, photography and hiking. Contact Thomas via Russell’s Room on Facebook, or email at thruss09@gmail.com, and his photography at thomasholtrussell.zenfolio.com.

Add a comment

Clicky Quantcast