Mac researcher Victor Kuperman is mining the web for big data


Throughout the past year, Kuperman and two McMaster PhD candidates have been reading and studying roughly 1.8 million blogs and webpages from 340,000 websites around the globe.

The world’s largest collection of free data is located right at our fingertips.

But what’s to be done when the desired sample size is so vast it could take weeks, months or even years to collect and process via conventional Internet browsing?

Enter McMaster assistant professor Victor Kuperman.

Throughout the past year, Kuperman and two McMaster PhD candidates have been reading and studying roughly 1.8 million blogs and webpages from 340,000 websites around the globe using a fleet of high-powered computers in Togo Salmon Hall. The sites are personal, commercial and governmental in nature.

Their search has yielded valuable information from 20 countries and geographic regions where English is a primary language of communication, including Canada, Australia, New Zealand, India, Jamaica and the United Kingdom.

The team is also collecting metadata that allows them to identify the source, genre and time-of-creation for each piece of text.

Their goal is to mine the web’s seemingly bottomless pit of “big data,” in the hopes that their findings will impact everything from government policy to how we understand our global neighbours.

“Blogs and websites are active all the time,” says Kuperman, an assistant professor in the Department of Linguistics and Languages, and an expert in statistical and experimental methods in language research.

“People exchange massive quantities of language and information every second of the day, and that big data is right there waiting to be analyzed.”

Most recently, Kuperman’s team has been focusing on how our perception of various countries and cultures is impacted by what we glean from websites and blogs. They’ve found that first-world nations in Scandinavia and other Western European countries are often mentioned in posts alongside “positive” adjectives and “excited” sentiments.

By comparison, third-world countries and nations embroiled in armed conflicts and political crises are often associated with “negative” adjectives.

The United States is a unique case, says Kuperman. Posts involving the U.S. are often, “more exciting, arousing and passionate,” whether leaning positive or negative.

Big data, the foundation of the team’s research, is a relatively new term used to describe huge pools of information so large and unwieldy they’re often impossible to read or study in any meaningful way.

Kuperman’s team is not only succeeding in doing so, they’re being rewarded for their efforts.

The team recently received a $70,000 SSHRC Insight Development Grant, and will also receive $290,000 from the Canadian Foundation for Innovation’s John R. Evans Leaders Fund and the Ontario Research Fund to continue their work.

The real value in Kuperman’s research can be found in the forthright nature of the Internet, and the willingness of its users to share honest, unfiltered commentary.

In other words, big data culled from the web provides a unique cross section of public opinion, without the added social filter that often comes with face-to-face interaction.

“Our methods are not unlike traditional phone surveys run by polling agencies, where they ask a resident, ‘what do you think about this or that?’ and record the data,” says Kuperman.

“The major difference is, people speak more freely online. They’re much more open, because it can be a one-way exchange. Over the phone, in a one-on-one conversation, you may hold something back and not express your true opinion.”

When McMaster’s new L.R. Wilson Hall is complete, Kuperman and his two PhD students — Bryor Snefjella and Daniel Schmidtke, fellows of the Lewis & Ruth Sherman Centre for Digital Scholarship — will move across campus and continue their work using even faster computers and software in the University’s state-of-the-art home for the humanities and social sciences.

He’s confident that several more PhD candidates will be able to join once the lab relocates, and that the team will be able to expand their focus into popular social networks, including Facebook and Twitter.

The latter could prove to be an invaluable resource for collecting and studying big data.

Roughly 288 million monthly active users post more than 500 million tweets per day, and more than 77 per cent of the network’s accounts are located outside the United States.

Below, Victor Kuperman explains the unique eye-tracking equipment he uses to study and analyze reading habits among Canadians: