Breadcrumbs

When machine learning, Twitter and te reo Māori merge

11 February 2019

Left to right: Andreea Calude, David Tyre, Felipe Bravo Márquez.

Researchers have whittled down a massive 8 million tweets, to a more manageable 1.2 million to look at how te reo Māori is being used in the genre.

The team from the University of Waikato have focused on 77 Māori loanwords (te reo Māori words used in an English context) and used them as training data for their machine-learning model. Machine learning allows data scientists to provide a computer with a large data set, and teach it to make predictions based on that data.

Computing and Mathematical Sciences student David Trye spent the summer working on the project, with supervisors Dr Andreea Calude and  Dr Felipe Bravo Márquez. The initial 8-million tweets contained a fair bit of distracting data ‘noise’.  David says the irrelevant tweets were those not used in a New Zealand English context, or were otherwise unrelated. “For example, Kiwi is the name of a song by Harry Styles, so people will tweet things like ‘listening to Kiwi’. And Moana can turn up as a Disney princess rather than the sea.”

Dr Calude and David manually coded about 4-thousand tweets, then David and Dr Bravo Márquez trained  a machine learning model, to weed out the irrelevant ones. Afterwards, they used another machine learning technique invented by Google called Word2Vec to automatically extract the meaning of words according to their context. Dr Bravo Márquez says the technique was invented a few years ago. “But it was a huge revolution in the area of computational linguistics, or the use of computers to process human language.  It was the first algorithm to do it in an efficient way.”

They’ve done the technical bit, now it’s the fun part - they’re going to ask some questions of the data. David is planning to grow the project into a dissertation, also bringing in Dr Te Taka Keegan and Dr Nicholas Vanderschantz to work with the project. He says it will be interesting to see who the users are. “If they are mainly te reo speakers, and if not, why they use the words. Is it because they want to practise and maintain their te reo or is it because it’s such a distinctive part of the New Zealand identity?”

Their analysis involves locating the other words that are associated with the Māori loanwords.  Dr Calude says it will give them a different kind of idea about how the words are being used. “In a dictionary you tend to have what the word means, abstractly out of context, or with a synonym or two. But here you have more of a network of related words, which may not have the same meaning but seem to occur in the same contexts. With whakapapa you have obvious words but also words like maunga, so it’s not the meaning of the word, but it’s like the baggage of the word that comes with it.” The diagram below shows how some of this data is being visualized. Red dots are the 40 closest words in the semantic space to the search word.



Dr Calude has been involved in research on Māori loanwords in newspapers as part of a Marsden funded project. She has already noticed a difference with Twitter use during the manual coding phase. “By comparison the words are more integrated. There’s more language mixing: full sections of Māori and full phrases in English together. So, it’s more like code switching, which is what bilinguals do. We might be looking at almost a different phenomenon of how Māori is used in twitter.”

The theory around the project has been around for quite a while, but combining it with machine learning means they have created a remarkably vast and accurate corpus of words to analyse. The researchers want to make it possible for others to do the same, so they’re providing the knowledge on open-source platform Github:  https://waikato.github.io/kiwiwords/. They’re adding to the website as they go along, so it is a growing resource.



Latest stories

Related stories

Martin Lodge

Musical leader's exceptional contribution recognised

The University of Waikato has awarded the prestigious title of Emeritus Professor to Martin Lodge…

IDI Lab sign

Newly refurbished Waikato IDI Lab a catalyst for fostering academic-industry collaboration

The University of Waikato re-opened its newly refurbished Integrated Data Infrastructure (IDI) Lab; a catalyst…

NZ Cyber Security Challenge 2021

Securing New Zealand’s future: Cyber Security Challenge celebrates 10 years

Tomorrow, Kiwis across the country will be at the University of Waikato for the 10th…

Scholarship supports two worthy women in technology

The inaugural Endace Women in Technology Scholarship has been awarded to two University of Waikato…

2023 TANZOS Talent

Waikato opera programme takes young New Zealand talent to the world stage

Six young singers are preparing to take the stage at the Sydney Opera House for…

Wairehu Grant at the University of Waikato after his music trip to Wales.

Māori Punk to Wales: strengthening international connections

A University of Waikato PhD researcher is sharing Māori punk with the world, and strengthening…

Catherine Chidgey Ockham Awards 2023

Chidgey axes the competition at The Ockham New Zealand Book Awards…again

Catherine Chidgey has been named winner of the Jann Medlicott Acorn Prize for Fiction for…

The opening of G Block

Golden jubilee: 50 years of Computing at University of Waikato

This year, the University of Waikato is celebrating 50 years of Computing, marking a significant…

Exploring and embracing diversity

Research at the University of Waikato is helping to underpin policy development and decisions designed…

Ian Witten

Waikato pays tribute to open-source data mining trailblazer

The University of Waikato is mourning the passing of computer software and digital technology trailblazer,…

Sargeson Prize winner praises prestigious award ahead of 2023 entries opening

Entries for the 2023 Sargeson Prize open on April 1, and last year's winner, Leeanne…

Bouncing unborn baby research between time zones secret to success

Bouncing work back and forth between time zones has allowed research into fetal development to…