Prediction Algorithm

Jose Rios
July 08, 2018

Introduction

The goal of this product is to build a natural language processing model that will receive as input a simple text and predict what should be the next word.

This is just like the virtual keyboard in a cellphone, which suggest the next word given the ones already typed.

This work is part of the Swiftkey Capstone Project course offered by the Coursera team at Johns Hopkins Bloomberg School of Public Health.

How the model works

Just give it a sentence in plain English, hit Enter and the model will return the most likely word that completes the sentence.

The model considers only two last two words to try and predict the continuation of the sentence.

Given that the purpose of this work is to build an application that will run on a cellphone, both the size of the predictive model and the runtime had to be drastically reduced. Therefore, only one word is provided.

How the model was built (1)

  • first, 50,000 lines were randomly selected from a 900,000-line corpus to build a sample text
  • next, all singles of three words were saved
  • the number of occurrences of each shingle was tallied
  • shingles with out-of-vocabulary words were removed
  • see Josh Kaufman's list https://github.com/first20hours/google-10000-english/blob/master/google-10000-english-usa-no-swears.txt
  • for those shingles whose first two words are exactly the same, only the most common combination was kept
  • example: if “inside the box” occurs more frequently than “inside the magic” and “inside the car”, only the former was kept

How the model was built (2)

  • shingles with a frequency smaller than a threshold, 0.1% of the total number of words, were disposed of
  • next, an index table was created with all unique words in the model
  • each word of each single was then replaced by the index in the table
  • lastly, only the 50,0000 most frequent shingles were used.
  • actually in the end, the final model is 43,927 shingles long
  • the size of the indexed model is only 528KB and the index table, 408KB
  • so in the end, the model data take only 936KB, that is, less than 1 MB of RAM.

Examples of use

input: the whole process will take about…
output: minutes

input: I received this gift from an old…
output: friend

input: the school committee will soon address this…
output: issue

input: You should worry what you do with your own…
output: life