Netflix’s The Great Hack and GPT-2

“Big data [1] is everything.”

Note: This article is in no way political or discusses anything related to Cambridge Analytica or Ethics, i’m only really interested in the technology and how it can be a double edged sword.

Anyone who watched the documentary The Great Hack [2] on Netflix recently might now have a basic overview of how powerful parsing big data can be, and how advancements in computer algorithms can not only utilise large data sets to provide an accurate profile of something; but how once you have enough data you can even make predictions.

Machine Learning [3] and Artificial Intelligence [4] are really nothing new, but it’s still very early and advancements are being made all the time. There is also the problem of the cost that would be involved to get into the real cutting edge. So for now, it’s really out of the grasp of people who can’t afford 200k+ for the hardware required.

Netflix’s “The Great Hack” helps illustrate, probably not in the best light, how far we’ve come if you have the resources and data. This is a significant step forward and is very exciting for fields such as weather or medical. Being able to more accurately predict the weather is not only good for letting you know when to pull out the barbecue, but can also save lives. The power to perform predictions in anything medical really needs no explanation. Having enough data to satisfy a model that can predict you visiting the doctor before you realise you need to, well, hand me the crystal ball. I’m all in.

While I was at university, I wrote a program in Java for handwriting recognition that attempted to classify gender at the same time. For this I used the JOONE – Java Object Oriented Neural Engine [5] and utilised a feed-forward back-propagation network. I printed out templates that I would provide to lecturers to hand out at the end of each lecture that students could fill in so I could collect data to train my model. Unfortunately, this wasn’t enough and I never really got to see how accurate the model could be. I really wish I was able to collect enough data to see this through, but it just wasn’t possible with my limited resources.

In the previous article related to email I touched on NLP (Natural Language Processing) [6] and GPT-2 [7], as well as how controversial discussions around this are right now [8]. The reason for this controversy is that GPT-2 highlights how far these advancements have come in relation to, basically, the Turing Test [9]; where Alan Turing describes a situation where conversing with a bot would be indistinguishable from a human.

GPT-2 was trained on 10x the data that GPT was; 8 million web pages scraped from Reddit posts that were links to articles with more than 3 karma. It’s basically a large transformer based language model 40GB in size.

So how well does it work?

In the most recent and controversial example, the GPT-2 model was fed the opening paragraph to a research paper that included the discovery of unicorns “living in a remote, previously unexplored valley, in the Andes Mountains.”, and they “spoke perfect English”. With this reference, GPT-2 was asked to predict the next word, then the next and so on until it had completed this gem;

Context was what was fed into GPT-2, the rest was produced by the model.

This is crazy!. Think about it, a computer came up with this not a human. Sure, you could code up something that might work quite well for a specific purpose, but what is really powerful here is that GPT-2 will most likely be able to do this for anything. It’s so close to being indistinguishable from having being written by a human, that it has all kinds of implications.

For additional information watch this short 9 minute video [10] from Computerphile.

The problem with this advancement is that it can be abused. You may already be bombarded with terms such as “fake news”, or news articles related to how bad actors with a political agenda may be abusing social media to steer opinion.

I also urge you to watch this follow up Computerphile video [11] which explains why they didn’t release it, even though GPT-1 was released as open source.

Could you imagine bad actors utilising this technology to automate bots on YouTube, Twitter, Facebook and any other platforms you can think of? The consensus is that because the model was held back from being released, we have approximately 6 months to implement better bot detection algorithms.

Good Luck.



Leave a Reply

Your email address will not be published. Required fields are marked *