Inthe movie Die Hard with a Vengeance(aka Die Hard 3), there isthis famous scene whereJohn McClane (Bruce Willis) and Zeus Carver (Samuel L. Jackson)are forced to solve a problem or be blown up: Given a 3 gallon jug and5 gallon jug, how do you measure out exactly 4 gallons of water?
Hypothesis has an excellent implementation of property-based testingfor Python.I thought to myself: I wonder if you can write thatDie Hard specification using Hypothesis? As it turns out, Hypothesissupports stateful testing,and I was able to port the TLA+ exampleto Python pretty easily:
Next we define invariants, which are properties that must always holdtrue in our system. Our first invariant, physics_of_jugs, says thatthe jugs must hold an amount of water that makes sense. For example,the big jug can never hold more than 5 gallons of water.
While we could have agreed to disagree, I had just passed the Azure AI Fundamentals certification and the topics covered on that certification gave me an idea that could resolve this issue once and for all.
Machine learning is a combination of mathematics and computer science that allows computer algorithms to make certain types of predictions or groupings based on data. There are a number of different flavors of machine learning including regression which predicts numerical values, clustering which sorts data into groups, and classification which determines what type of label to put on something based on its data.
Classification can be used for things like determining if a loan should be approved or rejected, or if a mole is cancerous or benign. However, it could conceivably be used to determine whether or not a movie should be considered a Christmas movie based on historical data from other movies.
In order to avoid adding my own bias to my experiment, I used top five lists produced by a internet search result for Christmas movie lists and considered any movie that appeared on at least two of those five lists a Christmas movie.
Interestingly, Die Hard appeared on 4 out of 5 of these lists with the 5th list, Thrillist, referencing Die Hard in its second paragraph and stating that it was explicitly excluding it. This put Die Hard at the same level of consideration as a Christmas movie as The Santa Clause and Christmas Vacation and above Rudolph the Red-Nosed Reindeer, How the Grinch Stole Christmas, and several versions of A Christmas Carol.
Now that I had a set of Christmas movies, I needed to find a source of data for movies that might have a sufficient level of detail about the movies for a machine learning algorithm to be able to make inferences as to whether or not a movie is a Christmas movie.
After a bit of searching, I discovered that Rounak Banik had uploaded a very popular dataset known simply as The Movies Dataset which contained over 45,000 movies along with ratings and keyword information on these movies.
This dataset seemed sufficient for my needs and looked to give a machine learning routine a good chance at finding relationships between the data. Perhaps most importantly, it included every one of my movies that would be flagged as Christmas movies for the algorithm and also contained information on Die Hard.
Once I had the data downloaded to my machine I was able to set up a Jupyter notebook project to explore the data using the Python programming language. This lets me load and interact with data from Python code in a very iterative and visual manner.
Exploring data is a more advanced topic to describe for an article about a specific project, but typically involves a mix of data visualization and using descriptive statistics to identify the characteristics of different columns of data. While most of my original exploratory data analysis and statistical analysis work was eventually removed to make way for the final version of the project, some examples are still present in my ProcessedEDA notebook on GitHub.
Because I was dealing with a dataset with 45,000 movies and only 50 movies flagged as Christmas movies by my process listed earlier, I could afford to throw out outliers. Any movie that was released before the earliest known Christmas movie or after the latest known one was discarded. Any unreleased movie was discarded. Any movie with a greater than 2.5 hour runtime was discarded (some had runtimes that appeared to resemble entire TV seasons). I also dropped a number of unreliable columns including the budget and revenue columns.
For example, I wanted to have a column for each type of genre and note if a movie was that genre or not with either a 0 or a 1. This would help a machine learning algorithm make distinctions as to what genres something was if that wound up being relevant for classification. However, the movie data I received stored its genre in a column with values that looked like this:
That meant that I need to write Python code to look at each row in my 45,000 row dataset and translate it from values like this to readable columns like Is Adventure, Is Fantasy, and Is Family with correct values for each movie. Thankfully there are amazing libraries like Pandas that are very good at this, but this is still a typical task a data scientist or data analyst must do when processing data.
The full scope of Azure Machine Learning Studio (also called ML Studio) is a topic for another article, but a quick summary is this: ML Studio lets data scientists of all skill and experience levels run machine learning experiments however they feel comfortable.
However, performance metrics can be deceiving in machine learning and there are a number of different metrics you need to look at such as recall and precision. Perhaps the best way of visualizing the performance of a classification routine is via a Confusion Matrix which displays a grid of predictions vs actual values.
The vertical axis here is what something actually was and the horizontal axis is what my routine said it was. In this case, 13,769 movies were not Christmas movies and my routine said they were not. However, 46 movies were Christmas movies and my routine said they were not.
This resulted in a more realistic result with 35 out of the 48 movies being correctly classified as Christmas movies, but a large number of non-Christmas movies were falsely classified in the process.
I tried a number of other configurations and got more aggressive in cleaning to reduce the imbalance between Christmas movies and non-Christmas movies, however none of these models performed sufficiently accurately on the validation data.
I realized that if I were having these issues, then it was likely very unfair on my machine learning routine and it might not be able to produce a good result (though machine learning can often find associations where humans cannot).
This resulted in a new Keywords column being added to the movies dataset with some JSON-like values. Take a look at the following set of tags (with one removed that contained the name of the film) and see if you can guess which movie it came from:
Looking at this data, I saw that I as a human now had enough information to make informed decisions as to what was and was not a Christmas movie just by genres, release dates, and the raw keyword data, but I still needed to get this keyword data into a format that Azure could recognize it.
I have an article discussing deep learning in more detail, but it is essentially a more complex flavor of a neural network that aims to achieve a certain degree of reasoning about data in a way that mimics complex layers of neurons in the brain. At its core, deep learning brings an added degree of complexity to its structure that can be used to detect patterns intelligently.
This confusion matrix is displaying the percentages of things instead of the raw count, but it tells a story of perfection: this model correctly classified each movie it encountered as a Christmas movie or a non-Christmas movie with zero mistakes.
Here Azure found that a number of keywords were very likely to result in a movie being categorized as a Christmas movie while others were maybe less likely to do so or were likely to mark a movie as a non-Christmas movie.
Deploying a model creates a new web service that others can send data to and get a response with some data. You use this with a trained model to generate predictions or classifications based on the data it was originally trained on.
Matt Eland is a software engineering leader and data scientist who has served as a senior engineer, software engineering manager, professional programming instructor, and has helped build enterprise-level software at a variety of organizations before distinguishing himself as a Microsoft MVP in Artificial Intelligence by using technology to accomplish ridiculous things in the name of science and teaching others.
Matt is a Microsoft Certified Azure Data Scientist and AI Engineer associate and is pursuing a master's in data analytics focusing on machine learning and artificial intelligence as he continues to build and learn new things and look for ways to share them with the community.Matt is the author of Refactoring with C# and is currently creating a course on computer vision on Azure and a new book.
There you have it: 29 perfect examples to why Die Hard is a Christmas movie. Let us finally put this all to rest. There are more important things to argue about. Like the best order to watch all the Star Wars movies.
Once we proceed beyond considering individuals only at a point in time, we could in principle examine all individuals on a continuum between choosing to spend their entire career history in self-employment, or none of it, or any fraction. However, the polarised nature of our data, the crucial elements of which are drawn from the sixth sweep of the UK National Child Development Study (NCDS), suggest that we ought to test whether a natural distinction exists between individuals who are pure wage-workers (no self-employment in their work history or likely future), and those who have ever been self-employed (or may be in the future). We label this latter group Entrepreneurial Types (ETs), some of whom may only be very briefly self-employed. The cross-section of those who are self-employed at any point in time within our sample period will be a proper subset of the ETs, since some individuals may only be self-employed before or after the date of the cross-section. The more inclusive ET set should thus provide more insight into the fundamental determinants of a propensity for self-employment.
3a8082e126