So for my first blog post I decided I would talk about a fun side project I did with my friend, Andrew Tinits, called Dumb Reviews.
Originally, we set out on a journey to pursue interesting things that nobody had really done before. We had ideas like: let's build an AI that plays QWOP (but this was already a thing * ), or what about a Twitch channel that plays a classic game? Ditto (pun intended ^^.) We were quickly running out of ideas here, so we decided to go with ol' faithful and go golfing. Who knew putting off a decision could lead to good ideas?
It was then that my friend mentioned that the internet would be soon graced by the presence of a multitude of new gTLDs, that is, "generic Top Level Domains". A mouthful, I know, but this is the magical acronym that would help us out of our procrastination session, and back into development mode. Simply put, Top Level Domains are the last few letters at the end of a domain name (e.g. .com, .gov, .org, etc.)
What makes a gTLD so generic? Well, picture our boring .com's and .org's and wonder, "What if we could end our website domain name with more appealing TLDs?" Then take a flood of new TLDs like .fish, .sex, .construction, or .restaurant for example, and set them free to roam the internet as they please (that's over-simplifying it a bit, but you get the idea.) Now we can really begin to have interesting domain names like crypto.fish, or dumb.domains (lots of inspiration drawn from this one.)
It was while we were busy laughing at dumb domain names that we decided that we should buy our own funny domain, and make a website of some sort. Easy, right? Now to just come up with a domain name and the purpose of the site! I decided that there aren't enough generators out on the web for my liking and that we needed to make one more, which posed another problem. What should it generate? We decided that we already did enough that day and started to procrastinate again by looking at the Video Game Name Generator, and Dumb Domains.
After some googling, we eventually decided that there aren't any movie review generators on the internet, and settled on that as our site's purpose. That's also when I attempted at making a logo for the site, and after a little searching, I eventually found Hipster Logo Generator. It's a delightful site that takes your hipster Abercrombie-and-Fitch-esque logo ideas and brings them to life. The only thing is that you have to shell out 5 bucks for a better-looking, higher-res result (I disagreed with that and stuck with the lower-res result.)
And now for the technical portion of this post, where I reveal how someone who has never before created an entire webpage in his life, helped throw one into existence.
Step 1: Register your Domain Name
Because nobody wants to implement their website around a specific name, only to find out when they finish that the smart domain name they had picked out is already taken. Nope, that's why there's a handy domain name registry lookup on any domain name registrar company's site. After you discover that your domain name is still available (congrats!), it's time to register it. We decided to go for GoDaddy as our site's host, but personal preferences and hosting costs may sway you to other hosting companies, so be sure to do a bit of research before you commit to one specific company.
Step 2: Use the Flask
What is Flask, you may ask? Flask is a very handy python oriented web framework that will save you a lot of trouble setting up your backend. It's so user-friendly, that practically anyone with basic knowledge of python, web development, and a bit of Google magic can write their own web server. Needless to say, I was pretty pleased with how painless the process was, and hope that all of you people thinking about making a site of your very own consider starting with Flask.
Step 3: Style Is Everything
The first thing you should think about when styling your site is background tiles. If you don't intend on using one large background for your site, you should consider using a pattern that can be tiled seamlessly across the page.
Subtle Patterns is a great site to find such patterns, as it contains many different styles that will appeal to a wide range of people. Triangular ended up being our final contender for the site (for those that are wondering what we used.)
Step 4: Figure Out The Mechanics
Right now you may be asking the question, "How does it comes up with all those <insert appropriate adjective here> reviews?" That's a great question, which I will get to answering shortly, but first I need to briefly talk about something very core to the site. That something is an equally interesting and useful concept in artificial intelligence known as Markov Chains.
A Markov Chain can be simply described as "A series of independent events [where] the probability distribution of the next step depends non-trivially on the current state." * A simple example of this is the board game snakes and ladders. Your position on the board after the next roll depends on what the current roll is, creating a chain of events, or a Markov Chain.
There, now that we have that out of the way, I can start explaining what we use Markov Chains for! We perused the interwebs for a database of movie reviews, and happened upon the database that Stanford University includes on their list of resources for a certain computer science class. More specifically, it's a collection of 50,000 IMDB user reviews for an arbitrary list of movies. What we wanted to do was turn that very large database into something that will generate every movie review page that the site spits out, which is where Markov Chains come into the picture.
Luckily, there was already an implementation handy for us to use, known as PyMarkovChain. It takes a database of text files and turns them into arrays of probability distributions that calculate the likelihood of any sections of text unbroken by whitespace appearing immediately after another section of text. In other words, it figures out the likelihood of any word in the database appearing immediately after another one.
This is useful because now we can just ask it to generate a sentence word-by-word, randomly picking out the first, and then using that result to find the most likely word that occurs after that, and the next one after that word, and so on. We also managed to separate the database into positive and negative Markov Chains. That way we can generate sentences for positive, negative, or neutral sections of text. Each personal review will either be positive or negative, while the movie title and description will be a neutral mix of both.
While the potential for cohesive sentences is always there,
Markov Chains are only as good as your training set is. Our test case is written by the general public, so it means the Markov Chains reach into a mixed bag of potential spelling mistakes and bad grammar. Even if the entire set was perfect, the algorithm itself is not.
Any text generated using this method will only be as relevant as any of the words that came before. So while reading the text on the site, a sentence may start off well, but it quickly degrades into something else and loses any coherence it may have had as it goes. Then it goes without saying that shorter sentences will usually make more sense than longer ones. Keeping this in mind, hilarity still finds a way into the site on the off-chance the sentences manage to make sense grammatically.
That's great! But how are the rating, release date, runtime, genre, director, cast, reviewer names, and individual scores generated?
Step 5: Generate Everything Else
Now that we have the cool AI-related probability stuff behind us, we should probably start talking about some good old-fashion non-AI-related probability stuff.
Let's begin with how the individual scores are generated. We decided that we wouldn't be smart and use some clever algorithm to detect what the rating will be based off of their reviews. No, we just went and gave each one a random score out of 5. Besides, we thought it would be funny if there was a negative review paired with a high score, and vice versa. In fact, it's pretty hard to tell what the reviewers are saying half of the time anyway, so it doesn't really detract from the experience.
Next, we should talk about how the release date and runtime are calculated. The release date is a randomly chosen date between January 1, 1970 and today's date. Nothing too special, I guess. Neither is the runtime, as it's just another random time between 30min and 300min.
I guess I should top that last paragraph with a more interesting one. How about I explain how the overall score is generated? Yes, that sounds more interesting than the last two paragraphs combined. It is, I promise!
I decided that it would be way too boring to use another random number here, so why not get a bit fancy? I started by getting a long list of the top 500 actors according to an entry I found on IMDB, and two lists of directors, one being a 1785 person long list of directors from around the world, and the other being a top 100 list from IMDB. Sounds good, but when do we get to the overall score part? Fine, these two lists end up influencing the rating entirely, which is why I started talking about them.
It starts with the site choosing a cast of a random size between 1 to 13 from the list of top actors. It places the actors in order of rank, that way lower-ranked actors get less important roles when there are higher-ranked actors in the cast. The ranks of all the actors are then turned into a raw score out of 5 by first bracketing each rank to a value in [0, 1, 2, 3, 4]. This is done by using a for loop:
The first thing you should think about when styling your site is background tiles. If you don't intend on using one large background for your site, you should consider using a pattern that can be tiled seamlessly across the page.
Subtle Patterns is a great site to find such patterns, as it contains many different styles that will appeal to a wide range of people. Triangular ended up being our final contender for the site (for those that are wondering what we used.)
Step 4: Figure Out The Mechanics
Right now you may be asking the question, "How does it comes up with all those <insert appropriate adjective here> reviews?" That's a great question, which I will get to answering shortly, but first I need to briefly talk about something very core to the site. That something is an equally interesting and useful concept in artificial intelligence known as Markov Chains.
A Markov Chain can be simply described as "A series of independent events [where] the probability distribution of the next step depends non-trivially on the current state." * A simple example of this is the board game snakes and ladders. Your position on the board after the next roll depends on what the current roll is, creating a chain of events, or a Markov Chain.
There, now that we have that out of the way, I can start explaining what we use Markov Chains for! We perused the interwebs for a database of movie reviews, and happened upon the database that Stanford University includes on their list of resources for a certain computer science class. More specifically, it's a collection of 50,000 IMDB user reviews for an arbitrary list of movies. What we wanted to do was turn that very large database into something that will generate every movie review page that the site spits out, which is where Markov Chains come into the picture.
Luckily, there was already an implementation handy for us to use, known as PyMarkovChain. It takes a database of text files and turns them into arrays of probability distributions that calculate the likelihood of any sections of text unbroken by whitespace appearing immediately after another section of text. In other words, it figures out the likelihood of any word in the database appearing immediately after another one.
This is useful because now we can just ask it to generate a sentence word-by-word, randomly picking out the first, and then using that result to find the most likely word that occurs after that, and the next one after that word, and so on. We also managed to separate the database into positive and negative Markov Chains. That way we can generate sentences for positive, negative, or neutral sections of text. Each personal review will either be positive or negative, while the movie title and description will be a neutral mix of both.
While the potential for cohesive sentences is always there,
Markov Chains are only as good as your training set is. Our test case is written by the general public, so it means the Markov Chains reach into a mixed bag of potential spelling mistakes and bad grammar. Even if the entire set was perfect, the algorithm itself is not.
Any text generated using this method will only be as relevant as any of the words that came before. So while reading the text on the site, a sentence may start off well, but it quickly degrades into something else and loses any coherence it may have had as it goes. Then it goes without saying that shorter sentences will usually make more sense than longer ones. Keeping this in mind, hilarity still finds a way into the site on the off-chance the sentences manage to make sense grammatically.
That's great! But how are the rating, release date, runtime, genre, director, cast, reviewer names, and individual scores generated?
Step 5: Generate Everything Else
Now that we have the cool AI-related probability stuff behind us, we should probably start talking about some good old-fashion non-AI-related probability stuff.
Let's begin with how the individual scores are generated. We decided that we wouldn't be smart and use some clever algorithm to detect what the rating will be based off of their reviews. No, we just went and gave each one a random score out of 5. Besides, we thought it would be funny if there was a negative review paired with a high score, and vice versa. In fact, it's pretty hard to tell what the reviewers are saying half of the time anyway, so it doesn't really detract from the experience.
Next, we should talk about how the release date and runtime are calculated. The release date is a randomly chosen date between January 1, 1970 and today's date. Nothing too special, I guess. Neither is the runtime, as it's just another random time between 30min and 300min.
I guess I should top that last paragraph with a more interesting one. How about I explain how the overall score is generated? Yes, that sounds more interesting than the last two paragraphs combined. It is, I promise!
I decided that it would be way too boring to use another random number here, so why not get a bit fancy? I started by getting a long list of the top 500 actors according to an entry I found on IMDB, and two lists of directors, one being a 1785 person long list of directors from around the world, and the other being a top 100 list from IMDB. Sounds good, but when do we get to the overall score part? Fine, these two lists end up influencing the rating entirely, which is why I started talking about them.
It starts with the site choosing a cast of a random size between 1 to 13 from the list of top actors. It places the actors in order of rank, that way lower-ranked actors get less important roles when there are higher-ranked actors in the cast. The ranks of all the actors are then turned into a raw score out of 5 by first bracketing each rank to a value in [0, 1, 2, 3, 4]. This is done by using a for loop:
The loop checks if the rank is less than or equal to 100, 200, 300, 400, or 500. If any make this true, the actor is given a score of 0, 1, 2, 3, or 4, respectively. The code then reuses cast as a container for the list of actors' names. Which is then later displayed on the site. Immediately after that, the overall rating is calculated as 5 minus the average score. Pretty simple, no? Actually, no, there's a little bit more to it than that, but we're almost there.
The loop only spits out the base score based on the cast, but now the code has to decide what hand the director will have in all of this. So, I've included another excerpt of my code to show you:
Here it gets interesting, as the director and the cast now have to play an intricate waltz in order to create the final rating. Wait a minute, this sounds just like real life! I told you it would get interesting ^^. Now let's work our way through the loop, shall we? The director will always make a hit movie if they appear in the top 100 list, which is usually a given anyway, but if that's not the case, then we have to introduce some new conditions.
We know the director is not in the top 100 list now, so I thought that I would be generous and give one of the better known actors a shot at directing their own movie, because why not. This is only possible if any of the cast is known to be a previous director, otherwise we'd have any joe putting together a movie, and that just isn't realistic enough. If the cast is lucky enough to be graced by the presence of a director, then that actor gets a generous 60% chance of being the new director. If the new director is in the top 100, they automatically score 5 stars because that's what happens in real life, and I said so. Otherwise they have a 40% chance of scoring it big. That's cool, but now we have to consider how some movies get their low ratings.
If all of these conditions return false, then we are quite possibly facing the most unlucky bastard to direct a movie, or just another not-too-well-known indie director that hasn't made it big yet. Your pick. If this aforementioned soul is misfortunate enough, they halve the total score. In this case, they have a 70% chance of failing, meaning that our critics might be just too harsh.
One of my other friends was fascinated by this and wondered what the overall distribution of scores might look like, so I made a small script that generates 10,000 movie reviews, tallies up each score, and creates a handy histogram for our enjoyment:
Aside from obvious lack of styling, the final result is a bit striking. Over 60% of all the reviews generated will be given a score of 2 or lower, only 6% ever get the middle-of-the-road score of 3, and 34% ever make it above that.
Ok, well that just leaves us the genre and reviewer names. Reviewer names are simply random first and last names from the actor and director lists combined. Not too bad. For the genres, I compiled a list of popular genres and subgenres, and the code randomly chooses one to use. As a bonus, it has a 50% chance of grabbing another genre from the list to create a crossover genre. Anyone up for a wholesome Gay / Lesbian-Alien Invasion film?
Whew, that covers all the fun content generation stuff. Wait, I hear someone in the back row yelling, "If everything is generated randomly, how come we can go back and see movie reviews that we've generated in the past?" True, I almost forgot to mention how the site can magically remember each and every generated review.
Step 6: Remember That Seeds Exist
Wait, what are seeds? Whoops, well luckily I can answer that question, and no, it does not involve botany. Purely random number generators are a pipe dream that we can't achieve anytime soon without nature's help, so we have to make due with a pseudo-random number generator (there's a nice wiki page that talks about this topic in further detail if you are interested.)
Without a seed, one does not a pseudo-random number generator get. Those are wise words to live by, but what does a pseudo-random number generator need a seed for? A seed can be any old number, string of characters, etc. used to prime the pseudo-random number generator. Basically it's like meeting up with a bunch of your buddies and supplying the set of dice/deck of cards for that board game/card game you were going to play. Any old deck of cards/set of dice would have done just fine, but you brought that special set/lucky deck.
The subtle uniqueness of each deck of cards/set of dice influences the outcome slightly. The deck might be bent one way, the dice might have a slight difference in weight. Either way, the results that come out of rolling the dice/shuffling the deck will be unique to that individual set of dice/deck of cards. In our case, each seed will always generate the same set of outcomes when used the same way, hence why it's pseudo-random.
Now that we know what seeds do, how can we use them to our advantage? Conveniently, the Markov Chain implementation uses the exact same pseudo-random number generator as the other randomly generated content on the site, so now we can kill two birds with one seed, if you will. It's as easy as letting the seed equal the current time, in seconds since the epoch. This is technically what the seed is set to by default, but explicitly setting it here lets us know that the seed is being used to generate a new review.
To make each review publicly accessible after it's been generated, we need to hand out a permalink containing the seed. So when the server gets a request for any page that looks like "http://dumb.reviews/generate/<seed>", it will set the seed back to what it was for that page and regenerate the site using that seed. Neat, no?
Part 7: Write A Blog Post About It
Aaand check. Now that I've done that, I hope that my journey making something not quite relevant to your daily lives has brought about cognitive growth of some sort, maybe. Not only was this a fun project to put together, but also brings about a small sense of accomplishment when I can finally show someone something that I made happen, insignificant as it may be. I hope you too can find that insignificant accomplishment that you can show other people and laugh about, as it's not only good practice, but it also looks good on a resume!