MyAnimeList Webscraper and Dashboard
Project Goal and Stakeholders
By 2023, I had learned enough Python to read and use it somewhat comfortably but I always felt I was in a loop of constantly watching tutorials without any way of actually applying it to a concrete project. I had just completed Harvard's free CS50P Introduction to Python Course and one of their recommendations was to apply the lessons using a topic I personally find interesting. It was very timely since that period coincided with the Series Finale of Attack on Titan and there was a lot of retrospective of how anime has changed since it first aired 10 years ago.
Since I have very much been addicted to watching anime content for the past several years, I figured that a fun project to explore would be to see what characteristics were common to the most popular anime out there and see how these characteristics may have changed in the last 10 years. Whenever I think of 'Popular Anime', the most common resource that often pops out is MyAnimeList since it has a well-organized database of various anime shows spanning back decades.
As a result, the goal of the project became clear for me: Use Python to scrape MAL website data and perform an analysis on 10 years worth of anime.
MAL is well-known for a having a great collection of anime shows and it even provides a great snapshot for what the anime community considers as a good or popular shows to watch.
Approach and Methodology
IDENTIFYING PROJECT SCOPE
The project began with identifying what parameters I wanted to follow for the project I was doing. These include the following:
-
The entire project needed to be conducted using Python as the primary tool for ETL and Analysis.
-
MyAnimeList would be the primary resource for the data I intend to use for my analysis.
-
I would limit my analysis to only Anime TV Shows that were newly released in the last 10 years. This meant that the scope would be from 2014 - 2023 and that popular long-runnning shows like 'One Piece' or movies such as 'Your Name' would not be included.
The general process flow I intend to follow would be the following:
Once I had all these parameters and process flows outlined, I was ready to begin work on my Python Code
PHASE 1: CODING THE WEBSCRAPER
I began by identifying what were the specific data-points I was interested in collecting and where in the MAL website I would obtain said information from. This meant I first needed to understand how MAL website content is structured in general and what would be the best approach to consolidate the data points I was interested in obtaining.
While the specifics involving what kind of functions and packages I used can be seen in my actual code, it is worth mentioning that at a high level, my code was structured with the following intent:
-
Each Anime Season has a specific webpage that lists down all of the anime content that was released within that season (Spring, Summer, Winter, Fall) and year. It divides the content into TV (New), TV (Continuing), OVA, ONA, Movie and Special Categories. For each season and each year from 2014 - 2023, we will be obtaining info on all the Anime under TV (New) is considered for the dataset.
-
Each Anime Show has a specific link that lists down important details such as genre, themes, start date, and additional statistical information such as MAL Score and Number of Members. In each anime that is part of the dataset, we will go to their webpage and obtain these specific metrics.
The website is incredibly well-organized with details per show listed out in for community members to look through. For example, tags on the genres, themes and demographics can allow members to determine if the show is something they would be interested in watching.
One of the challenges I encountered in working on this part of the code was that there were inconsistencies in the information one show may have had when compared to another. For example, data points such as demographics, themes or genres could be complete for one but may be missing or not found in another. Another concern was working out how how the data points can be formatted and stored as some were dates, string or numerical values. Finally, there was also the issue of making too many requests at a given time period since I was using the Beautiful Soup package. Fortunately, I was able to set up my code to work around these issues, and I managed to create the CSV output along with some exploratory data analysis to brainstorm what kind of visuals I can set-up for my dashboard.
It was incredibly satisfying to finally see the webscraper function properly and be able to actually visualize the generated dataset once it was thoroughly cleaned.
PHASE 2: CODING THE DASHBOARD
The second half of the project involving creating a web app that could host a working dashboard. I wanted to pattern the dashboard on a resource that I saw which was greatly inspired by Coding is Fun's Youtube Video: Turn An Excel Sheet Into An Interactive Dashboard Using Python (Streamlit).
After reviewing the contents of my generated dataset, I decided to structure it in the following way:
-
The filter section of the dashboard would allow you to look at the anime one year at a time. That way, you can see the progression in terms of common themes, genres and the like. Additionally, there will be filters such as which season you are interested in deep-diving into, and options to see Top Anime based on Member Count or Average MAL Score.
-
The KPI section of the dashboard looks into the numerical metrics that I feel would be interesting to learn about. These include the total number of Anime TV Shows that were released, the average score, average member count, and the total amount of time it would take to watch all of the shows released within the selected year/season.
-
The DF section of the dashboard takes a cleaned up version of the dataset and lists down (either by Score / Member Count) the Top 10/100 or all anime within the selected time period. In that way, people can see the specific titles and characteristics of the shows they wanted to explore.
-
The Visuals Section of the dashboard is comprised of 6 bar charts displaying the main characteristics I want people to focus on when looking at anime in the past 10 years. These include Genre, Studio, Theme, Demographic, Source and Rating.
The biggest challenge I encountered here was actually getting the streamlit app to be running locally as it involved playing around with my terminal, something I had not had been able to do before. However, once I had everything working as expected, I consolidated all the related code files and documents into my very first github repository.
I recommend clicking the fullscreen option to see the dashboard in its entirety. As this is hosted on streamlit, and I don't anticipate people to be looking at this 24/7, feel free to wake the app up in case it falls alseep. Here is a shareable link to the dashboard as well.
Project Outcome
While I did not necessarily have to present this project to any stakeholders, it was very enlightening to explore different trends in anime tv shows in the last 10 years. Below are just some of the insights that I found most interesting:
-
Anime adapted from Manga has always been the most popular source of inspiration for TV Shows, but anime-originals have always been a consistent runner-up for the past decade. It was in 2023 wherein the amount of anime-orignals dwindled and was overtaken by anime adapting light novels. We could be seeing a point wherein it may be harder to create completely original content hence an increasing reliance on other sources.
-
From 2013 - 2019, the comedy genre coupled with school themes has been the most popular combination for anime shows. Starting 2020, this genre was overtaken by Fantasy and Action anime with Isekai as a theme gaining traction in popularity. Interestingly, this also coincides with the year that the pandemic started. Perhaps this was a time people just really wanted to escape their reality, which is interestingly a plot followed by a lot of Isekai-Fantasy anime shows.
-
Studio Mappa has definitely gained a lot of brand recognition for the anime it's been releasing in the last few years. In 2020, 2022 and 2023, you would find its shows in the Top 10 most popular anime in terms of members and score. However, one of the biggest criticisms of the studio was the working conditions they subjected their employees to in order to produce so much anime output. If one were to zoom out and look at the sheer volume of anime content released by studios on a per year basis, Mappa is just a microcosm of this larger industry issue, where studios like J.C. Staff, Lindenfilms releasing up to 8 anime shows just in the last year.
Overall, this project taught me a lot about the work it takes to deploy and develop an end-to-end dashboard using python. There were a lot of concepts I used in my project I probably wouldn't have learned if I had just stuck to watching tutorials. While the code and the dashboard can always be improved, I'm glad to have found a way to translate Python skills I have picked up over time into a concrete project.