In the 17th century, Molière, the French playwright, actor and poet, was a revolutionary force in his day, using his artistic expression as a reflection of the tumultuous societal issues and deep-rooted questions of the time. Through his plays and poetry, Molière provided a glimpse into the customs and norms of his time, and today, the analysis of his work continues to deliver an electrifying insight into the issues and concerns of the era. In a similar manner, the 7th art has proven to be a powerful tool through which we can gain a startling understanding of the thoughts and beliefs of society.
In this study, we are about to unlock the society’s deepest mysteries by analyzing the content of movies plot summaries from the CMU Movie Summary Corpus. Thanks to groundbreaking natural language processing methods, we will delve into the lexical fields and word occurrences, as well as the words polarity and general sentiments to uncover the main concerns that society faces today – climate catastrophe, war, technological disaster, family disunity – and evaluate the emotions associated with them. By tracing the evolution of these concerns over time, we will come closer to fully understand the dire reality of the situation1.
Which scope of information can we get from CMU?
Our study focuses on the movies provided by the CMU Movie Summary Corpus and, more specifically, on the release dates and the plot summaries of these movies. Thanks to this database, we have access to more than 42000 movies with English plot texts, of which 93.8% are provided with a release year. Our dataset represents a wide period of time ranging from 1893 to 2014 with a higher representation for more recent movies. It also gathers a wide range of countries with the top-5 most representative countries being the United States of America, India, United Kingdom, Japan and France.
Which topics are the most common?
Our first goal is to extract the main topics from plot summaries in order to classify them. To achieve this, we use the BERTopic modeling technique, which is able to analyze all words and sentences contained in the plots summaries, in order to find relevant topics addressed in them. Our BERT model run on all plot summaries resulted in a set of 50 topics, each characterized by a name composed of a number determined by the model, followed by the three most frequent words inside each topic. In the following plot, the 19 most frequent topics are represented with their associated words and proportion among each topic. To better see the less represented topics, you can click on each bubble to make them bigger. From this plot, we observe that the first most frequent topic is 2_town_men_horse, representing cowboy/western movies, and the second topic is 1_police_murder_detective, which can be associated with crime or more violent movies. Overall, among the most represented topics, we find some topics related to conflicts (e.g. 14_hitler_nazi_berlin), relationships (e.g. 3_mother_father_family), technology (e.g. 4_earth_planet_space) or health (e.g. 18_dr_patient_hospital), reflecting major societal issues. In the following, we are going to delve deeper into these topics and understand what they represent from a societal perspective.
How do these topics evolve over time?
Now that we have obtained our topics, we can observe their evolution throughout the years. The figure below displays the proportion of each topic for each five years (e.g. 1960 includes years from 1960 to 1964), and reveals that the repartition has changed a lot in the last century. We indeed see a recent expansion of the topic 4_earth_planet_space, reflecting the importance of technology in modern society. Foreign countries related topics have also increased, such as 7_tokyo_conan_japan or 6_wong_kong_master. In addition, topics related to relationships and more precisely family are well represented in recent years, as reflected by topics 0_father_police_family, which is constantly increasing over time, and by topic 3_mother_father_family, which also increases but undergoes a small decline at the very end of the period. On the contrary, we see a drastic reduction of the second most frequent topic 2_town_men_horse, which is underrepresented in the last 30 years. We also observe a strong increase of the topic 14_hitler_nazi_berlin in 1940, in accordance with the historical facts at that time. Interestingly, topic 11_sharpe_soldiers_japanese and 24_soviet_agent_nuclear, both related to conflicts, evolve in a similmar manner, with an increase through the 20th century followed by a decline in the 21st. This could reflect the conflict context presents in the 20th century, which has calmed down after this period in the occidental society.