The media's reporting of women in the creative industries has changed markedly in recent years. Not only has there been a substantial rise in references to women, albeit from a low base, but greater space has been afforded to women's thoughts and opinions. That said, there are still large gender imbalances in reporting on areas such as technology and games, echoing the imbalances that persist within the creative industries. More broadly, this research shows how big data and machine learning can provide new insights on gender inequality.
The analysis that follows is based on over half a million articles published in The Guardian newspaper between 2000 and 2018. These articles were taken from sections of the paper relating to the creative industries, including Fashion, Stage, Media, Books and Games.1 Amongst British newspapers, The Guardian reports extensively on the creative industries and, unlike any other major newspaper, it offers open access (via an API) to its content. It is for this reason that our research focuses on The Guardian and we were unable to measure the representation of women in other newspapers. Open data is the key ingredient to enabling more in-depth analysis of diversity.
From each article we collected mentions of male and female third-person singular pronouns (he, himself, his, him, she, herself, hers and her) as well as the words that followed 'he' and 'she'. Unfortunately it was not possible to collect mentions of people who identify as non-binary because non-gendered pronouns are also used as plural pronouns. We did find that references to these pronouns (they, them, their, theirs and themselves) have remained fairly constant over the last 19 years, comprising between 25% and 29% of all third-person pronouns. The word 'non-binary' has been used just over 100 times (in the set of articles that we analysed), and over 50 of those mentions were made in the last year (2018).
In the last five years, there has been a large increase in references to women within the creative sections of The Guardian. The increase is interesting because for many years the gender mix had been fairly stagnant. Between 2000 and 2013, female pronouns consistently comprised less than a third of all gendered pronouns within the creative sections of the newspaper. This began to change in 2014, and by 2018, the percentage of female pronouns had reached 40%. To put that rise into context, the gender mix amongst workers in the UK's creative industries has remained largely unchanged in recent years, climbing just one percentage point in 6 years to reach 37%. This meant that 2018 was the first year in which The Guardian made relatively more references to women than the percentage of workers who are female in the creative industries. Going forward, the more equal representation of women in the press may have two effects: it may encourage more women to enter, but it may also give the impression that the creative industries are more balanced than is really the case.
Percentage of gendered pronouns that are male and female
Notes: The gender balance amongst workers in the UK's creative industries was calculated from figures in Table 24 of the DCMS's Sector Economic Estimates 2018: Employment.
Different paths at different paces
There are large differences between the creative sections of The Guardian, both in regards to their current gender mix and in terms of how that mix has changed over time. In 2018 the Fashion section gave the greatest space to women, and this is the only section where the balance has tipped over 50%. At the other end of the spectrum are the Technology and Games sections, where female pronouns comprised just a quarter of all gendered pronouns in 2018. While these figures are low, they are consistent with the gender mix of workers in IT, Software and Computer Services, which was estimated at 21% in 2018.2 And although every creative section of the paper now makes relatively more references to women than they did in 2000 (except Fashion), the sections have taken quite different paths to improving their gender mix. For example, in the Film and Book sections the gender gap has been gradually narrowing since 2012, while in the Music and Stage sections the gains have been concentrated in just the last two years.
Percentage of gendered pronouns that are female by section
Notes: The dotted line indicates that less than 150 articles were published in the year where the line starts.
The groups of words that follow he and she
While it is straightforward to calculate the mix of female and male pronouns, this metric cannot capture every aspect of 'gender balance'. In particular, it does not tell us about the differences in how men and women are portrayed within the paper. To investigate these differences, we analysed the words that followed 'he' and 'she' in the text of the articles.3 The words were clustered into groups based on their meanings.4 It is important to remember that not necessarily every word in an article was chosen by its author, because articles may include quotes and comments made by others.
Compared to men, there appears to have been both a greater focus on certain non-verbal reactions of women, and more references to particular sounds made by women. Examples of these non-verbal reactions include 'smiles', 'grins' and 'nods', while references to sounds include 'laughs', 'cries', 'giggles', and 'coos'. The use of 'she' compared to 'he' before this group of words is around 8 percentage points higher than for all words that follow these pronouns. These words, such as 'giggles', were never used frequently, but when they were used, they were more likely than other words to be referring to women than men.
Words that imply creative achievements and leadership roles were less likely than other words to refer to women. These words include 'directed', 'performed', 'painted' and 'designed' as well as 'managed', 'founded' and 'launched'. The use of these words after 'she' is around 5 percentage points lower than the overall mix of 'she' and 'he'. This finding is not too surprising. According to the British Film Institute, amongst the 11,000 credits for directors of British films over 100 years, just 5% of the credits were for women (while 23% of all crew members were women).
Groups of words that are more and less likely to follow 'she'
Notes: The words shown are a selection of the most frequently-used words that are most likely to belong in each group.
In recent years substantially more space has been given to the voices of creative women. There is a difference between simply mentioning someone in an article, and directly quoting their views and opinions. To identify the latter we searched for closing quotation marks (”) that were followed by 'she' or 'he'. In 2000, just under a quarter of these quotes were by women. However, in the last four years, there has been a marked rise in the number of women quoted. And since 2016 the share of quotes by women has been slightly higher than the share of pronouns referring to women. This suggests a shift in emphasis, away from merely mentioning women and towards giving more room for their voices. Based on current trends, 2019 may be the first year in which women are quoted as often as men in a given month.
Percentage of quotes by men and women
Notes: The words shown are those which most often follow the combination of closing quotation marks and either 'he' or 'she'.
A closer look at individual words
The chart below shows those individual words that have been significantly more likely to follow 'he' and 'she'.5 The number of circles next to each word denotes the number of periods in which the word was significant (there are 15 months in each period), with larger circles representing more recent periods. Only words that were significant in at least two periods are shown, except for the most recent period (following the Me Too movement), where all significant words are shown.
A tense matter
As the chart below shows, words that were significantly more likely to follow 'he' are often in the past tense, while words that follow 'she' tend to be in the present tense. For example, 'said' has been significantly more likely to follow 'he', while 'says' has been more likely to follow 'she'. This might be due to the gender mix in the creative industries having been even more unbalanced in the past than it is currently. Decades of large imbalances may have led to many more instances of 'he said' than 'she said', causing these past-tense words to become associated with male pronouns. The gradual increase in female creatives has meant that the ratio of 'she says' to 'he says' is substantially higher, leading these present-tense words to become associated with 'she'.
Words that are significantly more likely to follow he or she
Notes: The horizontal position of the word indicates how often it followed 'she' relative to 'he', and this is averaged for each period in which the word was significant. The vertical position of the word indicates how frequently it was used. The word is placed into a band based on its average frequency over the periods that it was significant. Only words that were used at least 10 times over each 15-month period were considered.
For many years certain creative activities were more closely associated with one gender. For example, the chart above shows that the words 'sings', 'sang', 'dances' and 'danced' were all more likely to refer to women than men (compared to other words). While the words 'produced', 'directed' and 'painted' were all significantly more likely to refer to men. As mentioned, these differences broadly mimic those amongst workers in the creative industries.
The Me Too movement
Following the Me Too movement (which became widespread in October 2017) there has been a shift in the language associated with men and women. The reporting of the movement is clear to see in the chart above. After October 2017 the words 'alleges' and 'alleged' become significantly more likely to follow 'she', while the words 'raped', 'sexually' and 'unequivocally' (as in 'he unequivocally denies') become more closely associated with men. The increase in space given to women's voices is also visible. 'Speaks' becomes significantly more likely to follow 'she' rather than 'he', as do several synonyms such as 'considers', 'muses', and 'explains'. And finally, there is tentative evidence to suggest that certain creative pursuits have become slightly more gender neutral in recent years. Specifically the words 'wrote', 'produced' and 'directed' are no longer significantly more likely to follow 'he'. And the word 'plays' (as in she plays a role in a film or on stage) has actually become more likely to follow 'she' than 'he'.
The bigger picture
This research has shown how we can use big data and machine learning to generate more meaningful insights on gender inequality. Rather than simply counting the number of men and women we should aim to capture differences in how they are portrayed. This is only possible when data is open, and for this reason open data is the crucial ingredient in developing meaningful measures of diversity.
1 The complete list of sections are Art and design, Music, Fashion, Games, Film, Books, Stage, Television and radio, Technology, Culture and Media. Duplicate articles were removed, as were articles that contained less than 20 words.
2 Calculated from figures in Table 24 of the DCMS's Sector Economic Estimates 2018: Employment.
3 Excluding 'himself' and 'herself'.
4 The words were clustered based on their vector representations in part of the Google News dataset, and the clustering was performed using a Gaussian Mixture Model.
5 Significant at the 5% level.