us world people country peace human south let freedom
9 8 8 6 6 5 5 5 4
Question types:
Computational Linguistics
, Natural Language Processing
, Corpus Linguistics
Critical Discourse Analysis
corpus
, frequency
, collocation
, keyness
, concordance
.challenge dominance
of English & Global NorthExamples created with 4cat, R language
and the quanteda
package which offers many useful text analysis functions.
These are just a taster - join us in Honours for more Digital Methods!
Baker,P.et al.(2008). A useful methodological synergy? Combining critical discourse analysis and corpus linguistics to examine discourses of refugees an d asylum seekers in the UK press. Discourse & society. 19.3.273–306.
Baker, P. (2006). Using Corpora in Discourse Analysis. Continuum: London. (Chapter on Concordance)
Boumans, J. W., & Trilling, D. (2016). Taking stock of the toolkit: An overview of relevant automated content analysis approaches and techniques for digital journalism scholars. Digital Journalism, 4(1), 8–23. doi:10.1080/21670811.2015.1096598
We will start with a relatively simple example and then move to a more complex social media dataset.
Some Quiz questions will focus on these datasets.
We will investigate the text from two historical speeches by Nelson Mandela. You can download them from the links below:
(You will need to use these for some of the quiz exercises.)
Discourse as the conventions which govern language "above the sentence"
e.g. generic conventions in writing, and norms which shape interaction between speaker & addressee in spoken discourse
e.g. academic discourse, legal discourse, media discourse
Discourse as power and normativity
, or “practices which systematically form the objects of which they speak” - e.g. medical discourse structures knowledge and social practices around health, constitutes entities e.g. “mental illness”, “homosexuality” and subject positions (doctor and patient) (see Foucault, 1972:49)
Discourse as practice
or “language-in-action” indexically anchored in social and cultural patterns (Blommaert, 2005:2)
Discourse includes “any meaningful semiotic human activity” as well as the material and historical conditions in which texts are produced.
Discourse produces gendered subjects through performativity
. Gender identity is actually an ongoing process of “citing” gender norms, or “doing” gender (Butler, 171-180)
“The hidden power of media discourse
and the capacity of … power-holders to exercise this power depend on systematic tendencies in news reporting and other media activities.”(Fairclough, 1989:54)
Every object or concept is surrounded by different ways of constructing it, which reflect different ways of representing the world. (Baker, 2006)
Language is just one part of discourse
Discourse can also be multimodal
(visual, audio, haptic signifiers)
Classic qualitative approaches to media research such as Critical Discourse Analysis
(CDA) are traditionally focused on relatively small amounts of text.
Such approaches have continued relevance and new uses if they can scale up to engage with larger collections of text e.g. evaluating new computational tools and auditing biases and ideological assumptions in AI training data.
Can be expanded and enhanced by using computational approaches such as automated text analysis.
We need to know where new tools come from, or their provenance
and what they assume about language and the world (e.g. what counts as “context”, English dominance, built-in gender biases etc).
Computational linguistics, Natural Language Processing and Corpus Linguistics are related areas which provide different approaches, concepts and tools for analysing textual data.
Computational linguistics is a broad inter-disciplinary area of study where software and algorithms are developed to study, analyse and synthesise language and speech for applications such as machine translation, speech recognition, machine learning and deep learning (“AI”).
Corpus linguistics has developed methods to study trends and patterns in language use and sociolinguistic variation by analysing large collections of electronically stored, naturally occurring texts.
Natural Language Processing (NLP) is a subfield of Computer Science NLP develops algorithms and models for computers to “understand”, interpret, and generate human language in a contextually relevant way.
From Eliza the “therapy bot” to ChatGPT
Automatic text analysis methods used in media studies include:
Very useful for describing large collections of textual data such as social media posts, web pages, or interview transcripts.
While Computational Linguistics has a strongly quantitative and statistical focus, Corpus Linguistics can also include qualitative analysis (such as examining concordance lines). Corpus linguistics involves much qualitative work interpreting text, and so can be used to extend the scope of Critical Discourse Analysis (CDA) and other smaller-scale media studies approaches to analysing linguistic discourse.
Van Atteveldt, W. & T. Peng, (2018) When Communication Meets Computation: Opportunities, Challenges, and Pitfalls in Computational Communication Science, Communication Methods and Measures, 12:2-3, 81-92, DOI: 10.1080/19312458.2018.1458084
Mandela’s speech from the dock, Rivonia Trial, 20 April 1964
Mandela’s inaugural speech as President, 10 May 1994
A corpus is a set of documents which stores large quantities of real-life text. The plural form of the word is corpora.
You can find a set of South African language corpora on the SADILAR corpus portal website.
Individual documents or posts which make up the corpus can be labelled and stored separately from one another in the corpus format.
Frequency is a key concept underpinning the analysis of text and corpora.
Nonetheless, as a purely quantitative measure it needs to be used with a sensitivity to
The word-distribution patterns in human languages.
The importance of context for meaning
Mandela’s inaugural speech as President, 10 May 1994
The most frequent word in the 1994 speech is “us”.
us world people country peace human south let freedom
9 8 8 6 6 5 5 5 4
Other pronouns are missing.
Why do you think this is the case?
Frequencies are usually only analysed after
the most common words in a language are removed (e.g. “a”,“the”,“I”,“you”).
These common words are called stopwords
We would need to edit the stopword list to check for frequency of other pronouns which may be used to construct a group identity.
"I/me"
highlights individual stance.
"We/us"
groups people together, suggests community, masks power.
"They/them"
Disidentifies, “others”
Download the text of the Rivonia speech, copy it and paste it into the Wordtree tool and use it to answer the questions in the quiz.
Which word is most frequently juxtaposed with “government” in the Rivonia speech?
Which is the most frequent conceptual association with the word “Africans” in the Rivonia speech?