Computational approaches in discourse analysis

Marion Walton

Outline

  1. What is discourse?
  2. Computational linguistics overview
  3. Discourse analysis & Corpus linguistics
  4. Automating content analysis in Media studies
  5. Multimodal discourses

Quiz

  • Multiple Choice Questions
  • Take-home, Randomized

Question types:

  1. Based on readings
  2. Apply concept to examples

Objectives

  • Disciplinary roots of automated text analysis: Computational Linguistics, Natural Language Processing, Corpus Linguistics
  • Use to extend Critical Discourse Analysis
  • Key concepts: corpus, frequency, collocation, keyness, concordance.
  • Build skills to challenge dominance of English & Global North

Questions

  • What disciplinary assumptions about language underlie various tools?
  • What is gained and lost when we automate text analysis?
  • How is text data prepared for analysis?
  • What are the strengths and weaknesses of frequency data?

Tools for text analysis

Exercises for MCQ

  1. Wordtree by Jason Davies
  2. Multilingual concordancer by Voyant Tools

Examples created with 4cat, R language and the quanteda package which offers many useful text analysis functions.

These are just a taster - join us in Honours for more Digital Methods!

Readings

  • Baker,P.et al.(2008). A useful methodological synergy? Combining critical discourse analysis and corpus linguistics to examine discourses of refugees an d asylum seekers in the UK press. Discourse & society. 19.3.273–306.

  • Baker, P. (2006). Using Corpora in Discourse Analysis. Continuum: London. (Chapter on Concordance)

  • Boumans, J. W., & Trilling, D. (2016). Taking stock of the toolkit: An overview of relevant automated content analysis approaches and techniques for digital journalism scholars. Digital Journalism, 4(1), 8–23. doi:10.1080/21670811.2015.1096598

Example data

We will start with a relatively simple example and then move to a more complex social media dataset.

Some Quiz questions will focus on these datasets.

Mandela speeches

Mandela speeches

We will investigate the text from two historical speeches by Nelson Mandela. You can download them from the links below:

(You will need to use these for some of the quiz exercises.)

Clicks Videos

  1. Video descriptions and comments on YouTube videos posted about the Clicks/Tresseme controversy (2020). (see Amathuba for link to files)

What is discourse?

  • Discourse can be a problematic term - used in many ways.
  • Focus has been on language and written text but also other signifiers in context.

Linguistics

Discourse as the conventions which govern language "above the sentence" e.g. generic conventions in writing, and norms which shape interaction between speaker & addressee in spoken discourse

e.g. academic discourse, legal discourse, media discourse

Poststructuralism

Discourse as power and normativity, or “practices which systematically form the objects of which they speak” - e.g. medical discourse structures knowledge and social practices around health, constitutes entities e.g. “mental illness”, “homosexuality” and subject positions (doctor and patient) (see Foucault, 1972:49)

Anthropology

Discourse as practice or “language-in-action” indexically anchored in social and cultural patterns (Blommaert, 2005:2)

Discourse includes “any meaningful semiotic human activity” as well as the material and historical conditions in which texts are produced.

Queer theory

Discourse produces gendered subjects through performativity. Gender identity is actually an ongoing process of “citing” gender norms, or “doing” gender (Butler, 171-180)

Media

“The hidden power of media discourse and the capacity of … power-holders to exercise this power depend on systematic tendencies in news reporting and other media activities.”(Fairclough, 1989:54)

Language as discourse

Every object or concept is surrounded by different ways of constructing it, which reflect different ways of representing the world. (Baker, 2006)

Language vs discourse

Language is just one part of discourse

Discourse can also be multimodal (visual, audio, haptic signifiers)

Discourse in Media research

Classic qualitative approaches to media research such as Critical Discourse Analysis (CDA) are traditionally focused on relatively small amounts of text.

Such approaches have continued relevance and new uses if they can scale up to engage with larger collections of text e.g. evaluating new computational tools and auditing biases and ideological assumptions in AI training data.

Can be expanded and enhanced by using computational approaches such as automated text analysis.

Interrogating AI

We need to know where new tools come from, or their provenance and what they assume about language and the world (e.g. what counts as “context”, English dominance, built-in gender biases etc).

Disciplinary roots

Computational linguistics, Natural Language Processing and Corpus Linguistics are related areas which provide different approaches, concepts and tools for analysing textual data.

Computational linguistics

Computational linguistics is a broad inter-disciplinary area of study where software and algorithms are developed to study, analyse and synthesise language and speech for applications such as machine translation, speech recognition, machine learning and deep learning (“AI”).

Corpus linguistics

Corpus linguistics has developed methods to study trends and patterns in language use and sociolinguistic variation by analysing large collections of electronically stored, naturally occurring texts.

Natural language processing

Natural Language Processing (NLP) is a subfield of Computer Science NLP develops algorithms and models for computers to “understand”, interpret, and generate human language in a contextually relevant way.

From Eliza the “therapy bot” to ChatGPT

NLP and media studies

Automatic text analysis methods used in media studies include:

  • Supervised text classification
  • Topic modelling
  • Word embeddings

Very useful for describing large collections of textual data such as social media posts, web pages, or interview transcripts.

Quantitative vs qualitative

While Computational Linguistics has a strongly quantitative and statistical focus, Corpus Linguistics can also include qualitative analysis (such as examining concordance lines). Corpus linguistics involves much qualitative work interpreting text, and so can be used to extend the scope of Critical Discourse Analysis (CDA) and other smaller-scale media studies approaches to analysing linguistic discourse.

Further Reading

Van Atteveldt, W. & T. Peng, (2018) When Communication Meets Computation: Opportunities, Challenges, and Pitfalls in Computational Communication Science, Communication Methods and Measures, 12:2-3, 81-92, DOI: 10.1080/19312458.2018.1458084

Text as data

1964

Mandela’s speech from the dock, Rivonia Trial, 20 April 1964

1994

Mandela’s inaugural speech as President, 10 May 1994

What is a Corpus?

A corpus is a set of documents which stores large quantities of real-life text. The plural form of the word is corpora.

You can find a set of South African language corpora on the SADILAR corpus portal website.

Individual documents or posts which make up the corpus can be labelled and stored separately from one another in the corpus format.

Frequency

Frequency is a key concept underpinning the analysis of text and corpora.

Nonetheless, as a purely quantitative measure it needs to be used with a sensitivity to

  1. The word-distribution patterns in human languages.

  2. The importance of context for meaning

Frequency lists

  • Help direct researcher’s investigations
  • Measures of dispersion can reveal trends across texts
  • Can provide a sociological profile of a given word or phrase

Why frequency?

  • Language is not a random affair - rule based / rule-generating
  • Words tend to occur in relationship to other words with high levels of predictability

Frequency & ideology

  • People have choices “No terms are neutral. Choice of words expresses an ideological position”. (Stubbs, 1996:107; Baker, 47)
  • “If people speak or write in an unexpected way, or make one linguistic choice over another, more obvious one, then that reveals something about their intentions, whether conscious or not.” (Baker:48)
  • Only certain choices are available at any one time (e.g. historically)

Problems with frequency

  • Can be reductive and generalising
  • Can oversimplify
  • Focus on differences between word distributions can obscure more interesting interpretations.

1994

Mandela’s inaugural speech as President, 10 May 1994

Thinking about pronouns

The most frequent word in the 1994 speech is “us”.

     us   world  people country   peace   human   south     let freedom 
      9       8       8       6       6       5       5       5       4 

Other pronouns are missing.

Why do you think this is the case?

Removing stopwords

Frequencies are usually only analysed after the most common words in a language are removed (e.g. “a”,“the”,“I”,“you”).

These common words are called stopwords

We would need to edit the stopword list to check for frequency of other pronouns which may be used to construct a group identity.

Pronouns and nationalism

"I/me" highlights individual stance.

"We/us" groups people together, suggests community, masks power.

"They/them" Disidentifies, “others”

Remove pronouns from stopwords

Exercise

Download the text of the Rivonia speech, copy it and paste it into the Wordtree tool and use it to answer the questions in the quiz.

Example

Which word is most frequently juxtaposed with “government” in the Rivonia speech?

  1. Democratic
  2. Buildings
  3. Force
  4. 1 & 2

Example

Which is the most frequent conceptual association with the word “Africans” in the Rivonia speech?

  • Political demands
  • Death and disease
  • Large numbers