Computational approaches in discourse analysis III

Marion Walton

Outline

  1. What is discourse?
  2. Computational linguistics overview
  3. Discourse analysis & Corpus linguistics
  4. Automating content analysis in Media studies
  5. Multimodal discourses

Quiz

  • Multiple Choice Questions
  • Take-home, Randomized

Question types:

  1. Based on readings and lectures
  2. Use Wordtree and Voyant tools to answer questions about Mandela speech and Clicks Corpus.
  3. Apply concepts from readings to examples from Clicks corpus

Starting points - Concordance

Lecture 3 - Machine learning and AI

ML & Clicks Video Descriptions

Framing

“Framing essentially involves selection and salience. To frame is to select some aspects of a perceived reality and make them more salient in a communicating text, in such a way as to promote a particular problem, definition, causal interpretation, moral evaluation, and/or treatment recommendation for the item described.” (Entman, 1993:52)

Machine learning and frame analysis

  • Co-occurrence of words interpreted as a frame using statistical techniques like cluster analysis.
  • Cluster analysis - organising items into groups, or clusters, on the basis of how closely associated they are.
  • Co-occurrences of words graphically visualized as networks of words

ML & Clicks Comments

Comments - Collocations

Top Collocations: Clicks Comments (n=35)
collocation count
2 black people 135
1 black women 106
3 white people 102
7 natural hair 48
9 black woman 45
4 fine flat 41
145 black hair 39
13 flat hair 35
5 dry damaged 28
15 can use 28
6 face cream 25
8 go back 25
27 black person 24
21 straight hair 22
49 can get 21
22 aneeza gold 20
76 white women 20
64 damaged hair 17
10 political party 16
11 thank ninja 16

n-grams

Continuous sequence of “n” terms - often used in predictive text, spell-checking, language modeling etc. Also useful for discourse analysis

          feature frequency rank docfreq group
48         racist       271   48     217   all
60         racism       208   60     167   all
392        racial        38  388      34   all
586      a_racist        27  566      27   all
607     of_racism        26  595      23   all
656     racism_is        25  622      25   all
665       racists        24  657      23   all
744    not_racist        22  729      22   all
938     racism_in        18  909      18   all
957     be_racist        18  909      15   all
1159    is_racist        15 1092      14   all
1259   racist_and        14 1165      13   all
1294   the_racist        13 1268      13   all
1436    is_racism        12 1402      10   all
1498   are_racist        12 1402      11   all
1713 being_racist        11 1551      11   all
1714   racism_and        11 1551      10   all
1982    to_racism         9 1936       8   all
2029 about_racism         9 1936       9   all
2046   the_racism         9 1936       8   all
                      feature frequency  rank docfreq group
744                not_racist        22   729      22   all
3732           was_not_racist         5  3642       5   all
4389               not_racism         5  3642       5   all
4469            is_not_racist         5  3642       5   all
6452            not_racist_it         3  6234       3   all
6457        ad_was_not_racist         3  6234       3   all
10014       was_not_racist_it         2  9552       2   all
10015       not_racist_it_was         2  9552       2   all
15642          not_racism_the         2  9552       2   all
15758          but_not_racist         2  9552       2   all
16573      totally_not_racist         2  9552       2   all
16574          not_racist_its         2  9552       2   all
16603   is_totally_not_racist         2  9552       2   all
18236         not_racist_they         2  9552       2   all
18334          its_not_racist         2  9552       2   all
19044           not_racist_at         2  9552       2   all
19046       not_racist_at_all         2  9552       2   all
19155           not_racist_to         2  9552       2   all
61441     truthful_not_racist         1 21181       1   all
61444 was_truthful_not_racist         1 21181       1   all

Topic Modeling - Comments

      Topic 1  Topic 2     Topic 3     Topic 4   Topic 5  Topic 6  Topic 7 
 [1,] "clicks" "🤣"        "black"     "racist"  "just"   "can"    "hair"  
 [2,] "eff"    "go"        "people"    "us"      "like"   "one"    "white" 
 [3,] "must"   "know"      "think"     "country" "get"    "use"    "ad"    
 [4,] "right"  "say"       "women"     "time"    "need"   "also"   "even"  
 [5,] "take"   "u"         "😂"        "people"  "want"   "yes"    "look"  
 [6,] "malema" "see"       "racism"    "eff"     "love"   "good"   "make"  
 [7,] "thing"  "said"      "person"    "never"   "way"    "skin"   "fine"  
 [8,] "matter" "really"    "beautiful" "racism"  "going"  "face"   "saying"
 [9,] "stupid" "something" "white"     "now"     "well"   "please" "flat"  
[10,] "stores" "back"      "many"      "race"    "better" "thank"  "racist"

Topic Modeling

In NLP, topic modeling applies unsupervised learning on a corpus to produce a summary sets of terms representing the collection’s overall primary set of topics.

Topic Modeling

Used to identify topics present in a corpus.

LDA algorithm identifies co-occurrence patterns of words and latent structure of the text

LDA assumes: - each doc is a mixture of topics - each topic has characteristic word distribution

Why Use Topic Modeling

  • Exploratory work on large dataset
  • Summarise key themes
  • Reduces complexity and size of dataset (dimensionality reduction)
  • Faster Information Retrieval - find by themes not keyword matches

Goals of ML

The goal of machine learning is to:

  • make accurate predictions.
  • use large datasets
  • use complex models which recognise nonlinear relationships between several variables.

Overview - Automated Content Analysis

Boumans & Trilling, 2016

Boumans, J. W., & Trilling, D. (2016). Taking stock of the toolkit: an overview of relevant automated content analysis approaches and techniques for digital journalism scholars. Digital Journalism, 4(1), 8-23. https://doi.org/10.1080/21670811.2015.1096598

Approaches

Simple automation

Dictionary approaches (e.g. sentiment analysis)

  • specifies explicit rules
  • best for coding manifest data
  • sentiments are contextual - does not always travel well

“Sentiment” in Clicks transcripts

Supervised machine learning

Model "learns" from decisions by human coders.

  • Makes more efficient use of human work.
  • Classifiers can be re-used and allow for faster response (e.g., Hopkins and King 2010; Jurka et al. 2013).
  • Inter-cultural generalisability?

Unsupervised machine learning

  • Unsupervised machine learning helps to describe discourses, frames or topics in an open way - doesn’t impose prior assumptions
  • Similar to qualitative methods
  • Difficult to audit
  • Can replicate bias

Strengths

  • Flexibility & scale
  • Classification tasks
  • Reality-based tasks with right and wrong answers

Weaknesses

  • Culturally constructed categories
  • Potential linguistic/culture/race/gender biases
  • Unintended uses
  • Variability & difficulties with auditing

Further reading

  • Jurka, Timothy P., Loren Collingwood, Amber Boydstun, Emiliano Grossman, and Wouter van Atteveldt. 2013. “RTextTools : A Supervised Learning Package for Text Classification.”The R Journal 5 (1): 6–12.

  • Hopkins, Daniel J. & Gary King. 2010. A Method of Automated Nonparametric Content: Analysis for Social Science. American Journal of Political Science, Vol. 54.1. 2010. 229–247.

  • Vlieger, Esther, and Loet Leydesdorff. 2012. “Content Analysis and the Measurement of Meaning: The Visualization of Frames in Collections of Messages.” In Research Methodologies, Innovations and Philosophies in Systems Engineering and Information Systems, edited by Manuel Mora, Ovsei Gelman, Anette Steenkamp and Manesh S. Raisinghani, 322–340. Hershey, PA: Information Science Reference.

Conclusion

CDA & CL

Interaction and synergy (Baker et al)

Corpus linguistics can expand the scope of discourse analysis:

  • Techniques provided a “map” of the corpus, pinpointed areas for subsequent close analysis
  • Found examples and quantified them through absolute and relative frequencies
  • Lexical patterns, keywords, clusters, collocates revealed novel patterns of use.
  • “it can reinforce, refute or revise a researcher’s intuition and show them why and how much their suspicions were grounded. (Partington 200;3:12)

Natural Language Processing

  • Sentiment analysis - major limitations
  • Topic modeling less useful for this small, focused (& reflexive) corpus
  • Sociograms particularly useful as they recorded contours of discourse practices

Importance of Critical Discourse Analysis

  • Key to address Northern/Anglocentric biases in tools
  • Important skillset for auditing and adapting tools and training datasets