Processing Text-Based Corpora for Musical Discourse Analysis: Key Points

Introduction: Analysing Web-Based Musical Discourse

Corpus linguistics is study of language as part of a body of text, wherein language appears in its “natural” context.
Corpus studies involve the compilation and analysis of collections of text (i.e. the body) which afford insights into the nature, structure, and use of language in this context.
Natural language processing (NLP) is a field that formulates techniques for understanding contexts and rules of language function.
Corpus methods can be used to determine the underlying patterns and contextual associations of words and phrases in a body of text, amongst other uses. This is known as a “distant reading” approach to texts, focused on identifying and quantifying patterns across large datasets.
The insights yielded from NLP approaches to text can supplement more traditional techniques of (critical) discourse analysis.

The Voyant website allows you to dive into your corpus right away by uploading a document (or several) to its server.
Most of Voyant’s dashboard tools such as Cirrus, Termsberry and Contexts, rely on token frequency counts and collocations.
These types of tools can yield insights into the lexical diversity of the corpus, which terms in a given corpus are most or least prominent, and the other words with which they frequently appear. This can highlight patterns or trends that can be analysed further with other tools.

SpaCy is a free and open-source Python library for Natural Language Processing which is based on pre-trained processing pipelines.
SpaCy uses language pipelines which automatically tokenize and tag parts of speech, with powerful recognition and prediction features built-in.
You can use simple list comprehension syntax to extract information that is relevant to your research question.
The ability to call spaCy’s NLP functions depends on the type of data object you are working with.

Pandas is a Python library for data manipulation and analysis.
You can use Pandas to create dataframes, i.e. data structures consisting of rows and columns from your spaCy doc and list objects.
You can add or remove columns and rows based on their labels or values and quantitatively sort and analyse the contents.
You can also use Pandas to write and export .csv and .tsv files for further manual coding or other manipulation.

Word embeddings rely on a conception of language (as a finite vocabulary) as a totality that occupies multi-dimensional space.
Word embeddings are sets of vectors that position a given word in this multi-dimensional space, with every word orthogonal to every other word.
Embeddings are trained by unsupervised machine learning on large text corpora which looks for collocations between words, syllables and characters, and builds predictions as to the probability of given n-grams occuring in proximity. This presupposes that words that are semantically close (i.e. have similar meanings or usages) will be close together in the multi-dimensional space, while words that are semantically different will be far apart.
There are many established statistical methods for manipulating and interpreting word embeddings.
Google TensorFlow’s Word Embedding Projector has built-in functions for dimensionality reduction which create intuitively readable visualisations of embeddings in 2- or 3-dimensional space.