Processing Text-Based Corpora for Musical Discourse Analysis: All in One View

Content from Introduction: Analysing Web-Based Musical Discourse

Last updated on 2024-04-07 | Edit this page

Overview

Questions

What is corpus linguistics?
What can I do using corpus linguistic methods?
How does this method differ from analogue ‘close reading’ methods in musicology?

Objectives

Recognise the disciplinary objectives of corpus linguistics and natural language processing.
Understand the aims of quantitative approaches to understanding texts.

Discussion

In groups of 3–4 people, please take 10 minutes to discuss:

– What do you normally read for when you read text?

– What do you read for when you read text about music?

– How does the nature of online text change that?

– How ight working with large amounts of textnchange how we read it or what we read into it?

Please nominate one member of your group to take notes in the Etherpad and be ready to share your thoughts with the whole group once time is up.

Key Points

Corpus linguistics is study of language as part of a body of text, wherein language appears in its “natural” context.
Corpus studies involve the compilation and analysis of collections of text (i.e. the body) which afford insights into the nature, structure, and use of language in this context.
Natural language processing (NLP) is a field that formulates techniques for understanding contexts and rules of language function.
Corpus methods can be used to determine the underlying patterns and contextual associations of words and phrases in a body of text, amongst other uses. This is known as a “distant reading” approach to texts, focused on identifying and quantifying patterns across large datasets.
The insights yielded from NLP approaches to text can supplement more traditional techniques of (critical) discourse analysis.

Content from Exploring Text-Based Corpora with Voyant

Last updated on 2024-04-07 | Edit this page

Overview

Questions

What is Voyant and what is it used for?
What are the principles underpinning its dashboard tools?
What kinds of insights do they yield?

Objectives

Define key terms relating to Natural Language Processing.
Load data into Voyant and conduct inductive analysis.
Identify affordances and limitations of a frequency-based approach to corpus analysis.

Introduction to Voyant

Voyant is a simple, powerful and user-friendly open-source reading and analysis environment for digital texts. It was created by Stéfan Sinclair and is now maintaned by Geoffrey Rockwell, Andrew MacDonald and Cecily Raynor at the Universities of McGill and Alberta in Canada. It is browser-based and allows you to upload documents or copy and paste text directly into the interface, which Voyant then automatically analyses according to some core Natural Language Processing (NLP) principles.

Voyant is a useful tool for those new to NLP because its dashboard provides an instant and customisable, synchronic overview of many facets of the corpus uploaded while keeping the workings ‘under the hood’. This allows users to explore the linguistic features of their corpus intuitively, before getting to grips with the linear, incremental workflow of NLP libraries such as the Natural Language Toolkit (NLTK), spaCy and Flair.

Callout

Voyant does not require any prior programming knowledge and enables inductive, corpus-driven observations to be made with relative ease. This allows you to refine your research questions in response to the data you are working with, and then move on to using other tools in a more targeted way.

Core NLP Terminology

Before we start using Voyant to explore our text corpora in more detail, it is useful to define some core NLP concepts so that you can understand what Voyant is looking for as it reads text data.

First of all, Voyant needs to parse the text as a sequence of characters (or string) and identify meaningful linguistic units such as words, numbers and punctuation.

A token is a defined unit within a string, such as an individual appearance of a word or number, usually separated from other tokens by whitespace.
A type, on the other hand, is a unique word form, which may appear many times in a corpus.
A lemma is the root form of a derived (inflected) word, e.g. ‘music’ is the lemma of the types ‘musically’, ‘musician’, ‘musicking’, etc.
A hapax legomena (or simply hapax) is a type that appears only once in a corpus.
A concordance is a generated list of all tokens that appear in a digital corpus.

The relationship between tokens and types (or type-token ratio), i.e. how many unique word forms there are in a corpus versus how many times they are used, is a measure of the lexical diversity of the corpus.

Besides parsing a corpus for tokens and types, Voyant also analyses the relationships between tokens and their neighbours, in line with the importance placed on knowing the company words keep in computational linguistics, discussed in the last episode.

An n-gram is a contiguous or consecutive sequence of tokens in a text. For example a bigram is a pair of consecutive written units, i.e. characters, syllables or words; a trigram is a sequence of three consecutive written units, and so on. Any number of consecutive units can be specified for an n-gram and it functions somewhat like a sample-rate for extracting information from the text that could be important for predicting aspects of meaning or function.
A collocation is typically a bigram, often at word level, which occurs in a text at a rate greater than chance. For example, type-pairs like ‘red wine’ and ‘string quartet’ are collocations in English.

The original and most common application of automated text analysis uses computers to count how often certain words occur in a given text. The analytical strategy employed in a frequency-based approach is relatively simple: count the number of occurrences of a specific word token and then normalise this according to how many words there are in the text overall to obtain relative term frequency.

Despite its simplicity, this approach is extremely powerful and versatile. Besides relative term frequency, another common technique used for making frequency counts reflect data more meaningfully is term frequency-inverse document frequency (TF-IDF), especially if working with corpora across multiple documents. This helps to minimise the weighting of frequently occurring but perhaps less significant terms (such as ‘the’, ‘a’, etc.) while making less frequent terms have a higher impact.

Callout

Frequency-based text analysis is often traced back to Father Roberto Busa, a Jesuit Priest who worked with IBM in the 1940s to manually index 11 million medieval Latin words from the writings of St. Thomas Aquinas, count each appearance of the word ‘in’ and look at its collocations in order to explore the concept of ‘presence’ in his work. This somewhat quaint story has come to be venerated as the origin myth of the field of humanities computing as a whole, and has more recently been the subject of characteristically humanistic critique.¹

In music studies, the use of frequency-based text analysis methods is growing, but not yet commonplace. A potential explanation for this reticence was given in a 2015 ISMIR paper by Charles Inskip and Frans Wiering, which itself uses frequency-based methods to reach its conclusions. Inskip and Wiering analysed responses to a survey on musicologists’ attitudes towards using technology in their work and found that, while music scholars were generally enthusiastic about incorporating software and other technologies into their research, data literacy represented a barrier that they perceived as frustrating.²

With these core principles in mind, we can now have a go at using Voyant to explore the Boomkat corpus.

Discussion

In pairs or groups of three, choose one of the subgenre datasets from the ‘Episode 2’ Zenodo repository which you should already have downloaded to your computer. Then, go to the Voyant website and click the ‘Upload’ button to load the data to Voyant’s server. This may take a few moments.

Once the data has loaded, take 5 minutes to look around the various panes of the Dashboard and think about:

What do the individual panels do? Can you describe it using the NLP terms we defined earlier?
Which tool or finding grabs your attention? Why?
Are there any problems with how the data appears that prevent a more meaningful analysis?
Are there any features that you do not understand?

Discuss these with your group and jot down some observations in the Etherpad to share with everyone else.

Fine-Tuning Parameters for Corpus Sensitivity

From this first pass at our corpus, we can see that Voyant has recognised punctuation mark tokens and omitted them from frequency counts, the word cloud, and other analyses. This is useful for readability and can almost go unnoticed. However, it should be remembered that punctuation marks are tokens and the decision to remove them is an intentional one, which other NLP libraries do not perform automatically.

Because Voyant has so far only performed a raw frequency count of the corpus, the most common terms that Voyant has identified are, perhaps unsurprisingly, quite generic (articles, prepositions, etc.). They do not give a sense of what is distinctive about corpus because it could be assumed that a majority of texts would also make regular use of these types of words. Of course, depending on your research question, this could be precisely the type of information you are after - Father Busa’s project, after all, was all about the seemingly generic word ‘in’.

If you already have some idea about the kind of linguistic devices you are looking for, either because of your initial research question or through what you have observed from this initial analysis, it is possible to get Voyant to filter out certain terms so that you can fine-tune your findings. This involves creating a list of stopwords, which is a common technique in corpus linguistics and NLP more generally.

Stopwords

A stopword is a word (or any token) that is automatically omitted from a computer-generated concordance. Many NLP libraries have automatic lists of stopwords specific to individual languages, and these can often be edited to suit your specific needs, or you can create your own.

In Voyant, you can inspect the stopword lists by clicking the ‘settings’ icon on the top right corner of the ‘Terms’ tool. As can be seen, there are lists for the different languages that Voyant supports, as well as an ‘auto-detect’ list. It can be assumed that the removal of punctuation marks from Voyant’s reading of our corpus is a feature of its ‘auto-detect’ stopword list.

Callout

Remember that data cleaning is an iterative process, not simply an initial step. We have been working with a dataset that has already been converted from .csv to .xlsx, UTF-8 encoded and which contains one uniform data type within it. Consideration of the type of data you are working with and how it is presented for analysis is important in pre-processing but does not end there.

Relatedly, stopword lists and any other filters should be thought through carefully in order to remain sensitive to the discursive priorities of the types of texts being analysed while also being used to streamline the dataset and remove ‘noise’. This, too, should be an iterative process and can involve trial and error as you get a feel for the shape and contents of your corpus.

Key Points

The Voyant website allows you to dive into your corpus right away by uploading a document (or several) to its server.
Most of Voyant’s dashboard tools such as Cirrus, Termsberry and Contexts, rely on token frequency counts and collocations.
These types of tools can yield insights into the lexical diversity of the corpus, which terms in a given corpus are most or least prominent, and the other words with which they frequently appear. This can highlight patterns or trends that can be analysed further with other tools.

Arun, Jacob. 2021. ‘Punching Holes in the International Busa Machine Narrative’. In Kim, Dorothy and Coh, Adeline, Eds. Alternative Historiographies of the Digital Humanities. Punctum: 121–139↩︎
Inskip, Charles and Wiering, Frans. 2015. ‘In Their Own Words: Using Text to Identify Musicologists’ Attitudes Towards Technology’. Presented at: 16th International Society for Music Information Retrieval Conference, Malaga, Spain. https://discovery.ucl.ac.uk/id/eprint/1470590/↩︎

Content from Introduction to Natural Language Processing with spaCy

Last updated on 2024-04-05 | Edit this page

Overview

Questions

What is spaCy and what can I do with it?
What are the steps in its English-language NLP pipeline?
How can I get started with Python and spaCy?

Objectives

Understand the components of spaCy’s text processing pipeline.
Understand the fundamentals of Python syntax.
Write simple Python code in Jupyter Notebook to run spaCy commands.
Recognise different types of spaCy objects.

The (Very) Basics of Python

In progress…

Getting Started with spaCy

SpaCy is a free, open-source software library for Natural Language Processing that uses the Python and Cython programming languages. One of the advantages of using spaCy is that it automatically tokenizes text and is able to recognise sentences, grammatical attributes and dependencies, named entities, and token lemmas. It can also calculate text similarity and perform classification and training tasks.

Unlike other NLP libraries like NLTK and Flair, however, spaCy was developed for professional and commercial rather than educational purposes. Practically, this means it has a different learning curve because while some basic tasks are automatically built into its language pipelines, others are more idiosyncratic. You can find out more about its features by visiting spaCy’s documentation pages.

To begin, open Jupyter Notebook. You should already have installed spaCy as part of the lesson set-up instructions. If not, you need to do so before you import the package.

PYTHON

! pip -U install spacy
import spacy

Then, you need to load one of spaCy’s language pipelines, in this case the large English pipeline, and assign it with the variable name nlp. Variables are assigned by typing in the name of the new variable, the equals sign, and the name or pathway of the object you want to assign to that variable. From now on, spaCy will working with this pipeline every time you call nlp funtions.

PYTHON

nlp = spacy.load("en_core_web_lg")

Now, let’s input some simple text to begin processing. For Jupyter, Python and spaCy to recognise what we input as structured text rather than code, a different variable name or an undefined string, we will need to assign it a variable name, e.g. text, and place the text itself into quotation marks. Either single or double quotation marks can be used for the same purpose, but consistency is key.

PYTHON

text = "Let's get started with spaCy :) ."

Once you have input your text, you need to create a doc variable which you will use to call spaCy’s nlp function onto your text. It is good practice to use the print() function immediately after creating the doc (or indeed any new variable) to double-check that the output appears as you expect it to.

PYTHON

greeting_doc = nlp(text)
print(greeting_doc)

As expected, the output reads Let's get started with spaCy :) .

As per the importance of count-based approaches to machine reading discussed previously, you can calculate the length of your greeting_doc by using the len() function, which will output the number of tokens in your text doc. If you want to see a list of the tokens spaCy has identified and counted, you should create a new variable and use ‘for’ loops with list comprehension syntax to iterate through the greeting_doc and extract the tokens every time they are encountered in the text array:

PYTHON

greeting_tokens = [token for token in greeting_doc]
print(greeting_tokens)

Challenge

Manually count what you would classify as tokens in your greeting_doc. Then, in Jupyter Notebook, use the len() function to check your answer. Do they match? If not, what might be an explanation?

Give me a hint

Use the list comprehension method from above to create the greeting_tokens variable to see the tokens spaCy identified.

Show me the solution

PYTHON

len(greeting_doc)

PYTHON

greeting_tokens = [token for token in greeting_doc]
print(greeting_tokens)

[Let, 's, get, started, with, spacy, :), .]

From this, we can see that spaCy recognises the smiley face emoji as a single token rather than two punctuation mark tokens. We can also see that it recognises the ’s as a single token dependency of the previous token, not as two separate tokens. This is pretty clever, and shows that spaCy already has a contextual view of the text loaded into it.

In part, this has to do with spaCy’s understanding of the tokens in the text as a specific sequence. SpaCy automatically indexes objects in an array based on their position so that they can be called by their individual index number, which is always an integer, and is presented in [] brackets.

It is important to note that indexes in Python begin with zero: the first object is assigned the index 0, the second object the index 1, and so on. You can specify a single index number or a range by using a semicolon between the numbers inside square brackets to extract them from your overall text array.

PYTHON

greeting_doc[0]

Let

PYTHON

greeting_doc[2:5]

get started with

If you don’t want to specify a beginning or end index number, but want to go from or to there from a specified point, you can omit either the number before or after the colon. This could be useful if, for example, you had a set of texts and wanted to compare their beginnings or endings.

PYTHON

greeting_doc[3:]

started with spaCy :) .

Indexing in Python

Remember: if the length of your array is n, the first index is 0 and the last index is n-1.

In an index range, the output will be from the object indexed as the number before the colon up to but not including the number after the colon.

Recognizing Parts Of Speech

Being able to identify and extract specific parts of speech can be incredibly useful when working with large text corpora. Helpfully, spaCy’s NLP pipelines have been trained on large language datasets to be able to automatically, and quite reliably, recognise certain linguistic attributes such as parts of speech, grammatical functions, named entities, and so on.

For example, with the Boomkat corpus we explored in the previous episode, you might want to further analyse the descriptive words it contains and their sentiments, or isolate named entities such as record labels and place names to begin to construct a network.

Tasks such as these require understanding how spaCy parses text and classifies parts of speech, as well as using ‘for’ loops to iterate through text arrays to extract relevant information. Let us inspect our greeting_doc in this more systematic way to extract the different token attributes that form part of spaCy’s NLP pipeline. The words after the full stops of tokens are the attributes.

PYTHON

for token in greeting_doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
    token.shape_, token.is_alpha, token.is_stop)

Let let VERB VB ROOT Xxx True False
's us PRON PRP nsubjpass 'x False True
get get AUX VB auxpass xxx True True
started start VERB VBN ccomp xxxx True False
with with ADP IN prep xxxx True True
spaCy spaCy PROPN NNP pobj xxxXx True False
:) :) PUNCT . punct :) False False
. . PUNCT . punct . False False

Here, every line of the output pertains to a single token in the greeting doc, and displays the token text, the lemmatized form of the token, the part of speech it represents, its tag, its dependency, and its letter case shape. Using Boolean True or False values, it also outputs whether the token is an alphanumeric character, and whether it is part of spaCy’s default stopwords list.

Some of the output text is intuitively understandable but, if not, you can always use the spacy.explain() function to find out more. For example:

PYTHON

spacy.explain("NNP")

'noun, proper singular'

From this, we can tell that spaCy has recognised its own name as a noun, not as an adjective meaning space-like. This is an accurate reading in this example not only because of the syntactic positioning of this word in our text but also because of the token shape, which includes the characteristic uppercase C that forms part of spaCy’s brand identity.

Let us now move on to a more complex text example. In Jupyter Notebook, create a new text variable which takes the whole of the first description from the ‘modern_classical_ambient_desc_only’ spreadsheet you downloaded during the last episode.

Note that Jupyter often auto-fills quotation marks and this can get confusing if your text also has quotation marks within it, so double-check that they match at the beginning and end of your text. Use the print() function to check that the text looks as you expected.

PYTHON

text = "'Clouds' is a perfectly measured suite of warm and hazy downbeats from Gigi Masin, Marco Sterk (Young Marco), and Johnny Nash recorded in the heart of Amsterdam's red light district over one weekend in April, 2014.It's all about louche vibes and glowing notes, gently absorbing and transducing the buzz of the streets outside the studio's open windows into eight elegantly reserved improvisations segueing between lush ambient drift, dub-wise solo piano pieces, and chiming late night jazz patter. In that sense, there's striking similarities between 'Clouds' and the recent Sky Walking album by Lawrence and co., but where they really go for the looseness, Gaussian Curve keep it supple yet tight, bordering on adult contemporary suaveness anointed with finest hash oil. Imbibe slowly."
print(text)

'Clouds' is a perfectly measured suite of warm and hazy downbeats from Gigi Masin, Marco Sterk (Young Marco), and Johnny Nash recorded in the heart of Amsterdam's red light district over one weekend in April, 2014.It's all about louche vibes and glowing notes, gently absorbing and transducing the buzz of the streets outside the studio's open windows into eight elegantly reserved improvisations segueing between lush ambient drift, dub-wise solo piano pieces, and chiming late night jazz patter. In that sense, there's striking similarities between 'Clouds' and the recent Sky Walking album by Lawrence and co., but where they really go for the looseness, Gaussian Curve keep it supple yet tight, bordering on adult contemporary suaveness anointed with finest hash oil. Imbibe slowly.

Then, turn the text into an nlp doc variable, as before, and check the length.

PYTHON

doc = nlp(text)
len(doc)

In Python, you can use ‘if’ and ‘if not’ statements to further refine your ‘for’ loop list comprehensions, which makes it possible to focus on particular token attributes while ignoring others. These ‘for … in … if/not …’ statements rely on standard logical conditions from maths, such as equals to (==), not equals to (!=), greater/less than (> / <), etc.

For example, with the Boomkat doc from above, we can create a list of tokens that excludes all punctuation marks:

PYTHON

word_tokens = [token 
              for token in doc
              if not token.is_punct]
print(word_tokens)

[Clouds, is, a, perfectly, measured, suite, of, warm, and, hazy, downbeats, from, Gigi, Masin, Marco, Sterk, Young, Marco, and, Johnny, Nash, recorded, in, the, heart, of, Amsterdam, 's, red, light, district, over, one, weekend, in, April, 2014.It, 's, all, about, louche, vibes, and, glowing, notes, gently, absorbing, and, transducing, the, buzz, of, the, streets, outside, the, studio, 's, open, windows, into, eight, elegantly, reserved, improvisations, segueing, between, lush, ambient, drift, dub, wise, solo, piano, pieces, and, chiming, late, night, jazz, patter, In, that, sense, there, 's, striking, similarities, between, Clouds, and, the, recent, Sky, Walking, album, by, Lawrence, and, co., but, where, they, really, go, for, the, looseness, Gaussian, Curve, keep, it, supple, yet, tight, bordering, on, adult, contemporary, suaveness, anointed, with, finest, hash, oil, Imbibe, slowly]

You can also use equals to (==) and Boolean values of True or False to obtain the same result.

PYTHON

word_tokens2 = [token 
              for token in doc
              if token.is_punct==False]
print(word_tokens2)

[Clouds, is, a, perfectly, measured, suite, of, warm, and, hazy, downbeats, from, Gigi, Masin, Marco, Sterk, Young, Marco, and, Johnny, Nash, recorded, in, the, heart, of, Amsterdam, 's, red, light, district, over, one, weekend, in, April, 2014.It, 's, all, about, louche, vibes, and, glowing, notes, gently, absorbing, and, transducing, the, buzz, of, the, streets, outside, the, studio, 's, open, windows, into, eight, elegantly, reserved, improvisations, segueing, between, lush, ambient, drift, dub, wise, solo, piano, pieces, and, chiming, late, night, jazz, patter, In, that, sense, there, 's, striking, similarities, between, Clouds, and, the, recent, Sky, Walking, album, by, Lawrence, and, co., but, where, they, really, go, for, the, looseness, Gaussian, Curve, keep, it, supple, yet, tight, bordering, on, adult, contemporary, suaveness, anointed, with, finest, hash, oil, Imbibe, slowly]

Furthermore, you can use or and and conditions to refine your list comprehension. For example, you could specify that you want to extract only nouns and proper nouns from your doc.

PYTHON

nouns_and_proper_nouns = [token 
              for token in doc
              if token.pos_=="NOUN" or token.pos_=="PROPN"]
print(nouns_and_proper_nouns)

[Clouds, suite, downbeats, Gigi, Masin, Marco, Sterk, Young, Marco, Johnny, Nash, heart, Amsterdam, light, district, weekend, April, 2014.It, vibes, notes, buzz, streets, studio, windows, improvisations, drift, dub, piano, pieces, night, jazz, patter, sense, similarities, Clouds, Sky, Walking, album, Lawrence, co., looseness, Gaussian, Curve, adult, suaveness, hash, oil]

Challenge

Use for … if… syntax to create a list of adjectives and adverbs with the variable name adj_and_adv from your doc.

Give me a hint

It is a common mistake, but if you use and in the code you will get a blank [] result because the and specifies that an individual token should be both an adjectives and an adverb. This is impossible in this instance, hence the blank result, but could have created confusion with the noun/proper noun example above.

So, you should use or to avoid this and print only tokens that are either adjectives or adverbs.

Show me the solution

PYTHON

adj_and_adv = [token 
              for token in doc
              if token.pos_=="ADJ" or token.pos_=="ADV"]
print(adj_and_adv)

[perfectly, warm, hazy, red, louche, glowing, gently, open, elegantly, lush, ambient, wise, solo, late, recent, really, supple, yet, tight, contemporary, finest, slowly]

Callout

Like Voyant, spaCy has sets of stopwords for the languages it supports that can be called and modified. For this, you have to import the stop_words module which contains the English language STOP_WORDS set and create a stop_words variable in your own project.

PYTHON

from spacy.lang.en import stop_words as stop_words
stop_words = stop_words.STOP_WORDS

Use this link to find more information on how to edit spaCy stopword sets.

Named Entity Recognition

Another useful feature built into spaCy’s nlp pipeline is Named Entity Recognition (NER). This recognises and classifies proper nouns like place names, countries, currencies, political organizations, and even (some) works of art. As with other tools, we cannot expect NER to be 100% reliable, especially when working with musical sublanguages, but it can still save time and/or streamline manual coding and tagging exercises.

To see a list of the NER labels spaCy uses, you can type the following code (and remember to use the spacy.explain() function if you need further clarification on any abbreviations):

PYTHON

ner_labels = nlp.get_pipe("ner").labels
print(ner_labels)

('CARDINAL', 'DATE', 'EVENT', 'FAC', 'GPE', 'LANGUAGE', 'LAW', 'LOC', 'MONEY', 'NORP', 'ORDINAL', 'ORG', 'PERCENT', 'PERSON', 'PRODUCT', 'QUANTITY', 'TIME', 'WORK_OF_ART')

It is important to note that named entities are a different type of spaCy object than the token or doc objects we have been working with so far. Predictably, and presumably unrelated to the talking trees in Lord of the Rings, they are called ents. We can create a list of named entities in our doc by typing:

PYTHON

print(doc.ents)

(Gigi Masin, Marco Sterk, Marco, Johnny Nash, Amsterdam, one weekend, April, late night, Sky Walking, Lawrence and co., Gaussian Curve)

The ent objects have .label_ and .text attributes which can tell us, respectively, which of the NER labels printed above our token corresponds to, or, the text of the token that label was assigned to. If you are interested in finding out the label for a specific named entity token, you can use the list comprehension method from above. For example:

PYTHON

print([ent.label_ for ent in doc.ents if ent.text=="Gaussian Curve"])

['ORG']

Challenge

Using the same method as when generating annotations of each token in the greeting_doc earlier:

Create a list of ent texts with ent labels for each token in the doc.

Then, (b) extract only the entities representing people or organizations.

Give me a hint

Note that .label (without the underscore) outputs the numerical index of the label type rather than the text string name of the label, which could be less useful for intuitive readability when you are exploring your document properties.

Show me the solution

PYTHON

for ent in doc.ents:
    print(ent.text, ent.label_)

Gigi Masin PERSON
Marco Sterk PERSON
Marco PERSON
Johnny Nash PERSON
Amsterdam GPE
one weekend DATE
April DATE
late night TIME
Sky Walking PRODUCT
Lawrence and co. ORG
Gaussian Curve ORG

PYTHON

print([ent.text for ent in doc.ents if ent.label_ =="PERSON" or ent.label_=="ORG"])

['Gigi Masin', 'Marco Sterk', 'Marco', 'Johnny Nash', 'Lawrence and co.', 'Gaussian Curve']

The ent.text and ent.label_ ‘for’ loop output of part (a) of the exercise is interesting in itself but cannot be worked on further because it was not assigned as a new variable. This can be done with the following syntax:

PYTHON

doc_ents_and_labels = []
for ent in doc.ents:
    print(ent.text, ent.label_)

However, if you try to do further NLP tasks on this variable, such as a simple token count, you will run into issues:

PYTHON

len(doc_ents_and_labels)

The reason that the len() function did not work as expected is because the type of object that this new variable constitutes is not a doc object to which spaCy’s NLP functions (such as token counting) apply. You can inspect the type of object you are dealing with by using the print(type) command:

PYTHON

print(type(doc_ents_and_labels))

<class 'list'>

On the other hand, if you inspect the type of our original doc:

PYTHON

print(type(doc))

<class 'spacy.tokens.doc.Doc'>

List objects are not tokenized or indexed in the same way that doc objects are, which means you cannot perform further NLP tasks on them in their current state. Nevertheless, they are often useful data sources that you will need to know how to manipulate to be able to quantify and analyse their contents in other ways.

How to do this will be the subject of the next episode.

Key Points

SpaCy is a free and open-source Python library for Natural Language Processing which is based on pre-trained processing pipelines.
SpaCy uses language pipelines which automatically tokenize and tag parts of speech, with powerful recognition and prediction features built-in.
You can use simple list comprehension syntax to extract information that is relevant to your research question.
The ability to call spaCy’s NLP functions depends on the type of data object you are working with.

Content from Organising Data with Pandas

Last updated on 2024-04-07 | Edit this page

Overview

Questions

What is Pandas and what is it for?
What is a dataframe?
How can I use Python and Pandas to create dataframes?
How can I organise the data presented in a dataframe?

Objectives

Use the Pandas package in Python and spaCy’s dframCy module to organise data in a tabular format.
Manipulate different types of objects for organisation and analysis.

Challenge

Using the first item in the ‘modern_classical_ambient_desc_only’ dataset that you worked on in the previous episode, create a dataframe that includes token.text and token.pos_ values for all of the nouns and proper nouns contained in that text. Make your final output the .shape attribute of the dataframe, and check this against the results in the last episode.

Then, write a .csv file of this dataframe, titled ‘clouds_nouns.csv’

Give me a hint

You will need to create a dictionary of annotations and then create a dataframe and filter out columns and rows.

Show me the solution

PYTHON

desc_doc = nlp("'Clouds' is a perfectly measured suite of warm and hazy downbeats from Gigi Masin, Marco Sterk (Young Marco), and Johnny Nash recorded in the heart of Amsterdam's red light district over one weekend in April, 2014.It's all about louche vibes and glowing notes, gently absorbing and transducing the buzz of the streets outside the studio's open windows into eight elegantly reserved improvisations segueing between lush ambient drift, dub-wise solo piano pieces, and chiming late night jazz patter. In that sense, there's striking similarities between 'Clouds' and the recent Sky Walking album by Lawrence and co., but where they really go for the looseness, Gaussian Curve keep it supple yet tight, bordering on adult contemporary suaveness anointed with finest hash oil. Imbibe slowly.")
desc_doc_dict = [{
    "text": token.text,
    "pos": token.pos_,
    "tag": token.tag_,
    "dep": token.dep_,
    "head": token.head}
for token in desc_doc]
desc_doc_dict
desc_df = pd.DataFrame.from_records(desc_doc_dict)
desc_pos_df = desc_df.drop(columns=["tag", "dep", "head"])
desc_nouns_df = desc_pos_df[desc_pos_df["pos"].isin(["NOUN", "PROPN"])]
desc_nouns_df

	text	pos
0	Clouds	NOUN
6	suite	NOUN
11	downbeats	NOUN
13	Gigi	PROPN
14	Masin	PROPN
16	Marco	PROPN
17	Sterk	PROPN
19	Young	PROPN
20	Marco	PROPN
24	Johnny	PROPN
25	Nash	PROPN
29	heart	NOUN
31	Amsterdam	PROPN
34	light	NOUN
35	district	NOUN
38	weekend	NOUN
40	April	PROPN
42	2014.It	PROPN
47	vibes	NOUN
50	notes	NOUN
57	buzz	NOUN
60	streets	NOUN
63	studio	NOUN
66	windows	NOUN
71	improvisations	NOUN
76	drift	NOUN
78	dub	NOUN
82	piano	NOUN
83	pieces	NOUN
88	night	NOUN
89	jazz	NOUN
90	patter	NOUN
94	sense	NOUN
99	similarities	NOUN
102	Clouds	PROPN
107	Sky	PROPN
108	Walking	PROPN
109	album	NOUN
111	Lawrence	PROPN
113	co.	PROPN
122	looseness	NOUN
124	Gaussian	PROPN
125	Curve	NOUN
134	adult	NOUN
136	suaveness	NOUN
140	hash	NOUN
141	oil	NOUN

PYTHON

desc_nouns_df.shape

(47, 2)

PYTHON

desc_nouns_df.to_csv("clouds_nouns.csv")

Key Points

Pandas is a Python library for data manipulation and analysis.
You can use Pandas to create dataframes, i.e. data structures consisting of rows and columns from your spaCy doc and list objects.
You can add or remove columns and rows based on their labels or values and quantitatively sort and analyse the contents.
You can also use Pandas to write and export .csv and .tsv files for further manual coding or other manipulation.

Content from Introduction to Word Embeddings

Last updated on 2024-04-07 | Edit this page

Overview

Questions

What are word embeddings and what are they for?
How can I create embeddings using spaCy?
How can I visualise the results?

Objectives

Understand the principles behind word embeddings.
Use Python, spaCy and Pandas to create word embeddings for a corpus of musical terms.
Use Google TensorFlow’s Embedding Projector tool to visualise and interpet the results.

The dataset for this episode, containing terms from Boomkat’s ‘Grime / FWD’ and ‘Industrial / Wave / Electro’ subgenre corpora, can be accessed via the ‘Episode 5’ Zenodo repository.

Challenge

You may have noticed that the merge_embeddings_genre_df has more rows than the unduplicated grime_ind_terms_df we created earlier. Why could that be? Can you amend the code used to create the grime_ind_embeddings_df to make the number of rows match the number of unduplicated terms?

Give me a hint

The terms in our lists are word types but are not necessarily single tokens. As you will recall from our first encounter with spaCy, it treats possessives like ’s as separate tokens. This means that the process of creating a list of doc objects of terms based on their tokens, spaCy’s nlp function has added rows (and vectors) for these tokens.

We can override this by specifying that only the first token in each doc is necessary to represent the term and that others should be discounted.

Show me the solution

PYTHON

embeddings = []
for doc in docs:
    token = doc[0]
    embeddings.append([token.text] + token.vector.tolist())
grime_ind_embeddings_df = pd.DataFrame(embeddings)
grime_ind_embeddings_df.shape

(685, 301)

Discussion

In breakout rooms of pairs or groups of three, go to the website for Google TensorFlow’s Embedding Projector Tool and load the grime_industrial_embedding data and metadata files. Play around with the settings and the projection and use the Etherpad to write down questions and observations to share with the whole group.

Give me a hint

Use the ‘Color By’ function to differentiate terms by genre. This will assign a different colour to Grime versus Industrial terms and allow you to visually identify semantic themes that are more prevalent in one or the other genre.

Enable the 3D labels mode to see the terms themselves rather than points.

Key Points

Word embeddings rely on a conception of language (as a finite vocabulary) as a totality that occupies multi-dimensional space.
Word embeddings are sets of vectors that position a given word in this multi-dimensional space, with every word orthogonal to every other word.
Embeddings are trained by unsupervised machine learning on large text corpora which looks for collocations between words, syllables and characters, and builds predictions as to the probability of given n-grams occuring in proximity. This presupposes that words that are semantically close (i.e. have similar meanings or usages) will be close together in the multi-dimensional space, while words that are semantically different will be far apart.
There are many established statistical methods for manipulating and interpreting word embeddings.
Google TensorFlow’s Word Embedding Projector has built-in functions for dimensionality reduction which create intuitively readable visualisations of embeddings in 2- or 3-dimensional space.