Content from Introduction: Analysing Web-Based Musical Discourse
Last updated on 2024-04-07 | Edit this page
Overview
Questions
- What is corpus linguistics?
- What can I do using corpus linguistic methods?
- How does this method differ from analogue ‘close reading’ methods in musicology?
Objectives
- Recognise the disciplinary objectives of corpus linguistics and natural language processing.
- Understand the aims of quantitative approaches to understanding texts.
Discussion
In groups of 3–4 people, please take 10 minutes to discuss:
– What do you normally read for when you read text?
– What do you read for when you read text about music?
– How does the nature of online text change that?
– How ight working with large amounts of textnchange how we read it or what we read into it?
Please nominate one member of your group to take notes in the Etherpad and be ready to share your thoughts with the whole group once time is up.
Key Points
- Corpus linguistics is study of language as part of a body of text, wherein language appears in its “natural” context.
- Corpus studies involve the compilation and analysis of collections of text (i.e. the body) which afford insights into the nature, structure, and use of language in this context.
- Natural language processing (NLP) is a field that formulates techniques for understanding contexts and rules of language function.
- Corpus methods can be used to determine the underlying patterns and contextual associations of words and phrases in a body of text, amongst other uses. This is known as a “distant reading” approach to texts, focused on identifying and quantifying patterns across large datasets.
- The insights yielded from NLP approaches to text can supplement more traditional techniques of (critical) discourse analysis.
Content from Exploring Text-Based Corpora with Voyant
Last updated on 2024-04-07 | Edit this page
Overview
Questions
- What is Voyant and what is it used for?
- What are the principles underpinning its dashboard tools?
- What kinds of insights do they yield?
Objectives
- Define key terms relating to Natural Language Processing.
- Load data into Voyant and conduct inductive analysis.
- Identify affordances and limitations of a frequency-based approach to corpus analysis.
Introduction to Voyant
Voyant is a simple, powerful and user-friendly open-source reading and analysis environment for digital texts. It was created by Stéfan Sinclair and is now maintaned by Geoffrey Rockwell, Andrew MacDonald and Cecily Raynor at the Universities of McGill and Alberta in Canada. It is browser-based and allows you to upload documents or copy and paste text directly into the interface, which Voyant then automatically analyses according to some core Natural Language Processing (NLP) principles.
Voyant is a useful tool for those new to NLP because its dashboard provides an instant and customisable, synchronic overview of many facets of the corpus uploaded while keeping the workings ‘under the hood’. This allows users to explore the linguistic features of their corpus intuitively, before getting to grips with the linear, incremental workflow of NLP libraries such as the Natural Language Toolkit (NLTK), spaCy and Flair.
Callout
Voyant does not require any prior programming knowledge and enables inductive, corpus-driven observations to be made with relative ease. This allows you to refine your research questions in response to the data you are working with, and then move on to using other tools in a more targeted way.
Core NLP Terminology
Before we start using Voyant to explore our text corpora in more detail, it is useful to define some core NLP concepts so that you can understand what Voyant is looking for as it reads text data.
First of all, Voyant needs to parse the text as a sequence of characters (or string) and identify meaningful linguistic units such as words, numbers and punctuation.
- A token is a defined unit within a string, such as an individual appearance of a word or number, usually separated from other tokens by whitespace.
- A type, on the other hand, is a unique word form, which may appear many times in a corpus.
- A lemma is the root form of a derived (inflected) word, e.g. ‘music’ is the lemma of the types ‘musically’, ‘musician’, ‘musicking’, etc.
- A hapax legomena (or simply hapax) is a type that appears only once in a corpus.
- A concordance is a generated list of all tokens that appear in a digital corpus.
The relationship between tokens and types (or type-token ratio), i.e. how many unique word forms there are in a corpus versus how many times they are used, is a measure of the lexical diversity of the corpus.
Besides parsing a corpus for tokens and types, Voyant also analyses the relationships between tokens and their neighbours, in line with the importance placed on knowing the company words keep in computational linguistics, discussed in the last episode.
- An n-gram is a contiguous or consecutive sequence of tokens in a text. For example a bigram is a pair of consecutive written units, i.e. characters, syllables or words; a trigram is a sequence of three consecutive written units, and so on. Any number of consecutive units can be specified for an n-gram and it functions somewhat like a sample-rate for extracting information from the text that could be important for predicting aspects of meaning or function.
- A collocation is typically a bigram, often at word level, which occurs in a text at a rate greater than chance. For example, type-pairs like ‘red wine’ and ‘string quartet’ are collocations in English.
The original and most common application of automated text analysis uses computers to count how often certain words occur in a given text. The analytical strategy employed in a frequency-based approach is relatively simple: count the number of occurrences of a specific word token and then normalise this according to how many words there are in the text overall to obtain relative term frequency.
Despite its simplicity, this approach is extremely powerful and versatile. Besides relative term frequency, another common technique used for making frequency counts reflect data more meaningfully is term frequency-inverse document frequency (TF-IDF), especially if working with corpora across multiple documents. This helps to minimise the weighting of frequently occurring but perhaps less significant terms (such as ‘the’, ‘a’, etc.) while making less frequent terms have a higher impact.
Callout
Frequency-based text analysis is often traced back to Father Roberto Busa, a Jesuit Priest who worked with IBM in the 1940s to manually index 11 million medieval Latin words from the writings of St. Thomas Aquinas, count each appearance of the word ‘in’ and look at its collocations in order to explore the concept of ‘presence’ in his work. This somewhat quaint story has come to be venerated as the origin myth of the field of humanities computing as a whole, and has more recently been the subject of characteristically humanistic critique.1
In music studies, the use of frequency-based text analysis methods is growing, but not yet commonplace. A potential explanation for this reticence was given in a 2015 ISMIR paper by Charles Inskip and Frans Wiering, which itself uses frequency-based methods to reach its conclusions. Inskip and Wiering analysed responses to a survey on musicologists’ attitudes towards using technology in their work and found that, while music scholars were generally enthusiastic about incorporating software and other technologies into their research, data literacy represented a barrier that they perceived as frustrating.2
With these core principles in mind, we can now have a go at using Voyant to explore the Boomkat corpus.
Discussion
In pairs or groups of three, choose one of the subgenre datasets from the ‘Episode 2’ Zenodo repository which you should already have downloaded to your computer. Then, go to the Voyant website and click the ‘Upload’ button to load the data to Voyant’s server. This may take a few moments.
Once the data has loaded, take 5 minutes to look around the various panes of the Dashboard and think about:
- What do the individual panels do? Can you describe it using the NLP terms we defined earlier?
- Which tool or finding grabs your attention? Why?
- Are there any problems with how the data appears that prevent a more meaningful analysis?
- Are there any features that you do not understand?
Discuss these with your group and jot down some observations in the Etherpad to share with everyone else.
Fine-Tuning Parameters for Corpus Sensitivity
From this first pass at our corpus, we can see that Voyant has recognised punctuation mark tokens and omitted them from frequency counts, the word cloud, and other analyses. This is useful for readability and can almost go unnoticed. However, it should be remembered that punctuation marks are tokens and the decision to remove them is an intentional one, which other NLP libraries do not perform automatically.
Because Voyant has so far only performed a raw frequency count of the corpus, the most common terms that Voyant has identified are, perhaps unsurprisingly, quite generic (articles, prepositions, etc.). They do not give a sense of what is distinctive about corpus because it could be assumed that a majority of texts would also make regular use of these types of words. Of course, depending on your research question, this could be precisely the type of information you are after - Father Busa’s project, after all, was all about the seemingly generic word ‘in’.
If you already have some idea about the kind of linguistic devices you are looking for, either because of your initial research question or through what you have observed from this initial analysis, it is possible to get Voyant to filter out certain terms so that you can fine-tune your findings. This involves creating a list of stopwords, which is a common technique in corpus linguistics and NLP more generally.
Stopwords
A stopword is a word (or any token) that is automatically omitted from a computer-generated concordance. Many NLP libraries have automatic lists of stopwords specific to individual languages, and these can often be edited to suit your specific needs, or you can create your own.
In Voyant, you can inspect the stopword lists by clicking the ‘settings’ icon on the top right corner of the ‘Terms’ tool. As can be seen, there are lists for the different languages that Voyant supports, as well as an ‘auto-detect’ list. It can be assumed that the removal of punctuation marks from Voyant’s reading of our corpus is a feature of its ‘auto-detect’ stopword list.
Callout
Remember that data cleaning is an iterative process, not simply an initial step. We have been working with a dataset that has already been converted from .csv to .xlsx, UTF-8 encoded and which contains one uniform data type within it. Consideration of the type of data you are working with and how it is presented for analysis is important in pre-processing but does not end there.
Relatedly, stopword lists and any other filters should be thought through carefully in order to remain sensitive to the discursive priorities of the types of texts being analysed while also being used to streamline the dataset and remove ‘noise’. This, too, should be an iterative process and can involve trial and error as you get a feel for the shape and contents of your corpus.
Key Points
- The Voyant website allows you to dive into your corpus right away by
uploading a document (or several) to its server.
- Most of Voyant’s dashboard tools such as Cirrus, Termsberry and
Contexts, rely on token frequency counts and collocations.
- These types of tools can yield insights into the lexical diversity of the corpus, which terms in a given corpus are most or least prominent, and the other words with which they frequently appear. This can highlight patterns or trends that can be analysed further with other tools.
Arun, Jacob. 2021. ‘Punching Holes in the International Busa Machine Narrative’. In Kim, Dorothy and Coh, Adeline, Eds. Alternative Historiographies of the Digital Humanities. Punctum: 121–139↩︎
Inskip, Charles and Wiering, Frans. 2015. ‘In Their Own Words: Using Text to Identify Musicologists’ Attitudes Towards Technology’. Presented at: 16th International Society for Music Information Retrieval Conference, Malaga, Spain. https://discovery.ucl.ac.uk/id/eprint/1470590/↩︎
Content from Introduction to Natural Language Processing with spaCy
Last updated on 2024-04-05 | Edit this page
Overview
Questions
- What is spaCy and what can I do with it?
- What are the steps in its English-language NLP pipeline?
- How can I get started with Python and spaCy?
Objectives
- Understand the components of spaCy’s text processing pipeline.
- Understand the fundamentals of Python syntax.
- Write simple Python code in Jupyter Notebook to run spaCy commands.
- Recognise different types of spaCy objects.
The (Very) Basics of Python
In progress…
Getting Started with spaCy
SpaCy is a free, open-source software library for Natural Language Processing that uses the Python and Cython programming languages. One of the advantages of using spaCy is that it automatically tokenizes text and is able to recognise sentences, grammatical attributes and dependencies, named entities, and token lemmas. It can also calculate text similarity and perform classification and training tasks.
Unlike other NLP libraries like NLTK and Flair, however, spaCy was developed for professional and commercial rather than educational purposes. Practically, this means it has a different learning curve because while some basic tasks are automatically built into its language pipelines, others are more idiosyncratic. You can find out more about its features by visiting spaCy’s documentation pages.
To begin, open Jupyter Notebook. You should already have installed spaCy as part of the lesson set-up instructions. If not, you need to do so before you import the package.
Then, you need to load one of spaCy’s language pipelines, in this
case the large English pipeline, and assign it with the variable name
nlp
. Variables are assigned by typing in the name of the
new variable, the equals sign, and the name or pathway of the object you
want to assign to that variable. From now on, spaCy will working with
this pipeline every time you call nlp
funtions.
Now, let’s input some simple text to begin processing. For Jupyter,
Python and spaCy to recognise what we input as structured text
rather than code, a different variable name or an undefined string, we
will need to assign it a variable name, e.g. text
, and
place the text itself into quotation marks. Either single or double
quotation marks can be used for the same purpose, but consistency is
key.
Once you have input your text, you need to create a doc
variable which you will use to call spaCy’s nlp
function
onto your text
. It is good practice to use the
print()
function immediately after creating the
doc
(or indeed any new variable) to double-check that the
output appears as you expect it to.
As expected, the output reads
Let's get started with spaCy :) .
As per the importance of count-based approaches to machine reading
discussed previously, you can calculate the length of your
greeting_doc
by using the len()
function,
which will output the number of tokens in your text doc. If you want to
see a list of the tokens spaCy has identified and counted, you should
create a new variable and use ‘for’ loops with
list comprehension syntax to iterate through the
greeting_doc
and extract the tokens every time they are
encountered in the text array:
Challenge
Manually count what you would classify as tokens in your
greeting_doc
. Then, in Jupyter Notebook, use the
len()
function to check your answer. Do they match? If not,
what might be an explanation?
Use the list comprehension method from above to create the
greeting_tokens
variable to see the tokens spaCy
identified.
From this, we can see that spaCy recognises the smiley face emoji as a single token rather than two punctuation mark tokens. We can also see that it recognises the ’s as a single token dependency of the previous token, not as two separate tokens. This is pretty clever, and shows that spaCy already has a contextual view of the text loaded into it.
In part, this has to do with spaCy’s understanding of the tokens in the text as a specific sequence. SpaCy automatically indexes objects in an array based on their position so that they can be called by their individual index number, which is always an integer, and is presented in [] brackets.
It is important to note that indexes in Python begin with zero: the first object is assigned the index 0, the second object the index 1, and so on. You can specify a single index number or a range by using a semicolon between the numbers inside square brackets to extract them from your overall text array.
Let
get started with
If you don’t want to specify a beginning or end index number, but want to go from or to there from a specified point, you can omit either the number before or after the colon. This could be useful if, for example, you had a set of texts and wanted to compare their beginnings or endings.
started with spaCy :) .
Indexing in Python
Remember: if the length of your array is n, the first index is 0 and the last index is n-1.
In an index range, the output will be from the object indexed as the number before the colon up to but not including the number after the colon.
Recognizing Parts Of Speech
Being able to identify and extract specific parts of speech can be incredibly useful when working with large text corpora. Helpfully, spaCy’s NLP pipelines have been trained on large language datasets to be able to automatically, and quite reliably, recognise certain linguistic attributes such as parts of speech, grammatical functions, named entities, and so on.
For example, with the Boomkat corpus we explored in the previous episode, you might want to further analyse the descriptive words it contains and their sentiments, or isolate named entities such as record labels and place names to begin to construct a network.
Tasks such as these require understanding how spaCy parses text and
classifies parts of speech, as well as using ‘for’ loops to iterate
through text arrays to extract relevant information. Let us inspect our
greeting_doc
in this more systematic way to extract the
different token attributes that form part of spaCy’s NLP pipeline. The
words after the full stops of tokens are the attributes.
PYTHON
for token in greeting_doc:
print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
token.shape_, token.is_alpha, token.is_stop)
Let let VERB VB ROOT Xxx True False
's us PRON PRP nsubjpass 'x False True
get get AUX VB auxpass xxx True True
started start VERB VBN ccomp xxxx True False
with with ADP IN prep xxxx True True
spaCy spaCy PROPN NNP pobj xxxXx True False
:) :) PUNCT . punct :) False False
. . PUNCT . punct . False False
Here, every line of the output pertains to a single token in the greeting doc, and displays the token text, the lemmatized form of the token, the part of speech it represents, its tag, its dependency, and its letter case shape. Using Boolean True or False values, it also outputs whether the token is an alphanumeric character, and whether it is part of spaCy’s default stopwords list.
Some of the output text is intuitively understandable but, if not,
you can always use the spacy.explain()
function to find out
more. For example:
'noun, proper singular'
From this, we can tell that spaCy has recognised its own name as a noun, not as an adjective meaning space-like. This is an accurate reading in this example not only because of the syntactic positioning of this word in our text but also because of the token shape, which includes the characteristic uppercase C that forms part of spaCy’s brand identity.
Let us now move on to a more complex text example. In Jupyter
Notebook, create a new text
variable which takes the whole
of the first description from the ‘modern_classical_ambient_desc_only’
spreadsheet you downloaded during the last
episode.
Note that Jupyter often auto-fills quotation marks and this can get
confusing if your text also has quotation marks within it, so
double-check that they match at the beginning and end of your text. Use
the print()
function to check that the text looks as you
expected.
PYTHON
text = "'Clouds' is a perfectly measured suite of warm and hazy downbeats from Gigi Masin, Marco Sterk (Young Marco), and Johnny Nash recorded in the heart of Amsterdam's red light district over one weekend in April, 2014.It's all about louche vibes and glowing notes, gently absorbing and transducing the buzz of the streets outside the studio's open windows into eight elegantly reserved improvisations segueing between lush ambient drift, dub-wise solo piano pieces, and chiming late night jazz patter. In that sense, there's striking similarities between 'Clouds' and the recent Sky Walking album by Lawrence and co., but where they really go for the looseness, Gaussian Curve keep it supple yet tight, bordering on adult contemporary suaveness anointed with finest hash oil. Imbibe slowly."
print(text)
'Clouds' is a perfectly measured suite of warm and hazy downbeats from Gigi Masin, Marco Sterk (Young Marco), and Johnny Nash recorded in the heart of Amsterdam's red light district over one weekend in April, 2014.It's all about louche vibes and glowing notes, gently absorbing and transducing the buzz of the streets outside the studio's open windows into eight elegantly reserved improvisations segueing between lush ambient drift, dub-wise solo piano pieces, and chiming late night jazz patter. In that sense, there's striking similarities between 'Clouds' and the recent Sky Walking album by Lawrence and co., but where they really go for the looseness, Gaussian Curve keep it supple yet tight, bordering on adult contemporary suaveness anointed with finest hash oil. Imbibe slowly.
Then, turn the text into an nlp
doc
variable, as before, and check the length.
147
In Python, you can use ‘if’ and ‘if not’
statements to further refine your ‘for’ loop list
comprehensions, which makes it possible to focus on particular token
attributes while ignoring others. These ‘for … in … if/not …’ statements
rely on standard logical conditions from maths, such as equals to
(==
), not equals to (!=
), greater/less than
(> / <
), etc.
For example, with the Boomkat doc
from above, we can
create a list of tokens that excludes all punctuation marks:
[Clouds, is, a, perfectly, measured, suite, of, warm, and, hazy, downbeats, from, Gigi, Masin, Marco, Sterk, Young, Marco, and, Johnny, Nash, recorded, in, the, heart, of, Amsterdam, 's, red, light, district, over, one, weekend, in, April, 2014.It, 's, all, about, louche, vibes, and, glowing, notes, gently, absorbing, and, transducing, the, buzz, of, the, streets, outside, the, studio, 's, open, windows, into, eight, elegantly, reserved, improvisations, segueing, between, lush, ambient, drift, dub, wise, solo, piano, pieces, and, chiming, late, night, jazz, patter, In, that, sense, there, 's, striking, similarities, between, Clouds, and, the, recent, Sky, Walking, album, by, Lawrence, and, co., but, where, they, really, go, for, the, looseness, Gaussian, Curve, keep, it, supple, yet, tight, bordering, on, adult, contemporary, suaveness, anointed, with, finest, hash, oil, Imbibe, slowly]
You can also use equals to (==
) and Boolean values of
True
or False
to obtain the same result.
[Clouds, is, a, perfectly, measured, suite, of, warm, and, hazy, downbeats, from, Gigi, Masin, Marco, Sterk, Young, Marco, and, Johnny, Nash, recorded, in, the, heart, of, Amsterdam, 's, red, light, district, over, one, weekend, in, April, 2014.It, 's, all, about, louche, vibes, and, glowing, notes, gently, absorbing, and, transducing, the, buzz, of, the, streets, outside, the, studio, 's, open, windows, into, eight, elegantly, reserved, improvisations, segueing, between, lush, ambient, drift, dub, wise, solo, piano, pieces, and, chiming, late, night, jazz, patter, In, that, sense, there, 's, striking, similarities, between, Clouds, and, the, recent, Sky, Walking, album, by, Lawrence, and, co., but, where, they, really, go, for, the, looseness, Gaussian, Curve, keep, it, supple, yet, tight, bordering, on, adult, contemporary, suaveness, anointed, with, finest, hash, oil, Imbibe, slowly]
Furthermore, you can use or
and and
conditions to refine your list comprehension. For example, you could
specify that you want to extract only nouns and proper nouns from your
doc
.
PYTHON
nouns_and_proper_nouns = [token
for token in doc
if token.pos_=="NOUN" or token.pos_=="PROPN"]
print(nouns_and_proper_nouns)
[Clouds, suite, downbeats, Gigi, Masin, Marco, Sterk, Young, Marco, Johnny, Nash, heart, Amsterdam, light, district, weekend, April, 2014.It, vibes, notes, buzz, streets, studio, windows, improvisations, drift, dub, piano, pieces, night, jazz, patter, sense, similarities, Clouds, Sky, Walking, album, Lawrence, co., looseness, Gaussian, Curve, adult, suaveness, hash, oil]
Challenge
Use for … if… syntax to create a list of adjectives
and adverbs with the variable name adj_and_adv
from your
doc
.
It is a common mistake, but if you use and
in the code
you will get a blank []
result because the and
specifies that an individual token should be both an adjectives
and an adverb. This is impossible in this instance,
hence the blank result, but could have created confusion with the
noun/proper noun example above.
So, you should use or
to avoid this and print only
tokens that are either adjectives or
adverbs.
Callout
Like Voyant, spaCy has sets of stopwords for the
languages it supports that can be called and modified. For this, you
have to import
the stop_words
module which
contains the English language STOP_WORDS
set and create a
stop_words
variable in your own project.
Use this link to find more information on how to edit spaCy stopword sets.
Named Entity Recognition
Another useful feature built into spaCy’s nlp
pipeline
is Named Entity Recognition (NER). This recognises and classifies proper
nouns like place names, countries, currencies, political organizations,
and even (some) works of art. As with other tools, we cannot expect NER
to be 100% reliable, especially when working with musical sublanguages,
but it can still save time and/or streamline manual coding and tagging
exercises.
To see a list of the NER labels spaCy uses, you can type the
following code (and remember to use the spacy.explain()
function if you need further clarification on any abbreviations):
('CARDINAL', 'DATE', 'EVENT', 'FAC', 'GPE', 'LANGUAGE', 'LAW', 'LOC', 'MONEY', 'NORP', 'ORDINAL', 'ORG', 'PERCENT', 'PERSON', 'PRODUCT', 'QUANTITY', 'TIME', 'WORK_OF_ART')
It is important to note that named entities are a different type of
spaCy object than the token
or doc
objects we
have been working with so far. Predictably, and presumably unrelated to
the talking trees in Lord of the Rings, they are called
ents
. We can create a list of named entities in our
doc
by typing:
(Gigi Masin, Marco Sterk, Marco, Johnny Nash, Amsterdam, one weekend, April, late night, Sky Walking, Lawrence and co., Gaussian Curve)
The ent
objects have .label_
and
.text
attributes which can tell us, respectively, which of
the NER labels printed above our token corresponds to, or, the text of
the token that label was assigned to. If you are interested in finding
out the label for a specific named entity token, you can use the list
comprehension method from above. For example:
['ORG']
Challenge
Using the same method as when generating annotations of each token in
the greeting_doc
earlier:
- Create a list of ent texts with ent labels for each token in the
doc
.
Then, (b) extract only the entities representing people or organizations.
Note that .label
(without the underscore) outputs the
numerical index of the label type rather than the text string name of
the label, which could be less useful for intuitive readability when you
are exploring your document properties.
Gigi Masin PERSON
Marco Sterk PERSON
Marco PERSON
Johnny Nash PERSON
Amsterdam GPE
one weekend DATE
April DATE
late night TIME
Sky Walking PRODUCT
Lawrence and co. ORG
Gaussian Curve ORG
['Gigi Masin', 'Marco Sterk', 'Marco', 'Johnny Nash', 'Lawrence and co.', 'Gaussian Curve']
The ent.text
and ent.label_
‘for’ loop
output of part (a) of the exercise is interesting in itself but cannot
be worked on further because it was not assigned as a new variable. This
can be done with the following syntax:
However, if you try to do further NLP tasks on this variable, such as a simple token count, you will run into issues:
0
The reason that the len()
function did not work as
expected is because the type of object that this new variable
constitutes is not a doc
object to which spaCy’s NLP
functions (such as token counting) apply. You can inspect the type of
object you are dealing with by using the print(type)
command:
<class 'list'>
On the other hand, if you inspect the type of our original
doc
:
<class 'spacy.tokens.doc.Doc'>
List
objects are not tokenized or indexed in the same
way that doc
objects are, which means you cannot perform
further NLP tasks on them in their current state. Nevertheless, they are
often useful data sources that you will need to know how to manipulate
to be able to quantify and analyse their contents in other ways.
How to do this will be the subject of the next episode.
Key Points
- SpaCy is a free and open-source Python library for Natural Language Processing which is based on pre-trained processing pipelines.
- SpaCy uses language pipelines which automatically tokenize and tag parts of speech, with powerful recognition and prediction features built-in.
- You can use simple list comprehension syntax to extract information that is relevant to your research question.
- The ability to call spaCy’s NLP functions depends on the type of data object you are working with.
Content from Organising Data with Pandas
Last updated on 2024-04-07 | Edit this page
Overview
Questions
- What is Pandas and what is it for?
- What is a dataframe?
- How can I use Python and Pandas to create dataframes?
- How can I organise the data presented in a dataframe?
Objectives
- Use the Pandas package in Python and spaCy’s dframCy module to organise data in a tabular format.
- Manipulate different types of objects for organisation and analysis.
Challenge
Using the first item in the ‘modern_classical_ambient_desc_only’
dataset that you worked on in the previous episode, create a dataframe
that includes token.text
and token.pos_
values
for all of the nouns and proper nouns contained in that text. Make your
final output the .shape
attribute of the dataframe, and
check this against the results in the last episode.
Then, write a .csv file of this dataframe, titled ‘clouds_nouns.csv’
You will need to create a dictionary of annotations and then create a dataframe and filter out columns and rows.
PYTHON
desc_doc = nlp("'Clouds' is a perfectly measured suite of warm and hazy downbeats from Gigi Masin, Marco Sterk (Young Marco), and Johnny Nash recorded in the heart of Amsterdam's red light district over one weekend in April, 2014.It's all about louche vibes and glowing notes, gently absorbing and transducing the buzz of the streets outside the studio's open windows into eight elegantly reserved improvisations segueing between lush ambient drift, dub-wise solo piano pieces, and chiming late night jazz patter. In that sense, there's striking similarities between 'Clouds' and the recent Sky Walking album by Lawrence and co., but where they really go for the looseness, Gaussian Curve keep it supple yet tight, bordering on adult contemporary suaveness anointed with finest hash oil. Imbibe slowly.")
desc_doc_dict = [{
"text": token.text,
"pos": token.pos_,
"tag": token.tag_,
"dep": token.dep_,
"head": token.head}
for token in desc_doc]
desc_doc_dict
desc_df = pd.DataFrame.from_records(desc_doc_dict)
desc_pos_df = desc_df.drop(columns=["tag", "dep", "head"])
desc_nouns_df = desc_pos_df[desc_pos_df["pos"].isin(["NOUN", "PROPN"])]
desc_nouns_df
text pos
0 Clouds NOUN
6 suite NOUN
11 downbeats NOUN
13 Gigi PROPN
14 Masin PROPN
16 Marco PROPN
17 Sterk PROPN
19 Young PROPN
20 Marco PROPN
24 Johnny PROPN
25 Nash PROPN
29 heart NOUN
31 Amsterdam PROPN
34 light NOUN
35 district NOUN
38 weekend NOUN
40 April PROPN
42 2014.It PROPN
47 vibes NOUN
50 notes NOUN
57 buzz NOUN
60 streets NOUN
63 studio NOUN
66 windows NOUN
71 improvisations NOUN
76 drift NOUN
78 dub NOUN
82 piano NOUN
83 pieces NOUN
88 night NOUN
89 jazz NOUN
90 patter NOUN
94 sense NOUN
99 similarities NOUN
102 Clouds PROPN
107 Sky PROPN
108 Walking PROPN
109 album NOUN
111 Lawrence PROPN
113 co. PROPN
122 looseness NOUN
124 Gaussian PROPN
125 Curve NOUN
134 adult NOUN
136 suaveness NOUN
140 hash NOUN
141 oil NOUN
(47, 2)
Key Points
- Pandas is a Python library for data manipulation and analysis.
- You can use Pandas to create dataframes, i.e. data structures
consisting of rows and columns from your spaCy
doc
andlist
objects. - You can add or remove columns and rows based on their labels or values and quantitatively sort and analyse the contents.
- You can also use Pandas to write and export .csv and .tsv files for further manual coding or other manipulation.
Content from Introduction to Word Embeddings
Last updated on 2024-04-07 | Edit this page
Overview
Questions
- What are word embeddings and what are they for?
- How can I create embeddings using spaCy?
- How can I visualise the results?
Objectives
- Understand the principles behind word embeddings.
- Use Python, spaCy and Pandas to create word embeddings for a corpus of musical terms.
- Use Google TensorFlow’s Embedding Projector tool to visualise and interpet the results.
The dataset for this episode, containing terms from Boomkat’s ‘Grime / FWD’ and ‘Industrial / Wave / Electro’ subgenre corpora, can be accessed via the ‘Episode 5’ Zenodo repository.
Challenge
You may have noticed that the merge_embeddings_genre_df
has more rows than the unduplicated grime_ind_terms_df
we
created earlier. Why could that be? Can you amend the code used to
create the grime_ind_embeddings_df
to make the number of
rows match the number of unduplicated terms?
The terms in our lists are word types but are not
necessarily single tokens. As you will recall from our
first encounter with spaCy, it treats possessives like ’s as separate
tokens. This means that the process of creating a list of doc objects of
terms based on their tokens, spaCy’s nlp
function has added
rows (and vectors) for these tokens.
We can override this by specifying that only the first token in each
doc
is necessary to represent the term and that others
should be discounted.
Discussion
In breakout rooms of pairs or groups of three, go to the website for Google TensorFlow’s Embedding Projector Tool and load the grime_industrial_embedding data and metadata files. Play around with the settings and the projection and use the Etherpad to write down questions and observations to share with the whole group.
Use the ‘Color By’ function to differentiate terms by genre. This will assign a different colour to Grime versus Industrial terms and allow you to visually identify semantic themes that are more prevalent in one or the other genre.
Enable the 3D labels mode to see the terms themselves rather than points.
Key Points
- Word embeddings rely on a conception of language (as a finite vocabulary) as a totality that occupies multi-dimensional space.
- Word embeddings are sets of vectors that position a given word in this multi-dimensional space, with every word orthogonal to every other word.
- Embeddings are trained by unsupervised machine learning on large text corpora which looks for collocations between words, syllables and characters, and builds predictions as to the probability of given n-grams occuring in proximity. This presupposes that words that are semantically close (i.e. have similar meanings or usages) will be close together in the multi-dimensional space, while words that are semantically different will be far apart.
- There are many established statistical methods for manipulating and interpreting word embeddings.
- Google TensorFlow’s Word Embedding Projector has built-in functions for dimensionality reduction which create intuitively readable visualisations of embeddings in 2- or 3-dimensional space.