This first version of our user portal provides access to two datasets with notebooks which enable exploration of the data.
- Farms to Freeways - Data collected in 1991-1992 in a project titled Western Sydney Women’s Oral History Project ‘From farms to freeways: Women’s memories of Western Sydney’, which sought to analyse the experiences of women who had lived in the Blacktown and Penrith areas since the early 1950s, including their responses to social changes brought about by rapid suburbanisation in the Western Sydney region in the post-war period.
- Corpus of Oz Early English (CoOEE) - Approximately 2 million tokens of material produced in Australia between 1788 and 1900, divided into four time periods and four registers.
Discursis is communication analytics technology that allows a user to analyse text based communication data, such as conversations, web forums and training scenarios. It uses natural language processing algorithms to automatically process transcribed text to highlight participant interactions around specific topics and over the time-course of the conversation. Discursis can assist practitioners in understanding the structure, information content, and inter-speaker relationships that are present within input data. Discursis also provides quantitative measures of key metrics, such as topic introduction; topic consistency; and topic novelty. See this blog post for more details.
Discursis was developed by Dan Angus, Janet Wiles and Andrew Smith and has been reworked as an open source tool by staff of Sydney Informatics Hub.
This Quotation Tool can be used to extract quotes from a text. In addition to extracting the quotes, the tool also provides information about who the speakers are, the location of the quotes (and the speakers) within the text, the identified named entities, and other information which can be useful for text analysis.
This Semantic Tagger uses the Python Multilingual Ucrel Semantic Analysis System (PyMUSAS) to tag text so that you can extract token level semantic tags from the tagged text. PyMUSAS, is a rule based token and Multi Word Expression (MWE) semantic tagger. The tagger can support any semantic tagset, however the currently released tagset is for the UCREL Semantic Analysis System (USAS) semantic tags. In addition to the USAS tags, you will also see the lemmas and Part-ofSpeech (POS) tags in the text. For English, the tagger also identifies and tags Multi Word Expressions (MWE), i.e., expressions formed by two or more words that behave like a unit such as ‘South Australia’.
These tools assist the processes of recognising placenames in historical documents and then using online gazetteers to determine what known locations the placenames correspond to, and then to gather related geolocation data such as coordinates. The tools accelerate the workflow for these processes but leave scope for the user to have input in disambiguation.
This notebook presents a Keyword Analysis tool to analyse words in a collection of corpora and identify whether certain words are over or under-represented in a particular corpus compared to their representation in the other corpora.
This notebook presents a tool to identify similar documents in your corpus and decide whether to keep them in the corpus or to remove them. See this blog post for more details.
Language Technology and Data Analysis Laboratory (LADAL)
The Language Technology and Data Analysis Laboratory offers a range of tools for text analysis (and more).