Paper: Towards Olfactory Information Extraction from Text – A Case Study on Detecting Smell Experiences in Novels

This weekend, Marieke van Erp presented a paper on extracting olfactory information from English text at the 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, organised in conjunction with COLING 2020. The paper was presented in a poster presentation, sadly not in Barcelona, but in a gather.town session.

For this paper, we did a first set of experiments into how we can best recognise references to smell in texts, which is an important task in Odeuropa’s Work Package 3.  For this paper, we first created an annotated dataset, i.e. a set of texts in which humans (= Odeuropa team members) marked whether the text described a reference to a smell. We then created patterns based on a set of smell related words from the Cambridge dictionary of English to such as ‘smells like X’ and ‘a Y fragrance’ where X and Y can stand for nouns and adjectives. We ran the patterns over a large set of texts to see if we could find more expressions referring to smells in text as compared to only using the dictionary smell keywords, and our experiments showed that patterns indeed worked better than keywords. In Odeuropa, we will further build on this, as well as try out other methods (such as machine learning) to recognise references to smells in Latin, English, Italian, German, French, Dutch, and Slovene texts from 1600 – 1920 across different genres.

This research paper was based on the Ryan Brate’s MSc thesis work which he did for the University of Amsterdam’s Data Science degree programme under the supervision of prof. dr. Paul Groth and dr. Marieke van Erp. Full citation:

Brate, Ryan, Paul Groth, and Marieke van Erp. “Towards Olfactory Information Extraction from Text: A Case Study on Detecting Smell Experiences in Novels.” In Proceedings of the The 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pp. 147-155. 2020.