Mining Dracula with Data Science

Mining Dracula with Data Science

Dracula 1st ed cover reproductionIf you’ve ever woken up from that nightmare where you’re giving a presentation but you’ve never read the book, then you’ll know just how stressful that can be, unless of course, you’re Tyler Diehl. Tyler, a junior Finance/Statistics double major, stood in front of a graduate class last month and presented his analysis of Bram Stoker’s Dracula without ever turning a single page. He coolly explained persistent themes throughout the book, styles of writing by the various characters, and even identified the saddest sentence in the whole novel. How did he do that you might ask, well Tyler had a secret weapon: data science.

You see, Tyler has been working this semester as part of the Research STAR1 program in the Office of Information Technology’s Research and Data Science Services group. This group works with researchers to help take their research to the next level with technology, and Tyler was doing just that. He had been tasked with developing a workshop on using the programming language R to do text mining, the process of extracting information from text using data science techniques. In working with his mentor, he settled on Dracula due to the interesting nature of how the book was written, an epistolary novel with several distinct internal authors, and that it was in the public domain, making the text readily available2. He then wrote an analysis in R that processed the text and enabled him to break it down into something that could be analyzed using the statistical methods he was familiar with from his classes. From there, he was able to use tools like sentiment analysis, topic modeling, and word clouds to extract relevant information about the text. He then found himself standing in front of a dozen graduate students in the Monsters in Myth, Literature, and Video Games class shedding interesting insights about a Gothic classic without so much as cracking the spine.

Tyler Diehl presentation
Tyler Diehl presented his findings at the Monsters in Myth, Literature, and Video Games class.

Tyler has since started to read Dracula to get a better understanding of his findings and take his analysis to the next level, but as he pointed out to the class, he had never text mined before the semester; he just had a working knowledge of R and a few hours a week dedicated to it.

If you are interested in trying out text mining or have an idea for a research project related to data science, high-performance computing, artificial intelligence, or the internet of things, please reach out to the OIT Research and Data Science group at help@smu.edu. If you are a student looking to work on cool projects and want to join the STAR program, we’re hiring, so reach out to Dr. Eric Godat at egodat@smu.edu.



1 Student Technology Assistant in Residence: https://stars.smu.edu
2 Courtesy of Project Gutenberg: https://www.gutenberg.org

Print Friendly, PDF & Email

Published by

Eric Godat

Dr. Eric Godat is a Data Scientist, Adjunct Professor, and Team Lead of the Research and Data Science Services Group in the Office of Information Technology at Southern Methodist University (SMU). He orchestrates a team of experts in the fields of data science, high performance computing, artificial intelligence, and the internet of things to assist researchers at SMU in enhancing their research capabilities through technology. Dr. Godat's background is in the field of high energy physics, specifically particle phenomenology using parton distribution functions to model heavy ion collisions at the Large Hadron Collider, and has recently been involved with collaborative projects from multiple disciplines across campus.