What is a corpus?
A linguistic corpus is a collection of written text concerning a certain topic for use as research database. Some corpora are tagged and annotated according to the needs of the researcher compiling it, and some simply consist of raw text data. They can be used to examine linguistic phenomena found in the text passages, to compare sentence structures or word meaning with each other and so on. Depending on the content, some corpora could also be used to research social or psychological aspects.
What kind of corpus is VGCoST?
The Video Game Corpus of Speech and Text is a collection of dialogue and ingame text extracted from video games. It consists of raw text data in .txt files ordered by the game they are from. It is entirely Open Access/Open Science and can therefore be freely used by anyone and everyone.
So far, the corpus only includes English text. While English will stay the main focus of VGCoST, some German files might find their way into the collection if time allows.
No idea how to work with a corpus?
I wrote a little introduction of working with VGCoST in German here. You can find extensive English explanations on a lot of university websites, such as this one from the University of Heidelberg.
Need help or want to contribute?
If you need specific information, help, data or anything connected to the Video Game Corpus of Speech and Text, use the contact form below. I will try to help as best as I can. If you are a game developer and want the text of your game included in the corpus for science, you are the best! You can send me an email to the address on the ‘Site Notice‘ page. All games are welcome, whether independently published or AAA. The only requirement is that the text is English! You can send in German text as well if you have and want to, preferably together with the English version.
Download
Via Dropbox (Version 10/2019)