Open Source Software Projects
I am involved in several open source software projects related to NLP research. In addition to the overview on this page, feel free to look at my GitHub profile.
DiscourseDB aims to represent online conversations from multiple platforms in a uniform structure that allows discourse analyses that transcend platforms and thus allow a bird's eye view of online conversations. It furthermore captures the interactions between users and the context of the discourse.
DKPro TC is a UIMA-based text classification framework built on top of DKPro Core and DKPro Lab. It is intended to alleviate supervised machine learning experiments with any kind of textual data. It comes with
- getting-started example code for standard text collections, e.g. the Reuters-21578 Text Categorization corpus, in Java and Groovy
- many generic feature extractors, e.g. n-grams, POS-tags etc.
- convenient parameter optimization capabilities
- comprehensive reporting with support for many standard performance measures
- support for single- and multi-label classification, sequence classification and regression in various frameworks, e.g. Weka and Mallet
JWPL is a free, Java-based application programming interface that allows easy programmatic access to Wikipedia articles and metadata via a preprocessed database. With its own efficient storage format for page revisions, it provides access to the whole revision history of any Wikipedia while only demanding a fraction of the necessary storage space compared to the uncompressed data dumps. At the same time, it allows random accesss to any page revision in the databse, which is not possible with bulk-compressed dumps.
Many NLP tools are already freely available in the NLP research community. DKPro Core provides UIMA components wrapping these tools (and some original tools) so they can be used interchangeably in UIMA processing pipelines. DKPro Core builds heavily on uimaFIT which allows for rapid and easy development of NLP processing pipelines, for wrapping existing tools and for creating original UIMA components. DKPro Core is a collection of software components for natural language processing (NLP) based on the Apache UIMA framework.
jWeb1T is an open source Java tool for efficiently searching n-gram data in the Web 1T 5-gram corpus format. It is based on a binary search algorithm that finds the n-grams and returns their frequency counts in logarithmic time. As the corpus is stored in many files a simple index is used to retrieve the files containing the n-grams.