Post-processing framework for large scale document collections

Type:

Master

Student name:

none

The goal of this work consists of building a post-processing framework for a large scale document collection. One part of this framework is an RDF triple store containing large amounts of meta data. Optimizing this large scale meta data collection and access in RDF is part of the aim of the work. Single included tasks consist of:

developing and implementing filters and processors that do (a) run within a Web crawler and (b) partially also after the crawler, that is to post-process the crawled data. This includes, e.g., creation of additional data, dependency on additional data, data conversions etc.
tuning the code of the Web crawler, which stores RDF metadata about Web Services
making sure that only necessary data is written to RDF
tuning a given RDF triple store, Virtuoso, to digest as many triples as possible, verifying limits on how many triples can be stored, how many queries can be processed, defining / implementing query quotas (forbid expensive queries) etc.