Spreading excellence and disseminating the cutting edge results of our research and development efforts is crucial to our institute. Check for our educational offers for Bachelor, Master and PhD studies at the University of Innsbruck!

Document Feature Set Extraction from Web Resources

Student name: 

The goal of this work is to:

  • define feature sets from Web documents
    • according to syntactical properties of the HTML pages, like number of camelcase tokens, links, non-linked links, currency symbols (e.g. prices), long number sequences (e.g. phone numbers), etc.
    • according to structural differences in the HTML pages, like headers, corresponding paragraphs, relation of mark-up to text, etc.
  • employ rules to extract concrete usable information from feature sets
  • build clusters of feature sets