Spreading excellence and disseminating the cutting edge results of our research and development efforts is crucial to our institute. Check for our educational offers for Bachelor, Master and PhD studies at the University of Innsbruck!

Evaluating the potential of graph algorithms for extracting the main text from HTML files

Type: 
Bachelor
Supervisor: 
Student name: 
Lukas Lechner
Assignment Date: 
November 9, 2022

The objective of this thesis is to explore graph algorithms as a novel approach for identifying and extracting the main text content from web pages.
At the time of writing, main text extraction from web pages has not been proposed as a potential application domain for graph algorithms. This is a lost opportunity considering the fact that the DOM tree of a given HTML document is intuitively represented as a hierarchical graph and therefore practically suggests the use of graph algorithms for associated ML problems. The main goals of the thesis will be the following:

  1. investigate the potential of graph algorithms for main text extraction in web pages,
  2. design and test different graph representations of a given HTML document (nodes, links, node attributes) to determine the optimal graph representation for main text extraction, and
  3. evaluate and compare graph algorithms against state-of-the-art approaches.