Evaluating the potential of graph algorithms for extracting the main text from HTML files

Student name: 
Lukas Lechner
Assignment Date: 
November 9, 2022

The objective of this thesis is to explore graph algorithms as a novel approach for identifying and extracting the main text content from web pages.
At the time of writing, main text extraction from web pages has not been proposed as a potential application domain for graph algorithms. This is a lost opportunity considering the fact that the DOM tree of a given HTML document is intuitively represented as a hierarchical graph and therefore practically suggests the use of graph algorithms for associated ML problems. The main goals of the thesis will be the following:

  1. investigate the potential of graph algorithms for main text extraction in web pages,
  2. design and test different graph representations of a given HTML document (nodes, links, node attributes) to determine the optimal graph representation for main text extraction, and
  3. evaluate and compare graph algorithms against state-of-the-art approaches.