Files
julien Lengrand-Lambert d982bc94cb Adding retro Java 17 compatibility
* Moving back to Java 17 compatibility. I'll change demo-remote a little later, because otherwise I have to break the project for some time.

* Prepares version 0.1.3
2025-10-26 00:43:04 +02:00
..
2025-06-03 00:36:03 +02:00
2025-06-03 00:36:03 +02:00
2025-06-03 00:36:03 +02:00

#Scrape test module

The scrape test module is intended to test the immplementation of the library at scale by parsing a large amount of webpages and checking the quality of its results

Data

At this moment

  • one dataset was found on Kaggle.
  • another on Moz (Top 500 most visited websites).

I'd like a more varied set of data from different types of sources, and the current set mostly seem to contain homepages but it's surprisingly hard to find.

Running the tests

For various reasons, I am not uploading the actual data of the various URLs. To run the analysis yourself:

  1. Run Scraper.kt once, which will grab all the webpages and place them in the data/web folder.
  2. Run ParserTest.kt, which will run the Parser on each of those web pages and check whether the tags can be extracted, and if the page is considered valid.