Files
OpenGraphKt/scrape-test
julien Lengrand-Lambert 12de34aa60 Feat/updates 10 25 (#42)
* Update dependency com.fleeksoft.ksoup:ksoup-network to v0.2.5 (#40)

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

* Update whole ksoup to 0.2.5

* Update dependency gradle to v8.14.3 (#37)

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

* Update dependency org.junit:junit-bom to v5.14.0 (#36)

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

* Update dependency gradle to v8.14.3 (#43)

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

* Update plugin org.jetbrains.kotlin.jvm to v2.2.20 (#35)

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

* Update ktor from scraper

* Update settings

* Update settings

* Update gradle/actions action to v4.4.4 (#31)

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

* Adds claude init

* Upgrades to Java 24

---------

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
2025-10-19 02:48:47 +02:00
..
2025-06-03 00:36:03 +02:00
2025-06-03 00:36:03 +02:00
2025-10-19 02:48:47 +02:00
2025-06-03 00:36:03 +02:00

#Scrape test module

The scrape test module is intended to test the immplementation of the library at scale by parsing a large amount of webpages and checking the quality of its results

Data

At this moment

  • one dataset was found on Kaggle.
  • another on Moz (Top 500 most visited websites).

I'd like a more varied set of data from different types of sources, and the current set mostly seem to contain homepages but it's surprisingly hard to find.

Running the tests

For various reasons, I am not uploading the actual data of the various URLs. To run the analysis yourself:

  1. Run Scraper.kt once, which will grab all the webpages and place them in the data/web folder.
  2. Run ParserTest.kt, which will run the Parser on each of those web pages and check whether the tags can be extracted, and if the page is considered valid.