Files
OpenGraphKt/scrape-test/README.md
julien Lengrand-Lambert 5372fab21c Fix types (#22)
* Improves types
* Adds missing properties to music album
* Changes gender from String to Enum
* Changes URL to an actual URL
* Fix typo
* Adds scalable live testing on real data
* Uses OffsetDateTime for articles, videos and books
2025-06-03 00:36:03 +02:00

21 lines
974 B
Markdown

#Scrape test module
The scrape test module is intended to test the immplementation of the library at scale by parsing a large amount of webpages and checking the quality of its results
## Data
At this moment
* one dataset was found on [Kaggle](https://www.kaggle.com/datasets/hetulmehta/website-classification).
* another on [Moz](https://moz.com/top-500/download/?table=top500Domains) (Top 500 most visited websites).
I'd like a more varied set of data from different types of sources, and the current set mostly seem to contain homepages but it's surprisingly hard to find.
## Running the tests
For various reasons, I am not uploading the actual data of the various URLs. To run the analysis yourself:
1. Run `Scraper.kt` once, which will grab all the webpages and place them in the `data/web` folder.
2. Run `ParserTest.kt`, which will run the `Parser` on each of those web pages and check whether the tags can be extracted, and if the page is considered valid.