JSoup Method for Page Scraping
I’m currently in the process of writing a web scraper for the forums on Gaia Online . Previously, I used to use Python to develop web scrapers, with the very handy Python library BeautifulSoup . Java has an equivalent called JSoup. Here I have written a class which is extended by each class in my project that wants to scrape HTML. This ‘Scraper’ class deals with the fetching of the HTML and converting it into a JSoup tree to be navigated and have the data picked out of. It advertises itself as a ‘web spider’ type of web agent and also adds a 0-7 second random wait before fetching the page to make sure it isn’t used to overload a web server. It also converts the entire page to ASCII, which may not be the best thing to do for multi-language web pages, but certainly has made the scraping of the English language site Gaia Online much easier. ...