JSoup Method for Page Scraping

I’m currently in the process of writing a web scraper for the forums on Gaia Online . Previously, I used to use Python to develop web scrapers, with the very handy Python library BeautifulSoup . Java has an equivalent called JSoup. Here I have written a class which is extended by each class in my project that wants to scrape HTML. This ‘Scraper’ class deals with the fetching of the HTML and converting it into a JSoup tree to be navigated and have the data picked out of. It advertises itself as a ‘web spider’ type of web agent and also adds a 0-7 second random wait before fetching the page to make sure it isn’t used to overload a web server. It also converts the entire page to ASCII, which may not be the best thing to do for multi-language web pages, but certainly has made the scraping of the English language site Gaia Online much easier. ...

September 7, 2011 · 3 min · David Craddock

Disabling Control-Enter and Control-B shortcut keys in Outlook 2003

At work, I still have to use Windows XP and Outlook 2003. I don’t particually mind this, except when I draft an email to someone and accidently I press Control-B instead of Control-V. Control-B will go ahead and send your partially composed email, resulting in some embarassment as you have to tell everyone to disregard it. So I wanted to remove the ‘send email’ shortcut keys in Outlook 2003. There are two ways of doing this, one involves editing your group policy, which is something only my IT administration team can do, and I didn’t want to have to involve them. The other way is by making a change to your registry, which I will describe here. ...

July 13, 2011 · 1 min · David Craddock

Directory names not visable under ls? Change your colours.

There is a problem I frequently encouter on Redhat/Fedora/CentOS systems with the output of the ls command. Under those distributions, the default setup is to display directories in a very dark colour. If you usually use a white foreground and a black background on your terminal client (such as Putty) then you will struggle to read the names of the directories under Redhat-based distributions. There are two soloutions that I have used: ...

May 4, 2011 · 2 min · David Craddock

Scraping Gumtree Property Adverts with Python and BeautifulSoup

I am moving to Manchester soon, and so I thought I’d get an idea of the housing market there by scraping all the Manchester Gumtree property adverts into a MySQL database. Once in the database, I could do things like find the average monthly price for a 2 bedroom flat in an area, and spot bargains through using standard deviation from the mean on the price through using simple SQL queries via phpMyAdmin . ...

May 1, 2011 · 7 min · David Craddock

RESTful Web Services

REST (Representational State Transfer) is a way of delivering web services. When a web service conforms to REST, it is known as RESTful. The largest RESTful web service is the Hypertext Transfer Protocol (HTTP) which you use every day to send and receive information from web servers while browsing the internet. To implement RESTful web services, you should implement four methods: GET, PUT, POST and DELETE. Resources on RESTful web services are typically defined as collections of elements. The REST methods can either act on a whole collection, or a specific element in a collection. ...

March 2, 2011 · 2 min · David Craddock

'Weather Forecast' Calendar Service in PHP

The BBC provide 3 day weather RSS feeds for most locations in the UK. I thought it would be interesting to create a web service to turn the weather feed into calendar feed format, so I could have a constantly updated forecast of the next 3 days of weather mapped on to my iPhone’s calendar. Here it is on my iPhone: Overview The service is separated into five files: ical.php – this contains the class ical which corresponds to a single calendar feed. A method called ‘addevent’ allows you to add new events to the calendar, and a method called ‘returncal’ redirects the resulting calendar file to the browser so people can subscribe to it using their calendar application. forecast.php – this file contains the class forecast, which has properties for all aspects that we want to record for each day’s forecast, ie: Wind Speed and Humidity. It also contains the forecast set, which is a collection of forecast objects. The set class is serializable, which means each forecast object can be stored in a text file, including the Wind Speed, Humidity and all other things we want to record for each day. scrape-weather.php – this file contains code that scrapes the weather feed, populates the forecast set with all the weather information for the next 3 days, and stores the result in a file called forecasts.ser. forecasts.ser – this is all the data for the three day weather forecast, in serialized format. It is automatically deleted and recreated when the scrape-weather.php script is run. reader.php – this file converts the forecasts.ser file into an iCal calendar, and outputs the iCal formatted result to the calendar application that accesses reader.php page. It uses two external libraries: ...

February 24, 2011 · 4 min · David Craddock

Find large files by using the OSX commandline

To quickly find large files to delete if you have filled your startup disk, enter this command on the OSX terminal: sudo find / -size +500000 -print This will find and print out file paths to files over 500MB. You can then go through them and delete them individually by typing rm “”, although there is no undelete so make sure you know you won’t miss them.

February 22, 2011 · 1 min · David Craddock

Finding files in Linux modified between two dates

You use the ’touch’ command to create two blank files, with a last modified date that you specify - one with a date of the start of the range you want to specify, and the second with a date at the end of the range you want to specify. Then you reference to those two files in your find command: touch /tmp/temp -t 200604141130 touch /tmp/ntemp -t 200604261630 find /data/ -cnewer /tmp/temp -and ! -cnewer /tmp/ntemp

February 16, 2011 · 1 min · David Craddock

Writing simple email alerts in PHP with MagpieRSS

I wrote an email alerter that sends me an email whenever the upcoming temperature may dip below freezing. It uses the Magpie RSS reader to pull down a 3 day weather forecast that is provided for my area in RSS form by the BBC weather site. It then parses this forecast and determines if either today’s or tomorrow’s weather may dip below freezing. If it might, it sends an email to my email address to warn me. ...

February 12, 2011 · 1 min · David Craddock

Reverting back to a previous version in CVS - the magic "undo" feature

If you’ve committed some code into to CVS, and made a mistake on that commit, you will want to know how to revert to a previously saved version. Here is the command line command for CLI versions of CVS: $ cvs update -D '1 week ago' Run this command in the main directory of your checked out working copy. This will revert your working copy to the version of the code that was checked in ‘1 week ago’ from the present date. You also use commands like “1 day ago” and “5 days ago”. ...

January 28, 2011 · 1 min · David Craddock