Whilst there is a ton of useful data on the EPA Ireland web-site, it’s not exactly easy to track what’s going on. After my recent RSS post, I got a request from Ashley to see if something better was doable with the EPA data.
After a bit of playing around I was able to scrape the thousands of individual RSS feeds and generate what is hopefully helpful to those of you who wish to monitor submissions on the site. Ashley has a good thread about it here on Twitter.
The code is all up on GitHub here. In summary what it does is:
- Once a day around 1.30am GMT, it scrapes all the feeds on the EPA site
- It saves all the data into a SQLite database and uploads that to Amazon S3
- It updates a single small RSS feed with all the submissions from the previous day
- It generates a new CSV file with the same data as the RSS feed and saves that to GitHub
Subscribing to the RSS Feed
Use this URL in Feedly or similar.
Viewing the daily CSV files.
They are all here in the repo starting on Sep 22nd 2022. You can view them directly there or download and load into Excel.
Getting notified by email (experimental)
If you’d like to receive email with a link to the latest CSV each day:
- Create a GitHub Account
- Click the drop-down menu beside “Watch” in the top right of this project’s page.
- Select “Custom” and tick the box beside “Issues”. Then click Apply.
- You should start receiving the emails beginning tomorrow.
Examining the data in the SQLite Database using Datasette Lite
You can use a very cool project by Simon Willison called Datasette Lite to browse and query all the latest data in your browser by going here. I highly recommend playing around with it, as you can query by keywords and date ranges.
This was thrown together very quickly. There is minimal error checking or handling. If you’d like to improve this project, let me know and I’ll give you access. It’s unlikely I’ll have much time in the coming months to work on it.