Data Science Problem 2017

Our data science problem will be about understanding the house market in Dublin, Ireland by examining data that is available online and processing it in such a way that we can answer interesting questions based on this data.

The data is available through the house register for Dublin, Ireland: https://www.propertypriceregister.ie

Download the csv file for Dublin for 2017. Note that we might need to download data for a few years to understand the variation with time.

We will use our Linux skills to explore the data, do some basic calculations, and possibly clean the data.

When processing big data, i.e. data that can be too big to store (or store in many formats), Linux is ideal. Ask about it in class and we can discuss why I am making this statement.

What are some research questions we might want to know about with regards to this data? What basic questions do we want to know about the data itself, its format, and the meaning of the fields?

Cleaning the Data

Often the raw data is not in a format that can be easily processed. In order to investigate it, it must be transformed into a more consistant format. This is called cleaning the data. Once the data is cleaned, the cleaned copy is used for analysis.

Basic Questions about the data (exploring the data)

How many post codes are represented in the data?

Research Questions (interesting questions that might direct decision-making)

What is the most expensive postcode? The least expensive?

Notes

Potential Research Questions

  1. What is the average house price?
  2. Using a histogram of house prices, how much variation can you find in prices?
  3. Plot house price against date as a scatter plot. Can you see a trend?
  4. What post codes are the most expensive and cheapest? Can you explain this by looking at a map of Dublin?
  5. Is there any difference between new dwellings and second-hand dwellings?
  6. What was the most expensive house and why might it be worth so much?
  7. Can we find some way to estimate the size of the house?
  8. What features are important for understanding house prices?
  9. Can we use a model to predict house prices?
  10. What else do we need to improve predictions?