Data Science Problem 2017
Our data science problem will be about understanding the house market in Dublin, Ireland
by examining data that is available online and processing it in such a way that we can answer
interesting questions based on this data.
The data is available through the house register for Dublin, Ireland: https://www.propertypriceregister.ie
Download the csv file for Dublin for 2017. Note that we might need to download data for a few years to understand the variation with time.
We will use our Linux skills to explore the data, do some basic calculations, and possibly clean the data.
When processing big data, i.e. data that can be too big to store (or store in many formats), Linux is ideal. Ask about it in class and we can discuss why I am making this statement.
What are some research questions we might want to know about with regards to this data? What basic questions do we want to know about the data itself, its format, and the meaning of the fields?
Cleaning the Data
Often the raw data is not in a format that can be easily processed. In order to investigate it, it must be transformed into a more consistant format. This is called cleaning the data. Once the data is cleaned, the cleaned copy is used for analysis.
Basic Questions about the data (exploring the data)
How many post codes are represented in the data?
Research Questions (interesting questions that might direct decision-making)
What is the most expensive postcode? The least expensive?
Notes
- Note that we might need to download data for a few years to understand variations over time.
How does this change your approach to processing the data?
- As we work thought the Data Problem, think about how Linux, Excel, and MatLab can help you
process the data. Which techniques work best for which types of problems?
- Load data into matlab. Hint use "textread".
- Turn dates in string format into matlab date format (integer values). Hint use "datenum".
Potential Research Questions
- What is the average house price?
- Using a histogram of house prices, how much variation can you find in prices?
- Plot house price against date as a scatter plot. Can you see a trend?
- What post codes are the most expensive and cheapest? Can you explain this by looking at a map of Dublin?
- Is there any difference between new dwellings and second-hand dwellings?
- What was the most expensive house and why might it be worth so much?
- Can we find some way to estimate the size of the house?
- What features are important for understanding house prices?
- Can we use a model to predict house prices?
- What else do we need to improve predictions?