use case: bargain detector on immo data

Geoffrey De Smet

unread,

Oct 14, 2011, 4:17:24 AM10/14/11

to bigd...@googlegroups.com

Here's something we might be able to do with the immo bigdata, step by step:

0) Average price overall:
230k

1) Average price per city:
Gent -> 200k
Sint-martens latem -> 300k
Wielsbeke -> 100K

2) Average price per word in the description:
bathroom -> 240k
douche -> 230k
sauna -> 400k
is -> 210k
gerenoveerd -> 250k
2 bedrooms -> 100k

3) Average delta of price per word in the description compared to the average price of its city:
bathroom -> +0k
sauna -> +100k
2 bedrooms -> -150k

4) Replace in 3) "word in the description" by "building year" (grouped by 10 years)
1930-1940 -> -40k
1990-2000 -> +30k

5) Same as 3), but with immo add display date
4 years ago -> -30k
3 years ago -> -25k
2 years ago -> -25k
1 year ago -> -10k

...

Based on all these mappings, you get a formula (start with all sums) for a single add which results into a "predicted price".
Try that formula on every add in the database and compare it with the real price of the add.
For 80% of the cases, this should be withing 5% correct.
Those other 20% of the cases, are the bargains and anti-bargains.
Now, when a new add is added to immoweb, we know if it's probably a bargain or not :)

Filter out all adds without description or price or ...

Geoffrey De Smet

unread,

Oct 14, 2011, 4:20:25 AM10/14/11

to bigd...@googlegroups.com

This kind of stuff works. If and only if there's enough data. There is a critical amount of data we need for this to work.
Having 100 times more data can mean having a accuracy improvement from 70% (useless) to 99% (useful).
I am not sure if enough houses are being sold for there to be enough data.

André Kelpe

unread,

Oct 14, 2011, 10:32:13 AM10/14/11

to bigd...@googlegroups.com

BTW: Here are the other openstreetmap downloads I mentioned yesterday:

http://downloads.cloudmade.com/europe/western_europe/belgium#downloads_breadcrumbs

-- André

Kenny Helsens

unread,

Oct 14, 2011, 12:52:36 PM10/14/11

to bigd...@googlegroups.com

Hi all,

I don't see the formula in there yet, but it's definitely a nice vector of variables one can calculate for each house sell. (My personal favorite is your 3th point, which would be a tough analysis on itself imho)

Furthermore, by calculating openstreetmaps/openbelgium variables, we should be able to scale this dataset to bigdata proportions.

e.g.,

- distance to E40

- distance to Gent

- heads per city

- sells per city

- ..

3) Average delta of price per word in the description compared to the average price of its city:
bathroom -> +0k
sauna -> +100k
2 bedrooms -> -150k

Based on all these mappings, you get a formula (start with all sums) for a single add which results into a "predicted price".

Try that formula on every add in the database and compare it with the real price of the add.
For 80% of the cases, this should be withing 5% correct.

Those other 20% of the cases, are the bargains and anti-bargains.

Now, when a new add is added to immoweb, we know if it's probably a bargain or not :)
Filter out all adds without description or price or ...

This kind of stuff works. If and only if there's enough data. There is a critical amount of data we need for this to work.
Having 100 times more data can mean having a accuracy improvement from 70% (useless) to 99% (useful).
I am not sure if enough houses are being sold for there to be enough data.

70% accuracy would be a fair improvement to random, so I would not say that such a prediction would be useless. Also, I think that its not purely the amount of houses that is relevant, yet rather the discriminative power of the variables we will calculate for each sell. If the database would contain a few hundred thousands of sells, I would definitely expect us to reveal some nice insights.

Keep the discussion going, and let the Doodle come!

best regards,

kenny

Klaas Bosteels

unread,

Nov 14, 2011, 2:02:21 PM11/14/11

to bigd...@googlegroups.com

I wasn't at the last meetup so you guys might've discussed (some of) these things already, but when I'd have to implement something like this I'd take the following things into account:

1) Watch out for outliers. Pretty much all possible ways of computing an average tend to be quite sensitive to outliers, so everything could get messed up by a relatively small number of prices that somehow have a few too many zeros for instance. Throwing more data at the problem should mitigate this issue, but even with substantial amounts of data you might still want to make sure your model is robust enough to deal with outliers.

2) Take trends into consideration. A price that would've been considered a bargain a few months or years ago might not be a bargain anymore today, so I'm guessing it would make sense to take trends into account by, e.g., using something like exponential smoothing instead of simply computing averages.

3) Maybe even take seasonality into consideration. I'm not sure to what extent is is true, but I've heard that december is a good month for getting an apartment in some cities because most people tend to wait for the new year when planning to move. Such seasonal effects could be taken into account by using a more advanced modeling technique such as exponential smoothing as well.

Just my 2 cents,
-Klaas

Reply all

Reply to author

Forward