I think an intermediary step before looking at more complex models
like neural networks might be to try some form of regression like the
ones covered in ml-class (linear, polynomial, logistic, etc.).
For our first submission, I simply used the sample file included on
kaggle's website without writing any code.
I've been quite busy with deadlines in four classes but I was thinking
we can probably re-use the Octave/Matlab code from http://ml-class.org
and http://openclassroom.stanford.edu.
I'm planning to spend more time on kaggle projects as soon as I finish
a SVM project I have in Machine Learning class I'm taking in a local
University.
As Dan mentionned, volumes would most likely be used if we were to
build a real model.
However, I think we can still learn from "toy problems" since we're
fairly new to AI competitions. If you prefer, we can participate in a
different competition. I think we need to get some experience with
simple competitions before we jump to large competitions like
http://heritagehealthprize.com.
What do you think?
Thanks for your interest. ;) Maxime
On Sat, Nov 19, 2011 at 2:48 PM, Paul Tan <paul...@gmail.com> wrote:>
Could you post the code you used to generate the data?>> I'm trying to
get started, but wanted to get a framework to input the> data and
generate the output data that the competition requested.>> I am
thinking of just putting the data into a neural network with a> large
number of hidden units and see what happens. I will post my> Octave
code here for others to see if they have a better idea.>>> Thanks.>>
Paul Tan.>
I'll take a look at the download sample from the site and see if I can
load it into Octave.
Paul Tan.
On Nov 19, 3:23 pm, Maxime Leclerc <maksim.lek...@gmail.com> wrote:
> Hi Paul,
>
> For our first submission, I simply used the sample file included on
> kaggle's website without writing any code.
>
> I've been quite busy with deadlines in four classes but I was thinking
> we can probably re-use the Octave/Matlab code fromhttp://ml-class.org
> andhttp://openclassroom.stanford.edu.
>
> I'm planning to spend more time on kaggle projects as soon as I finish
> a SVM project I have in Machine Learning class I'm taking in a local
> University.
>
> As Dan mentionned, volumes would most likely be used if we were to
> build a real model.
>
> However, I think we can still learn from "toy problems" since we're
> fairly new to AI competitions. If you prefer, we can participate in a
> different competition. I think we need to get some experience with
> simple competitions before we jump to large competitions likehttp://heritagehealthprize.com.
>
> What do you think?
>
> Thanks for your interest. ;) Maxime
>
> On Sat, Nov 19, 2011 at 2:48 PM, Paul Tan <paulc...@gmail.com> wrote:>
>
> Could you post the code you used to generate the data?>> I'm trying to
> get started, but wanted to get a framework to input the> data and
> generate the output data that the competition requested.>> I am
> thinking of just putting the data into a neural network with a> large
> number of hidden units and see what happens. I will post my> Octave
> code here for others to see if they have a better idea.>>> Thanks.>>
> Paul Tan.>
>
> On Thu, Nov 17, 2011 at 10:41 PM, Dan M. Shaw <olom...@gmail.com> wrote:
>
>
>
>
>
>
>
> > On Thu, Nov 17, 2011 at 8:09 AM, Maxime Leclerc <maksim.lek...@gmail.com>
Thanks,
Rashmi
Here is how to set up a database for the algo training data
http://www.kaggle.com/c/AlgorithmicTradingChallenge/forums/t/1032/importing-the-data-into-a-database
The script:
https://github.com/fidlej/atrade/blob/master/displaycurves.py
Example usage:
./displaycurves.py path/to/training.csv
Anyone can fork the git repo and enhance the code.
I will describe the used method in a next email.
For now, enjoy the hot news.
To minimize the error, I was using means.
Mean of a set of values minimizes their squared error.
I compute means of price changes.
price_change = event[t].price - event[50].price
Each mean is for rows with the same (security_id, initiator). I also
keep different means at event51, ..., event100. It allows to represent
changing of the mean when more events are coming.
When predicting a value, I add the mean change to the price of the
50th event.
See means.py and produce_response.py in the git repo for details.
https://github.com/fidlej/atrade
Feel free to experiment.
You can try computing means over different groups.
You can also do "Error Analysis" as suggested by Ng. See what
validation examples are problematic.
See means.py and produce_response.py in the git repo for details.
https://github.com/fidlej/atrade
Wow that's great progress. Thanks a lot for this contribution, I'm
looking forward to helping you out and improving your solution in a
few days. Doing "Error Analysis" will most likely help us focus on
areas to improve.
Good job ;) Maxime
I'm finding difficult making sense of the data...
as I understand, each row is a day, and in a day there is only 50
movements (?).
the movements are
Q, if the prices are adjusted (but in the data I see that there are
times where the event is Q, but I see no data change in the ask or
bid )
T if there has a trade (but we have no information ?)
can some of you help me of give me some documentation so I can
understand the challenge?
I looked at Ivo's code, but I'm not too familiar with Phyton.. and I'm
finding it difficult to follow.
thnx!
Well, I will try and summarize Ivo's code for those who do not know
python.
Let me know if I have missed something important or got it wrong.
Summary of Ivo's work:
We train on Training data and update the testing data to produce
output.
- for each Bid/Ask value in training set, starting from 51st Bid thru
100th Bid, we calculate the mean/average difference between it and
50th Bid/Ask
- we keep a running total of this difference in file Changes.pickle ,
summing over the same key combination ('security_id'/'initiator' ).
- once trained, we open up the testing file, and create a new output
file by replacing 51st thru 100th Bid/Ask prices with 50th Bid/Ask
price plus the mean/average value for the corresponding column from
Changes.pickle file.
File Changes.pickle :
- is produced/updated by program 'means.py' and is used by program
'produce_response.py' to generate output file for submission to
Kaggle.
- the raw 'Changes.pickle' file is a serialized document. see
http://docs.python.org/library/pickle.html for more info on this.
- It is better to understand the de-serialized structured of the
file.
- Each row(viewed as a set of columns) in this file is specific to
a particular key ('security_id'/'initiator' combination).
- The first element of this row (index=0) tells us how many rows in
training set with the same key have been processed so far. This value
is used to calculate the mean(average) later on
- The remaining pair columns keep the sum of difference between the
base ask/bid price (set to 50th ask/bid price in input row) and bid/
ask prices in the input training row, starting from 51st bid until
100th bid. The relevant code for this is in 'means.py' file line 60,61
and 62
- For example if the input row in training set has (base)50th bid/
ask price $23/$24 and 87th bid/ask price as $18/$19.5, the difference
-$5/-$4.5 is added to the existing values at columns 75/76
- The row in the Changes.pickle file to which the above
difference is updated depends on the key value
('security_id'/'initiator' combination)
- When all training examples have been used to update corresponding
rows in Changes.pickle file, the output data is produced by
- lines 41 thru 45 in file produce_response.py
- as discussed in summary, this value is 50th Bid/Ask value(in
testing file) plus the mean of Ask/Bid difference calculated for the
training set.
- we use the element at zero index for each row in the
Changes.pickle file to calculate the mean Bid/Ask price difference in
that row.
Hmm, my explanation didn't really turn out as clear as I wanted it to
be. But it is a start...
On Nov 22, 3:07 am, AIBrisbane <rafi.fer...@gmail.com> wrote:
> > I looked at Ivo's code, but I'm not too familiar with Phyton.. and I'm
> > finding it difficult to follow.
>
> Well, I will try and summarize Ivo's code for those who do not know
> python.
> Let me know if I have missed something important or got it wrong.
>
> Summary of Ivo's work:
> We train on Training data and update the testing data to produce
> output.
> - for each Bid/Ask value in training set, starting from 51st Bid thru
> 100th Bid, we calculate the mean/average difference between it and
> 50th Bid/Ask
> - we keep a running total of this difference in file Changes.pickle ,
> summing over the same key combination ('security_id'/'initiator' ).
> - once trained, we open up the testing file, and create a new output
> file by replacing 51st thru 100th Bid/Ask prices with 50th Bid/Ask
> price plus the mean/average value for the corresponding column from
> Changes.pickle file.
>
> File Changes.pickle :
> - is produced/updated by program 'means.py' and is used by program
> 'produce_response.py' to generate output file for submission to
> Kaggle.
> - the raw 'Changes.pickle' file is a serialized document. seehttp://docs.python.org/library/pickle.htmlfor more info on this.
Yes this seems a bit "strange". Have no idea.
Regards,
Herbert
--
=================================================================
Herbert Muehlburger Software Development and Business Management
Graz University of Technology
www.muehlburger.at www.twitter.com/hmuehlburger
=================================================================
Yes I saw that too. I guess it's not too hard to find our group with a
few google searches.
Do you think we should switch to a code repository with passwords?
Thanks. Maxime
2011/11/22 Herbert Mühlburger <herbert.m...@gmail.com>:
It might be someone in the group, too... It is a difficult problem.
Can you see who downloads the source code if you switch to one with
passwords? That would be the only way to deal with it.
On 11/23/2011 01:13 AM, Maxime Leclerc wrote:
> Hi,
>
> Yes I saw that too. I guess it's not too hard to find our group with a
> few google searches.
>
> Do you think we should switch to a code repository with passwords?
>
> Thanks. Maxime
>
> 2011/11/22 Herbert M�hlburger<herbert.m...@gmail.com>:
Am 23.11.2011 00:13, schrieb Maxime Leclerc:
> Yes I saw that too. I guess it's not too hard to find our group with a
> few google searches.
>
> Do you think we should switch to a code repository with passwords?
If you want to switch to a "private" repro maybe [1] is a good choice?
I think it will be always a kind of problem as long as you can win
some money on Kaggle. Sharing your code is great but if someone
takes advantage out of it's not that great any more for the whole group.
Cheers,
Herbert
2011/11/23 Herbert Mühlburger <herbert.m...@gmail.com>:
The improvement was made by analyzing the errors on a validation set.
One security_id is then handled specially.
More details can be seen on our Kaggle "Submissions" page.
It is getting harder to get an improvement in the score.
It will be interesting to read the winning entry description after the
end of the competition.
On Nov 23, 9:32 am, Emre Çelikten <emre.celikten.1...@gmail.com>
wrote:
> Was the group public?
>
> It might be someone in the group, too... It is a difficult problem.
>
> Can you see who downloads the source code if you switch to one with
> passwords? That would be the only way to deal with it.
>
> On 11/23/2011 01:13 AM, Maxime Leclerc wrote:
>
>
>
>
>
>
>
> > Hi,
>
> > Yes I saw that too. I guess it's not too hard to find our group with a
> > few google searches.
>
> > Do you think we should switch to a code repository with passwords?
>
> > Thanks. Maxime
>
> > 2011/11/22 Herbert M hlburger<herbert.muehlbur...@gmail.com>:
On Nov 23, 3:43 pm, Ivo Danihelka <i...@danihelka.net> wrote:
> I corrected the order on the Kaggle Leaderboard.
> We are 3rd again.http://www.kaggle.com/c/AlgorithmicTradingChallenge/Leaderboard
What exactly are we trying to predict? What is special about the 50th
bid?
Regards
Eby
Some other resources:
1) The data schema description on the Data page:
http://www.kaggle.com/c/AlgorithmicTradingChallenge/Data
We predict the "Responses" part of the rows.
2) Download and read the data files.
Spend time looking at them.
Visualize them.
Some nice visualization tips from past competitions:
http://blog.kaggle.com/2011/03/23/getting-in-shape-for-the-sport-of-data-sciencetalk-by-jeremy-howard/
3) Read the existing threads on the Kaggle forum. The organizers
provided some info about the dataset there.
--
Ivo Danihelka