Rhino ETL to build a stock market database from files

129 views

Skip to first unread message

Twist

unread,

Nov 28, 2013, 4:27:09 AM11/28/13

to rhino-t...@googlegroups.com

Hi,

I tried to look for similar requirement but I did not manage to find a suitable answer so I am sorry in advance if I am re-asking same questions as others :)

I would like to use Rhino ETL for my files processing, I have like 200K csv files zipped which correspond to a daily prices for stocks and have to

1 - unzip

2 - process

3 - validate (?)

4 - save in db

In total there must be more then 400 million rows worth of data. Rhino ETL seems to be very suitable for this operation as it allows me to apply the pipeline pattern "easily" but right now I don't understand how can I separate my steps because from examples that i see left right, the only type returned by an operation is an IEnumerable<Row>.

I also would like to Massively insert the data, right now I am using SqlBulkCopy but some files contain duplicate which bother me. I would like to remove it before inserting but when ever I try to input this duplicate existence check, it slows down too much the process.

Right now my pipeline is composed by the steps I mentioned.

Is it possible for me to return some stream or something else then row ?

Again sorry if this is a repeated subject, please do not hesitate to redirect me,.

Thanks!

Louis Haußknecht

unread,

Nov 28, 2013, 8:32:18 AM11/28/13

to rhino-t...@googlegroups.com

Hi,

for the input part you can build an InputOperation which uses a ZipInputStream (from SharpZipLib) to read directly from an archive:

  public class ReadCompressedCSV : AbstractOperation
    {
        public override IEnumerable<Row> Execute(IEnumerable<Row> rows)
        {
            using (var file = File.OpenRead(@"c:\temp\test.zip"))
            using (var zipInputStream = new ZipInputStream(file))
            {
                //assume we have only one entry
                zipInputStream.GetNextEntry();
                using (var sr = new StreamReader(zipInputStream))
                {
                    var read = sr.ReadLine();
                    while (read != null)
                    {
                        var splitted = read.Split(';');
                        var inputFormat = new InputFormat();
                        inputFormat.Column1 = splitted[0];
                        inputFormat.Column2 = splitted[1];
                        yield return Row.FromObject(inputFormat);
                        read = sr.ReadLine();
                    }
                }
            }
        }
    }

Try to sort your input files first! Then the check for dupes is easy to implement if you remeber the last row.

2013/11/28 Twist <htwi...@gmail.com>

--
You received this message because you are subscribed to the Google Groups "Rhino Tools Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhino-tools-d...@googlegroups.com.
To post to this group, send email to rhino-t...@googlegroups.com.
Visit this group at http://groups.google.com/group/rhino-tools-dev.
For more options, visit https://groups.google.com/groups/opt_out.

Reply all

Reply to author

Forward

0 new messages