Unable to correctly load data from csv file into zipline

Ravi Shukla

unread,

Aug 18, 2015, 2:08:38 PM8/18/15

to Zipline Python Opensource Backtester

I was trying zipline out with my own csv file . I used the data that is downloaded in the default case from Yahoo and copied it into a csv file.

The file format which is :

Date,Open,High,Low,Close,Volume,Adj Close

2012-01-03,409.399971,412.499989,408.999989,411.229973,75555200,54.934461

But , when I print the data being passed to handle_data , I get this :

{'volume': 1000, 'dt': Timestamp('2012-01-03 00:00:00+0000', tz='UTC'), 'price': 409.39997099999999, 'sid': 'Open'}

{'volume': 1000, 'dt': Timestamp('2012-01-03 00:00:00+0000', tz='UTC'), 'price': 412.49998900000003, 'sid': 'High'}

{'volume': 1000, 'dt': Timestamp('2012-01-03 00:00:00+0000', tz='UTC'), 'price': 408.99998900000003, 'sid': 'Low'}

{'volume': 1000, 'dt': Timestamp('2012-01-03 00:00:00+0000', tz='UTC'), 'price': 411.22997299999997, 'sid': 'Close'}

{'volume': 1000, 'dt': Timestamp('2012-01-03 00:00:00+0000', tz='UTC'), 'price': 75555200.0, 'sid': 'Volume'}

{'volume': 1000, 'dt': Timestamp('2012-01-03 00:00:00+0000', tz='UTC'), 'price': 54.934460999999999, 'sid': 'Adj Close'}

{'volume': 1000, 'dt': Timestamp('2012-01-04 00:00:00+0000', tz='UTC'), 'price': 410.00001099999997, 'sid': 'Open'}

BarData({'Volume': SIDData({'volume': 1000, 'sid': 'Volume', 'source_id': 'DataFrameSource-7006398718743e03d6d00635b63c8e98', 'dt': Timestamp('2012-01-03 00:00:00+0000', tz='UTC'), 'type': 4, 'price': 75555200.0}), 'Adj Close': SIDData({'volume': 1000, 'sid': 'Adj Close', 'source_id': 'DataFrameSource-7006398718743e03d6d00635b63c8e98', 'dt': Timestamp('2012-01-03 00:00:00+0000', tz='UTC'), 'type': 4, 'price': 54.934461}), 'High': SIDData({'volume': 1000, 'sid': 'High', 'source_id': 'DataFrameSource-7006398718743e03d6d00635b63c8e98', 'dt': Timestamp('2012-01-03 00:00:00+0000', tz='UTC'), 'type': 4, 'price': 412.499989}), 'Low': SIDData({'volume': 1000, 'sid': 'Low', 'source_id': 'DataFrameSource-7006398718743e03d6d00635b63c8e98', 'dt': Timestamp('2012-01-03 00:00:00+0000', tz='UTC'), 'type': 4, 'price': 408.999989}), 'Close': SIDData({'volume': 1000, 'sid': 'Close', 'source_id': 'DataFrameSource-7006398718743e03d6d00635b63c8e98', 'dt': Timestamp('2012-01-03 00:00:00+0000', tz='UTC'), 'type': 4, 'price': 411.229973}), 'Open': SIDData({'volume': 1000, 'sid': 'Open', 'source_id': 'DataFrameSource-7006398718743e03d6d00635b63c8e98', 'dt': Timestamp('2012-01-03 00:00:00+0000', tz='UTC'), 'type': 4, 'price': 409.399971})})

After that I do a simple:

data = pd.read_csv('appleDataFromYahoo.csv', index_col='Date', parse_dates=True)

data.index = data.index.tz_localize(pytz.UTC)

data.head()

algo = TradingAlgorithm(initialize=initialize, handle_data=handle_data, capital_base=1000000)

results = algo.run(data)

The data I am getting in logs is obviously wrong because in the default case handle_data spit something like this :

BarData({'AAPL': SIDData({'high': 44.118027429170454, 'open': 43.50086143458919, 'price': 44.025853000000005, 'volume': 111284600, 'low': 43.393986829608814, 'sid': 'AAPL', 'source_id': 'DataPanelSource-714fe8ac9fdca11967199c1edefb9597', 'close': 44.025853000000005, 'dt': Timestamp('2011-01-03 00:00:00+0000', tz='UTC'), 'type': 4})})

Can someone please someone point into a direction of what I might be doing wrong . Is my file format wrong ? Do I need to provide another field specifying the symbol Id ? or something else ?

John Ricklefs

unread,

Aug 18, 2015, 7:08:48 PM8/18/15

to Zipline Python Opensource Backtester

Hi Ravi,

It looks like the format of the DataFrame you are passing into TradingAlgorithm.run() isn't quite right. To use a DataFrame as a datasource, you need it to have symbols for columns, a DatetimeIndex, and price data in each row.

Looking at your output above, it appears that using pd.read_csv() in this way is causing the CSV's columns to be interpreted as symbols, hence all of the "sid" fields in the BarData are "Open", "Close", "Volume", etc.

For an example dataframe matching the format TradingAlgorithm.run() expects, you could inspect the 'data' object from the below sample:

from zipline.utils.factory import load_from_yahoo
from datetime import datetime
start = datetime(2004, 1, 1, 0, 0, 0, 0, pytz.utc)
end = datetime(2008, 1, 1, 0, 0, 0, 0, pytz.utc)
STOCKS = ['AAPL', 'INTC'] # ... etc

data = load_from_yahoo(stocks=STOCKS, indexes={}, start=start, end=end)
data = data.dropna()

Hope that helps!

--John "JD" Ricklefs
Engineering @ Quantopian

Ravi Shukla

unread,

Aug 19, 2015, 5:02:38 AM8/19/15

to Zipline Python Opensource Backtester

Hi John,

It was sure helpful ! Thanks :)
I looked into the implementation of load_bars_from_yahoo method in the loader module to understand how data is being loaded in the default case .

For anyone facing this issue in future , I did this :
dictData = OrderedDict()

data = pd.read_csv('appleDataFromYahoo.csv', index_col='Date', parse_dates=True)
data.index = data.index.tz_localize(pytz.UTC)
data.head()

dictData['AAPL']=data

panel = pd.Panel(dictData)
panel.minor_axis = ['open', 'high', 'low', 'close', 'volume', 'price']

and then pass the panel object to the run function .

Reply all

Reply to author

Forward