Performance issues with custom CSV data

136 views
Skip to first unread message

Douglas S

unread,
Aug 28, 2017, 2:40:25 PM8/28/17
to Zipline Python Opensource Backtester
Hi all,

I'm attempting to run a very simple algorithm on a set of custom datapoints I imported through the data parameter for run_algorithm().

The data itself is of 1-minute resolution. The backtest is run over 1 month's worth of data (~40k bars) and takes somewhere between 1-2 minutes to run.

I have a feeling this is a bit slower than what I should expect. I'm attaching my pstat profile of the run, hopefully there's a chance anyone can take a quick look to see what the major bottleneck is?

I can see a lot of the pandas functions taking a significant time to execute, but I'm not sure why or how I can optimise that.

My code for the data import is below:
def csv_mt4_ohlcv_to_panel( filename: object, symbol: object, debug: object ) -> object:
    """
    Imports data from a trimmed CSV file into a pandas dataframe.
    
    :param filename: String for CSV filename
    :param symbol: Symbol to name the panel by
    :param debug: Boolean whether you want head to be printed
    :return: The df in question
    """
    
    df = pd.read_csv( filename,
                      # header = None,
                      # names = ['date', 'time', 'open', 'high', 'low', 'close', 'volume'],
                      # parse_dates = [['date', 'time']],
                      parse_dates = ['date_time'],
                      compression = 'infer',
                      )
    
    df.set_index( 'date_time', inplace = True )
    df.tz_localize( pytz.timezone( 'EET' ) )
    
    if debug:
        print( df.head() )
    
    od = OrderedDict()
    od[symbol] = df
    
    panel = pd.Panel( od )
    panel.minor_axis = ['open', 'high', 'low', 'close', 'volume']
    
    if debug:
        print( panel )
    
    return panel

And my code for the actual strategy is below:
def initialize( context ):
    context.i = 0
    context.security = symbol( 'EURUSD' )
    return


def handle_data( context, data ):
    context.i += 1
    if context.i < 50:
        return
    
    current_positions = context.portfolio.positions[symbol( 'EURUSD' )].amount
    
    ma_long = history( 250, '1m', 'price' ).mean()
    ma_short = history( 50, '1m', 'price' ).mean()
    
    if ma_short[0] > ma_long[0]:
        order_target( context.security, 1 )
    elif ma_short[0] < ma_long[0]:
        order_target( context.security, -1 )
    
    record( eurusd = data[symbol( 'EURUSD' )].price,
            short_mavg = ma_short[0],
            long_mavg = ma_long[0] )
    
    return

algo = run_algorithm(
        start = pd.datetime( 2016, 2, 1, 0, 0, 0, 0, pytz.timezone('EET') ),
        end = pd.datetime( 2016, 3, 31, 0, 0, 0, 0, pytz.timezone('EET') ),
        initialize = initialize,
        handle_data = handle_data,
        # analyze = analyze,
        data = dataset,
        data_frequency = 'minute',
        capital_base = 1e6 )

Looking forward to any insight! Thank you in advance.

I'm considering writing an ingest-function and doing my own bundle, but I'm still getting to grips with it. Will it massively increase my performance?
algo1.pstat

Douglas S

unread,
Sep 2, 2017, 1:44:12 PM9/2/17
to Zipline Python Opensource Backtester
After a fair amount of work building my custom data bundle, I still don't see any performance improvements whatsoever.

I'm starting to think zipline is simply not suitable for any large-scale backtesting -- but this is very disappointing since backtesting is exactly what it should do well.

Does anyone else have any inputs/experiences regarding backtesting of minute-data that they'd like to share? I don't want to give up on zipline, but given how it is performing, I may have to.

// Doug

Lucas S Silva

unread,
Dec 11, 2017, 8:06:00 AM12/11/17
to Zipline Python Opensource Backtester
Hi Douglas,

You are right, I think I posted something related to performance a couple of years ago.
With a normal installation / Setup you cannot use Zipline for optimization or large scale testing.

Normally a backtester will execute the strategy and compute all stats at the end. Zipline, because
it was used for real-time trading they needed to compute every stats for every bar. Add this to some
poor data structure handling. It is just not fast enough.

Cheers,
Lucas
Reply all
Reply to author
Forward
0 new messages