Jet performances vs simple job

yann.b...@gmail.com

unread,

Nov 29, 2018, 3:20:29 AM11/29/18

to hazelcast-jet

Hello, I open a new subject.

First thanks for you help in my previous post.

Now I'm sharing with you my test code, because I'm a little bit disappointed :

https://bitbucket.org/spartans_coders/pochzcjet.git

In data directory, I have a position.csv example file. To test I build a larger file with such bash command :

for i in {1..10000} ; do cat position.csv >> position-big.csv ; done

Then in code I change inputdir and input file.

I have two main class :

- PocJet, that run 4 instance of jet and do my pipeline

- TestFlat : do the same things, but traditionally.

When TestFlat take 9.5s to complete, Jet take 32s... Why a such difference ?

In test flat if I remove the code that simulate the heavy cpu task, TestFlat take 2s. So I'm sure that this cpu consuming code simulate really well.

Thanks in advance.

c...@hazelcast.com

unread,

Nov 29, 2018, 3:24:22 AM11/29/18

to hazelcast-jet

You shouldn't create 4 instances on a single machine. A single instance by default will try to use all the resources on the machine. When you create 4 they will just compete for the same resources.

What performance do you get if you only start 1 JetInstance?

yann.b...@gmail.com

unread,

Nov 29, 2018, 5:57:58 AM11/29/18

to hazelcast-jet

Ok, I remade some tests, on my Windows computer at work. I think I localised the problem

File : 1Gb.

The dummy test is done in 31922ms

If I set 1 instance with the following code.

pipeline
        .drawFrom(source)
        .filter(fd -> fd.getLineNumber() > 0) // skip header, managed in source
        .map(FileData::parseLine)   // parse in multi thread
        .map(fd -> {
            fd.getData()[2] = "test2";
            return fd;
        })
        .mapUsingIMap("bigmap", this::computeHeavy)
        .drainTo(Sinks.noop());

private FileData computeHeavy(IMap<String, List<String[]>> bigmap, FileData fd) {
    List<String[]> liste = bigmap.get("INV" + fd.getLineNumber() % 50000);
    for (String[] data : liste) {
        for (int dataIndex = 0; dataIndex < data.length; dataIndex++) {
            String datum = data[dataIndex] + "" + 3;
            if (datum.equals(fd.getData()[dataIndex])) {
                System.out.println("GOTCHA !");
            }
        }
    }
    return fd;
}

it take 67836ms to complete

Then I removed all the code in computeHeavyMethod, It tooks 5005ms to complete. Well.

Then I just let the computeHeavyMethod as that :

private FileData computeHeavy(IMap<String, List<String[]>> bigmap, FileData fd) {
      List<String[]> liste = bigmap.get("INV" + fd.getLineNumber() % 50000);
    return fd;
}

And it took 40763ms to complete.

This is the access to bigMap, even with 1 instance wich is very very slow !

How could I have a "local" cache ??

c...@hazelcast.com

unread,

Nov 29, 2018, 6:07:05 AM11/29/18

to hazelcast-jet

mapUsingIMap already does the lookup for you, your function should be like below. Please read the JavaDoc of the method carefully.

private FileData computeHeavy(List<String[] liste, FileData fd) {

c...@hazelcast.com

unread,

Nov 29, 2018, 6:08:36 AM11/29/18

to hazelcast-jet

Of course you also need the "groupingKey" method.

yann.b...@gmail.com

unread,

Nov 29, 2018, 7:14:25 AM11/29/18

to hazelcast-jet

mapUsingImap in my first case do not do the lookup, in GeneralStage. After groupingKey, yes I agree

So, I changed my code to

pipeline
        .drawFrom(source)
        .filter(fd -> fd.getLineNumber() > 0) // skip header, managed in source
        .map(FileData::parseLine)   // parse in multi thread
        .map(fd -> {
            fd.getData()[2] = "test2";
            return fd;
        })

        .groupingKey(fd->"INV" + fd.getLineNumber() % 50000)


        .mapUsingIMap("bigmap", this::computeHeavy)
        .drainTo(Sinks.noop());

And

    private FileData computeHeavy( FileData fd, List<String[]> liste) {
//        for (String[] data : liste) {
//            for (int dataIndex = 0; dataIndex < data.length; dataIndex++) {
//                String datum = data[dataIndex] + "" + 3;
//                if (datum.equals(fd.getData()[dataIndex])) {
//                    System.out.println("GOTCHA !");
//                }
//            }
//        }
        return fd;
    }

This take 38140ms (without the simulation of high cpu usage. Close to previous result.

The access to IMap slow down everything, even "locally/"

yann.b...@gmail.com

unread,

Nov 29, 2018, 8:39:22 AM11/29/18

to hazelcast-jet

To improve that, how in pipe can I use a simple map in memory ?

I mean, I can imagine a code that store data rfom IMap to a simple Map per HazelcastInstance. Okay.

But How can I use it in the pipe ?

vlad...@hazelcast.com

unread,

Nov 29, 2018, 8:46:33 AM11/29/18

to hazelcast-jet

Hi Yann,

Attaching any data to the streaming items is generally referred as an enrichment. One can enrich using local map, replicated map, distributed map (IMap) or external service to do the lookup.

This blogpost explains the topic using code samples: https://blog.hazelcast.com/ways-to-enrich-stream-with-jet/

Please refer to the blogpost as it shows how to enrich data in your pipeline.

Cheers,

Vladimir

On Thursday, November 29, 2018 at 9:20:29 AM UTC+1, yann.b...@gmail.com wrote:

yann.b...@gmail.com

unread,

Nov 29, 2018, 9:29:32 AM11/29/18

to hazelcast-jet

Oo, I have to beg your pardon :(

I made a mistake in The Dummy code, the map was not enought big.

Now dummy code give me 112277ms.

If I use ContextFactory, to get a local map on Jet Instance, Pipeline comes to 50902.

And with mapUsingIMap --> 65639ms

And using ReplicatedMap -> 50498ms

So yes, a real gain ! :)

Can Gencer

unread,

Nov 29, 2018, 9:47:41 AM11/29/18

to yann Blazart, hazelc...@googlegroups.com

Hi again,

I had a detailed look at your code & also ran it locally.

The main issue I see is that your entries in IMap are quite large. IMap stores data in binary form (serialized) so for each lookup you do this data needs deserialization.

In your TestFlat example you are just using a normal HashMap, so no serialization/deserialization involved.

There's a few ways to work around this:

1. Enable near cache on the IMap, this will make sure that once an item is deserialized it won't get deserialized again in subsequent lookups

JetConfig config = new JetConfig();
Config hzConfig = config.getHazelcastConfig();
hzConfig.getMapConfig("bigmap")
        .setNearCacheConfig(
                new NearCacheConfig()
                        .setInMemoryFormat(InMemoryFormat.OBJECT)
                        .setCacheLocalEntries(true));

With near cache enabled the run time for my test data went from 45 sec to 7 sec, TestFlat returned 10 sec.

2. Use hash-join

In this case you read the IMap into a local HashMap on each instance, and the lookups are done against the local HashMap. Disadvantage is that you can't update the map after a job started running. The pipeline would be like this:

BatchStage<Entry<String, List<String[]>>> bigMap = pipeline.drawFrom(Sources.map("bigmap"));


pipeline
        .drawFrom(source)
        .filter(fd -> fd.getLineNumber() > 0)

        .map(FileData::parseLine)   // parse in multi thread
        .map(fd -> {

            fd.getData()[2] = "a";  // dummy replacement of value
            return fd;
        })
        .hashJoin(bigMap, JoinClause.onKeys(fd -> "INV" + (fd.getLineNumber() % KEY_COUNT), e -> e.getKey()),
                (FileData fd, Entry<String, List<String[]>> e) -> {  // simulate high compute cpu
                    for (String[] data : e.getValue()) {
                        for (int dataIndex = 0; dataIndex < 15; dataIndex++) {
                            if (data[dataIndex].equals(fd.getData()[dataIndex])) {


                                System.out.println("GOTCHA !");
                            }
                        }
                    }
                    return fd;

                })
        .drainTo(Sinks.noop());

This also performs similar to the near cache approach, 7 to 8 seconds.

--
You received this message because you are subscribed to the Google Groups "hazelcast-jet" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast-je...@googlegroups.com.
To post to this group, send email to hazelc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast-jet/3cc72dcf-ed67-4b88-9fcf-269b9bcb82b8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

yann.b...@gmail.com

unread,

Nov 29, 2018, 11:29:31 AM11/29/18

to hazelcast-jet

Thank you for your help !

To be honest I'm pocing to prove that Jet sould be a better choice than an Old spring batch application, and it's getting good :)

I will apply your suggesstions as soon as I will be back at home tonight.

Again thanks a lot !

yann.b...@gmail.com

unread,

Nov 29, 2018, 11:30:27 AM11/29/18

to hazelcast-jet

And I can also add Serializer for FileData, or even do a Class to embedd List<String[]> with a dedicated Serializer also.

yann.b...@gmail.com

unread,

Nov 30, 2018, 2:36:03 AM11/30/18

to hazelcast-jet

Hello, I tried your suggestions on my mac tonight.

I don't understand, I will check today, but on my mac , Jet is really slower. On windows, really faster. Did I miss something ?...

I will keep you in touch.

yann.b...@gmail.com

unread,

Dec 3, 2018, 10:07:31 AM12/3/18

to hazelcast-jet

I 'm may be an idiot...

After refactoring test code, ok, Jet is faster whatever the system.

Thanks !

Reply all

Reply to author

Forward