PSL Database queries

64 views
Skip to first unread message

mh.ma...@gmail.com

unread,
Dec 17, 2021, 5:34:00 PM12/17/21
to PSL Users
Hi,

I am using PSL in the heart of a rule-based recommender engine which is part of a microservice architecture. This service is a dockerized component of a bigger system and receives the input data through RabbitMQ. I was wondering if it is possible to bypass PSL DataStore and make PSL understand predicate domains and query the truth value from some programmatically data-aware part of my application? 
I don't want to grab data from another microservice that is backed by a database and again fill the PSL database with that data. I just want to bypass RDBMSInserters and introduce a part of my code which responses to all predicate queries.

Thanks

Eriq Augustine

unread,
Dec 17, 2021, 5:58:51 PM12/17/21
to mh.ma...@gmail.com, PSL Users
Hello,

It sounds to me like you don't really want to bypass DataStore, but rather just implement your own instance of a DataStore.
Creating your own DataStore is a totally doable task, but it will be a bit difficult.

If I understand your situation correctly, then another path could be to just pull your data from this other data source before you run PSL and serialize it as files that you can then load into PSL in a standard way.

One question I have that will change my recommendations is:
Do you expect your data to change during inference, i.e., is this an online problem?

-eriq


--
You received this message because you are subscribed to the Google Groups "PSL Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to psl-users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/psl-users/a87205b4-f18f-4410-ac1b-96a33b3e30d7n%40googlegroups.com.

mh.ma...@gmail.com

unread,
Dec 18, 2021, 5:24:00 AM12/18/21
to PSL Users
Hey Eriq,

It sounds to me like you don't really want to bypass DataStore, but rather just implement your own instance of a DataStore.

Yes, you are right. It means creating my custom DataStore for this purpose. Is there a simple way to change the existing data source in a way to just avoid targets and observation queries from database?

If I understand your situation correctly, then another path could be to just pull your data from this other data source before you run PSL and serialize it as files that you can then load into PSL in a standard way.
 
I would rather want to bypass the file serialization step as there can be many recommendation tasks running in parallel.

One question I have that will change my recommendations is:
Do you expect your data to change during inference, i.e., is this an online problem?

Actually, there may be some changes in data during inference, but for now, we can seal the inference process from these changes.

Thank you in advance

Eriq Augustine

unread,
Dec 18, 2021, 11:00:31 AM12/18/21
to mh.ma...@gmail.com, PSL Users
Unfortunately, it's not so easy to fully bypass the DataStore.
The real hardship here is that the DataStore (backed by an RDBMS) not only holds the data, but is also responsible for the bulk of the grounding process.
So, it's used for a lot more than storing/querying data.

If this is an online problem, then a possible solution is to use PSL's online engine:
With that you can pass data to PSL over the network.

-eriq

--
You received this message because you are subscribed to the Google Groups "PSL Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to psl-users+...@googlegroups.com.

Mohammad Hossein Mahsuli

unread,
Dec 19, 2021, 4:59:23 AM12/19/21
to Eriq Augustine, PSL Users
Thank you Eriq, I will look into it.

Regards

mh.ma...@gmail.com

unread,
May 18, 2022, 6:04:00 AM5/18/22
to PSL Users
Hi Eriq,
 
The real hardship here is that the DataStore (backed by an RDBMS) not only holds the data, but is also responsible for the bulk of the grounding process.

According to what you said before, I am inserting data into RDBMSDataStore programmatically to use the default grounding process. But as I mentioned before many recommendation tasks are running in parallel. This creates some conflict in the Database layer as I found out that the same table is used for Metadata and partitions. Let me explain the way I have implemented my code:

There is a PSLRecommendationJob class that is similar to PSL example classes:

public class PSLRecommendationJob Job {
    private static final String PARTITION_OBSERVATIONS = "observations";
    private static final String PARTITION_TARGETS = "targets";

    @Override
    public void run() {
        Partition obsPartition = dataStore.getPartition(PARTITION_OBSERVATIONS);
        Partition targetsPartition = dataStore.getPartition(PARTITION_TARGETS);
        definePredicates();
        defineRules();
        loadData(obsPartition, targetsPartition);
        runInference(obsPartition, targetsPartition);
        handleResults(targetsPartition);
    }
}

My program is a background service that listens for incoming recommendation requests, So, there can be parallel requests at the same time and it creates many instances from  PSLRecommendationJob class. But internally you are using the same H2 database instance for all of the partitions and it causes an error when creating the same PARTITION_OBSERVATIONS in the second instance of PSLRecommendationJob class. I can fix this error by prefixing job_id to PARTITION_OBSERVATIONS, but this makes many different partitions in the lifetime of my application which has no use after the job is completed and I'm not sure if these different partitions ensure that the input data of each recommendation job is sealed from other recommendation jobs!

I hope I described my problem well and looking forward to hearing your suggestions!

Thanks

Eriq Augustine

unread,
May 18, 2022, 9:36:18 AM5/18/22
to mh.ma...@gmail.com, PSL Users
Do you need information from one recommendation request to affect another recommendation request?
If you don't, then what about having a different database instance per request?
You can do that just by changing the database path (for H2).

How many requests to you expect to handle at the same time?
What kind of throughput do you want?

-eriq

--
You received this message because you are subscribed to the Google Groups "PSL Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to psl-users+...@googlegroups.com.

mh.ma...@gmail.com

unread,
May 18, 2022, 11:06:08 AM5/18/22
to PSL Users
Hey Eriq,

Thanks for your quick response!

Do you need information from one recommendation request to affect another recommendation request?

At the moment the recommendation jobs are sealed from each other, but when we gather more data from our users we plan to use other users' data and add some inter-user rules to enrich recommendations.

If you don't, then what about having a different database instance per request?
You can do that just by changing the database path (for H2).

According to what I said earlier my application is an element of microservice architecture and it is a dockerized component of a bigger system and receives the input data through RabbitMQ. I just wanted to avoid file or database serialization because of performance, so I decided to use the H2 in-memory database feature with this configuration:
dataStore = new RDBMSDataStore(new H2DatabaseDriver(H2DatabaseDriver.Type.Memory, null, true))
Unfortunately, this causes the same problem I've mentioned in my previous post.

How many requests to you expect to handle at the same time?
What kind of throughput do you want?

For now, it is near 10 requests in parallel. But it can rise dramatically when the project is launched! I should also mention that this application is written in the Docker environment and it is highly scalable. So we can safely ignore the degree of parallelism outside of the PSL inference scope.

I don't know the exact details of the Partition concept. If you can guarantee that the data in different partitions are sealed from each other in the inference phase, I can use different partitions for different jobs and delete the partition after the inference is done. Is this a valid way or do you have any better suggestions?

Eriq Augustine

unread,
May 18, 2022, 12:01:42 PM5/18/22
to mh.ma...@gmail.com, PSL Users
According to what I said earlier...

Remember that your last email before today was 5 months ago, so you are going to have to help refresh me.

At the moment the recommendation jobs are sealed from each other, but when we gather more data from our users we plan to use other users' data and add some inter-user rules to enrich recommendations.

How up-to-date does the data have to be?
I can see a system that has some regular job to update the data/database/image that several isolated PSL instances all use.

According to what I said earlier my application is an element of microservice architecture and it is a dockerized component of a bigger system and receives the input data through RabbitMQ. I just wanted to avoid file or database serialization because of performance, so I decided to use the H2 in-memory database feature with this configuration:
dataStore = new RDBMSDataStore(new H2DatabaseDriver(H2DatabaseDriver.Type.Memory, null, true))
Unfortunately, this causes the same problem I've mentioned in my previous post.

I'm a bit confused here.
Based on what you said here, my recommendation of changing the database path sounds like it is the solution.
Have you tried changing the database path (that null parameter you are passing in):

I don't know the exact details of the Partition concept.

Here is a description of how data is handled in PSL:

If you can guarantee that the data in different partitions are sealed from each other in the inference phase, I can use different partitions for different jobs and delete the partition after the inference is done. Is this a valid way or do you have any better suggestions?

Partitions are separate, and the scheme you described should work with Postgres.
However, H2 databases should generally not be shared between PSL instances, as we do not implement H2's client/server mode.
So if you want to use H2, then I think different database instances (paths) are the solution.
If you want to use Postgres, then your method should work.

I thing setting up the Postgres infrastructure will be harder (just because the nature of a non-embedded database), but will be faster in the end.
Starting with H2 and then moving to Postgres would be a good way to work things out as you go.

-eriq


--
You received this message because you are subscribed to the Google Groups "PSL Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to psl-users+...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages