OpenQm Big Files

58 views
Skip to first unread message

euobeto

unread,
Jun 30, 2022, 3:41:15 PMJun 30
to mvd...@googlegroups.com
Hello everyone,
i'm starting to build a big file in OpenQm, its a dynamic file, so far has 3 GB and over 5990000 (every time i tried do select to know how many files my computer freezes hahah), i'm importing some cvs data, its over 80GB and billions itens ( with 32 attributes each)... and i read something about split file or multifiles... but i didn't understand how to use... does any one has some tips about this ? 

Thanks 

--
Alberto  Leal
T.I  Campo Grande
LPI ID: LPI000191272
E-mail: alb...@tecwebcg.com
Gmail:  ees....@gmail.com
          

Brian Speirs

unread,
Jun 30, 2022, 4:57:15 PMJun 30
to Pick and MultiValue Databases
Hi Alberto,

You are probably looking to implement a "distributed" file. A distributed file is a "logical" file that uses the data in several "physical" files. Those physical files are best structured as a "multifile" - which means you have multiple DATA portions sharing one DICT portion.

You NEED to determine how the items in the files are allocated to the various data portions. This should be a function of the item-id so that any given item-id can only ever go to a specific data portion. As a simple example, let's say you have sequential item-ids. A partitioning dictionary item might be:

  @PARTNO
  1 I
  2 @ID // 1000000 + 1
  3
  4 PartNo
  5 6R
  6 S


This means that @ID's 1 to 999999 will go to file part 1; @ID's 1000000 to 1999999 will go to part 2 etc.

Now, set up your physical files. If I want to create a distributed file with a name of MYFILE, then I will set up the physical files using a name of MYFILE.D. We do this like:

     CREATE.FILE DICT MYFILE.D
  CREATE.FILE DATA MYFILE.D,PART1
  CREATE.FILE DATA MYFILE.D,PART2
  CREATE.FILE DATA MYFILE.D,PART3

You don't have to use names of part1, part2, etc. Choose any name you like - but it should be logical.

Add the @PARTNO dictionary item (shown above) to the dictionary.

Now, create Q-pointers to the part files. For PART1, this looks like:

  VOC MYFILE.PART1
  1  Q
  2
  3  MYFILE.D,PART1


Now, create the distributed file by adding the partfiles:

  ADD.DF MYFILE MYFILE.PART1 1 @PARTNO RELATIVE
  ADD.DF MYFILE MYFILE.PART2 2 RELATIVE
  ADD.DF MYFILE MYFILE.PART3 3 RELATIVE


Note that you only need to include the partitioning algorithm when you first create the distributed file.

You can check the parts in a file using the LIST.DF command:

  LIST.DF MYFILE
  Partitioning algorithm: @PARTNO
  @ID // 1000000 + 1

  Part no.  Pathname
         1  MYFILE.D\PART1
         2  MYFILE.D\PART2
         3  MYFILE.D\PART3


If we had omitted the RELATIVE qualifier in the ADD.DF commands above, then the parts would be fully pathed here.

Now you can put some data in the files:

  ED MYFILE 1 999999 1000000
  Put anything in these items and file them

See where those data items ended up:

  SORT MYFILE.PART1
  1
  999999

  SORT MYFILE.PART2
  1000000

  SORT MYFILE
  1
  1000000
  999999


So, we entered data into MYFILE (the logical distributed file), and the data ended up split across the physical part files.

ISSUES: You need to make sure that your partitioning algorithm evaluates to a part number that exists! If it evaluates to a part number that does not exist (e.g. 6), then the WRITE to the file will FAIL. Likewise, this depends on numeric item-ids - you may want to put a CONSTRAINT on the @ID to ensure that the @ID is a positive integer.

Indexes must be applied to each individual part file.

Feel free to contact me offline if you want any futher assistance.

Cheers,

Brian

Jim Idle

unread,
Jul 1, 2022, 2:37:18 AMJul 1
to mvd...@googlegroups.com
I am not an expert in OpenQM, but based upon the result below then here’s some thoughts:

  • Maybe QM isn’t the right system to do this in - it should just work
  • Perhaps your system does not have enough power to do this - what operating system etc are you running?
  • Perhaps there is a better way to organize your imported records. Partition files are all well and good but they need to be managed and if you need to change partition counts or sizes, then it is a pain. If you can devise a partition algorithm that can accommodate partition changes, then maybe it is OK.
 
jBASE is the only system I run - no surprise there I suppose - but if I run a program to create 600,000 1K records (I did not allocate enough disk space to go further), then I get this:

~/tmp ᐅ CREATE-FILE JIM    
[ 417 ] File JIM]D created , type = JD
[ 417 ] File JIM created , type = JD

~/tmp ᐅ vi hw.b

~/tmp ᐅ BASIC . hw.b   
hw.b
Source file hw.b compiled successfully

~/tmp ᐅ CATALOG . hw.b 
hw
Object hw cataloged successfully

~/tmp ᐅ time hw
hw  4.41s user 4.86s system 34% cpu 26.708 total

~/tmp ᐅ jstat -v JIM

File ./JIM
Type=JD , Created Fri Jul  1 06:21:24 2022
Frame size = 8192 bytes , Secondary Record Size = 2040 bytes
Hash Index Table: Fid 58504, Frames 129, Groups 262144, Mask Bits 18
File size = 997695488 bytes , Inode = 12791236 , Device = Id 2049
Last Accessed Fri Jul  1 06:22:46 2022 , Last Modified Fri Jul  1 06:22:46 2022
Backup = YES , Log = YES , Rollback = YES , Network = NO , Secure updates = NO
Case = ON - Case sensitive , Encrypted = NO , Lock = MUTEX , Audit Log = NO


Record Count = 600000 , Record Bytes = 600000000
Bytes/Record = 1000 , Bytes/Group = 5125
Primary   file space:   Total Frames = 121554 , Total Bytes = 619088895
Secondary file space:   Total Frames = 0 , Total Bytes = 0


~/tmp ᐅ cat hw.b         
Rec = SPACE(1000)
OPEN "JIM" TO J ELSE ABORT 201, "JIM"

FOR i = 1 TO 600000

 WRITE Rec ON J, i
NEXT


~/tmp ᐅ time COUNT JIM

 600000 Records counted 

COUNT JIM  0.11s user 0.21s system 78% cpu 0.415 total


~/tmp ᐅ time SELECT JIM

 600000 Records selected 

SELECT JIM  0.20s user 0.22s system 94% cpu 0.440 total



So, 27 seconds to create 600,000 records on a VM with low resources. Obviously you will be doing more processing etc, including reading the CVS file. But jBASE JD files are about constant time complexity, so you can extrapolate from that, subject to system resources. 

It’s not really relevant if you must use OpenQM, but it might give you food for thought.

You have a single CSV file that is 80GiB of data? Maybe this is an X Y question? Why is the data exported like that?  

Jim


--
You received this message because you are subscribed to
the "Pick and MultiValue Databases" group.
To post, email to: mvd...@googlegroups.com
To unsubscribe, email to: mvdbms+un...@googlegroups.com
For more options, visit http://groups.google.com/group/mvdbms
---
You received this message because you are subscribed to the Google Groups "Pick and MultiValue Databases" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mvdbms+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mvdbms/CAPZXEvPtSh95pHvVsxb9CGe6mWkP3w8-t5E4DbRX3AWRWq1viw%40mail.gmail.com.

Wols Lists

unread,
Jul 1, 2022, 4:16:42 AMJul 1
to mvd...@googlegroups.com
On 01/07/2022 07:37, Jim Idle wrote:
> I am not an expert in OpenQM, but based upon the result below then
> here’s some thoughts:
>
> * Maybe QM isn’t the right system to do this in - it should just work
> * Perhaps your system does not have enough power to do this - what
> operating system etc are you running?
> * Perhaps there is a better way to organize your imported records.
> Partition files are all well and good but they need to be managed
> and if you need to change partition counts or sizes, then it is a
> pain. If you can devise a partition algorithm that can accommodate
> partition changes, then maybe it is OK.

And is the OP using READSEQ to read the CSV? When eating an elephant,
don't try to eat it in one bite, eat it in little chunks ...

Cheers,
Wol

Martin Phillips

unread,
Jul 1, 2022, 4:31:02 AMJul 1
to mvd...@googlegroups.com

Hi Alberto,

 

OpenQM should be able to handle files of this size or even larger. There is nothing in your posting to suggest that the process has died. It is just taking a long time.

 

It is useful to understand what happens when populating a large file. In your example, the data is coming from a CSV file and I assume is being read line by line using READSEQ or similar. Performance of each READSEQ will not be affected by the total size of the data being imported.

 

The imported data then has to be written to the target file. The position of a record in the file is determined by the hashing algorithm and it is likely that this varies widely from one record to the next. For a small file, the group level caching of the system means that the target group is often already in memory and can be updated very quickly. For a large file, there is a high probability that the group must be read, updated and then written (though the write may not immediately go to disk).

 

The impact of this is that loading a file that is smaller than the memory space available for caching group buffers is fast but, as the file grows beyond the cache size, we reach the situation where we do a read/write pair for each record. Performance ultimately becomes dependent on the speed of the disk.

 

I think that if you leave it running, your data import will eventually finish, however, we need to consider whether the file created will actually be usable from a performance point of view. If you do a query processor command with a selection clause, this must read the entire file, checking whether each record meets the selection criteria. This is going to take a while with an 80Gb file.

 

If the file has indices, the select is much faster as it simply reads the index record, however, updating the index on a write or delete requires us to read/update/write the index record. It is likely that some of the index records will themselves be enormous, impacting performance.

 

You do not give any indication of what this data represents or how it is used but it may be helpful for me to give a simple example of how distributed files can give a big improvement in some selection operations. For the purpose of this explanation, I am going to assume that each record has a date field and that you have historic data going back many years.

 

Instead of storing all the data in a single file, you could split it up so that each year or perhaps month had a separate data file. Now, if you want to select record from a specific year (and maybe other criteria) you can do your selection against just the relevant year's file. This can give substantial performance benefits.

 

But, of course, there are parts of your application that rely on it all being in one data file. No problem. You can define a "distributed file" that links all of the data file together and can be processed exactly as though it really was one file. A distributed file holds no data. It is simply a set of pointers to the individual part files that are to be linked as though they are one file. If the partitioning went to the one file per month level, you could define multiple distributed files giving different views of subsets of the data. There is an example of this in the OpenQM documentation.

 

On Jul 1, 2022 at 3:41:01 AM, euobeto <ees....@gmail.com> wrote:

Hello everyone,
i'm starting to build a big file in OpenQm, its a dynamic file, so far has 3 GB and over 5990000 (every time i tried do select to know how many files my computer freezes hahah), i'm importing some cvs data, its over 80GB and billions itens ( with 32 attributes each)... and i read something about split file or multifiles... but i didn't understand how to use... does any one has some tips about this ? 

 

Thanks 

 

--

Alberto  Leal
T.I  Campo Grande
LPI ID: LPI000191272
E-mail: alb...@tecwebcg.com
Gmail:  ees....@gmail.com

================================
Rocket Software, Inc. and subsidiaries ■ 77 Fourth Avenue, Waltham MA 02451 ■ Main Office Toll Free Number: +1 855.577.4323
Contact Customer Support: https://my.rocketsoftware.com/RocketCommunity/RCEmailSupport
Unsubscribe from Marketing Messages/Manage Your Subscription Preferences - http://www.rocketsoftware.com/manage-your-email-preferences
Privacy Policy - http://www.rocketsoftware.com/company/legal/privacy-policy
================================

This communication and any attachments may contain confidential information of Rocket Software, Inc. All unauthorized use, disclosure or distribution is prohibited. If you are not the intended recipient, please notify Rocket Software immediately and destroy all copies of this communication. Thank you.

euobeto

unread,
Jul 1, 2022, 7:45:22 AMJul 1
to mvd...@googlegroups.com
Hi everyone, I confess that the hardware is not the best, I'm doing this on my notebook, an i5 with 6 GB of ram (hahah).
Importing the data is not a problem, it comes in several files with 100 lines each, so reading and importing them into the database is not a problem. But when they are imported when giving a select on the file it takes forever.
about their Id: it consists of a 14-digit key, and they are not exactly sequential. I carry them on my notebook because I'm the only one who's going to use this base, and it's updated every 3 months. But rethinking here I think I'm going to rebuild the base, today I have a single file with all the records in it, I'm going to delete it, and start from scratch and separating the records by states, so I'll have 27 smaller records instead of one only giant.
In this case is it better to create multiple files as suggested before?
for example:
   CREATE.FILE DICT Company
  CREATE.FILE DATA Company,AC
  CREATE.FILE DATA Company,AM
  CREATE.FILE DATA Company,MS
and so on
or create a file directly for each for example:
  CREATE.FILE AC
  CREATE.FILE AM
  CREATE.FILE MS
?

--
You received this message because you are subscribed to
the "Pick and MultiValue Databases" group.
To post, email to: mvd...@googlegroups.com
To unsubscribe, email to: mvdbms+un...@googlegroups.com
For more options, visit http://groups.google.com/group/mvdbms
---
You received this message because you are subscribed to the Google Groups "Pick and MultiValue Databases" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mvdbms+un...@googlegroups.com.

Wol

unread,
Jul 1, 2022, 8:59:04 AMJul 1
to mvd...@googlegroups.com
On 01/07/2022 12:45, euobeto wrote:
> Hi everyone, I confess that the hardware is not the best, I'm doing this
> on my notebook, an i5 with 6 GB of ram (hahah).
> Importing the data is not a problem, it comes in several files with 100
> lines each, so reading and importing them into the database is not a
> problem. But when they are imported when giving a select on the file it
> takes forever.

It may sound weird, but try creating an index on the id. I can't speak
for QM, but ime selecting the index will do a scan of the entire file.
If all the records are pushed into overflow it's fast because the main
file is pure key, but if the records are small and nothing's in
overflow, then it's a full scan ...

> about their Id: it consists of a 14-digit key, and they are not exactly
> sequential. I carry them on my notebook because I'm the only one who's
> going to use this base, and it's updated every 3 months. But rethinking
> here I think I'm going to rebuild the base, today I have a single file
> with all the records in it, I'm going to delete it, and start from
> scratch and separating the records by states, so I'll have 27 smaller
> records instead of one only giant.

Is the state part of the key? If so, then it's nice and easy to create a
distributed file, create an i-descriptor that converts state to
file-name or whatever the distributed syntax is, and then you can just
copy the old huge file into the new distributed file and QM will sort it
out for you.

If you're worried about disk space having two or three what you can do
is use the copy command to select let's say the first hundred records in
the old file, copy and delete them, rinse and repeat. Make sure you use
the correct syntax to do it, you can tell it to just grab the first keys
it finds so it will always just grab the start of the file.

Or you can write a little program that uses BASIC "SELECT, LOOP,
READNEXT, COPY, DELETE, NEXT" which will just make its way sequentially
through the file.

> In this case is it better to create multiple files as suggested before?
> for example:
>    CREATE.FILE DICT Company
>   CREATE.FILE DATA Company,AC
>   CREATE.FILE DATA Company,AM
>   CREATE.FILE DATA Company,MS
> and so on
> or create a file directly for each for example:
>   CREATE.FILE AC
>   CREATE.FILE AM
>   CREATE.FILE MS
> ?
But the first thing I'd do is create an index on @ID, create indices on
anything else you need, and try to make sure you only select on fields
that are indexed. That might be all you need ...

When you update it next time, try multiple distributed state files and
see if that's better.

But READ UP on them, because the more you come back trying to show
you've clued yourself up, the more we'll be inclined to help :-)

NB - creating this huge distributed file with one sub-file per state,
it's easy enough to write a little program that will do it for you :-)
Actually, given that you're clearly only creating records for a subset
of states, it might be worth modifying your import program to create
them on the fly as required - I've done that before, it's pretty easy.
Then when a new state comes along, it just appears rather than your
import crashing with "subfile doesn't exist".

Cheers,
Wol

euobeto

unread,
Jul 1, 2022, 9:42:41 AMJul 1
to mvd...@googlegroups.com
I think I'm going to create a file for each country independently, as I'm going to treat only one state at a time, I'll never treat the country as a whole, it's easier, because I can get a state and work only with data from that state and I don't need to open the file with the entire country data

the state is an attribute and is not part of the key
it's a list of 27 acronyms of 2 digits only... simpler, and I'll create an extra file if I have an import error, it saves it in the correct place

--
You received this message because you are subscribed to
the "Pick and MultiValue Databases" group.
To post, email to: mvd...@googlegroups.com
To unsubscribe, email to: mvdbms+un...@googlegroups.com
For more options, visit http://groups.google.com/group/mvdbms
---
You received this message because you are subscribed to the Google Groups "Pick and MultiValue Databases" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mvdbms+un...@googlegroups.com.

Brian Speirs

unread,
Jul 1, 2022, 8:27:10 PMJul 1
to Pick and MultiValue Databases
OK. You will need to change the ID to use this data in a distributed file arrangement. Here is what the documentation says:

"The partitioning algorithm is written as an I-type expression in the dictionary that defines the part files. This calculation must be based only on examination of the record id, not the data in the record, because the data is not available when identifying the part that will contain the record for a read operation."

So, your new id might be something like:   OR*43210987654321

You will need a STATE dictionary with an expression like:  FIELD(@ID, '*', 1)

A partitioning algorithm might then be:  LOCATE(STATE, list-of-states, 1; 'AL')

For the @ID above, that function should return 37.

For your files, create them as subfiles (multifiles) - because that way the dictionary is already shared across all files. If you create them as individual files, then you will either need to create a dictionary for each file (not good), or update the dictionary the dictionary reference so that they all point to the same dictionary. The downside of creating the files as multifiles is that you need to create q-pointers to them before you add them to the distributed file.

CREATE.FILE DICT MYFILE.D
CREATE.FILE DATA MYFILE.D,OR
Create a Q-pointer as MYFILE.OR
ADD.DF MYFILE MYFILE.OR 37 {algorithm} {RELATIVE}

I hope that helps.

Brian

Brian Speirs

unread,
Jul 2, 2022, 12:55:08 AMJul 2
to Pick and MultiValue Databases
On reflection, my last answer didn't properly respond to your intentions. You stated you don't need to be able to query the whole country ... which I ignored.

Nevertheless, I would still structure your ID's as outlined above, AND create the base files in a multi-file structure as outlined above. That way, you can easily convert the structure to a distributed file if you need to do so in the future.

You state you don't need  to query the whole country ... but maybe in future, you will want to query adjoining states, or say the Pacific Coast states, in which case you could use a query like:

  SELECT MYFILE WITH STATE EQ "CA" "OR" "WA" etc

My two cents worth ...

Brian
Reply all
Reply to author
Forward
0 new messages