[fityk-devel] new version of code commited

张鹏

unread,

Jul 20, 2007, 3:59:55 PM7/20/07

to Marcin Wojdyr, fityk-dev ML

Hi, Marcin:

I have just committed the new version of the xylib code.

Main changes:
////////////////////////////////////
* changed the existing code following Marcin's advice (fp, left-hand & right-hand, put xylib::util into separated files, etc..)
* added meta-info exporting in export_to_xy() with an optional param to determine whether export meta-info together with the xy data
* new format support: Philips_udf format
* Used a class "UxdLikeDataSet" to present all UXD-like formats (so far, they are UXD, Rigaku_dat and Philips_udf format).
* Re-wrote code of the UXD format parsing and Rigaku_dat format parsing with UxdLikeDataSet
////////////////////////////////////

I have not fixed the testmain.cpp file in this commit; I will make it done ASAP.
After you reviewing the code, please tell me your opinion. Thanks.

--
Regards,

Peng ZHANG (张鹏)

Marcin Wojdyr

unread,

Jul 20, 2007, 6:49:05 PM7/20/07

to 张鹏, fityk-dev ML

On 7/20/07, 张鹏 <zhangp...@gmail.com> wrote:

> * changed the existing code following Marcin's advice (fp, left-hand &
> right-hand, put xylib::util into separated files, etc..)

Let me focus on util.*:

string add_space(const string &str, int max_len)
{ string ret(str);
int spaces = max_len - str.size();
for (; spaces > 0; --spaces) {
ret += " ";
}

return ret;
}

You can replace:

int spaces = max_len - str.size();
for (; spaces > 0; --spaces) {

with:
for (int spaces = max_len - str.size(); spaces > 0; --spaces) {

or even better with:
for (int i = 0; i < max_len - str.size(); ++i) {

You can append spaces to string, to get needed length simply with:
ret.resize(max_len, ' ');
or
ret += string(' ', max_len - str.size());

But IHMO this function is not needed at all, you can use iostream
manipulators instead.

endiannes:
just a note: usually swapping bytes is done with bitwise operators.
But it's ok how it's done now.

// convert a string to double
double string_to_double(const string &str)
{
char **endptr = NULL;
return strtod(str.c_str(), endptr);
}

Instead of
double foo = string_to_double(bar);
you can just write
double foo = strtod(bar, NULL);

the `endptr' is used when you want to check, if the string was
converted properly.
We are touching one of the most difficult problems: what should we do
when the format is not exactly what it should be, in other words, how
strictly we should verify the file format and what variations from the
`standard' should be allowed. It's even worse when we do not have any
specification of the format, only examples.

in xylib.h
DataSet(const std::string &filename_, xy_ftype ftype_ = FT_UNKNOWN)

i think it never can be unknown.

I also think the way type information is handled (g_ftype, g_desc,
etc.) will need to be changed at some point, but let's leave it how it
is for now.

> * added meta-info exporting in export_to_xy() with an optional param to
> determine whether export meta-info together with the xy data

good

> * new format support: Philips_udf format

good

> * Used a class "UxdLikeDataSet" to present all UXD-like formats (so far,
> they are UXD, Rigaku_dat and Philips_udf format).

I'm not sure if it's a good idea, but there is no need to change it now.
You have in your code:
char ch = ll[0];
if (string::npos != string("0123456789+-").find(ch)) ...

Names that consists of only l's and O's are not recommended, because
of similarity to 1 and 0.
But what's worst in the snippet above is that I have to stop and think
for a while what it's supposed to do. Avoid all smart tricks, make is
as simple to read as possible.
I think what this line does is simply:
if (isdigit(ch) || ch == '+' || ch == '-') ...

Rethink using static members in UxdLikeDataSet, it won't work.

Could you write a tentative ordered list of the formats you are going
to implement?

Cheers,
Marcin

--
Marcin Wojdyr | http://www.unipress.waw.pl/~wojdyr/

Marcin Wojdyr

unread,

Jul 21, 2007, 4:50:20 PM7/21/07

to 张鹏, fityk...@lists.sourceforge.net

On 7/21/07, 张鹏 <zhangp...@gmail.com> wrote:
> > We are touching one of the most difficult problems: what should we do
> > when the format is not exactly what it should be, in other words, how
> > strictly we should verify the file format and what variations from the
> > `standard' should be allowed. It's even worse when we do not have any
> > specification of the format, only examples.

> yes, this is a problem indeed. But I think now we can just assume that they
> are all as what they should be; otherwise, it will be very complex to be
> tolerant of the format errors.

Yes. Checking for all possible errors will make reading 2-3 times more
complicated. That's always painful.
You don't have to tolerate errors, you have a choice: either to try to
read the file anyway or to throw exception.

Next few issues:

expression "return ( condition ? true : false)" can (IMO should) be
replaced with just "return condition"

It's better to read the file only once, when it's possible - in
UdfDataSet::load_data (and other places) almost each line is read
twice (you read it first time, then you move stream pointer back, and
read it again). It's not optimal and it's easy to avoid reading it
twice.

> > Could you write a tentative ordered list of the formats you are going
> > to implement?
>

> Next, I will finish the testmain.cpp first, then try the following formats:
>
> JCAMP
> Powder Diffraction File (PDF) format
> Galactic (GRAMS/32) SPC files
> NetCDF

Could you write, what you know about these formats?
- ascii / binary
- how data there is structured logically: one or multiple ranges
(blocks), is it x-y data?
- what your implementation will be based on (format specification,
other open-source implementations, sample data, etc.)

You have a number of sample files for each formats. Please add
somewhere info where these files come from.

Regards,

张鹏

unread,

Jul 23, 2007, 2:14:40 PM7/23/07

to Marcin Wojdyr, fityk-dev ML

On 7/22/07, Marcin Wojdyr <woj...@gmail.com > wrote:

Next few issues:

expression "return ( condition ? true : false)" can (IMO should) be
replaced with just "return condition"

You mean the return value in load_data()?

It's better to read the file only once, when it's possible - in
UdfDataSet::load_data (and other places) almost each line is read
twice (you read it first time, then you move stream pointer back, and
read it again). It's not optimal and it's easy to avoid reading it
twice.

Is it suitable to just call seekg() to set the reading pointer back in the ifstream?

Could you write, what you know about these formats?
- ascii / binary
- how data there is structured logically: one or multiple ranges
(blocks), is it x-y data?
- what your implementation will be based on (format specification,
other open-source implementations, sample data, etc.)

Till now, I only have a basic knowledge of JCAMP-DX format. I think it will be better to implement these formats one by one, so I have not explore the other formats.
JCAMP-DX info:
ascii,
multiple ranges. Maybe some of them have x-y data. The data are organized in a strange format, and now I am working hard to understand the organizaton of the data.
My implementation of JCAMP-DX will based on both format specification and 2 open-source projects on SF.NET (http://sourceforge.net/search/?type_of_search=soft&words=JCAMP-DX). Both of their license is LGPL, is there any problem to borrow some code from them?
BTW, which license should I use in xylib: GPL or LGPL?

You have a number of sample files for each formats. Please add
somewhere info where these files come from.

OK, the source info of sample files has been written in a README file.

>>Rethink using static members in UxdLikeDataSet, it won't work.

Oh, all derived classes' static members share the same memory with their base class.
I have changed these members to non-static.

Marcin Wojdyr

unread,

Jul 23, 2007, 3:28:44 PM7/23/07

to 张鹏, fityk-dev ML

On 7/23/07, 张鹏 <zhangp...@gmail.com> wrote:
>
> > expression "return ( condition ? true : false)" can (IMO should) be
> > replaced with just "return condition"
>
>
> You mean the return value in load_data()?

I meant every case where there is such an expression.

> > It's better to read the file only once, when it's possible - in
> > UdfDataSet::load_data (and other places) almost each line is read
> > twice (you read it first time, then you move stream pointer back, and
> > read it again). It's not optimal and it's easy to avoid reading it
> > twice.
>
>
> Is it suitable to just call seekg() to set the
> reading pointer back in the ifstream?

I'm not sure if I understand your question. If you move the pointer
back, you are probably going to read a part of the file again.

>
> Till now, I only have a basic knowledge of JCAMP-DX format. I think it will
> be better to implement these formats one by one, so I have not explore the
> other formats.
> JCAMP-DX info:
> ascii,
> multiple ranges. Maybe some of them have x-y data. The data are organized in
> a strange format, and now I am working hard to understand the organizaton of
> the data.
> My implementation of JCAMP-DX will based on both format
> specification and 2 open-source projects on SF.NET
> (http://sourceforge.net/search/?type_of_search=soft&words=JCAMP-DX).

I had a look at JCAMP-DX code, there is 10 times more lines of code
there than in your library now. How are you going to implement this
format? Is there anything I fail to see?

> Both of their license is LGPL, is there any problem to borrow some code from
> them?
> BTW, which license should I use in xylib: GPL or LGPL?

I'd recommend LGPL - more project will be able to use it.

> > You have a number of sample files for each formats. Please add
> > somewhere info where these files come from.
>
> OK, the source info of sample files has been written in a README file.

Great, I'm looking forward to see it.

张鹏

unread,

Jul 24, 2007, 3:12:54 AM7/24/07

to Marcin Wojdyr, fityk-dev ML

On 7/24/07, Marcin Wojdyr <woj...@gmail.com > wrote:

I meant every case where there is such an expression.

ok, I have modified all of such code.

>I'm not sure if I understand your question. If you move the pointer
>back, you are probably going to read a part of the file again.

I will change the old code:

while (true) {
    skip_invalid_lines(f);
    peek_line(f, line);
    if (str_startwith(line, rg_start_tag)) {
        break;
    }
    getline(f, line);
    line = str_trim(line);
    ln_type = get_line_type(line);
    ...
}

to the following to avoid reading every line twice.

while (true) {
    skip_invalid_lines(f);
    int pos = f.tellg();
    getline(f, line);
    line = str_trim(line);
    if (str_startwith(line, rg_start_tag)) {
         f.seekg(pos);
        break;
    }

    ln_type = get_line_type(line);
    ...

}

>
> Till now, I only have a basic knowledge of JCAMP-DX format. I think it will
> be better to implement these formats one by one, so I have not explore the
> other formats.
> JCAMP-DX info:
> ascii,
> multiple ranges. Maybe some of them have x-y data. The data are organized in
> a strange format, and now I am working hard to understand the organizaton of
> the data.
> My implementation of JCAMP-DX will based on both format
> specification and 2 open-source projects on SF.NET

> ( http://sourceforge.net/search/?type_of_search=soft&words=JCAMP-DX).

I had a look at JCAMP-DX code, there is 10 times more lines of code
there than in your library now. How are you going to implement this
format? Is there anything I fail to see?

Yes, it is very complex.
How much do you know about JCAMP-DX format? You can find some ref info at http://en.wikipedia.org/wiki/JCAMP
Can you help me to get some papers (actually, they are format specifications) listed at that page? especially the first one:

R. S. Mc Donald and Paul A. Wilks, JCAMP-DX: A Standard Form for Exchange of Infrared Spectra in Computer Readable Form, Appl. Spec., 1988, 2, 151-162.

I am also trying to get these papers.

And, I am wondering: Is there a way to re-use the JCAMP-DX project on SF?

Marcin Wojdyr

unread,

Jul 24, 2007, 6:27:14 AM7/24/07

to fityk...@lists.sourceforge.net

On 7/24/07, 张鹏 <zhangp...@gmail.com> wrote:

> I had a look at JCAMP-DX code, there is 10 times more lines of code
> there than in your library now. How are you going to implement this
> format? Is there anything I fail to see?
>
> Yes, it is very complex.
> How much do you know about JCAMP-DX format? You can find some ref info at
> http://en.wikipedia.org/wiki/JCAMP
> Can you help me to get some papers (actually, they are format
> specifications) listed at that page? especially the first one:
>
> R. S. Mc Donald and Paul A. Wilks, JCAMP-DX: A Standard Form for Exchange of
> Infrared Spectra in Computer Readable Form, Appl. Spec., 1988, 2, 151-162.
>
> I am also trying to get these papers.
>
> And, I am wondering: Is there a way to re-use the JCAMP-DX
> project on SF?

It's hard to estimate how long it will take you to add this format, so
please leave it. There is not much time till the end of GSOC.
Could you rather pick one or two more formats from:
http://www.ccp14.ac.uk/solution/powderdataconv/index.html ?

Do you have a converting program (that's a few lines of code)?

Please submit the README file, where you have info about sample files.

For each file format, give some basic information about it (either in
code or in docs) and write what your implementation was be based on

(format specification, other open-source implementations, sample data,
etc.)

Marcin

张鹏

unread,

Jul 24, 2007, 1:37:35 PM7/24/07

to Marcin Wojdyr, fityk-dev ML

Hi, Marcin

On 7/24/07, Marcin Wojdyr <woj...@gmail.com> wrote:

It's hard to estimate how long it will take you to add this format, so
please leave it. There is not much time till the end of GSOC.

Don't worry. After GSoC, I can still co-work with you to improve the xylib.

Could you rather pick one or two more formats from:
http://www.ccp14.ac.uk/solution/powderdataconv/index.html ?

yes, I will try some simpler formats instead.

Do you have a converting program (that's a few lines of code)?

yes, I have finihed the code of the converting program.
testmain can accept all arguments as the usage() says. And it can also automatically convert all of the sample files.

Please submit the README file, where you have info about sample files.

Added.

A new version of code has been committed to SVN repo:
ChangeLog:

//////////////////////////////
* Added a README file in every sub dir in ./test, to describe some related info about the sample files and format implementation in xylib.
* Improved xylib::guess_file_type(): can guess the file type by preview the file content.
* Finished testmain.cpp. which can test all of the sample files (only on POSIX OSes, not implemented on Windows), and can also just export user specified input / output files.
* Used my_getline() and seekg() instead of peek_line, to avoid reading every line twice.
* Cleaned up code and some other slight changes, according to Marcin's advice.

Michael Richardson

unread,

Jul 24, 2007, 9:05:52 PM7/24/07

to fityk...@lists.sourceforge.net

Apologies for the unexpected voice from the corner, but the JCAMP-DX paper
below has been cited incorrectly by Wikipedia. (I'll update it this
afternoon, NZ time).

> > R. S. Mc Donald and Paul A. Wilks, JCAMP-DX: A Standard Form for Exchange
> > of Infrared Spectra in Computer Readable Form, Appl. Spec., 1988, 2,
> > 151-162.

It should be:

McDonald, Robert S.; Wilks, Paul A., JCAMP-DX: A Standard Form for Exchange of
Infrared Spectra in Computer Readable Form, Applied Spectroscopy, Volume 42,
Number 1, January 1988 , pp. 151-162.

You can download a copy from www.ingentaconnect.com/content/sas/sas (DOI:
10.1366/0003702884428734).

Cheers,

Michael

Marcin Wojdyr

unread,

Jul 24, 2007, 10:53:18 PM7/24/07

to 张鹏, fityk-dev ML

On 7/24/07, 张鹏 <zhangp...@gmail.com> wrote:

> > Could you rather pick one or two more formats from:
> > http://www.ccp14.ac.uk/solution/powderdataconv/index.html
> ?
>
> yes, I will try some simpler formats instead.

I'll try to find documentation (by asking people) for more binary
files. There is a lot of binary formats produced by various
spectrometers, diffractometers, multichannel analyzers, etc., and
these instruments usually come (I think so, I'm not an
experimentalist) with paper manuals, where file formats are
documented, but it's hard to find it on the Internet.

But before I start asking people, it would be good to have a list of
already implemented formats, so when you have time, please update the
list in docs/formats.txt. You can also add there information, which
format is "multi-range".

Are you OK with LGPL licence?

> > Do you have a converting program (that's a few lines of code)?
>
> yes, I have finihed the code of the converting program.
> testmain can accept all arguments as the usage()
> says. And it can also automatically convert all of the
> sample files.

There are a few minor things I don't like about this program:
- inconsistent indentation
- #include "common.h". Please move what you need from this file to
util.h and remove the file.
- most of the code is "auto test", and IMO it was not a good idea to
put there auto-test.
- exceptions are not caught without -a option
- usage
- testmain.h file, that doesn't contain anything useful

I added simpler xyconv tool, that just convert files.
I also added 5-line shell script samples.sh, that does the same
"testmain -a" is doing, but
IMHO usually it's more convenient to test converting from command
line, without this script.
I also removed empty out/* directory from SVN.
BTW, what are empty strings in g_ftype and g_desc for?

> > Please submit the README file, where you have info about sample files.
>
> Added.

Thanks!

张鹏

unread,

Jul 25, 2007, 2:07:08 AM7/25/07

to Michael Richardson, fityk-dev ML

Hi, Michael:

Thank you very much for you info.

You can download a copy from www.ingentaconnect.com/content/sas/sas (DOI:
10.1366/0003702884428734).

I have taken a look at this page, it takes 20$ to get the full-text document.
According to Marcin's advice, he thinks it will be better for me to implement some less complex file formats instead of JCAMP.
BTW, how much do you know about the JCAMP format? and if you have that document, please send me a copy.

At last, thanks again for your kindly help.

张鹏

unread,

Jul 25, 2007, 2:21:46 AM7/25/07

to Marcin Wojdyr, fityk-dev ML

Hi, Macin:

On 7/25/07, Marcin Wojdyr <woj...@gmail.com> wrote:

I'll try to find documentation (by asking people) for more binary

files. ...

great. Please send me a copy of the specifications once you get one. And I will also try to find some.

But before I start asking people, it would be good to have a list of
already implemented formats, so when you have time, please update the
list in docs/formats.txt. You can also add there information, which
format is "multi-range".

ok. In fact, the most precise description is the format specification: like the scanned pages you sent me and the VAMAS paper. However, directly putting these files to the public domain may cause some legal and copyright issues. So, what info should I add into docs/formats? It cannot be as precise and detailed as the official format specification.

Are you OK with LGPL licence?

Yes. no problem.

> > Do you have a converting program (that's a few lines of code)?
>
> yes, I have finihed the code of the converting program.
> testmain can accept all arguments as the usage()
> says. And it can also automatically convert all of the
> sample files.

There are a few minor things I don't like about this program:
- inconsistent indentation

...

I added simpler xyconv tool, that just convert files.
I also added 5-line shell script samples.sh, that does the same
"testmain -a" is doing, but
IMHO usually it's more convenient to test converting from command
line, without this script.

yes, I have also fount that a flexible user tool is very important. Then all of the test
can be done by scripts. Doing this by scrips is much better than by c++.
I will remove the 2 files testmain.cpp/h later.

I also removed empty out/* directory from SVN.

ok

BTW, what are empty strings in g_ftype and g_desc for?

those lines are for FT_UNKNOWN. I will add some text there.

Marcin Wojdyr

unread,

Jul 25, 2007, 4:43:03 AM7/25/07

to 张鹏, fityk-dev ML

On 7/25/07, 张鹏 <zhangp...@gmail.com> wrote:

> > list in docs/formats.txt. You can also add there information, which
> > format is "multi-range".
> ok. In fact, the most precise description is the format specification: like
> the scanned pages you sent me and the VAMAS paper. However, directly putting
> these files to the public domain may cause some legal and copyright issues.
> So, what info should I add into docs/formats? It cannot be as precise and
> detailed as the official format specification.

You don't need to have full specification there. It's up to you what
you put there.
The documentation can be minimal, but should be easy to find and up to date.
I'd be good to have a list of supported formats. You have two such
lists, docs/formats and README, both are incomplete. I understand that
you are still developing the code, but you must remember, that I
frequently review your code and docs, so please keep it clean.
I would recommend to:
update README file
move docs about file formats from "IMPLEMENTATION REF" in
test/*/README and from docs/formats to ds_*.cpp files.
svn rm docs/

Your description of Rigaku .dat file is very clear.

The documentation can be terse.

Example:
You have in README:
test/
Directory containing the original sample files. They are categorized
into different sub-directories according to their types. For more
detailed info (like where they are from), see the README file in each
sub-directory.

It's good, but I'd be also happy with shorter version:

test/ - sample files

or even better, rename test/ to samples/, and it will be self-explaining.

> > BTW, what are empty strings in g_ftype and g_desc for?
>
> those lines are for FT_UNKNOWN. I will add some text there.
>

aha, ok

BTW, IMO it's better to keep all format informations in ds_* files, like this:

struct FormatInfo
{
string name;
string desc; // short, can be used in filter of open-file dialog
vector<string> extensions;
bool binary;
bool multi_range;

TypeInfo(...);
bool has_extension(); // case insensitive
};

in each XXXDataSet:
static const FormatInfo;

in xylib.cpp:
FormatInfo const* formats[] = { ..., &RigakuDataSet::FormatInfo, ...} ;

Michael Richardson

unread,

Jul 25, 2007, 4:38:13 PM7/25/07

to zhangp...@gmail.com, fityk...@lists.sourceforge.net

Hi Peng,

> BTW, how much do you know about the JCAMP
> format? and if you have that document, please send me a copy.

Not much, unfortunately :) Sadly, we don't have access to the journal either.
However, I'll have a hunt around and see if we can get a hard copy of it
through an alternate library.

Best wishes,

Michael.

Marcin Wojdyr

unread,

Jul 25, 2007, 4:54:40 PM7/25/07

to michael.r...@vuw.ac.nz, fityk...@lists.sourceforge.net

Thanks Micheal,
I have it, I'll sent it to Peng.

Are you using files in this format?

Marcin

Michael Richardson

unread,

Jul 25, 2007, 5:03:57 PM7/25/07

to Marcin Wojdyr, fityk...@lists.sourceforge.net

Hi Marcin,

> Are you using files in this format?

No. (These days my main tool for spectroscopy is XPS, thus plain TSV files and
occasionally those in Vamas format.)