[fityk-devel] Support to pdCIF format has been added

13 views
Skip to first unread message

张鹏

unread,
Aug 13, 2007, 2:50:08 PM8/13/07
to woj...@gmail.com, fityk-dev ML
Hi, Marcin:

Support to pdCIF format has been added, and the code has been committed.
Some notes on pdCIF:
* Every field of the format has been parsed and added to the meta-info
* I found the structure to store the X-Y data in xylib need to be modified to adapt pdCIF format.
In fact, in every range of data in pdCIF format, the X-Y data may be like this:

///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
loop_
_pd_proc_d_spacing
_pd_proc_intensity_total
_pd_proc_ls_weight
_pd_proc_intensity_bkg_calc
_pd_calc_intensity_total
_pd_proc_point_id
0.45802 0.778(9) 12931. 0.4211 0.7851 1
0.45875 0.880(9) 11421. 0.4206 0.8992 2
0.45949 0.890(9) 11288. 0.4201 0.9062 3
0.46022 0.816(9) 12327. 0.4196 0.8249 4
0.46095 0.709(8) 14193. 0.4191 0.7178 5
0.46168 0.601(8) 16748. 0.4186 0.6240 6
0.46242 0.539(7) 18683. 0.4181 0.5557 7
0.46315 0.503(7) 20015. 0.4177 0.5120 8
0.46388 0.499(7) 20188. 0.4172 0.4896 9
0.46461 0.493(7) 20439. 0.4167 0.4852 10
0.46535 0.498(7) 20230. 0.4163 0.4900 11
0.46608 0.503(7) 20026. 0.4158 0.4877 12
///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

And there may be many possible X-Ys here. I have checked pdCIF, the plotting software Toby recommanded to us to read
pdCIF format, to see how it handle such a file to give a 2-D curve. Here are some source code taken form pdCIFplot.tcl

///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
global xaxisvars
array set xaxisvars {
_pd_meas_2theta_range_ 2Theta
_pd_proc_2theta_range_ "corrected 2Theta"
_pd_meas_2theta_scan 2Theta
_pd_meas_time_of_flight "TOF, ms"
_pd_proc_2theta_corrected "corrected 2Theta"
_pd_proc_energy_incident "energy, eV"
_pd_proc_wavelength "wavelength, A"
_pd_proc_d_spacing "d-space, A"
_pd_proc_recip_len_Q "Q, 1/A"
_pd_meas_position "linear position, mm"
}
array set yobsvars {
_pd_meas_counts_total Counts
_pd_meas_intensity_total Intensity
_pd_proc_intensity_net "Corrected Intensity"
_pd_proc_intensity_total Intensity
}
array set ybckvars {
_pd_meas_counts_background Background
_pd_meas_counts_container Container
_pd_meas_intensity_background Background
_pd_meas_intensity_container Container
_pd_proc_intensity_bkg_calc "Fit background"
_pd_proc_intensity_bkg_fix "Fixed background"
}
array set ycalcvars {
_pd_calc_intensity_net "Corrected Intensity"
_pd_calc_intensity_total Intensity
}
array set ymodvars {
_pd_proc_ls_weight 1/sqrt(weight)
_pd_meas_counts_total sqrt(counts)
}
array set yesdvars {
_pd_meas_intensity_total "Intensity S.U."
_pd_proc_intensity_net "Intensity S.U."
_pd_proc_intensity_total "Intensity S.U."
}
///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
we can see that there are many possible X-Y combinations.
To fully utilize all of the data in pdCIF, I think the current data structure to hold the x-y points need to be modified.
We can read all of the data in, along with its labels, and then let the user to choose which one as x and which one
as y.

How do you think about this?
We can discuss about this late today or tomorrow.


PS: I am now going to implement the PANalytical's data format.

--
Regards,

Peng ZHANG (张鹏)

Marcin Wojdyr

unread,
Aug 13, 2007, 8:52:27 PM8/13/07
to 张鹏, fityk-dev ML
On 8/13/07, 张鹏 <zhangp...@gmail.com> wrote:

> Support to pdCIF format has been added, and the code has
> been committed.

You forgot to commit it (ds_pdcif.*).

> we can see that there are many possible X-Y combinations.
> To fully utilize all of the data in pdCIF, I think the current data
> structure to hold the x-y points need to be modified.
> We can read all of the data in, along with its labels, and then let the
> user to choose which one as x and which one
> as y.

Yes, this can be also useful for reading CSV files with more than two columns.

I'd suggest class MultiColumnRange with:

int get_column_count();
string get_column_name(int n); // returns "" if column is unnamed
void set_column_x(int n); // by default = 0
void set_column_y(int n); // by default = 1
void set_column_y_stddev(int n); // -1 = no stddev column

int get_column_x(), etc.

vector<string> column_names;
vector<double> all_data;
double get_x(int n) { return all_data[n * get_column_count() +
get_column_x()]; }

other thoughts:
- we can assume that y_has_stddev is a property of "range" rather than of point.
- that's bad that FixedStepRange inherits add_pt() and other methods
and members that make no sense for this class.
- if "calibration" in WinSpec format is linear, has_fixed_step()
should return true
- why add_pt() is virtual?
- IMHO almost all writing methods should be protected
- istream argument can be passed to load_data() instead of ctor.
- i'm not sure if check() should throw exceptions

> PS: I am now going to implement the PANalytical's data format.

I'd suggest other things to do first:

- as you wrote in one of the previous emails:
> I am now going to implement the two new formats and then re-write uxd
> and rigaku format without the UxdLikeDataSet

- check if you have rights to distribute all sample files that are in
samples/ directory

- find other data files (if it's possible) and test your library.
Having several sample files that are from the same source and are
almost identical is not very useful for testing.

Marcin

--
Marcin Wojdyr | http://www.unipress.waw.pl/~wojdyr/

张鹏

unread,
Aug 14, 2007, 4:07:17 AM8/14/07
to fityk-dev ML
sorry, forgot to cc this ML.

---------- Forwarded message ----------
From: 张鹏 <zhangp...@gmail.com>
Date: Aug 14, 2007 4:06 PM
Subject: Re: Support to pdCIF format has been added
To: Marcin Wojdyr <woj...@gmail.com>

Hi, Marcin:

On 8/14/07, Marcin Wojdyr <woj...@gmail.com > wrote:

> we can see that there are many possible X-Y combinations.
> To fully utilize all of the data in pdCIF, I think the current data
> structure to hold the x-y points need to be modified.
>  We can read all of the data in, along with its labels, and then let the
> user to choose which one as x and which one
> as y.

Yes, this can be also useful for reading CSV files with more than two columns.

I'd suggest class MultiColumnRange with: ...

About the multi-column range, I also thought about this. And the following is my proposal:

CODE START
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

//////////////////////////////////////////
// abstract base class for a column
class Column
{
public:
    Cloumn(bool fixed_step_ = false) : fixed_step(fixed_step_), stddev(-1) {}

    virtual double get_val(unsigned n) = 0;
    unsigned get_pt_count() = 0;
   
    bool is_fixed_step() { return fixed_step; }

    bool has_stddev() { return (stddev != -1); }
    double get_stddev() { return stddev; }
    void set_stddev(double stddev_) { stddev = stddev_; }
   
    string get_name() { return name; }
    void set_name(string name_) { name = name_; }
   
protected:
    bool fixed_step;
    double stddev;
    string name;
};

// column uses vector<double> to represent the data
class VecColumn : public Column
{
public:
    VecColumn(bool fixed_step_ = false) : Column(fixed_step_) {}
   
    void add_pt(double val) { dat.push_back(val); }
   
    // implementation of the base interface
    double get_val(unsigned n)
    {
        // sanity checking with index n
        return dat[n];
    }
    unsigned get_pt_count() { return dat.size(); }
   
protected:
    vector<double> dat;
};

class StepColumn : public Column
{
public:
    StepColumn(double start_ = 0, double step_ = 0,
        count_ = std::numeric_limits<unsigned char>::max())
        : Column(true), start(start_), step(step_), count(count_)
    {}
   
    void set_count(double count_) { count = count_; }
    void set_start(double start_) { start = start_; }
    void set_step(double step_) { step = step_; }
    double get_start() { return start; }
    double get_step() { return step; }
   
    // implementation of the base interface
    double get_val(unsigned n)
    {
        // sanity checking with index n
        return (start + n * step);
    }
    unsigned get_pt_count() { return count; }
   
protected:
    double start, step;
};

// All of the old class Range & FixedStepRange and the MultiColumnRange can be replaced with this new Range
class Range
{
public:
    Range() : column_x(0), column_y(0) {}
    ~Range();    // required, because of ptr member

    int get_column_count();
    string get_column_name(unsigned n); // returns "" if column is unnamed
    unsigned get_size()
    {
        return min(all_colomn_size);
    }
   
    void set_column_x(unsigned n); // by default = 0
    void set_column_y(unsigned n); // by default = 1
   
    void get_column_x(unsigned n); // by default = 0
    void get_column_y(unsigned n); // by default = 1
   
protected:
    vector<Column*> cols;
    unsinged column_x, column_y;
};



////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
 CODE END

other thoughts:
- we can assume that y_has_stddev is a property of "range" rather than of point.
ok. in my proposal, I have changed the stddev as a scalar member of Column.

- that's bad that FixedStepRange inherits add_pt() and other methods
and members that make no sense for this class.

no this problem in my new proposal.

- if "calibration" in WinSpec format is linear, has_fixed_step()
should return true

ok, this has been modified. 

- why add_pt() is virtual?

 no this problem in my new proposal.

- IMHO almost all writing methods should be protected

which methods do you mean ?  is there  this problem in  my new proposal?

- istream argument can be passed to load_data() instead of ctor.

yes, and there should also be some changes:
* use istream *p_is  instead of ref member f to store istream object.
* need a member function clear() as following:
And in load_data, clear() should be called first.

DataSet::clear()
{
    vector<Range*>::iterator it;
    for (it = ranges.begin(); it != ranges.end (); ++it) {     
        delete *it;
    }

    ranges.clear();
    meta_map.clear();
}


ranges.clear();

}

 

- i'm not sure if check() should throw exceptions

I think it should not  throw exceptions. if there is something not as  expected,  just simply return false.
I will change the code where throw exceptions in check().

>  PS: I am now going to implement the PANalytical's data format.

I'd suggest other things to do first:

- as you wrote in one of the previous emails:
> I am now going to implement the two new formats and then re-write uxd
> and rigaku format without the UxdLikeDataSet
the PANalytical's data format is one of the 2 new formats.
However, I will re-write the uxd & rigaku first.

- check if you have rights to distribute all sample files that are in
samples/ directory
ok

- find other data files (if it's possible) and test your library.
Having several sample files that are from the same source and are
almost identical is not very useful for testing.
ok, I will try to find some more.

Marcin Wojdyr

unread,
Aug 14, 2007, 12:46:02 PM8/14/07
to 张鹏, fityk...@lists.sourceforge.net
On 8/14/07, 张鹏 <zhangp...@gmail.com> wrote:
> About the multi-column range, I also thought about this. And the following
> is my proposal:

OK, it looks good.

> bool has_stddev() { return (stddev != -1); }
> double get_stddev() { return stddev; }
> void set_stddev(double stddev_) { stddev = stddev_; }

[...]


> > - we can assume that y_has_stddev is a property of "range" rather than of
> point.

> ok. in my proposal, I have changed the stddev as a scalar member of Column.

I meant that y has std. dev. either for all points or for none. But it
is of course different for each point.
BTW, does notation like 0.778(9) in pdCIF mean 0.778 +- 0.009 ?

> > - IMHO almost all writing methods should be protected
>

> which methods do you mean ? is there this problem in my new proposal?

no, I was wrong

> > - istream argument can be passed to load_data() instead of ctor.
>

> yes, and there should also be some changes:
> * use istream *p_is instead of ref member f to store istream object.

Why do you want to store it?

> * need a member function clear() as following:
> And in load_data, clear() should be called first.

ok

张鹏

unread,
Aug 15, 2007, 10:21:47 AM8/15/07
to Marcin Wojdyr, fityk-dev ML

BTW, does notation like 0.778(9) in pdCIF mean 0.778 +- 0.009 ?

This is the official description on this notation called "crystallographic uncertainty":
http://www.iucr.org/iucr-top/comm/cnom/statdes/uncert.html 

> > - istream argument can be passed to load_data() instead of ctor.
>
> yes, and there should also be some changes:
> * use istream *p_is  instead of ref member f to store istream object.

Why do you want to store it?

 
1. it will be convinient for the other member function to use it. Although, it can also passed as a function arg.
2. it can used as a flag to indicate that an sream has been read in. And the methods like  DataSet::get_range() can check it to make sure some data has been read.

const Range&  DataSet::get_range()
{
    my_assert(p_if != NULL, "no stream data read");
    ...
}

 

Marcin Wojdyr

unread,
Aug 15, 2007, 11:57:14 AM8/15/07
to 张鹏, fityk-dev ML
On 8/15/07, 张鹏 <zhangp...@gmail.com> wrote:
> > BTW, does notation like 0.778(9) in pdCIF mean 0.778 +- 0.009 ?
>
> This is the official description on this notation called
> "crystallographic uncertainty":
> http://www.iucr.org/iucr-top/comm/cnom/statdes/uncert.html

Thanks. How do you use it?

> > > > - istream argument can be passed to load_data() instead of ctor.
> > >
> > > yes, and there should also be some changes:
> > > * use istream *p_is instead of ref member f to store istream object.
> >
> > Why do you want to store it?
>
> 1. it will be convinient for the other member function to use it. Although,
> it can also passed as a function arg.

yes, and in some cases you don't have to pass it to other functions

> 2. it can used as a flag to indicate that an sream has been read in. And the
> methods like DataSet::get_range() can check it to make sure
> some data has been read.

I don't think it's really helpful. It's enough to check the number or
ranges (and even this is not necessary). You don't have to check if
the library is used in a correct way, although it is good if there are
assertions here and there. But you can assume that programmer who is
using it have read docs. (you can't assume that the input data is
correct, of course.)

张鹏

unread,
Aug 16, 2007, 9:57:35 AM8/16/07
to Marcin Wojdyr, fityk-dev ML
On 8/15/07, Marcin Wojdyr <woj...@gmail.com> wrote:
>[...the uncertainty]

> Thanks. How do you use it?
what about using it as a separate column in the range?

> I don't think it's really helpful. It's enough to check the number or
> ranges (and even this is not necessary).

ok, I will remove it from the DataSet class in next version of code.

> You don't have to check if
> the library is used in a correct way, although it is good if there are
> assertions here and there. But you can assume that programmer who is
> using it have read docs. (you can't assume that the input data is
> correct, of course.)

mainly there are three kind of errors:
1. format error or file corruption: this is the most important one to
be prevented
2. system fault: like lack of memory new() return NULL, this should
also to be prevented.
3. wrong use of the lib by the client programmer who use xylib: this
can be taken less consideration of.

Reply all
Reply to author
Forward
0 new messages