Analysis of four-dimensional datasets

115 views
Skip to first unread message

0kto

unread,
Apr 22, 2016, 10:33:19 AM4/22/16
to julia-stats
Hi all,

I started working with Julia when I started my PhD / physics and I wrote quite a bit for my research. However, not being educated in programming, I currently reach some limits; performance wise and my knowledge of existing algorithms is limited as well. Now I am here to ask for (general) help.

  1. I am working in physics, scattering techniques. Therefore I work a lot with four dimensional datasets (reciprocal space and energy, (H,K,L,E)), which tend to be huge (several gigs).
    I found it quite easy to work with DataFrames here.
  2. From a 4d dataset, I need to reduce dimensions.
    Example: say H,E gets binned, and K and L are integrated and normalized and the statistical error is calculated. The result would be sth like a 2D dataset with axis along H and E.  H = [-2,0.1,2]; -0.1 < K < 0.1; 1 < L < 4; E = [0,1,45]
    I need to find the optimal region to present data, searching in a 4d dataset is hugely inconvenient. So far I used a DataFrame's join() on several fields, which creates a small set of duplicate / similar datapoints, which are then combined. This works on small datasets, but as you compare each point against a huge set of points, I can't rely on this technique for sets > 1 M points :)
    I guess clustering is the best way to go; are there any algorithems that come to mind? I googled and found the canopy clustering algorithem by McCallum et al.. Considering, that this creates overlapping canopies, and it should be a good starting point. However, it has not been implemented into Julia, right?
  3. Last but not least, I need to fit my 1d-data to custom models, using the Chi^2 algorithm.
    I had huge problems fitting arbitrary curves (Background, several Normal-distributions, linear offset) to my dataset. This should use the Chi^2 algorithm for fits weighted by the statistical error. Is that already implemented in any package?
Again -- for each of the tasks I have written some routines, but I have the feeling there is much faster and eleganter stuff out there. It would be a huge help of you guys if you dropped me a line with your thoughts on these points.


Looking forward to your input!

Cedric St-Jean

unread,
Apr 22, 2016, 9:22:40 PM4/22/16
to julia-stats
Hi, that sounds like interesting work! Some comments:

1. How are you storing a 4D dataset in a dataframe? One row per data-point? Could you store it in a 4D array instead? If dataframes work, that's great, but they are not optimized for every kind of use. You might get much better performance writing the join explicitly using another structure.

2. Do you have public code? In particular, have you profiled your code for a bottleneck? If you post some fragment that is slow, it'll be easier to provide concrete advice.

3. Re. Chi^2, I'm not familiar with that algorithm, but FYI there seems to be an R implementation, and you can call it with RCall.

4. If you want advice about choice of algorithm, you might have better luck asking in a forum specialized for your field. Also, consider posting on julia-users, it gets more traffic.

Best,

Cédric
Message has been deleted

0kto

unread,
Apr 25, 2016, 9:56:27 AM4/25/16
to julia-stats
Hi,

I just recoded quite a bit, and after a short excursion into the world of multidimensional arrays I stayed at the Dataframe setup. With the DataFramesMeta package speed improves quite a bit, the bottleneck is now to get the data into a DataFrame (reading from disk/ascii is slow).

Getting the interesting region from the 6column / 4-axis Dataframe works like
    df = @where(df, :QH .> H[1], :QH .< H[2])
and works on this 3Million row dataframe almost instantly, and the integration then works fast on subsets. This way the timedelay from the raw data to the plot is almost gone :)

So thanks again for your hint, they were giving me some good ideas. I guess that solves most my questions!

Cedric St-Jean

unread,
Apr 25, 2016, 10:12:53 AM4/25/16
to julia...@googlegroups.com
Cheers! If your data is in CSV you can convert it to a better format to solve your loading time issue. Or you can use JLD.save(), JLD.write() to do the same thing. That's what I usually do.

--
You received this message because you are subscribed to a topic in the Google Groups "julia-stats" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/julia-stats/1tiC2aQOKOw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to julia-stats...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages