Some comments and suggestions

102 views
Skip to first unread message

Jeff

unread,
Mar 3, 2012, 8:10:54 PM3/3/12
to heuris...@googlegroups.com
Hi everyone,

So, I have been working with the software for the past 2 days quite intensively and wanted to share my thoughts. You guys have an incredible infrastructure here and the Microsoft based GUI potential is unlimited. This is just well beyond the level of most open source efforts. But there are some silly, but serious, problems with the basic interface and functionality that make it extremely frustrating to use and figure out. Many of them are one minute fixes that would yield a lot of bang for the buck in my opinion.

  • You should be able to save a setting for where you will place your dependent (Y) variable-- first column, second column (in case there is a date column for a time series), and last column. And that should be applied universally across the program.
  • You can assume that no user would want to include a variable as both an independent and dependent variable. By looking at the column header text, you can automatically deselect the variable that has been selected as the dependent variable from being in the set of of independent variables.
  • The system should have different default behaviors based on the number of total columns of data. You only really need 2: small and large problems. If you have like 600 columns (which isn't even that much), you shouldn't even allow the user to select the variable frequency view, because it will just hang the entire program for sometimes minutes at a time-- again, no user would ever want to disable the program for so long.
  • Here is a more serious problem: if you want to have such a big complex project with essentially no documentation, then you can at least get most of the way by having a "learning mode" where you will get a huge number of extra tool-tips when you hover over anything at all in the program. But instead of having most of the tips be the same ("a result which has a name and holds a value"), each one should be very specific about what the parameter does (preferably examples too). 
  • You need to have a way of showing the user which components are compatible with each other. You could use a color based scheme, or make a drag and drop scheme with a changing cursor, so when to components are supposed to be connected, the cursor will indicate that, and if they can't, you will get a little no smoking sign symbol.
  • The "operator view" really should be the interface where you apply operations in my view. It would make the system so much easier to understand. Then you could connect components visually like in Simulink or Microsoft Visio (indeed I'm sure you can use some of the same WPF controls).
  • The engine type should default to parallel (or this should be a setting the user can store as a preference). 
  • The user should only have to define the problem once, and to the greatest extent, the file should be made to be as compatible as possible with different algorithms, so instead of showing the user an error and saying the file can't be opened, at least you can still use the data inside and the variable selections that you have already specified. 
  • The user should be able to import two files-- a "predicted values" file and an "independent variables" file, and it should be easy to try to automatically predict each of the "predicted values" columns without manually specifying the different combinations separately for each one (the current way it must be done). An even more useful thing would be an automatic mode where the system attempted to solve for each predicted value in parallel, and then every few minutes culled away the columns where little progress was being made (e.g., the best quality score was static for the past K generations), so the system can focus the computational resources on the variables that the system has the best chance of solving in a robust way. 
  • The program has trouble importing csv and txt files that work fine in other programs like Excel and JMP, so I found myself constantly re-saving csv files in Excel that I created in Matlab. Should be easy to find an open source component that never has a problem.
  • Instead of just saying CurrentAverageQuality, it would be nice to have it say in parenthesis (or just as a tool-tip) what the quality measure is (Rsquared, MSE).
  • That leads to the next point-- it is the easiest thing in the world to add additional quality measures. You should definitely include MAE and Correlation Coefficient (not just Rsquared), since for many applications these are more relevant.
  • This is the single biggest and most important recommendation: you should have many, many more templates of the software already configured in all the possible relationships. I spent many hours trying to get symbolic regression to work with the particle swarm, and it would never allow me to start the simulation. It would only take a few minutes to create each valid combination and load it up with non-trivial data showing how it is working, so all the user needs to do is swap out the data. 
  • You guys would take over this space if you could get all of this to work on CUDA. The hive stuff is cool and ultimately even more scalable, but you can buy Nvidia cards for a couple hundred bucks that will let you calculate many of these things ~15x faster. Perhaps you can use an open source framework for this such as Thrust. 
  • An in-app demo that takes control of the cursor when you play it would be a really great way to get people to the next level of using this. For example, showing how to construct a complex project, the whole flow of the app: import data into a regression problem template; open a new algorithm and load up the template; create an experiment; import the algorithm into it; change the settings to get it to work. I had to figure all of this out myself over the course a couple days, and it was only because I was so determined to get it to work that I even stayed with it-- most users would give up far sooner most likely. This doesn't need to be fancy-- just a sort of macro recording of where to click would be great.
  • It would be pretty cool to get the latest algorithm such as the Cuckoo and Spiral search into the app. I wonder if the authors of those even know about this system-- they might agree to write the implementations themselves if they know C#.
  • A view that showed how the scatter plot changes over time and over different generations would be really nice and informative. 
  • It would be great to use more complicated atomic functions like the wavelet transform, FFT, cepstrum, Hilbert, etc. There must be an open source math library that you could link up to that would include a library of functions like that (perhaps Boost).
  • The concept of Bagging (see Breiman and Friedman) would be very powerful within this sort of framework for combining successful models (I'm thinking particularly of the symbolic regressions ) and visually showing the user the gain from using the Bagged version of the model (this would seem to also give you confidence intervals and more robust variable relevance). 
Well, that is a long list, but I think this system has almost unlimited potential if it can address most of those items. Keep of up the great work, and I look forward to using future version of this system. 

Michael Kommenda

unread,
Mar 6, 2012, 5:09:56 PM3/6/12
to heuris...@googlegroups.com
Hi Jeff,

first of all thank you for your extensive feedback on HeuristicLab. We are aware of the lack of documentation in HeuristicLab and try to improve it steadily, e.g. the mailing list is only available since last August, we are currently planning to create video tutorials to show the basic handling of HeuristicLab, tutorials on HeuristicLab are held on scientific conferences, etc..

Now to your concrete feedback. I am assuming most of your comments concern the symbolic regression / classification functionality in HL (please correct me if I am wrong).


  • You should be able to save a setting for where you will place your dependent (Y) variable-- first column, second column (in case there is a date column for a time series), and last column. And that should be applied universally across the program.
Currently HL assumes that the first double column present in the dataset contains the dependent variable. An exception is the classification problem, which allows only dependent variable with a maximum of 100 distinct values.

  • You can assume that no user would want to include a variable as both an independent and dependent variable. By looking at the column header text, you can automatically deselect the variable that has been selected as the dependent variable from being in the set of of independent variables.
That's a good suggestion and  true except for the specialized case of auto regressive time series modeling. In the future there will be an specialized symbol in the GP grammar for this use case and so we can automatically disable the dependent variable.
  • The system should have different default behaviors based on the number of total columns of data. You only really need 2: small and large problems. If you have like 600 columns (which isn't even that much), you shouldn't even allow the user to select the variable frequency view, because it will just hang the entire program for sometimes minutes at a time-- again, no user would ever want to disable the program for so long.
This issue was not known up to now and is recorded in Ticket #1785 (http://dev.heuristiclab.com/trac/hl/core/ticket/1785).

  • Here is a more serious problem: if you want to have such a big complex project with essentially no documentation, then you can at least get most of the way by having a "learning mode" where you will get a huge number of extra tool-tips when you hover over anything at all in the program. But instead of having most of the tips be the same ("a result which has a name and holds a value"), each one should be very specific about what the parameter does (preferably examples too).
Again I must admit that you are right. In a perfect HL all tooltips and descriptions (information icon) should be filled with descriptive text. As this is not the case, I created a ticket #1786 (http://dev.heuristiclab.com/trac/hl/core/ticket/1786) so this should and will be improved in the future.

  • You need to have a way of showing the user which components are compatible with each other. You could use a color based scheme, or make a drag and drop scheme with a changing cursor, so when to components are supposed to be connected, the cursor will indicate that, and if they can't, you will get a little no smoking sign symbol.
Here I am not really sure about what you are talking about. Do you mean the drag & drop over different view components, or in the OperatorGraph visualization? Regarding the drag and drop between different views, this is already indicated by a changing cursor. If you are thinking about features in the OperatorGraph visualization, could you please clarify your suggestions.

  • The "operator view" really should be the interface where you apply operations in my view. It would make the system so much easier to understand. Then you could connect components visually like in Simulink or Microsoft Visio (indeed I'm sure you can use some of the same WPF controls).
What exactly do you mean by "operator view"? The operators side panel containing all available operators that should be used for algorithm modeling or the OperatorGraphVisualization (Operator graph tab in the algorithm)?

  • The engine type should default to parallel (or this should be a setting the user can store as a preference).

  • The user should only have to define the problem once, and to the greatest extent, the file should be made to be as compatible as possible with different algorithms, so instead of showing the user an error and saying the file can't be opened, at least you can still use the data inside and the variable selections that you have already specified.
An errors message should only occur if regression / classification problems are opened in other DataAnalysis algorithms, that do not support the saved problem. Although again a more detailed error message can be provided for inexperienced users.

  • The user should be able to import two files-- a "predicted values" file and an "independent variables" file, and it should be easy to try to automatically predict each of the "predicted values" columns without manually specifying the different combinations separately for each one (the current way it must be done). An even more useful thing would be an automatic mode where the system attempted to solve for each predicted value in parallel, and then every few minutes culled away the columns where little progress was being made (e.g., the best quality score was static for the past K generations), so the system can focus the computational resources on the variables that the system has the best chance of solving in a robust way.
This is a feature that is for a longer time period on our ToDo ist. There is a prototype implementation already available, that allows the specification of multiple "predicted values". It also allows full parameter range tests (e.g. population size in (100, 500, 1000, 5000), mutation rates between [0.05, 0.50] with a step width of 0.05) and also meta optimization runs (optimizing the parameters of a heuristic by another heuristic). This is implementation must still be completely reviewed and partly rewritten and does not contain an automatic stopping criterion, but it should fulfill your request. The current prototype can be downloaded from our build server at [4]

  • The program has trouble importing csv and txt files that work fine in other programs like Excel and JMP, so I found myself constantly re-saving csv files in Excel that I created in Matlab. Should be easy to find an open source component that never has a problem.
It is known that some csv files cannot be imported correctly. We want to steadily improve our importer and not working test files are always a good way to improve the importer. Could you please provide not working examples?

  • Instead of just saying CurrentAverageQuality, it would be nice to have it say in parenthesis (or just as a tool-tip) what the quality measure is (Rsquared, MSE).
This is not that easy to implement, because there is only a louse coupling between an algorithm and its problem. IMHO this can be easily seen by just looking at the selected evaluator.
  • That leads to the next point-- it is the easiest thing in the world to add additional quality measures. You should definitely include MAE and Correlation Coefficient (not just Rsquared), since for many applications these are more relevant.
This is really an easy task, as we already have calculator classes, which compute the MAE and the Correlation Coefficient. However, I suspect that the optimization results differ significantly as the MAE is related to the MSE. Additionally things can get worse when using the Pearson's r instead of R², because a low fitness value is assigned to models with an indirect correlations (r < 0) between its predicted and the target values  (although this could be easily corrected by an linear transformation). See ticket #1788 (http://dev.heuristiclab.com/trac/hl/core/ticket/1788).


  • This is the single biggest and most important recommendation: you should have many, many more templates of the software already configured in all the possible relationships. I spent many hours trying to get symbolic regression to work with the particle swarm, and it would never allow me to start the simulation. It would only take a few minutes to create each valid combination and load it up with non-trivial data showing how it is working, so all the user needs to do is swap out the data.
I pity you that you spend that many hours configuring PSO with symbolic regression. The reason why it is not working, is that we have not defined "Move operators" for symbolic regression. This is already written down in our trac system, because we are thinking about symbolic regression by tabu search instead of GP for a longer time (ticket #1476 http://dev.heuristiclab.com/trac/hl/core/ticket/1476). A good point is also the suggestion to provide more valid combinations of algorithms and problems.
  • You guys would take over this space if you could get all of this to work on CUDA. The hive stuff is cool and ultimately even more scalable, but you can buy Nvidia cards for a couple hundred bucks that will let you calculate many of these things ~15x faster. Perhaps you can use an open source framework for this such as Thrust.
Sure that is a possibility to speed up calculation and lots of researchers in the community have prototypes and running programs that support computation on graphic cards. Although that would also mean to reimplement large parts of HL for this specialized hardware. We follow another way and provide distributed computing via Hive and parallelization by using multiple cores. Additionally a research project has started last autumn that will provide cloud computation facilities for HL. One of the first test cases for this project will be the automatic analysis of datasets by easily configured regression runs, similar to the feature you mentioned before.

  • An in-app demo that takes control of the cursor when you play it would be a really great way to get people to the next level of using this. For example, showing how to construct a complex project, the whole flow of the app: import data into a regression problem template; open a new algorithm and load up the template; create an experiment; import the algorithm into it; change the settings to get it to work. I had to figure all of this out myself over the course a couple days, and it was only because I was so determined to get it to work that I even stayed with it-- most users would give up far sooner most likely. This doesn't need to be fancy-- just a sort of macro recording of where to click would be great.
As described in the beginning we plan to provide video tutorials that explain exactly some of the functionality described above (algorithm configuration, experiment planing, experiment analysis, etc.). So I am pretty sure that we will not have an in-app demo.

  • It would be pretty cool to get the latest algorithm such as the Cuckoo and Spiral search into the app. I wonder if the authors of those even know about this system-- they might agree to write the implementations themselves if they know C#.
I must admit I haven't heard of Cuckoo or Spiral search and had to google the terms.  There are plenty of cool algorithms described in research papers and implemented in other frameworks, but it always takes some time to implement them. Currently we are working to provide Scatter Search within the next months and learning classifier systems in the more distant future.

  • A view that showed how the scatter plot changes over time and over different generations would be really nice and informative.
That would be a really interesting concept, but I would formulate it more generally to store all evolved solutions as a result instead of only the current best one. Then you can not only display the scatter plot over time as a movie, but also see the according solutions and formulas. By the way I am not sure if you already discovered the feature, because you have mostly been dealing with symbolic regression, but a movie view is already available in HL. Just start the "Genetic Algorithm - TSP" sample from the start page and active the population diversity analyzer from the algorithms tab page. Afters set the store history flag of the population diversity analyzer to true and have a look at the TSP population diversity analyzer results, especially the solution similarity history.
  • It would be great to use more complicated atomic functions like the wavelet transform, FFT, cepstrum, Hilbert, etc. There must be an open source math library that you could link up to that would include a library of functions like that (perhaps Boost).
Currently we are using AlgLib (http://www.alglib.net/) as the main math library for HeuristicLab. Additional symbols that can be used in the symbolic expression trees are always worth thinking and implementing. We were also discussing the possibility to provide user defined symbols, which can be defined within HL.

  • The concept of Bagging (see Breiman and Friedman) would be very powerful within this sort of framework for combining successful models (I'm thinking particularly of the symbolic regressions ) and visually showing the user the gain from using the Bagged version of the model (this would seem to also give you confidence intervals and more robust variable relevance).
Sure it is a quite powerful and inspiring concept (Bagging). That's also the reason we are currently working on combining various models in "ensembles" to improve the prediction quality. Although we have no automatic possibility right now to automatically learn models on different parts of the data and with different input variables, we are working on a general way to combine multiple solutions / models in an ensemble for the mentioned reasons (confidence intervals, robustness, ...). This is currently implemented in a branch and will hopefully make it in one of the next releases.

Well, that is a long list, but I think this system has almost unlimited potential if it can address most of those items. Keep of up the great work, and I look forward to using future version of this system. 

At last I will provide you with a few links, that provide a little bit more information and hints about the usage of HL:
HL Blog:                    http://dev.heuristiclab.com/trac/hl/core/blog
GP features:             http://dev.heuristiclab.com/trac/hl/core/wiki/UsersFeaturesGeneticProgramming
Tutorials + Slides:    http://dev.heuristiclab.com/trac/hl/core/wiki/UsersTutorials

Thank you again for your positive feedback and the useful comments regarding the usability of HL. Please be patient if we could not provide everything within the next X weeks, but we are steadily working on improving the software as well as the available documentation, wiki pages, blog entries, tutorials.

Best regards,
Michael

Jeffrey Emanuel

unread,
Mar 6, 2012, 10:42:43 PM3/6/12
to heuris...@googlegroups.com
Hi Michael,

Thanks for your quick and thoughtful reply. See my comments below.

On Tue, Mar 6, 2012 at 5:09 PM, Michael Kommenda <mkom...@heuristiclab.com> wrote:
Hi Jeff,

first of all thank you for your extensive feedback on HeuristicLab. We are aware of the lack of documentation in HeuristicLab and try to improve it steadily, e.g. the mailing list is only available since last August, we are currently planning to create video tutorials to show the basic handling of HeuristicLab, tutorials on HeuristicLab are held on scientific conferences, etc..

Now to your concrete feedback. I am assuming most of your comments concern the symbolic regression / classification functionality in HL (please correct me if I am wrong).


Yes, that is correct. I'm not sure how many of your others need to minimize the Rastrigin function, but to me the main value of the system is to use it with the regression methods to solve problems (I understand it is also very important in logistics/routing). It is a very powerful strategy and only limited by the power of the functions you use in your grammar.  

  • You should be able to save a setting for where you will place your dependent (Y) variable-- first column, second column (in case there is a date column for a time series), and last column. And that should be applied universally across the program.
Currently HL assumes that the first double column present in the dataset contains the dependent variable. An exception is the classification problem, which allows only dependent variable with a maximum of 100 distinct values.

Yes I understand, I was just saying that the user may already have data prepared with a different ordering. I found myself constantly having to change it from the last to the front until I started generating the files differently.  
  • You can assume that no user would want to include a variable as both an independent and dependent variable. By looking at the column header text, you can automatically deselect the variable that has been selected as the dependent variable from being in the set of of independent variables.
That's a good suggestion and  true except for the specialized case of auto regressive time series modeling. In the future there will be an specialized symbol in the GP grammar for this use case and so we can automatically disable the dependent variable.
  • The system should have different default behaviors based on the number of total columns of data. You only really need 2: small and large problems. If you have like 600 columns (which isn't even that much), you shouldn't even allow the user to select the variable frequency view, because it will just hang the entire program for sometimes minutes at a time-- again, no user would ever want to disable the program for so long.
This issue was not known up to now and is recorded in Ticket #1785 (http://dev.heuristiclab.com/trac/hl/core/ticket/1785).

 
OK that's good. The other thing I forgot to mention is that, when you have 100 or more variables, the graph is also quite useless because you can't tell which curve represents which variable. A tool-tip like pop-up that shows what variable the mouse is currently over would be a nice way to address this.  
  • Here is a more serious problem: if you want to have such a big complex project with essentially no documentation, then you can at least get most of the way by having a "learning mode" where you will get a huge number of extra tool-tips when you hover over anything at all in the program. But instead of having most of the tips be the same ("a result which has a name and holds a value"), each one should be very specific about what the parameter does (preferably examples too).
Again I must admit that you are right. In a perfect HL all tooltips and descriptions (information icon) should be filled with descriptive text. As this is not the case, I created a ticket #1786 (http://dev.heuristiclab.com/trac/hl/core/ticket/1786) so this should and will be improved in the future.

  • You need to have a way of showing the user which components are compatible with each other. You could use a color based scheme, or make a drag and drop scheme with a changing cursor, so when to components are supposed to be connected, the cursor will indicate that, and if they can't, you will get a little no smoking sign symbol.
Here I am not really sure about what you are talking about. Do you mean the drag & drop over different view components, or in the OperatorGraph visualization? Regarding the drag and drop between different views, this is already indicated by a changing cursor. If you are thinking about features in the OperatorGraph visualization, could you please clarify your suggestions.

 
Sorry for not being clear. What I mean is the various parts of a system. Suppose I want to use multi-objective symbolic regression as my problem type. Well, I can't use that with most of the algorithms-- the only algorithm that I was able to get to work with that is NSGA-II. But that took me a long time to figure out through trial and error. So what I mean is, if the user could select an icon of a multi-regression symbolic regression problem from a palette of problem types, drag it onto a sort of "canvas", and then select, say, the Particle Swarm algorithm icon from another palette of algorithm types, the cursor could change or turn red or something to show that these two components are not compatible. And if the user then picked NSGA-II, the cursor would be green to show you could plug them together and it would work. And the point is that the "OperatorGraph" visualization, instead of just being a picture that represents the system, should instead be an interactive system (i.e., the "canvas" that I mentioned before) that the user works in to actually build the model. If you have ever used the Visio graphing software, what I am thinking of is similar to the way various components are connected together. It is just a very nice way to make complex things very simple and straightforward so that people who have important problems to solve, but not the time or computer skills to really figure out a complex piece of code like this, can do really powerful things on their own.    
  • The "operator view" really should be the interface where you apply operations in my view. It would make the system so much easier to understand. Then you could connect components visually like in Simulink or Microsoft Visio (indeed I'm sure you can use some of the same WPF controls).
What exactly do you mean by "operator view"? The operators side panel containing all available operators that should be used for algorithm modeling or the OperatorGraphVisualization (Operator graph tab in the algorithm)?

I answered this in my response above.
  • The engine type should default to parallel (or this should be a setting the user can store as a preference).
Good point! See ticket #1787 (http://dev.heuristiclab.com/trac/hl/core/ticket/1787).

  • The user should only have to define the problem once, and to the greatest extent, the file should be made to be as compatible as possible with different algorithms, so instead of showing the user an error and saying the file can't be opened, at least you can still use the data inside and the variable selections that you have already specified.
An errors message should only occur if regression / classification problems are opened in other DataAnalysis algorithms, that do not support the saved problem. Although again a more detailed error message can be provided for inexperienced users.

  • The user should be able to import two files-- a "predicted values" file and an "independent variables" file, and it should be easy to try to automatically predict each of the "predicted values" columns without manually specifying the different combinations separately for each one (the current way it must be done). An even more useful thing would be an automatic mode where the system attempted to solve for each predicted value in parallel, and then every few minutes culled away the columns where little progress was being made (e.g., the best quality score was static for the past K generations), so the system can focus the computational resources on the variables that the system has the best chance of solving in a robust way.
This is a feature that is for a longer time period on our ToDo ist. There is a prototype implementation already available, that allows the specification of multiple "predicted values". It also allows full parameter range tests (e.g. population size in (100, 500, 1000, 5000), mutation rates between [0.05, 0.50] with a step width of 0.05) and also meta optimization runs (optimizing the parameters of a heuristic by another heuristic). This is implementation must still be completely reviewed and partly rewritten and does not contain an automatic stopping criterion, but it should fulfill your request. The current prototype can be downloaded from our build server at [4]

Great, I will download the prototype and try to figure it out. It might be very useful to make a light wrapper to allow Heuristiclab to communicate with Matlab, since there are some excellent open source black-box parameter optimization toolboxes for that, particularly  NOMAD ( a Mesh-Adaptive Direct Search algorithm). 

  • The program has trouble importing csv and txt files that work fine in other programs like Excel and JMP, so I found myself constantly re-saving csv files in Excel that I created in Matlab. Should be easy to find an open source component that never has a problem.
It is known that some csv files cannot be imported correctly. We want to steadily improve our importer and not working test files are always a good way to improve the importer. Could you please provide not working examples?

The actual file was quite big but I will try to generate a short one the same way and will send it to you. 

  • Instead of just saying CurrentAverageQuality, it would be nice to have it say in parenthesis (or just as a tool-tip) what the quality measure is (Rsquared, MSE).

In that case, perhaps you can have a concept of a "good" or "bad" quality and you could show it with the color (green being good, black is avg, red is poor). The point is that, if the user is doing multiple runs with different measures, you need to click around to see what you are looking at. It's nice to know at a glance how it is going. Not the most important feature though!
This is not that easy to implement, because there is only a louse coupling between an algorithm and its problem. IMHO this can be easily seen by just looking at the selected evaluator.
  • That leads to the next point-- it is the easiest thing in the world to add additional quality measures. You should definitely include MAE and Correlation Coefficient (not just Rsquared), since for many applications these are more relevant.
This is really an easy task, as we already have calculator classes, which compute the MAE and the Correlation Coefficient. However, I suspect that the optimization results differ significantly as the MAE is related to the MSE. Additionally things can get worse when using the Pearson's r instead of R², because a low fitness value is assigned to models with an indirect correlations (r < 0) between its predicted and the target values  (although this could be easily corrected by an linear transformation). See ticket #1788 (http://dev.heuristiclab.com/trac/hl/core/ticket/1788).

OK. I could see how using MAE would hurt the convergence properties of some of the methods, but the fact is that computers are so fast today, and I at least would gladly wait longer to get a better answer, and if it doesn't converge at all, then you can just switch to the squared error. 

  • This is the single biggest and most important recommendation: you should have many, many more templates of the software already configured in all the possible relationships. I spent many hours trying to get symbolic regression to work with the particle swarm, and it would never allow me to start the simulation. It would only take a few minutes to create each valid combination and load it up with non-trivial data showing how it is working, so all the user needs to do is swap out the data.
I pity you that you spend that many hours configuring PSO with symbolic regression. The reason why it is not working, is that we have not defined "Move operators" for symbolic regression. This is already written down in our trac system, because we are thinking about symbolic regression by tabu search instead of GP for a longer time (ticket #1476 http://dev.heuristiclab.com/trac/hl/core/ticket/1476). A good point is also the suggestion to provide more valid combinations of algorithms and problems.

Hah, I pity myself for that too! But that is how I really learned how the system works. I don't really understand why all of the methods can't be made to work with symbolic regression. The tree grammar and tree mutation rules seem to be separately implemented from the algorithm, and the quality function can be whatever you want. I guess you need a way for the algorithm to move through the problem space, but I would think that this could be done in a generic kind of way using an interface. It would certainly be desirable to have automatic compatibility of different problems to all of the algorithms. By that I mean, if you spend the time to hook up your new problem (say it is some augmented symbolic regression system that brings in outside functions) to the Offspring Selection algorithm, you shouldn't then have to separately integrate it to work with Simulated Annealing and Local Analysis-- they should be written with a common interface so that you can integrate through the interface instead. Hope this makes sense. 
  • You guys would take over this space if you could get all of this to work on CUDA. The hive stuff is cool and ultimately even more scalable, but you can buy Nvidia cards for a couple hundred bucks that will let you calculate many of these things ~15x faster. Perhaps you can use an open source framework for this such as Thrust.
Sure that is a possibility to speed up calculation and lots of researchers in the community have prototypes and running programs that support computation on graphic cards. Although that would also mean to reimplement large parts of HL for this specialized hardware. We follow another way and provide distributed computing via Hive and parallelization by using multiple cores. Additionally a research project has started last autumn that will provide cloud computation facilities for HL. One of the first test cases for this project will be the automatic analysis of datasets by easily configured regression runs, similar to the feature you mentioned before.


That sounds great. I wonder if I might be able to be of assistance with adding the 3D card functionality. I know from recent Matlab development experience that you can get much of the benefit of this hardware without necessarily changing that much in your program. If there are bottlenecks you can identify that involve large matrices (or the problem can be represented in this way, which is often the case), you can cast these as a special type and use one of various frameworks to do all the primitive operations, and then cast back to regular arrays. Also there is no reason why you can't combine the two approaches-- in 5 years every computer and phone will have a fast GPU. The ability to get a 15x speedup on every node isn't just a cool feature, it is game-changing and lets you tackle much bigger problems on a small budget.  
  • An in-app demo that takes control of the cursor when you play it would be a really great way to get people to the next level of using this. For example, showing how to construct a complex project, the whole flow of the app: import data into a regression problem template; open a new algorithm and load up the template; create an experiment; import the algorithm into it; change the settings to get it to work. I had to figure all of this out myself over the course a couple days, and it was only because I was so determined to get it to work that I even stayed with it-- most users would give up far sooner most likely. This doesn't need to be fancy-- just a sort of macro recording of where to click would be great.
As described in the beginning we plan to provide video tutorials that explain exactly some of the functionality described above (algorithm configuration, experiment planing, experiment analysis, etc.). So I am pretty sure that we will not have an in-app demo.

Yes, as I think more about it, having simple screen recordings describing what is happening is easy enough. I would be happy to prepare some, as I wanted to show some friends how the system works anyway. 
  • It would be pretty cool to get the latest algorithm such as the Cuckoo and Spiral search into the app. I wonder if the authors of those even know about this system-- they might agree to write the implementations themselves if they know C#.
I must admit I haven't heard of Cuckoo or Spiral search and had to google the terms.  There are plenty of cool algorithms described in research papers and implemented in other frameworks, but it always takes some time to implement them. Currently we are working to provide Scatter Search within the next months and learning classifier systems in the more distant future.

Well I'm glad I could expose you to the wonders of the novel and creatively named algorithms. It would be great if there were some simplified way for a user to add an algorithm (especially one that is built out of the same kind of basic building blocks as the other algorithms) using some kind of scripting language like VBA without having to re-compile the system. In general, anything that makes it easier for users to create and contribute code will likely lead to a more vibrant system and engaged user base (I know I would try to contribute). 
  • A view that showed how the scatter plot changes over time and over different generations would be really nice and informative.
That would be a really interesting concept, but I would formulate it more generally to store all evolved solutions as a result instead of only the current best one. Then you can not only display the scatter plot over time as a movie, but also see the according solutions and formulas. By the way I am not sure if you already discovered the feature, because you have mostly been dealing with symbolic regression, but a movie view is already available in HL. Just start the "Genetic Algorithm - TSP" sample from the start page and active the population diversity analyzer from the algorithms tab page. Afters set the store history flag of the population diversity analyzer to true and have a look at the TSP population diversity analyzer results, especially the solution similarity history.

OK, I will check it out. I don't really have any way to apply TSP type problems to the kinds of things I am working on though. One thing that would be extremely useful is, after you have done an experiment with hundreds of runs, if you had some automatic way to group the runs together (ideally in some kind of visual way) based on a specified similarity metric-- like, they used the same input variables; or they used the same kinds of functions; or they had similar quality levels. And then for each run in that grouping, you could have a slider to move through the different view and see a live-updated view of the scatter plot and the line plot on the same screen showing how well the model performed. This would make it much easier to make sense of all the data that is now generated when you let an experiment go for a couple days. This could also be a good interface in to a model ensemble creator, where, as the user was exploring and navigating the experiment results, you could pick certain models with a "thumbs up" icon or something to include them in the ensemble. You might want to have a way of helping the user pick solutions that are well suited to being in the same ensemble in that case though, where the program could warn that many of the models use the same inputs or are otherwise too similar to enhance the model accuracy. 
  • It would be great to use more complicated atomic functions like the wavelet transform, FFT, cepstrum, Hilbert, etc. There must be an open source math library that you could link up to that would include a library of functions like that (perhaps Boost).
Currently we are using AlgLib (http://www.alglib.net/) as the main math library for HeuristicLab. Additional symbols that can be used in the symbolic expression trees are always worth thinking and implementing. We were also discussing the possibility to provide user defined symbols, which can be defined within HL.

Nice. I see from reviewing the AlgLib site that a cross-correlation function is built into the library. Just this function alone could be extremely powerful combined with the conditional logic-- it would let you test a whole set of lag periods at once. And for other special functions like the Gamma function or the Bessel function, what is the harm of including them? If they aren't useful to most problems, then they simply will not be selected for in the algorithms, right? The beauty of the symbolic regression system is also that the user doesn't need to know how the functions work (at least at first), only that they turned out to be useful in solving their particular problem. Obviously it is good for the user to learn afterwards how the specific functions work as this will yield additional insight into the problem. 

  • The concept of Bagging (see Breiman and Friedman) would be very powerful within this sort of framework for combining successful models (I'm thinking particularly of the symbolic regressions ) and visually showing the user the gain from using the Bagged version of the model (this would seem to also give you confidence intervals and more robust variable relevance).
Sure it is a quite powerful and inspiring concept (Bagging). That's also the reason we are currently working on combining various models in "ensembles" to improve the prediction quality. Although we have no automatic possibility right now to automatically learn models on different parts of the data and with different input variables, we are working on a general way to combine multiple solutions / models in an ensemble for the mentioned reasons (confidence intervals, robustness, ...). This is currently implemented in a branch and will hopefully make it in one of the next releases.

That's quite interesting and I will try to look at that part of the code if it is available. The easiest way to implement this probably is to just calculate all of the models and then average the results together, and then you can improve on this by randomly changing the weightings you assign to each model until the overall ensemble quality measure increases significantly on both the training and validation data (probably not good to do this on the test data though!).  
Well, that is a long list, but I think this system has almost unlimited potential if it can address most of those items. Keep of up the great work, and I look forward to using future version of this system. 

At last I will provide you with a few links, that provide a little bit more information and hints about the usage of HL:
HL Blog:                    http://dev.heuristiclab.com/trac/hl/core/blog
GP features:             http://dev.heuristiclab.com/trac/hl/core/wiki/UsersFeaturesGeneticProgramming
Tutorials + Slides:    http://dev.heuristiclab.com/trac/hl/core/wiki/UsersTutorials

Thank you again for your positive feedback and the useful comments regarding the usability of HL. Please be patient if we could not provide everything within the next X weeks, but we are steadily working on improving the software as well as the available documentation, wiki pages, blog entries, tutorials.

Again, thanks so much for taking the time to write such a detailed reply. I hope to make my own small contributions to your efforts, even if it doesn't go much beyond constructive feedback and bug reports. It is an exciting and simply cool part of computer science and I think one that will have a lot more respect and attention in the coming years (the main thing that seems to hold it back in academic circles is the rather serious lack of substantive convergence results or understanding of the asymptotic behavior of these systems). For now, it has a large dose of experimentation and exploration to it, but that at least makes it exciting to use. 

Best regards,
Michael

Michael Kommenda

unread,
Mar 9, 2012, 4:46:36 AM3/9/12
to heuris...@googlegroups.com
Hi Jeffrey!

Sorry for the delayed reply to your last suggestions and explanations. I will not include the whole conversation again as this is quite a long one and not really necessary to understand the topics discussed.

Regarding the choosing of the first available variable as the the target variable. The reason behind this, was that this is the default behavior of Weka [0].

OK that's good. The other thing I forgot to mention is that, when you have 100 or more variables, the graph is also quite useless because you can't tell which curve represents which variable. A tool-tip like pop-up that shows what variable the mouse is currently over would be a nice way to address this. 

A tool-tip for the variable frequency chart is already implemented, but it is rather hard to choose the correct line. Therefore, it is possible to filter the displayed variables by clicking on the variable name in the header, but the question remains if a chart with more than 100 variables is really interpretable and informative.



Sorry for not being clear. What I mean is the various parts of a system. Suppose I want to use multi-objective symbolic regression as my problem type. Well, I can't use that with most of the algorithms-- the only algorithm that I was able to get to work with that is NSGA-II. But that took me a long time to figure out through trial and error. So what I mean is, if the user could select an icon of a multi-regression symbolic regression problem from a palette of problem types, drag it onto a sort of "canvas", and then select, say, the Particle Swarm algorithm icon from another palette of algorithm types, the cursor could change or turn red or something to show that these two components are not compatible. And if the user then picked NSGA-II, the cursor would be green to show you could plug them together and it would work. And the point is that the "OperatorGraph" visualization, instead of just being a picture that represents the system, should instead be an interactive system (i.e., the "canvas" that I mentioned before) that the user works in to actually build the model. If you have ever used the Visio graphing software, what I am thinking of is similar to the way various components are connected together. It is just a very nice way to make complex things very simple and straightforward so that people who have important problems to solve, but not the time or computer skills to really figure out a complex piece of code like this, can do really powerful things on their own.    

We have already discussed this issue and agree that HL needs a more intuitive way of combining algorithms and problems to create valid configurations and avoid invalid ones. The idea regarding a canvas as a starting point for connecting is quite appealing. Another possibility would be a "Solve with" context menu / button / ... in the problem that only displays algorithms that can handle the concrete problem.


Great, I will download the prototype and try to figure it out. It might be very useful to make a light wrapper to allow Heuristiclab to communicate with Matlab, since there are some excellent open source black-box parameter optimization toolboxes for that, particularly  NOMAD ( a Mesh-Adaptive Direct Search algorithm).

I must admit I haven't tried the MetaOptimization branch by myself, but other HL-developers have more experience and can provide support.


In that case, perhaps you can have a concept of a "good" or "bad" quality and you could show it with the color (green being good, black is avg, red is poor). The point is that, if the user is doing multiple runs with different measures, you need to click around to see what you are looking at. It's nice to know at a glance how it is going. Not the most important feature though!

Have you already took a look at the possibilities for result analysis? These are located in the "Runs" tab and provide visual and textual analysis and comparison of multiple runs. A run is simple a collection of all algorithm and problem parameter settings and produced results. The most common way to compare different runs is by using the bubble chart and the tabular view. The tabular view is simple a data grid that allows sorting and columns filtering. In the bubble chart (Screenshots at [1]) the user can specify two arbitrary results as axes and  every run is indicated as bubble in this two dimensional space. By choosing for example the "Best training solution. MAE" as y-axis value and "Index" as x-axis value you can immediately see the best solution on the training partition. The bubble chart also supports automatic coloring (small icons next to the axis combo-box) to have another dimension available and opens runs of special interest by double clicking on the bubble.



Hah, I pity myself for that too! But that is how I really learned how the system works. I don't really understand why all of the methods can't be made to work with symbolic regression. The tree grammar and tree mutation rules seem to be separately implemented from the algorithm, and the quality function can be whatever you want. I guess you need a way for the algorithm to move through the problem space, but I would think that this could be done in a generic kind of way using an interface. It would certainly be desirable to have automatic compatibility of different problems to all of the algorithms. By that I mean, if you spend the time to hook up your new problem (say it is some augmented symbolic regression system that brings in outside functions) to the Offspring Selection algorithm, you shouldn't then have to separately integrate it to work with Simulated Annealing and Local Analysis-- they should be written with a common interface so that you can integrate through the interface instead. Hope this makes sense. 

Yes it is true that the tree grammar, mutation, etc. are all implemented problem independent, but issue is that different optimization need different ways of navigating through the search space. Most evolutionary population based heuristic (e.g. genetic algorithms, evolution strategies), use crossover and mutation, whereas trajectory-base algorithms (e.g. local search, tabu search) use moves. There are already generic interfaces defined to represent moves, but this must be implemented in the symbolic expression tree encoding and up-to-now no one found time to do so.

That sounds great. I wonder if I might be able to be of assistance with adding the 3D card functionality. I know from recent Matlab development experience that you can get much of the benefit of this hardware without necessarily changing that much in your program. If there are bottlenecks you can identify that involve large matrices (or the problem can be represented in this way, which is often the case), you can cast these as a special type and use one of various frameworks to do all the primitive operations, and then cast back to regular arrays. Also there is no reason why you can't combine the two approaches-- in 5 years every computer and phone will have a fast GPU. The ability to get a 15x speedup on every node isn't just a cool feature, it is game-changing and lets you tackle much bigger problems on a small budget.  


Well I'm glad I could expose you to the wonders of the novel and creatively named algorithms. It would be great if there were some simplified way for a user to add an algorithm (especially one that is built out of the same kind of basic building blocks as the other algorithms) using some kind of scripting language like VBA without having to re-compile the system.
We already have a possibility to create new or modify existing algorithms without re-compiling the whole system and every thing works within HL. The feature is called operator graph modeling and allows the definition of algorithms as a sequence of operators that must be executed. One drawback is that one needs in-depth knowledge about the internals of HL to model an algorithm correctly. It is much easier to modify existing algorithms by creating an user-defined algorithm (menu -> edit -> convert into user-defined algorithm) and changing its operator graph (operator graphs can be nested and double clicking on the visual representation opens the contained operator graph). To ease such modifications and grasp what happens internally it is also possible to use the "Debug-Engine" that allows stepping through all the operators that are executed and see all the intermediate results. Last but not least HL supports so called "ProgrammableOperators" that allow programming C# code within HL. Currently this possibilities are mostly used by HL-developers because of the lack of documentation, but it maybe useful for other user in the future.


In general, anything that makes it easier for users to create and contribute code will likely lead to a more vibrant system and engaged user base (I know I would try to contribute).
IMHO this anything is documentation. Although we try to follow coding guidelines, naming conventions and the code should be ideally self-documentary, HL got are rather large framework and it is quite hard to find the correct entry point / code location, where the desired functionality should be implemented, without any guidance.



OK, I will check it out. I don't really have any way to apply TSP type problems to the kinds of things I am working on though.
Sure, it was only meant to show you some further already available features of HL.



Nice. I see from reviewing the AlgLib site that a cross-correlation function is built into the library. Just this function alone could be extremely powerful combined with the conditional logic-- it would let you test a whole set of lag periods at once. And for other special functions like the Gamma function or the Bessel function, what is the harm of including them? If they aren't useful to most problems, then they simply will not be selected for in the algorithms, right? The beauty of the symbolic regression system is also that the user doesn't need to know how the functions work (at least at first), only that they turned out to be useful in solving their particular problem. Obviously it is good for the user to learn afterwards how the specific functions work as this will yield additional insight into the problem.
It almost never harms to include new mathematical functions in the symbolic regression grammar. The only thing that hampers is are time-constraints as we are working on HL as a side-project.



That's quite interesting and I will try to look at that part of the code if it is available. The easiest way to implement this probably is to just calculate all of the models and then average the results together, and then you can improve on this by randomly changing the weightings you assign to each model until the overall ensemble quality measure increases significantly on both the training and validation data (probably not good to do this on the test data though!).

Regression ensembles which average the output of the contained models are already implemented (new -> regression ensemble solution). Afterwards you have to drag the "problem data" onto the list box containing the quality metrics. The next step is to add regression solutions, that have been trained on the same problem data, by selecting "ensemble solutions" and dragging them on the appearing list box. I must warn you that this feature is rather experimental and currently rewritten and extended.


Again, thanks so much for taking the time to write such a detailed reply. I hope to make my own small contributions to your efforts, even if it doesn't go much beyond constructive feedback and bug reports. It is an exciting and simply cool part of computer science and I think one that will have a lot more respect and attention in the coming years (the main thing that seems to hold it back in academic circles is the rather serious lack of substantive convergence results or understanding of the asymptotic behavior of these systems). For now, it has a large dose of experimentation and exploration to it, but that at least makes it exciting to use. 


I hope this mail answers your questions and provides more hints to work efficiently with HL. It is also planned to describe some of the mentioned features in our wiki, but this will take some time.

Kind Regards,
Michael

[0] http://www.cs.waikato.ac.nz/ml/weka/
[1] https://dev.heuristiclab.com/trac/hl/core/wiki/UsersFeatures

Youness El Hamzaoui

unread,
Jul 31, 2018, 4:34:47 AM7/31/18
to HeuristicLab
YOU COULD USE EUREQA SOFTWARE
Reply all
Reply to author
Forward
0 new messages