- You should be able to save a setting for where you will place your dependent (Y) variable-- first column, second column (in case there is a date column for a time series), and last column. And that should be applied universally across the program.
- You can assume that no user would want to include a variable as both an independent and dependent variable. By looking at the column header text, you can automatically deselect the variable that has been selected as the dependent variable from being in the set of of independent variables.
- The system should have different default behaviors based on the number of total columns of data. You only really need 2: small and large problems. If you have like 600 columns (which isn't even that much), you shouldn't even allow the user to select the variable frequency view, because it will just hang the entire program for sometimes minutes at a time-- again, no user would ever want to disable the program for so long.
- Here is a more serious problem: if you want to have such a big complex project with essentially no documentation, then you can at least get most of the way by having a "learning mode" where you will get a huge number of extra tool-tips when you hover over anything at all in the program. But instead of having most of the tips be the same ("a result which has a name and holds a value"), each one should be very specific about what the parameter does (preferably examples too).
- You need to have a way of showing the user which components are compatible with each other. You could use a color based scheme, or make a drag and drop scheme with a changing cursor, so when to components are supposed to be connected, the cursor will indicate that, and if they can't, you will get a little no smoking sign symbol.
- The "operator view" really should be the interface where you apply operations in my view. It would make the system so much easier to understand. Then you could connect components visually like in Simulink or Microsoft Visio (indeed I'm sure you can use some of the same WPF controls).
- The engine type should default to parallel (or this should be a setting the user can store as a preference).
- The user should only have to define the problem once, and to the greatest extent, the file should be made to be as compatible as possible with different algorithms, so instead of showing the user an error and saying the file can't be opened, at least you can still use the data inside and the variable selections that you have already specified.
- The user should be able to import two files-- a "predicted values" file and an "independent variables" file, and it should be easy to try to automatically predict each of the "predicted values" columns without manually specifying the different combinations separately for each one (the current way it must be done). An even more useful thing would be an automatic mode where the system attempted to solve for each predicted value in parallel, and then every few minutes culled away the columns where little progress was being made (e.g., the best quality score was static for the past K generations), so the system can focus the computational resources on the variables that the system has the best chance of solving in a robust way.
- The program has trouble importing csv and txt files that work fine in other programs like Excel and JMP, so I found myself constantly re-saving csv files in Excel that I created in Matlab. Should be easy to find an open source component that never has a problem.
- Instead of just saying CurrentAverageQuality, it would be nice to have it say in parenthesis (or just as a tool-tip) what the quality measure is (Rsquared, MSE).
- That leads to the next point-- it is the easiest thing in the world to add additional quality measures. You should definitely include MAE and Correlation Coefficient (not just Rsquared), since for many applications these are more relevant.
- This is the single biggest and most important recommendation: you should have many, many more templates of the software already configured in all the possible relationships. I spent many hours trying to get symbolic regression to work with the particle swarm, and it would never allow me to start the simulation. It would only take a few minutes to create each valid combination and load it up with non-trivial data showing how it is working, so all the user needs to do is swap out the data.
- You guys would take over this space if you could get all of this to work on CUDA. The hive stuff is cool and ultimately even more scalable, but you can buy Nvidia cards for a couple hundred bucks that will let you calculate many of these things ~15x faster. Perhaps you can use an open source framework for this such as Thrust.
- An in-app demo that takes control of the cursor when you play it would be a really great way to get people to the next level of using this. For example, showing how to construct a complex project, the whole flow of the app: import data into a regression problem template; open a new algorithm and load up the template; create an experiment; import the algorithm into it; change the settings to get it to work. I had to figure all of this out myself over the course a couple days, and it was only because I was so determined to get it to work that I even stayed with it-- most users would give up far sooner most likely. This doesn't need to be fancy-- just a sort of macro recording of where to click would be great.
- It would be pretty cool to get the latest algorithm such as the Cuckoo and Spiral search into the app. I wonder if the authors of those even know about this system-- they might agree to write the implementations themselves if they know C#.
- A view that showed how the scatter plot changes over time and over different generations would be really nice and informative.
- It would be great to use more complicated atomic functions like the wavelet transform, FFT, cepstrum, Hilbert, etc. There must be an open source math library that you could link up to that would include a library of functions like that (perhaps Boost).
- The concept of Bagging (see Breiman and Friedman) would be very powerful within this sort of framework for combining successful models (I'm thinking particularly of the symbolic regressions ) and visually showing the user the gain from using the Bagged version of the model (this would seem to also give you confidence intervals and more robust variable relevance).
Well, that is a long list, but I think this system has almost unlimited potential if it can address most of those items. Keep of up the great work, and I look forward to using future version of this system.
Hi Jeff,
first of all thank you for your extensive feedback on HeuristicLab. We are aware of the lack of documentation in HeuristicLab and try to improve it steadily, e.g. the mailing list is only available since last August, we are currently planning to create video tutorials to show the basic handling of HeuristicLab, tutorials on HeuristicLab are held on scientific conferences, etc..
Now to your concrete feedback. I am assuming most of your comments concern the symbolic regression / classification functionality in HL (please correct me if I am wrong).
Currently HL assumes that the first double column present in the dataset contains the dependent variable. An exception is the classification problem, which allows only dependent variable with a maximum of 100 distinct values.
- You should be able to save a setting for where you will place your dependent (Y) variable-- first column, second column (in case there is a date column for a time series), and last column. And that should be applied universally across the program.
That's a good suggestion and true except for the specialized case of auto regressive time series modeling. In the future there will be an specialized symbol in the GP grammar for this use case and so we can automatically disable the dependent variable.
- You can assume that no user would want to include a variable as both an independent and dependent variable. By looking at the column header text, you can automatically deselect the variable that has been selected as the dependent variable from being in the set of of independent variables.
This issue was not known up to now and is recorded in Ticket #1785 (http://dev.heuristiclab.com/trac/hl/core/ticket/1785).
- The system should have different default behaviors based on the number of total columns of data. You only really need 2: small and large problems. If you have like 600 columns (which isn't even that much), you shouldn't even allow the user to select the variable frequency view, because it will just hang the entire program for sometimes minutes at a time-- again, no user would ever want to disable the program for so long.
Again I must admit that you are right. In a perfect HL all tooltips and descriptions (information icon) should be filled with descriptive text. As this is not the case, I created a ticket #1786 (http://dev.heuristiclab.com/trac/hl/core/ticket/1786) so this should and will be improved in the future.
- Here is a more serious problem: if you want to have such a big complex project with essentially no documentation, then you can at least get most of the way by having a "learning mode" where you will get a huge number of extra tool-tips when you hover over anything at all in the program. But instead of having most of the tips be the same ("a result which has a name and holds a value"), each one should be very specific about what the parameter does (preferably examples too).
Here I am not really sure about what you are talking about. Do you mean the drag & drop over different view components, or in the OperatorGraph visualization? Regarding the drag and drop between different views, this is already indicated by a changing cursor. If you are thinking about features in the OperatorGraph visualization, could you please clarify your suggestions.
- You need to have a way of showing the user which components are compatible with each other. You could use a color based scheme, or make a drag and drop scheme with a changing cursor, so when to components are supposed to be connected, the cursor will indicate that, and if they can't, you will get a little no smoking sign symbol.
What exactly do you mean by "operator view"? The operators side panel containing all available operators that should be used for algorithm modeling or the OperatorGraphVisualization (Operator graph tab in the algorithm)?
- The "operator view" really should be the interface where you apply operations in my view. It would make the system so much easier to understand. Then you could connect components visually like in Simulink or Microsoft Visio (indeed I'm sure you can use some of the same WPF controls).
Good point! See ticket #1787 (http://dev.heuristiclab.com/trac/hl/core/ticket/1787).
- The engine type should default to parallel (or this should be a setting the user can store as a preference).
An errors message should only occur if regression / classification problems are opened in other DataAnalysis algorithms, that do not support the saved problem. Although again a more detailed error message can be provided for inexperienced users.
- The user should only have to define the problem once, and to the greatest extent, the file should be made to be as compatible as possible with different algorithms, so instead of showing the user an error and saying the file can't be opened, at least you can still use the data inside and the variable selections that you have already specified.
This is a feature that is for a longer time period on our ToDo ist. There is a prototype implementation already available, that allows the specification of multiple "predicted values". It also allows full parameter range tests (e.g. population size in (100, 500, 1000, 5000), mutation rates between [0.05, 0.50] with a step width of 0.05) and also meta optimization runs (optimizing the parameters of a heuristic by another heuristic). This is implementation must still be completely reviewed and partly rewritten and does not contain an automatic stopping criterion, but it should fulfill your request. The current prototype can be downloaded from our build server at [4]
- The user should be able to import two files-- a "predicted values" file and an "independent variables" file, and it should be easy to try to automatically predict each of the "predicted values" columns without manually specifying the different combinations separately for each one (the current way it must be done). An even more useful thing would be an automatic mode where the system attempted to solve for each predicted value in parallel, and then every few minutes culled away the columns where little progress was being made (e.g., the best quality score was static for the past K generations), so the system can focus the computational resources on the variables that the system has the best chance of solving in a robust way.
It is known that some csv files cannot be imported correctly. We want to steadily improve our importer and not working test files are always a good way to improve the importer. Could you please provide not working examples?
- The program has trouble importing csv and txt files that work fine in other programs like Excel and JMP, so I found myself constantly re-saving csv files in Excel that I created in Matlab. Should be easy to find an open source component that never has a problem.
- Instead of just saying CurrentAverageQuality, it would be nice to have it say in parenthesis (or just as a tool-tip) what the quality measure is (Rsquared, MSE).
This is not that easy to implement, because there is only a louse coupling between an algorithm and its problem. IMHO this can be easily seen by just looking at the selected evaluator.
This is really an easy task, as we already have calculator classes, which compute the MAE and the Correlation Coefficient. However, I suspect that the optimization results differ significantly as the MAE is related to the MSE. Additionally things can get worse when using the Pearson's r instead of R², because a low fitness value is assigned to models with an indirect correlations (r < 0) between its predicted and the target values (although this could be easily corrected by an linear transformation). See ticket #1788 (http://dev.heuristiclab.com/trac/hl/core/ticket/1788).
- That leads to the next point-- it is the easiest thing in the world to add additional quality measures. You should definitely include MAE and Correlation Coefficient (not just Rsquared), since for many applications these are more relevant.
I pity you that you spend that many hours configuring PSO with symbolic regression. The reason why it is not working, is that we have not defined "Move operators" for symbolic regression. This is already written down in our trac system, because we are thinking about symbolic regression by tabu search instead of GP for a longer time (ticket #1476 http://dev.heuristiclab.com/trac/hl/core/ticket/1476). A good point is also the suggestion to provide more valid combinations of algorithms and problems.
- This is the single biggest and most important recommendation: you should have many, many more templates of the software already configured in all the possible relationships. I spent many hours trying to get symbolic regression to work with the particle swarm, and it would never allow me to start the simulation. It would only take a few minutes to create each valid combination and load it up with non-trivial data showing how it is working, so all the user needs to do is swap out the data.
Sure that is a possibility to speed up calculation and lots of researchers in the community have prototypes and running programs that support computation on graphic cards. Although that would also mean to reimplement large parts of HL for this specialized hardware. We follow another way and provide distributed computing via Hive and parallelization by using multiple cores. Additionally a research project has started last autumn that will provide cloud computation facilities for HL. One of the first test cases for this project will be the automatic analysis of datasets by easily configured regression runs, similar to the feature you mentioned before.
- You guys would take over this space if you could get all of this to work on CUDA. The hive stuff is cool and ultimately even more scalable, but you can buy Nvidia cards for a couple hundred bucks that will let you calculate many of these things ~15x faster. Perhaps you can use an open source framework for this such as Thrust.
As described in the beginning we plan to provide video tutorials that explain exactly some of the functionality described above (algorithm configuration, experiment planing, experiment analysis, etc.). So I am pretty sure that we will not have an in-app demo.
- An in-app demo that takes control of the cursor when you play it would be a really great way to get people to the next level of using this. For example, showing how to construct a complex project, the whole flow of the app: import data into a regression problem template; open a new algorithm and load up the template; create an experiment; import the algorithm into it; change the settings to get it to work. I had to figure all of this out myself over the course a couple days, and it was only because I was so determined to get it to work that I even stayed with it-- most users would give up far sooner most likely. This doesn't need to be fancy-- just a sort of macro recording of where to click would be great.
I must admit I haven't heard of Cuckoo or Spiral search and had to google the terms. There are plenty of cool algorithms described in research papers and implemented in other frameworks, but it always takes some time to implement them. Currently we are working to provide Scatter Search within the next months and learning classifier systems in the more distant future.
- It would be pretty cool to get the latest algorithm such as the Cuckoo and Spiral search into the app. I wonder if the authors of those even know about this system-- they might agree to write the implementations themselves if they know C#.
That would be a really interesting concept, but I would formulate it more generally to store all evolved solutions as a result instead of only the current best one. Then you can not only display the scatter plot over time as a movie, but also see the according solutions and formulas. By the way I am not sure if you already discovered the feature, because you have mostly been dealing with symbolic regression, but a movie view is already available in HL. Just start the "Genetic Algorithm - TSP" sample from the start page and active the population diversity analyzer from the algorithms tab page. Afters set the store history flag of the population diversity analyzer to true and have a look at the TSP population diversity analyzer results, especially the solution similarity history.
- A view that showed how the scatter plot changes over time and over different generations would be really nice and informative.
Currently we are using AlgLib (http://www.alglib.net/) as the main math library for HeuristicLab. Additional symbols that can be used in the symbolic expression trees are always worth thinking and implementing. We were also discussing the possibility to provide user defined symbols, which can be defined within HL.
- It would be great to use more complicated atomic functions like the wavelet transform, FFT, cepstrum, Hilbert, etc. There must be an open source math library that you could link up to that would include a library of functions like that (perhaps Boost).
Sure it is a quite powerful and inspiring concept (Bagging). That's also the reason we are currently working on combining various models in "ensembles" to improve the prediction quality. Although we have no automatic possibility right now to automatically learn models on different parts of the data and with different input variables, we are working on a general way to combine multiple solutions / models in an ensemble for the mentioned reasons (confidence intervals, robustness, ...). This is currently implemented in a branch and will hopefully make it in one of the next releases.
- The concept of Bagging (see Breiman and Friedman) would be very powerful within this sort of framework for combining successful models (I'm thinking particularly of the symbolic regressions ) and visually showing the user the gain from using the Bagged version of the model (this would seem to also give you confidence intervals and more robust variable relevance).
At last I will provide you with a few links, that provide a little bit more information and hints about the usage of HL:
Well, that is a long list, but I think this system has almost unlimited potential if it can address most of those items. Keep of up the great work, and I look forward to using future version of this system.
HL Blog: http://dev.heuristiclab.com/trac/hl/core/blog
GP features: http://dev.heuristiclab.com/trac/hl/core/wiki/UsersFeaturesGeneticProgramming
Tutorials + Slides: http://dev.heuristiclab.com/trac/hl/core/wiki/UsersTutorials
Thank you again for your positive feedback and the useful comments regarding the usability of HL. Please be patient if we could not provide everything within the next X weeks, but we are steadily working on improving the software as well as the available documentation, wiki pages, blog entries, tutorials.
Best regards,
Michael
OK that's good. The other thing I forgot to mention is that, when you have 100 or more variables, the graph is also quite useless because you can't tell which curve represents which variable. A tool-tip like pop-up that shows what variable the mouse is currently over would be a nice way to address this.
Sorry for not being clear. What I mean is the various parts of a system. Suppose I want to use multi-objective symbolic regression as my problem type. Well, I can't use that with most of the algorithms-- the only algorithm that I was able to get to work with that is NSGA-II. But that took me a long time to figure out through trial and error. So what I mean is, if the user could select an icon of a multi-regression symbolic regression problem from a palette of problem types, drag it onto a sort of "canvas", and then select, say, the Particle Swarm algorithm icon from another palette of algorithm types, the cursor could change or turn red or something to show that these two components are not compatible. And if the user then picked NSGA-II, the cursor would be green to show you could plug them together and it would work. And the point is that the "OperatorGraph" visualization, instead of just being a picture that represents the system, should instead be an interactive system (i.e., the "canvas" that I mentioned before) that the user works in to actually build the model. If you have ever used the Visio graphing software, what I am thinking of is similar to the way various components are connected together. It is just a very nice way to make complex things very simple and straightforward so that people who have important problems to solve, but not the time or computer skills to really figure out a complex piece of code like this, can do really powerful things on their own.
Great, I will download the prototype and try to figure it out. It might be very useful to make a light wrapper to allow Heuristiclab to communicate with Matlab, since there are some excellent open source black-box parameter optimization toolboxes for that, particularly NOMAD ( a Mesh-Adaptive Direct Search algorithm).
In that case, perhaps you can have a concept of a "good" or "bad" quality and you could show it with the color (green being good, black is avg, red is poor). The point is that, if the user is doing multiple runs with different measures, you need to click around to see what you are looking at. It's nice to know at a glance how it is going. Not the most important feature though!
Hah, I pity myself for that too! But that is how I really learned how the system works. I don't really understand why all of the methods can't be made to work with symbolic regression. The tree grammar and tree mutation rules seem to be separately implemented from the algorithm, and the quality function can be whatever you want. I guess you need a way for the algorithm to move through the problem space, but I would think that this could be done in a generic kind of way using an interface. It would certainly be desirable to have automatic compatibility of different problems to all of the algorithms. By that I mean, if you spend the time to hook up your new problem (say it is some augmented symbolic regression system that brings in outside functions) to the Offspring Selection algorithm, you shouldn't then have to separately integrate it to work with Simulated Annealing and Local Analysis-- they should be written with a common interface so that you can integrate through the interface instead. Hope this makes sense.
That sounds great. I wonder if I might be able to be of assistance with adding the 3D card functionality. I know from recent Matlab development experience that you can get much of the benefit of this hardware without necessarily changing that much in your program. If there are bottlenecks you can identify that involve large matrices (or the problem can be represented in this way, which is often the case), you can cast these as a special type and use one of various frameworks to do all the primitive operations, and then cast back to regular arrays. Also there is no reason why you can't combine the two approaches-- in 5 years every computer and phone will have a fast GPU. The ability to get a 15x speedup on every node isn't just a cool feature, it is game-changing and lets you tackle much bigger problems on a small budget.
Well I'm glad I could expose you to the wonders of the novel and creatively named algorithms. It would be great if there were some simplified way for a user to add an algorithm (especially one that is built out of the same kind of basic building blocks as the other algorithms) using some kind of scripting language like VBA without having to re-compile the system.
We already have a possibility to create new or modify existing algorithms without re-compiling the whole system and every thing works within HL. The feature is called operator graph modeling and allows the definition of algorithms as a sequence of operators that must be executed. One drawback is that one needs in-depth knowledge about the internals of HL to model an algorithm correctly. It is much easier to modify existing algorithms by creating an user-defined algorithm (menu -> edit -> convert into user-defined algorithm) and changing its operator graph (operator graphs can be nested and double clicking on the visual representation opens the contained operator graph). To ease such modifications and grasp what happens internally it is also possible to use the "Debug-Engine" that allows stepping through all the operators that are executed and see all the intermediate results. Last but not least HL supports so called "ProgrammableOperators" that allow programming C# code within HL. Currently this possibilities are mostly used by HL-developers because of the lack of documentation, but it maybe useful for other user in the future.
In general, anything that makes it easier for users to create and contribute code will likely lead to a more vibrant system and engaged user base (I know I would try to contribute).
OK, I will check it out. I don't really have any way to apply TSP type problems to the kinds of things I am working on though.
Nice. I see from reviewing the AlgLib site that a cross-correlation function is built into the library. Just this function alone could be extremely powerful combined with the conditional logic-- it would let you test a whole set of lag periods at once. And for other special functions like the Gamma function or the Bessel function, what is the harm of including them? If they aren't useful to most problems, then they simply will not be selected for in the algorithms, right? The beauty of the symbolic regression system is also that the user doesn't need to know how the functions work (at least at first), only that they turned out to be useful in solving their particular problem. Obviously it is good for the user to learn afterwards how the specific functions work as this will yield additional insight into the problem.
That's quite interesting and I will try to look at that part of the code if it is available. The easiest way to implement this probably is to just calculate all of the models and then average the results together, and then you can improve on this by randomly changing the weightings you assign to each model until the overall ensemble quality measure increases significantly on both the training and validation data (probably not good to do this on the test data though!).
Again, thanks so much for taking the time to write such a detailed reply. I hope to make my own small contributions to your efforts, even if it doesn't go much beyond constructive feedback and bug reports. It is an exciting and simply cool part of computer science and I think one that will have a lot more respect and attention in the coming years (the main thing that seems to hold it back in academic circles is the rather serious lack of substantive convergence results or understanding of the asymptotic behavior of these systems). For now, it has a large dose of experimentation and exploration to it, but that at least makes it exciting to use.