This is a description and proof of concept report of an innovative learning data extraction, interpretation and modeling tool for disease modeling. In this paper the author demonstrates how the package uses data from a metadata of clinical trials hosted at a public site called Clinicaltrials.gov in the United States. This site combines all data available from registered clinical trials so as to improve the transfer of clinical knowledge to researchers, health professionals who need it and be able to further data analysis to produce new knowledge. In this paper he uses 13 type 2 diabetes randomized clinical trials in a sample model. He describes how the package “learns” which units are being used, and updates future imports of trial data; how it assigns values to individual simulated patients based on the distributions of measures it imports; how it model several outcomes, and finally how it optimizes which models contribute the most explanatory value to the final best fitting model.
First, let me clearly state that while I understand the theory and uses of disease models, and the need for accurate clinical parameters I am not an expert in machine learning, or on data mining/extraction. I believe this may actually be helpful in editing this paper in that we may make it more accessible to people who are concerned with the same issues, but may not be as conversant with the jargon, strategies, or processes involved.
Major comments; This is an important piece of work describing a tool which has great potential in chronic and infectious disease modeling. There are two major flaws in the work both of which are easily addressed. First it is important to acknowledge the fact that randomized clinical trials RCTs contain data which is innately biased. Generally only those people who consent to be studied are eligible which creates an important and sometimes fatal self selection bias. Automatically this likely excludes those people who are sickest or whom the disease affects disproportionately. For example, people who do not write or speak English well are commonly excluded from trials, due to worries over informed consent, which then biases towards those who do. Speaking, understanding and reading English usually includes people then who have higher levels of education, better paying jobs and thus do not represent “normal” people who would be in need of the treatment or intervention. This is not necessarily avoidable although there have been great strides in attempting to make these trials more realistic. Nonetheless it it would add much to the paper if this were to be acknowledged to be a limitation which can be minimized by exactly this kind of application which allows access to data from many countries, collected using many different protocols.
Second flaw is the very poor grammar and spelling used in the paper. Because of the amount of time taken to review papers, it is considered discourteous to present a paper with this number of grammatical errors, which greatly lengthen the time take to review the paper. This topic is quite complex, so it behooves the author to be as precise, clear and succinct as possible. The poor use of English is exemplified in the title “ The Reference Model models Clinicaltrials.gov” which is not only confusing especially to someone wishing to learn more about this; but is also inaccurate. As far as I can tell, the Reference Model does far more than just model – it extracts data, stores it and imports it intelligently based on previous notations for variables and their definitions; then it combines various models, and then optimizes them. To call that a “model” is to grossly undersell it’s virtues. There are missing articles throughout the paper such as in the introduction line 1; where you need to add the word “ The “ to Medical profession”. Please accept the suggested grammatical changes offered by your word processor which will minimize the errors.
More minor comments follow.
I think this is a great piece of work, so it is well worth your time spent to improve the grammar, clarify the language and clearly define processes and terms. Otherwise reviewers may reject it out of impatience!
Good luck!!
Yours truly,
Ann Jolly, PhD
##################################################################
Reply to reviewer: Dan Zelterman
######################
This is not exactly what I do but I can make a few useful comments on the manuscript. I am OK with it being unblinded. I am not hiding anything and have become bolder in the process!
Anyway. Here goes.
Overall: This is a good idea. The clinicaltrials.gov database is a rich source of useful information. It is nice to see it is being used for these purposes. My own interactions with this database is for a very narrow focused search.
################
RESPONSE: I thank the reviewer for recognizing the importance of this effort.
################
On the downside: The proposed uses do not include a qualitative aspect. Specifically, some trials are worth so much more than others. Some trials are terminated early because of unexpected patient side effects or if a better treatment comes along. Yet other trials become landmark findings and influence medical decisions for years afterwards. As a result, any good meta-analysis will have a (possibly subjective) measure of quality or influence, sometimes in a Bayesian framework. The search does not include this additional useful information.
################
RESPONSE: You will find some new text in the discussion section in the paper. Yet this opens a big discussion that I will add to this reply on what data is best and what data is useful. And this can be argued from multiple perspectives. The Reference Model can be of help figure out what data is useful and this was published in the past. For example, the model initially was showing a fitness matrix where multiple models were executed on multiple populations. This Matrix clearly showed outlier populations. So the model does have capabilities of detecting outliers. However, in the ensemble modeling approach that the model took, the idea is that all data entered is influential on deciding model mixture – the idea is to scoop as much data as possible to increase the model capabilities. Note that the user does have control over data quality by defining the query they ask. In the simulations used for the paper, two aspects that were taken into consideration in the query are study length and study size. Early terminated trials will have less influence since they are shorter – yet it would be incorrect to eliminate them completely from consideration since they may represent phenomena not seen in other trials – whatever happened there did happen and an ideal model should be able to represent the phenomena. The Reference Model just tries to get close to that ideal model as much as possible within existing data and assumptions. The size of studies is also considered when influencing the decision – larger studies will have more impact since their data represents more subjects. I added some explanations to the paper to indicate this.
Yet please acknowledge that this paper is not about the mathematical techniques - this is the first time that this connection to ClinicalTrials.Gov is implemented. In the future it would be possible to add many more features and conduct many more simulations that explore the clinical trials – the system does output fitness per cohort as well as specific outcomes within a trial so those can be easily explored. For now, the focus was on creating the interface that did not exist in the past and reporting on its existence – yes there is much more do to, please let it be done step by step. As far as I know no one ever did this connection step in the past in the past and I did speak with ClinicalTrials.Gov developers – so this is a step forward for modeling – please let it be published first.
################
In general: Write using the active voice! Write in the first person! The Abstract contains this passive, third-person gem: "Collecting these data elements from the literature was extremely tedious." Tedious for whom? Instead, write: "We found collecting these data elements to be tedious." But then again: Who cares? You did the research so everybody assumes there was lots of tedious work involved. Get over it!
################
RESPONSE: As for writing style – I did go over the paper once more and decided to keep passive voice yet make it consistent throughout the paper. I can show you a remark I got from another review last year for another paper that asks the opposite for writing style than what you are asking – I guess is depends on venue and reviewer – I do understand your point that keeping it in first person makes the text correct since it represents a viewpoint rather than absolute truth yet I would ask the reviewer to please accept the selection of the writing style – as long as it is readable and not wrong. I sometimes prefer first person, myself – yet a major rewrite may not be effective here. I did many corrections already in this version.
And specifically to the term tedious you mentioned. Perhaps the term tedious is not representative of the intent – perhaps the word inefficient should be used. The long term idea is to automate the procedure completely - we are far from this point now, yet consider a system in the future that once a clinical trial is reported in ClinicalTrials.Gov the knowledge is automatically propagated to adapt a living model that includes all this knowledge that keeps on growing. The import mechanism is a small step in that direction. There is a lot of work to do, yet until we get there what you see is still an improvement. For example, in the past it took roughly a week to get a paper and code it into the model. It took roughly the same time to code the import mechanism in python and import 13 new populations into the system – about 3 months. With the import mechanism ready, it took only a few hours to add an entire new population and the import code maintains a traceable link back to the source of data that helps eliminate human error. I have no exact numbers since development overlapped importing and therefore I am only reporting it in short in the paper, yet near the end I was adding several populations a day. This is dramatic increase in productivity and quality. I made sure this is briefly mentioned in the discussion section.
################
I would strongly suggest a thorough re-write of the whole document with the help of a patient editor. I have found this action to be helpful.
A
Dan Zelterman
################
RESPONSE: Thanks for the suggestion – I went over the paper and improved it.
Hopefully the reviewer finds it in better shape considering the responses.
################
##################################################################
Reply to reviewer: Ann Jolly
This is a description and proof of concept report of an innovative learning data extraction, interpretation and modeling tool for disease modeling. In this paper the author demonstrates how the package uses data from a metadata of clinical trials hosted at a public site called Clinicaltrials.gov in the United States. This site combines all data available from registered clinical trials so as to improve the transfer of clinical knowledge to researchers, health professionals who need it and be able to further data analysis to produce new knowledge. In this paper he uses 13 type 2 diabetes randomized clinical trials in a sample model. He describes how the package “learns” which units are being used, and updates future imports of trial data; how it assigns values to individual simulated patients based on the distributions of measures it imports; how it model several outcomes, and finally how it optimizes which models contribute the most explanatory value to the final best fitting model.
################
RESPONSE: Thanks to the reviewer for describing the intake from the paper – it made me think more if I clarified the objectives. There are some clarifying modifications that I added to the paper:
ClinicalTrials.Gov is a perhaps US based, yet in contains world wide information – so it global - I mentioned this in the abstract and introduction
One very long term goal is to automate the process of collection and interpretation of clinical knowledge – of course researchers will benefit from it, yet first we need to get to the point that machines can understand the data out there – so far there were difficulties to collect the data and transform it to machine readable format. It took a human to interpret this. Once ClinicalTrials.Gov started gathering the data, a system was needed to interpret it to machine readable knowledge - so this paper is less about researchers, it is more about making the machine understand this data. I added a sentence about it at the end of the discussion.
Although there are some machine learning elements in the paper – it is less about learning techniques. It is more about how data is imported. You will find a sentence about it in section 3 bullet 1.
################
Great!! Got it add a comma after “ Although this is the core of the system “, “
First, let me clearly state that while I understand the theory and uses of disease models, and the need for accurate clinical parameters I am not an expert in machine learning, or on data mining/extraction. I believe this may actually be helpful in editing this paper in that we may make it more accessible to people who are concerned with the same issues, but may not be as conversant with the jargon, strategies, or processes involved.
################
RESPONSE: Thanks to the reviewer for this suggestion – I think I caught this from the first paragraph. I did make several changes to better explain what is going on, as much as possible within a paper – the paper is less about learning and more about extraction of messy data in a way that can be linked by to the source. And such medical data is hard to get, so it is important to be able to extract as much of it as possible.
################
Good yes thanks Jacob
Major comments; This is an important piece of work describing a tool which has great potential in chronic and infectious disease modeling. There are two major flaws in the work both of which are easily addressed.
################
RESPONSE: The reviewer is observant and caught the notion that the technique described here to read data from ClinicalTrials.Gov can be used for modeling infectious diseases – yet since the author has no experience with infectious disease modeling, it is best left outside this paper to other venues.
I agree though if you need to persuade editors to accept this paper it is wise to appeal to as many readers as possible… all you have to do is say “ Of course this approach can be easily applied to infectious diseases as well as chronic ones.”
################
First it is important to acknowledge the fact that randomized clinical trials RCTs contain data which is innately biased. Generally only those people who consent to be studied are eligible which creates an important and sometimes fatal self selection bias. Automatically this likely excludes those people who are sickest or whom the disease affects disproportionately. For example, people who do not write or speak English well are commonly excluded from trials, due to worries over informed consent, which then biases towards those who do. Speaking, understanding and reading English usually includes people then who have higher levels of education, better paying jobs and thus do not represent “normal” people who would be in need of the treatment or intervention. This is not necessarily avoidable although there have been great strides in attempting to make these trials more realistic. Nonetheless it it would add much to the paper if this were to be acknowledged to be a limitation which can be minimized by exactly this kind of application which allows access to data from many countries, collected using many different protocols.
################
RESPONSE: The reviewer is absolutely correct, the data is noisy and may contain bias. The first reviewer also caught this. However, this is the data available to us and we have to do the best we can with it – the alterative is having no data and speculating things. This is why it is important to have some sort of quantifiable reference to which we can compare the data to find outliers and perhaps be able to quantify bias in the future by modeling it. Think about it, the starting population is a subset of the human population. Each clinical trial is biased by many factors such as inclusion and exclusion criteria the reviewer describes. However, if we record those in a model we can then mathematically test if the model matches what we observed in multiple trials. Think about every trial brings a sample of the human population and each sample has a different age range or different sickness pre-condition – we may not capture the entirety of human disease progression, yet if we are able to merge the pieces of information somehow it is better than mere speculation. It is very similar to the abstract story of the wise men trying to describe the elephant while being blindfolded – if we can piece together the elephant from the stories the wise man are telling us, we may have a much better picture than any story told separately. The Reference Model has elements that allow merging this bigger picture and it has mechanisms to help reduce effect of less significant data such as study length and population size that affect importance of a study. This was discussed in the revised version.
The Reference model also has the potential of testing some assumptions if those are modeled using the assumption engine and adding meta data to populations. So in the future it may be possible to test assumptions regarding bias. I am reluctant to add a long discussion about such capabilities since this is not the focus of the paper and not formulated yet, yet I tried at least to hint in that direction in this reply and added some text in the discussion section.
Hopefully the reviewer will find the changes I made in the paper sufficient to represent the point the reviewer is making.
Bullet 2 Para 3 “ Data is..” should be “Data are”… “Datum “ is Latin singular and “Data” is the Latin plural for many pieces.
Sorry I must have missed where you inserted the pieces about this.. all I found when I did a word search was the word “bias” was in the Discussion, 8th para. Here all I am asking for is a single sentence at the beginning under the section on “ClinicalTrials.gov saying something like; “ It is important to note randomized clinical trial data are inherently biased to include only those who have consented to take part. This means, for example, that people who cannot read the language of the consent form will be excluded; similarly certain ethnic groups, socio economic classes and other minorities may be consistently omitted.” Just to let you know Jacob that also includes women and children much of the time.
################
Second flaw is the very poor grammar and spelling used in the paper. Because of the amount of time taken to review papers, it is considered discourteous to present a paper with this number of grammatical errors, which greatly lengthen the time take to review the paper. This topic is quite complex, so it behooves the author to be as precise, clear and succinct as possible. The poor use of English is exemplified in the title “ The Reference Model models Clinicaltrials.gov” which is not only confusing especially to someone wishing to learn more about this; but is also inaccurate. As far as I can tell, the Reference Model does far more than just model – it extracts data, stores it and imports it intelligently based on previous notations for variables and their definitions; then it combines various models, and then optimizes them. To call that a “model” is to grossly undersell it’s virtues. There are missing articles throughout the paper such as in the introduction line 1; where you need to add the word “ The “ to Medical profession”. Please accept the suggested grammatical changes offered by your word processor which will minimize the errors.
################
RESPONSE: The reviewer is very gracious in expanding the title of the paper beyond the short notation I used. The reviewer is correct, the model does go through many steps to get to a final answer. Yet allow me to keep the catchy somewhat repetitive shorter title. It is not an error if you consider all of those steps as one modeling operation – so it is in the eye of the beholder.
As you wish
The reviewer is right that this is more like a set of operations rather than an equation that models something– yet from the field I am coming from this is what considered modeling – any set of operations that describes something can be bundled into a function that “Models” some phenomena – i.e. trying to imitate it. In the conference this paper is addressed to there are typically people showing discrete event simulation models that are also a set of operations that model some behavior. The Reference Model is not much different than that.
Yes I understand though the “model” which I am used to are those used by mathematical ecologists to predict the sizes or other parameters of disease spread using differential equations. Perhaps you may say just that; “ in this paper the word model refers to a set of operations which can be combined into system designed to imitate a phenomenon.”
I went over the text again and tried to reduce redundancy – from my view it is ok, yet I got the same feedback you are giving from other English speakers that do not like the redundancy and repetition of the word model.
However, I did take all you suggested corrections seriously and ran over the paper with a spell checker and made many corrections you requested. Indeed there was room for improvement. It is funny what the human eye misses when looking at the same text over and over again – so the second eye is much appreciated here.
################
More minor comments follow.
1.Abstract line 1 try not to use the word “model” twice in the same sentence before you even define exactly what it is that you mean. Ensure that the readers to whom this paper is targeted understand “ensemble model”. As I said, the Reference model is actually more than just a model in my understanding.. more like an application or computer program or package.
################
RESPONSE:
I tried to remove redundancy by removing ensemble model from the abstract – I
hope it is better now.
Thanks!!
################
2.Abstract, 2nd and 3rd lines. The words ‘extremely tedious’ and ‘Humans reading papers” sounds odd. Perhaps rephrase to “Requiring researchers to collect model parameters from the literature is extremely labor intensive, error prone and tedious.”
################
RESPONSE: The other reviewer also did not like this phrasing so I changed tedious to inefficient otherwise I used your suggestion – it is very good.
################
3.Introduction line 1, “accumulate” is the wrong use of the word; “knowledge accumulates”; doctors “collect”. Consider changing doctors to physicians as many do not have PhDs.
################
RESPONSE: Thanks for explaining the nuances – I guess I was using the day to day term – I made the changes suggested.
################
4.Para 2 line 2; change the sentence order to; IBM Watson is a … (program or computer or platform) ” designed to read the accumulated knowledge.
################
RESPONSE: Thanks for the suggestion – it actually helped merge two sentences together and remove redundancy.
################
5.Para2 Line line3 “ It was reported in Miller that it..,. delete the repetition. What do you mean by equivalent? Equivalent to learn, remember , store, or interpret? When started to what? Read?
################
RESPONSE: Thanks for the suggestion – here is the exact quote from Miller : “When we first encountered it in our training sets, it was as good as a third-year medical student at answering questions. Every day it’s a little bit smarter — right now it’s as good as our fellows.”. I changed the text a little bit without losing meaning – hopefully it is acceptable now.
################
6.Last para on page 1 2nd line, “The Reference model is partially in focus” this is not good English. I think you mean it is partly the focus of this paper.
################
RESPONSE: Thanks The sentence was changes as suggested and punctuated better.
################
7.Page 2 2nd para, omit this; you already said it.
################
RESPONSE: Sorry, this is unclear, do you mean that I repeat something written in the abstract? If this is the case, then I would like the repetition – it is important to explain in the body that copying data by hand was not the best way to go – another emphasis Is needed – after all this is what this paper solves. If you meant omitting the paragraph just before table 1, then I would like to keep it since it elaborates on difficulties that the system solves. However, I did remove the first sentence in the paragraph before table 1 since it was not very significant, - hopefully this was what you meant.
…growing fast over the last few years” is very similar to “ rapid increase in size”; omit one of them.
8.Page 2 para3 line 1 the adjective in English comes before the noun so change to “registered clinical trials”
################
RESPONSE: Thanks, this was corrected.
################
9.Page 2, para 2 line 2 omit as it is repetitive.
################
RESPONSE: Thanks, you are correct, iw was redundant and was removed.
################
10.Page 2 Para 2.;last line, what is a “specific scheme”? do you mean order?
################
RESPONSE: This needs clarification. XML formal is very general and typically schemes are used to define a specific XML sub language to describe something – it is called a scheme and widely used amongst fields to specialize XML to a specific use. I changed the text to XML scheme be more specific about what I mean – this is somewhat a technical term that programmers will recognize – it is not worth elaborating more than that.
################
11.Page 2 Para 3 “accumulating data” should be changed to “ allows accumulation of data from multiple users.
################
RESPONSE: Thanks, this was corrected as requested.
################
12.Page 2 Para 3 second last line, always use a comma after “therefore” & “however” at the beginning of a sentence.
################
RESPONSE: Thanks, a comma was added.
################
13.Page 2 Para 3 last line; Hereafter is an antiquated word, use “below.”
################
RESPONSE: OK. This is a nice changed that was made.
################
14.Page 4 para 3 first line n”… typically defined in clinical trial report (s) add the “s” change “as” to “in” a table]
################
RESPONSE: Thanks, - corrected.
################
15.Page 4 Next line remove “that’ add “tabulated”
################
RESPONSE: Thanks, - however, I decided to instead change the sentence – you see Table 1 in most clinical trials defines the population demographics – is it almost a standard so I would like to keep the notion that this is a specific table rather than any tabulated data. The text now is “…the statistics described in that table…” add within the limit of computing..’;
################
16.Page 4 What are population building blocks? Define.
################
RESPONSE: Well, this is too much for the space defined – this is why I provided so many references – I use object oriented approach where the modeler models small objects that can be assembled together to define populations that can later be used in assembly of other populations or modified by others – therefore those are like building blocks. To keep it short, I just send the reader to the references. I did use quotes around building blocks so the reader understands this is not an actual block. Hopefully, this is good enough.
Good!!
################
17.Page 4 para 5, fist line, what is the first element; define. “The latter elements” of what? If they represent observed data then evidence based is redundant, omit it.
If the outcomes are observed then they are considered to be empirical data, that is data which are collected directly from the experiments or populations measured. Evidence based knowledge means that the practice of doing something like administering Vitamin C to sailors should be supported by adequate evidence ( which does not necessarily have to be a randomized clinical trial).
################
RESPONSE: Yes, I was afraid this was not clear – I meant the two last two enumerated items. I completely restructured the text to be more specific.
################
18.Page 4 Para 5 last line, a paper can have a focus, but only an object can be “in focus”.
################
RESPONSE: The sentence was completely restructured as mentioned above.
################
19.Page 4 Last para heading 2, I think you mean systematic review; systemic is a medical term meaning “affecting all body systems”
################
RESPONSE: Thanks – how nice to have an observant human eye – I meant systematic and the speller did not catch the typo. I made the correction in more than one place – I now suspect auto correction cause this in the first place.
################
20.Page 4 2nd para line 2, “textual” means relating to texts or based on a text as in ‘references to Noah’s flood are both oral and textual” Here you mean text as in the format of the file is text.
################
RESPONSE: Sure, this was changes to text format and the sentence was restructured to better reflect the advantages of this format for this case.
Hi Jacob one more in the middle of the 2nd paragraph after the diagram.
################
21.Page 4 para 2, second line racial is an adjective and ethnicity is a noun so think you mean “race “ or racial group.
################
RESPONSE: many thanks – corrected.
################
22.Page 4 halfway down; add a comma after “Despite the variety of information,’
################
RESPONSE: thanks, comma added.
################
23.Page 5 line 2”… in case of an error, add comma”
################
RESPONSE: Thanks, comma added.
################
24.Page 5, 3rd para, what is joining.
################
RESPONSE: There were colons missing after involved. Joining is only one of the operations supported. I changes the sentence.
################
25.Para 4 first line see 8,above.
################
RESPONSE: Thanks – corrected..
################
26.Para 4 given an example of what the DSL is able to “figure out” Don’t use this term it is a colloquialism; I suggest “understand” in quotes or” adapt to’
################
RESPONSE: Good suggestion – I chose “understand” in quotes - as suggested below.
################
27.Whenever you are giving an inanimate object the abilities of a human use quotes;’ like “understand’ “recognize” etc
################
RESPONSE: Good suggestion I used it twice in the paper.
################
28.Para 4 second last line, Use “process’ rather than “handle”
################
RESPONSE: Sure, process is a better word
################
29.Please give an example of how the program can combine and recategorize ethnic group. How is this code written? Is it adapted mostly for the United States or can it be used world wide?
################
RESPONSE: If the reviewer is ok, I will skip doing this. It almost requires writing the code in the paper to fully understand. It is a Heuristic used that identifies user defined names of ethnicity/race and compares them to text defined in Race/Ethnicity fields in ClinicalTrials.Gov . It then recalculates proportions and tries its best, within limitations, to handle duplicate information such as Hispanic that is sometimes defined as a separate category in some trials. This code may change in the future as handling more trials may require more complicated interpretation. Yet it was a sufficient compromise for importing the populations in this paper. The method is not specific to US trials, although it is designed to map race/ethnicity to an array of categories the user defines that is used by some US trials.Sonce this text is very long, I suggest leaving it in this reply rather than in the paper.
Ok but all I was suggesting was to give an example; like if the categories are “Black” does that equate to African American etc? Cherokee to North American Indian etc?
################
30.Page 6. 1st para. It does not folow that if a module was developed quickly that it keeps track of versions. It seems to me that this could happen even if the module were developed slowly.
################
RESPONSE: Sure, this is a weak link – keeping versions is important always, yet importance is increased when code changes quickly. Yet since the link is weak, I removed the reference to speed of development.
################
31.Page 6 3rd para. In which segment of the citation were the search words specified? The title, abstract, whole article, MESH terms?
################
RESPONSE: The search terms were specified in the general field – here is the search query as presented in the expert search field in ClinicalTrials.Gov
“Diabetes type 2 AND ( Heart OR Stroke OR CVD OR MI ) AND NOT NOTEXT [FIRST-RECEIVED-RESULTS-DATE]”
In the future the query and date should be sufficient for reproducing trial list. For now I am leaving this outside the paper and in this reply.
Consider placing this in the paper as it is very difficult to know how these things are validated. So if you leave out a certain term are you dealing with only half the papers you could have had.
################
d
32.Page 6 3rd para “ Short duration studies with duration of less than 2.5 years…” should be changed to “Studies of less than 2.5 years were define as short term “
################
RESPONSE: Ok, this makes sense – I omitted the short term altogether.
################
33.What does marked for scruting towards inclusion mean? Do you mean they were scrutinized for inclusion?
################
RESPONSE: The reviewer is correct that this is convoluted – yet this was a borderline case. Let me explain. This decision happened only after looking more closely in the studies extracted – many were only several months long so some sort of cutoff was required to remove phenomena that clearly should not be modeled using a time scale of several years. The model time step is one year, and although the model can technically provide results after a year, it was decided to allow at least two modeling time steps to produce results – so all trials below 2.5 years were first considered excluded and scrutinized by eye once more to salvage only those suitable from modeling perspectives. This eventually included Elixa and TAK-875 where the latter study was eventually removed as reported since actual data was missing. Again, this is a modeling decision - anything that is too short will have little effect anyway in the model and 2 years was considered a borderline so anything less than 2.5 years was looked at. An alternative decision would have been to make a hard exclusion of less than 3 years, yet I decided to include as much as seems reasonable so eventually one 2 years study was modeled. Do note that the time scales became clear only after the query was produced – so this decision came only after the list of trials was extracted from ClinicalTrials.Gov – time restriction was not part of the query – it is a human decision. This explanation is long and I rather leave it out of the paper and in this reply. I did change the text a little bit to be clearer – yet this decision was not trivial.
################
Good, got it!
34.Next sentence’ change to “if they fit other criteria”
################
RESPONSE: Thanks – the change was made.
################
35.Numerals should not be used at the beginning of a sentence.
################
RESPONSE: Ok, the word Overall was added. .
################
36.Consider creating a diagram with the numbers of papers considered and those excluded for different reasons.
################
RESPONSE: There is no space within space allotted for another diagram – and as much as this is common in clinical trial papers, it is not the intention here – the spreadsheet that holds the data about the trials has the reason of exclusion / inclusion of each trial all in one page – I think it is better than a diagram – yet again it is too long – I think the numbers are sufficient to understand what is going on. In the farther future all of this will be done by putting a query in the computer and the modeler will play with the query until satisfied with the final set of trials to be modeled – it is different than ClinicalTrials where design should be planned ahead of recruitment and this diagram makes more sense – here it is mostly post collection data processing.
################
Ok too bad…
37.Page 6, 3rd para. Tautology “ each study was imported and mapped for the elements to import”
################
RESPONSE: OK, the sentence was rephrased and shorted. I used “modeled study” to avoid reuse of import.
################
38.Next line, change to “required recoding” and omit ‘needed attention’
################
RESPONSE: Good suggestion – I made the change
################
39.Mid paragraph, how do you extract the population distribution from categorical data? Do you assume equal proportions for each age within a category? Explain
################
RESPONSE: This was a mistake in the sentence – the word “not” was omitted by mistake – I added it and the text now is “so statistics could not be extracted automatically”. Thanks for noticing – in the future some heuristic can be written to cope with such cases – the modeling language does allow complicated definitions of all sorts of population spreads – beyond common statistical distributions – for now the system does not handle this import automatically – this is handled downstream by object oriented override - by a default diabetic patient for now. Apologies if this mislead the reviewer – it was a mistake in the text – I am so happy it was caught before publication.
################
Cool
40.Define eGFR
################
RESPONSE: I used “estimated glomerular filtration rate “ instead since acronyms are not reused in the paper.
################
Good
41.next sentence is completely unclear, do you mean that you have values for two variables which are correlated? Why would you not want to import both to see how exactly they are correlated?
################
RESPONSE:
In this case, I decided to keep things unchanged – it is out of the scope for this paper. This is a long discussion and here are some clarifications.. First, the import cannot handle correlations not programmed in advance. So for the sake of simplicity correlations are generally not modeled in this paper – especially not during import. Distributions imported are independent for the sake of this paper. This is what this sentence means.
The Reference Model, however, can handle correlations among biomarkers, even if those are unknown by running multiple scenarios using High Performance Computing – think about is as sensitivity analysis – However, in this paper such simulations were disabled.
However, the system does handle some cases like calculating cholesterol values from the others when applicable. Yet I really do not want to get into details here – the entire idea is that the system automatically extracts what it can from ClinicalTrials.Gov – everything else is simplified of handled downstream after import.
Yes, so then to be clear just say just import all parameters. Then, “ correlations can be later introduced through…”
################
42.Page 6 last para. How are the import instructions and specifications changed? Does the researcher choose from a list, or has to go into the code? Or is she alerted by an error message in the import report<
################
RESPONSE:
If you are referring to the ACCORD and RECORD study example in the paper, then specifications are actually part of the import instruction – since the data in ClinicalTrials.Gov has some ambiguity in with regards to how cohorts are specified in baseline and outcomes, the instruction has to have clear specifications of how to handle the ambiguity. In the future this may improve, yet for now it is necessary for the modeler to intervene. I made no changes in the text – there is not space to describe all the issues dealt in details,
So here just say; “: For the RECORD trial, outcomes for combined or merged groups had to be specifically matched in the import instructions with the baseline population”. Also remember one thing is different “from” another, not “than” another. “Than” is the wrong proposition.
################
43.Page 7 first para. Dates are converted all the time to durations in epidemiology. See codeing in SAS, STATA, SPSS, and Epi Info
################
RESPONSE:
This has to be clarified. Here is the link to Outcomes measure 1 in TREAT: https://clinicaltrials.gov/ct2/show/results/NCT00093015?sect=X70156#outcome1
Here is the text of the relevant field there:
“Time to All-cause Mortality or Cardiovascular (CV) Events Including Hospitalization Due to Acute Myocardial Ischemia, Congestive Heart Failure (CHF), Myocardial Infarction (MI), and Cerebrovascular Accident (CVA) [ Time Frame: Until a primary cardiovascular event (death, myocardial ischemia, congestive heart failure, myocardial infarction or cerebrovascular accident) occurred or 28 March 2009, whichever occurred first ]”
If any of those systems can automatically extract duration from this text alone, I will be surprised. There is insufficient data – in this case a human override was used. There is no room for going into many details like this in the paper – so I made no changes in the text. The important thing to remember is that the system does what it case and then the modeler can add all sorts of specifications on how the import will happen and in extreme cases, override the system.
My mistake, I thought you were referring to the metadata from each of the participants. In which case of course you have date of entry or event and hopefully date of outcome and then the program just subtracts.
################
44.How much higher were the 451 cells now filled than those previously.
################
RESPONSE: It depends how you count. For this paper, I am trying to show how many cells are automatically generated by the import system using the templates the modeler defined. Think about having a human copy 451 cells by hand – even using copy and paste operation and spreadsheet calculation there will be errors that cannot be traced back – this is why I am writing those numbers – we need a computer to handle this amount of information. So I am not changing the text in the paper.
Yet since you asked. there are 1354 cells in the spreadsheet if we count all non empty cells and the titles column. Yet only some of those are numbers need extraction – this is what I mean by it depends how you count. Without the system many errors can be made by a human, such as placing a number on the wrong column/row or miscalculating a number by hand.
################
Ok so here I am trying to let you show the reader how much better the computerized system is than the human one. So if you even say that the highest was x by human and contained 30% errors compared with 1,354 filled cells, which are 100% accurate although not all useful..
45.Page 8 line 1 please don’t use “figure out” ; try “determine”
################
RESPONSE: Thanks. This is a good suggestion that I followed.
################
46.Page 10 2nd para, change dominant to those “equations which performed best..”
################
RESPONSE: Unless there is a better phrasing, I would prefer to leave the text as is – the reason is that the model contains multiple equations that work together – some contribute more than others, so the notion of one equation performing best is somewhat lost – in the past I was trying to figure out which equations worked best, in the last year the system changed so that there is no longer one equation – so dominant is a better short phrasing of a longer idea. Hopefully I was able to pass the idea in this reply.
################
Ok so here what I am trying to say is “dominant’ is the wrong word.. try ‘better’ or performs better”. Dominant means “most influential” which does not mean that it is better, or prevailing or exerting control over. You need an adjective which refers to an equation or set of equations.
47.Page 10, para 3 I assume all of these programs are coded by humans, so perhaps you may want to change this? I am not sure what you mean.
################
RESPONSE: This is a good point. I changed the paragraph a bit. The imported data represents only what could be extracted from ClinicalTrials.Gov using the current programming. More automation may improve capabilities. Yet for now, post import human intervention is still helpful – I hope the paragraph now reflects this.
################
48.Para 4 2nd line. “applied to” not “on”
################
RESPONSE: Thanks, corrected.
################
49.What does legally binding have to do with this?
################
RESPONSE: This has to do with potential of future growth of the database. I changed the text a bit to better explain this. Think about it, in a few years, it will contain much more data – this means bigger modeling potential.
So perhaps you can just say that as more jurisdictions require registration of trials, the database will increase. The way you have it here the centralized repository is legally binding which does not make sense, what you mean I think is that the requirement to register is legally binding or mandatory.
################
50.Page 10 2nd last line, “over time’ is two words.
################
RESPONSE: Thanks – it seems I was working overtime while writing this, I corrected now.
################
Great job Jacob! I don’t need to see this again!
Ann
Good luck!!
Yours truly,
Ann Jolly, PhD
################
RESPONSE: Thanks Ann, your review was more than helpful - it really made me think from the start. And the last comment very encouraging. I did my best to address all your comments in length – hopefully you will see the benefit in the revised version. I really value the contribution – may all reviews be similar.
################
##################################################################
Reply to reviewer: Robert Smith?
Editorial review:
The bibliography contains 16 references, of which 12 are by the author. Self-citations are sometimes unavoidable, but they should generally not constitute 75% of the literature survey! Of the remaining four, one is to google. This weakens the manuscript's weight quite significantly. The work should be situated in the academic literature, so I suggest switching citations to a more acceptable balance.
################
RESPONSE: You are right. This is uncommon. Yet not unexplainable. The Google reference you saw shows relevant technology that may be able to read pdf data automatically. It is a Google project under beta that Alon Halevy described in SummerSim – unfortunately I could not find the video recorded of that talk online.
As for self references, I was trying to provide as much details as I can for folk to follow evolution of this project – it is built from layers added in the last 5 years, yet the technology development started a decade ago. However, some revisions in bibliography was necessary – I aggregated many publications in a SimTk project web site and linked to it to remove some references and create some space. However, I still tried to keep specific links to files that readers may find useful in the future - there are now 7 self references , yet I think those are important and relevant to this paper.
################
Final response from Ann Jolly:
I just reviewed Jakob’s changes and its all good to go!! Can you tell him thanks and what a nice guy! Happy to have done this… much more collegial than adversarial!