Coding in Logistic Regression

52 views
Skip to first unread message

Sukhmani Singh

unread,
Mar 2, 2021, 9:39:45 AM3/2/21
to DataAnalysis

Dear Neeraj Sir and fellow scholars,

Greetings from Chandigarh!

I have a doubt regarding the coding aspect in a case of Logistic Regression.

Here, Dependent variable is – Out of Pocket Expenses, coded as; 1-Low expenditure, 2-High Expenditure.

Following are 5 Independent Variables:

1. Age Category, coded as;

                1-Under five age group,

                2-Child,

                3-Adolescents,

                4-Adults, and

                5-Senior Citizen

2. Monthly Consumer Expenditure, coded as;

                1-Lowest monthly group,

                2-Lower middle,

                3-Upper middle, and

                4- Highest monthly exp group

3. Nature of Treatment, coded as:

                1-Allopathy

                2-Ayush

                9-Others

4. Pregnancy, coded as;

                1-Pregnant

                2- Not Pregnant.

Now, my question is that – Is this coding correct especially in the case of last two independent variables i.e. Nature of Treatment and Pregnancy?

Earnestly waiting for some response.

Thanks and Regards,

Dr. Sukhmani.

Neeraj Kaushik

unread,
Mar 2, 2021, 8:38:36 PM3/2/21
to dataanalysistraining
Yes, coding is correct but Logistics regression will be too complex here.
Cross-tabulation will be a much easier technique to apply here.

--
Protocols of this Group:
 
1. Plz search previous post in group before posing the question.
2. Don't write query in someone's post. Always use the option of New topic for the new question. You can do this by writing to dataanalys...@googlegroups.com
3. Its better to give a proper subject to your post/query. It'll help others while searching.
4. Never write Open ended queries. This group intend to help research scholars NOT FOR WORK THEM.
5. Never write words like URGENT in ur posts. People will help them when they are free.
6. Never upload any info about National Seminars/Conferences. Send such info on personal emails. And feel free to share any RESEARCH related info.
7. No Happy New Year, Happy Diwali, Happy Holi, Happy B'day, Happy Anniversary etc. allowed on this group.
8. Few months back there was a facility for asking & sharing the Research Papers. Now there is no provision of asking for the research paper here.
 
Let’s make a better research environment.
---
You received this message because you are subscribed to the Google Groups "DataAnalysis" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataanalysistrai...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataanalysistraining/ecbeca08-f3d7-42cf-a4e7-426b8dd789b8n%40googlegroups.com.

Sukhmani Singh

unread,
Mar 3, 2021, 4:52:33 AM3/3/21
to DataAnalysis

Dear Sir,

Thankyou for your feedback and suggestion about Crosstabulation.

Sorry to bother you again, but I still need clarity on the following:

1. In crosstabulation, we will check the Association between each IV with the DV, and for this, we have to check the Significance of Chi-Square. Am I right sir?

2. But in case, if we have to check the Prediction of IV on DV, then also we can use crosstabulation OR regression is required here?

3. Lastly, you are right…Logistic regression is getting complex here. But I still need to know that…normally we give same kind of coding to “Control/No effect/low effect” category in every variable, i.e. in this case:

Dependent Variable (Out of Pocket Expenses) coded as; 1-Low expenditure, 2-High Expenditure. Here 1 is being assigned to “Control/Low effect” category.

Independent variables:

1. Age Category, coded as; 1-Under five age group, 2-Child, 3-Adolescents, 4-Adults, and 5-Senior Citizen. Here 1 is being assigned to “Control/Low effect” category.

2. Monthly Consumer Expenditure, coded as; 1-Lowest monthly group, 2-Lower middle, 3-Upper middle, 4-Highest monthly exp group. Here 1 is being assigned to “Control/Low effect category”.

3. Nature of Treatment, coded as: 1-Allopathy, 2-Ayush, 9-Others. Here 1 is being assigned to Allopathy. But in actual data 85% respondents are using Allopathy and only 5% are using Others. Now, here - Is it correct to assign 1 to Allopathy which means “Control/No/Low effect” category?

4. Pregnancy, coded as; 1-Pregnant, 2- Not Pregnant. Here also 1 is being assigned to Pregnant, thereby meaning “Control/Low effect” category. But how can be “Pregnant” considered as Control/No/Low effect category” especially in conjunction with other IVs.

And moreover, specifically in SPSS in Logistic Regression, while Defining the Categorical Covariates, we select Reference Category as FIRST  where we have given lowest number to code the control category.

Screenshot 2021-03-03 14.23.12.png


As per SPSS requirement also, How can we assign 1 to Allopathy and 1 to Pregnancy, thereby making them as Control category as similar to 1-Under five age  and 1-Lowest monthy exp which conveys Control/No/Effect category.

I hope I am clear sir. Please guide on this as I’m very confused on this aspect.

Thanks and Regards,

Dr. Sukhmani.

Neeraj Kaushik

unread,
Mar 4, 2021, 1:02:54 AM3/4/21
to dataanalysistraining
Dear Sukhmani

Here're my inputs:

1. Yes, Association is checked by Chi-sq test

2. The term prediction is used in Regression, difference is used in ty-test/ANOVA and association is used in Crosstabulation and Chi-sq test.
If you like to understand the difference between the 3 terms (Bivariate Analysis) in detail plz watch the 2 videos given on 

Ex: I'm working on 2 vars Gender (Boys & Girls) and Accommodation (Hostelers & Day Sch). Data given below

image.png

Chi-sq might tell me that most of the boys are day sch and girls are hostelers and there is a significant association
But, I can precisely predict for a new girl whether she'll be day sch or hosteler?
That's why there is the concept of Odd's ratio here and we'll say these much chances are there that she'll be a hosteler.

You may like to understand this concept by watching videos of https://www.youtube.com/playlist?list=PLzUJUtTJcj8SNnpk7wYUNNyGg4pDA4Xo-

3. Log-Linear Analysis will be best to sue in this scenario.

Best wishes

Sukhmani Singh

unread,
Mar 4, 2021, 1:32:00 AM3/4/21
to dataanalys...@googlegroups.com
Dear Sir,
Many many thanks for such a detailed and painstaking explanation. 
I will study what you have explained and get back to you, in case of further doubts.
 Regards.

Pankaj Gupta

unread,
Mar 4, 2021, 3:55:44 AM3/4/21
to dataanalys...@googlegroups.com
Dear Mam 
1. for logistic Regression DV should be categorical data which is fine in your case but IDV should be continuous i.e. interval or ratio data. In your case IDV also categorical. 

2. The test or techniques are decided by level of measurement i.e. nominal, ordinal, interval or ratio data. 

3. When both DV & IDV are categorical data chi-square, cross tab will be used as suggested by Dr. Neeraj Kaushik

Pankaj Gupta
Assistant Professor
Department of Commerce
Ramanujan College
University of Delhi

--

Sukhmani Singh

unread,
Mar 4, 2021, 9:05:46 AM3/4/21
to DataAnalysis
Dear Pankaj Sir,
Thank you so much for your valuable inputs. Really appreciated!

1. Agree, that Logistic Regression is used where DV is categorical data, if DV is binary then Logistic Regression and if DV has more than 2 categories then Multinomial logistic regression. But i slightly beg to differ on IVs. IVs can be of categorical data or in continuous scale. Here is a screenshot from a book on SPSS by Andy Field (2020):
LR.jpeg

2. Absolutely agree that statistical technique to be applied depends upon the nature of data. But Research Objectives and Hypotheses also play a major role.

3. Yes, i will study more about these two techniques as suggested by Dr. Neeraj Kaushik. He is a remarkable expert, class apart!

Thankyou and Regards. 

Neeraj Kaushik

unread,
Mar 5, 2021, 2:13:54 AM3/5/21
to dataanalysistraining
Dear Sukhmani/Pankaj

I can understand the confusions.

I've prepared a scenario about the essentials of every technique applied to Non-metric variables.

image.png

I hope this will clarify the difference between Cross-tabulation (Chi-square test), Logistics regression and Log-Linear Analysis.

Best wishes

Neeraj Kaushik

unread,
Mar 5, 2021, 2:18:10 AM3/5/21
to dataanalysistraining
Dear Sukhmani
The correct statistical technique in your case if Logistics Regression but it will be extremely difficult to read the output and interpret as you have all non-metric IDVs hence there will be a lot of base categories (for every single IDV there will 1 base category).
That's why I recommend the use of Log-Linear Analysis which is much simpler.
I will soon make videos for the same.
Best wishes.

Sukhmani Singh

unread,
Mar 5, 2021, 4:43:13 AM3/5/21
to DataAnalysis

Dear Neeraj Sir,

I am greatly indebted for your inputs and moreover, your valuable time that you are putting into this discussion.

Thank you for this wonderful explanation (2 tables) explaining the Basis of Technique applied to Non-metric (categorical) variables.

In the first table, in the footnote, you have mentioned - *Results of Non-Metric Variables will be computed from the perspective of the base category.

Now, here is my Main confusion…regarding the Coding of Base (Control) category

In your second message, you have pointed that in my case Logistic Regression can be applied but it will be complex to interpret the output. Totally agree….

But, in general I am asking, in case of Logistic Regression, We should or shouldn’t code the Base (Control) category of all the Independent Variables by giving the same number (for e.g., 0 or 1)?
SPSS itself codes the base (control) category as 0 while calculating the result, right sir?


So, it becomes very necessary for the researcher to carefully code the categories of IDVs, especially in case of Logistic Regression. Moreover, while entering data in SPSS for LR, we have to specify which of our IDVs are Categorical variables along with the Reference Category (Last or First) (kindly see the screenshot below). “Last” is chosen when we have used highest number to code the control (base) category and “First” is chosen when we have used the lowest number to code the control (base) category, in all the Categorical IDVs, Okay sir?

Screenshot 2021-03-05 14.32.28.png

Now, in my case…

Dependent Variable is (Out of Pocket Expenses) coded as; 1-Low expenditure, 2-High Expenditure. Here 1 is being assigned to “Control/Low effect” category.

Independent variables:

1. Age Category, coded as; 1-Under five age group, 2-Child, 3-Adolescents, 4-Adults, and 5-Senior Citizen. Here 1 is being assigned to “Control/Low effect” category.

2. Monthly Consumer Expenditure, coded as; 1-Lowest monthly group, 2-Lower middle, 3-Upper middle, 4-Highest monthly exp group. Here 1 is being assigned to “Control/Low effect category”.

3. Nature of Treatment, coded as: 1-Allopathy, 2-Ayush, 9-Others. Here 1 is being assigned to Allopathy. But in actual data 85% respondents are using Allopathy and only 5% are using Others. Now, here - Is it correct to assign 1 to Allopathy which means “Control/No/Low effect” category?

4. Pregnancy, coded as; 1-Pregnant, 2- Not Pregnant. Here also 1 is being assigned to Pregnant, thereby meaning “Control/Low effect” category. But how can be “Pregnant” considered as Control/No/Low effect category” especially in conjunction with other IVs.

Sir, I am extremely sorry, but I am still not getting the clarity in Coding aspect i.e. How can “Allopathy” and “Pregnancy” be considered as Base (Control) category in Logistic Regression (In my opinion they cannot be regarded as having No or Less effect) whereas in other IDVs “Under five age group” and “Lowest income group” are regarded as Base (Control) category?

Sir, kindly explain this as per your statement “*Results of Non-Metric Variables will be computed from the perspective of the base category.”  OR, Pl Pardon my asking, How would you code them while calculating LR?

Lastly, may be coding is not that important aspect in case of Chi-Square and Log-linear analysis But I think it is an important aspect in case of LR.

Please guide, sir!

Thankyou and Regards,

Neeraj Kaushik

unread,
Mar 7, 2021, 8:27:34 PM3/7/21
to dataanalysistraining
Dear Sukhmani
I request to watch the first 2 videos as they answer your query.
Best wishes

Sukhmani Singh

unread,
Mar 8, 2021, 2:24:37 AM3/8/21
to DataAnalysis
Thank you very much sir for your guidance. I will surely look into it.
Regards
Reply all
Reply to author
Forward
0 new messages