Week 1 - Is email address Structured or unstructured data ?

101 views
Skip to first unread message

Darria Osline

unread,
Nov 10, 2020, 12:09:00 PM11/10/20
to Discussion forum for Statistics for Data Science I
Email address contain text,numbers and special characters.How does that come in structured data.Kindly explain.

Anand Iyer

unread,
Nov 11, 2020, 8:31:35 PM11/11/20
to Discussion forum for Statistics for Data Science I, Darria Osline
characters uses don't define structured and unstructured data.  The pattern does.  If there's a clearly identifiable, and usable pattern in data, it's structured.

Emails are short (can't exceed 320 characters) and always have a part that goes before (user part) and after (domain part) )@ symbol.  THat's a pattern good enough for our purpose.

Antony

unread,
Nov 12, 2020, 12:57:15 AM11/12/20
to Discussion forum for Statistics for Data Science I, darria...@gmail.com
Lets say you have a dataset 1 in the below format

CandidateName | Contactnum | Pincode | Email | Address | City
Darria | 1234567890 | 5860027 | Darria @something.com | #hno 1/22, Abcd road | some city 

Dataset 1 is structured, since you are able to identify every attribute of the observation provided. 

Lets say you have a dataset 2 in the below format

Candidate Information
Darria who is from some city , residing at #hno 1/22, Abcd road , can be reached at 1234567890 mobile and email id Darria @something.com

Dataset 2 is unstructured, because you cannot do any analysis on top of this text field. You will have to first make it structured , similar to dataset 1.

Whether dataset is structured or not , is decided on the context of what analysis you want to perform and whether it can be achieved from the 'dataset in the raw form' that was provided to you.

Darria Osline

unread,
Nov 12, 2020, 12:59:35 AM11/12/20
to Antony, Discussion forum for Statistics for Data Science I
Thank you guys.

Anand Iyer

unread,
Nov 12, 2020, 1:18:15 AM11/12/20
to Antony, Discussion forum for Statistics for Data Science I, darria...@gmail.com
Very good explanation, Antony!

On Thu, Nov 12, 2020 at 11:27 AM Antony <royd...@gmail.com> wrote:
--
You received this message because you are subscribed to a topic in the Google Groups "Discussion forum for Statistics for Data Science I" group.
To unsubscribe from this topic, visit https://groups.google.com/a/nptel.iitm.ac.in/d/topic/ma1002-discuss/KpMMg0AcXXk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ma1002-discus...@nptel.iitm.ac.in.
To view this discussion on the web visit https://groups.google.com/a/nptel.iitm.ac.in/d/msgid/ma1002-discuss/5230ddb8-cfc2-49f0-947e-add3136cdfd4o%40nptel.iitm.ac.in.


--
Cheers,

Antony

unread,
Nov 12, 2020, 1:30:56 AM11/12/20
to Discussion forum for Statistics for Data Science I, darria...@gmail.com
@Anand, its my daily bread :P

Just to add on to my explanation , lets say you have dataset 3
Candidate | Email Id
Darria | Darria123 at gmail . com
Anand | Anand456 at yahoo . com

The analysis you want to perform , requires you to find the email provider (gmail , yahoo etc)  of the candidate .  This is not straight away possible on the email field , but since you know the general format of an email id ( something @ something . something ) . YOu can do basic string operation on email field to find the email provider. So this makes "email id", in this dataset a semi structured data field. 

Examples of structured files : files where data is in rows/columns format 
Examples of unstructured files : log (text) files, audio/video files
Examples of unstructured data : xml files, email ids etc

On Tuesday, November 10, 2020 at 10:39:00 PM UTC+5:30, Darria Osline wrote:

Anand Iyer

unread,
Nov 12, 2020, 1:37:12 AM11/12/20
to Antony, Discussion forum for Statistics for Data Science I, darria...@gmail.com
ah, you brought a new variation now called semi-structured.

given the options structured or unstructured, you'll classify email address as structured, right? Or do you have a different opinion?

BTW, are you working as a data analyst/scientist, @Antony?

--
You received this message because you are subscribed to a topic in the Google Groups "Discussion forum for Statistics for Data Science I" group.
To unsubscribe from this topic, visit https://groups.google.com/a/nptel.iitm.ac.in/d/topic/ma1002-discuss/KpMMg0AcXXk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ma1002-discus...@nptel.iitm.ac.in.


--
Cheers,

Antony

unread,
Nov 12, 2020, 1:54:49 AM11/12/20
to Discussion forum for Statistics for Data Science I, darria...@gmail.com
As per sample dataset 3, if I have to do some analysis on email provider and semi structured  is not one of the available options , I will choose structured.  
Even with semi structured given as one of the option, I would think this is one of those questions which will lead to ambiguity depending on how you personally interpret it. Because doing a basic string operation is no big deal in the given example. But if the file was in xml format, it definitely has to be semi structured. 

@anand I do work in data analytics field, but mostly into descriptive analytics. This course is me trying to get into more advanced analytics. But if mock test is anything to go by , I will most probably say tata byebye in qualifier since I need my excel to do math , I cant work pen and paper anymore even for single digit calculations :)) . Not to mention aspirations and labiowhat in english.  

On Tuesday, November 10, 2020 at 10:39:00 PM UTC+5:30, Darria Osline wrote:
Reply all
Reply to author
Forward
Message has been deleted
0 new messages