Need help ASAP!!!

serikbek...@gmail.com

unread,

Jun 1, 2017, 10:04:27 AM6/1/17

to Common Crawl

Hello everyone,

I need help from experienced people from this forum)

I'm doing MSc project, but did not find the main problem yet(((( Can someone help me please to do research or give any advise.

Topic is about "The Common Crawl know all: analysing web information leakage through indirect means " and I have to find at list 3 main problem and give a way to solve it

Thank you )))

Ananta Gupta

unread,

Jun 1, 2017, 10:16:12 AM6/1/17

to common...@googlegroups.com

Various problems on which you can work is

- detecting spam pages while crawling

- reducing time while indexing

- detecting duplicates

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl+unsubscribe@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at https://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.

serikbek...@gmail.com

unread,

Jun 1, 2017, 10:26:48 AM6/1/17

to Common Crawl

Thank you so much. I dont have any experience on this subject and i dont know how to start writing. Can you tell me what kind of tools i can use and if you know any websites where i can read related information?

четверг, 1 июня 2017 г., 15:04:27 UTC+1 пользователь serikbek...@gmail.com написал:

Ananta Gupta

unread,

Jun 1, 2017, 10:50:42 AM6/1/17

to common...@googlegroups.com

As a beginner i explored this site http://nutch.apache.org/

and simulated apache nutch( a web crawler) with the help of https://wiki.apache.org/nutch/NutchTutorial this tutorial, understood its full working, and find out loopholes where the work could be done to improve the existing system.

U can simulate binary version of apache nutch on linux and for source version you need to install eclipse.

For finding the solution of problems you can explore the research papers and can read how much work has been done to improve the existing work and what further you can improve, the improvements that you feel that can be done can be written as solutions proposed .

All the best for your project ))

--

You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl+unsubscribe@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at https://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.

--

Ananta Gupta

serikbek...@gmail.com

unread,

Jun 1, 2017, 11:01:54 AM6/1/17

to Common Crawl

Thank you so much, you helped to me a lot !!! I will try to do it all point which you gave to me, if i have any problems with it can i ask you?)))

четверг, 1 июня 2017 г., 15:50:42 UTC+1 пользователь Ananta Gupta написал:

As a beginner i explored this site http://nutch.apache.org/
and simulated apache nutch( a web crawler) with the help of https://wiki.apache.org/nutch/NutchTutorial this tutorial, understood its full working, and find out loopholes where the work could be done to improve the existing system.
U can simulate binary version of apache nutch on linux and for source version you need to install eclipse.
For finding the solution of problems you can explore the research papers and can read how much work has been done to improve the existing work and what further you can improve, the improvements that you feel that can be done can be written as solutions proposed .

All the best for your project ))

On Thu, Jun 1, 2017 at 7:56 PM, <serikbek...@gmail.com> wrote:

Thank you so much. I dont have any experience on this subject and i dont know how to start writing. Can you tell me what kind of tools i can use and if you know any websites where i can read related information?

четверг, 1 июня 2017 г., 15:04:27 UTC+1 пользователь serikbek...@gmail.com написал:
Hello everyone,
I need help from experienced people from this forum)
I'm doing MSc project, but did not find the main problem yet(((( Can someone help me please to do research or give any advise.
Topic is about "The Common Crawl know all: analysing web information leakage through indirect means " and I have to find at list 3 main problem and give a way to solve it
Thank you )))

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.

To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.

To post to this group, send email to common...@googlegroups.com.
Visit this group at https://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.

--
Ananta Gupta

Sebastian Nagel

unread,

Jun 2, 2017, 4:29:58 AM6/2/17

to common...@googlegroups.com

Hi,

"analysing web information leakage through indirect means" sounds somewhat vague. I would in any
case ask your advisor to make it more precise. It's a broad topic...

One pointer:
Stephen Merity's "Measuring the impact of Google analytics" [1]
It's about "leakage" of the browsing history of individuals indirectly by tracking the page/site access.

Best,
Sebastian

[1] https://www.slideshare.net/CommonCrawl/measuring-theimpactgoogleanalytics-37370713

> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to

> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.

> To post to this group, send email to common...@googlegroups.com

> <mailto:common...@googlegroups.com>.

Aigerim Serikbekova

unread,

Jun 2, 2017, 8:33:50 AM6/2/17

to common...@googlegroups.com

Thank you Sebastian, I will look up all information that you gave to me.

On 2 June 2017 at 09:29, Sebastian Nagel <seba...@commoncrawl.org> wrote:

Hi,

"analysing web information leakage through indirect means" sounds somewhat vague. I would in any
case ask your advisor to make it more precise. It's a broad topic...

One pointer:
Stephen Merity's "Measuring the impact of Google analytics" [1]
It's about "leakage" of the browsing history of individuals indirectly by tracking the page/site access.

Best,
Sebastian

[1] https://www.slideshare.net/CommonCrawl/measuring-theimpactgoogleanalytics-37370713

On 06/01/2017 04:04 PM, serikbek...@gmail.com wrote:
> Hello everyone,
> I need help from experienced people from this forum)
> I'm doing MSc project, but did not find the main problem yet(((( Can someone help me please to do
> research or give any advise.
> Topic is about "The Common Crawl know all: analysing web information leakage through indirect means
> " and I have to find at list 3 main problem and give a way to solve it
> Thank you )))
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to

> common-crawl+unsubscribe@googlegroups.com <mailto:common-crawl+unsub...@googlegroups.com>.

> To post to this group, send email to common...@googlegroups.com

> <mailto:common-crawl@googlegroups.com>.

> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.

To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl+unsubscribe@googlegroups.com.

To post to this group, send email to common...@googlegroups.com.

heng.y...@gmail.com

unread,

Apr 19, 2019, 8:40:01 AM4/19/19

to Common Crawl

Hi,

I'm doing similar MSc project as you now, can I ask you some questions?

Thank you!!

在 2017年6月1日星期四 UTC+1下午3:04:27，serikbe...@gmail.com写道：

Reply all

Reply to author

Forward