CC v0.2 Released???

Lewis John Mcgibbney

unread,

Nov 2, 2012, 5:50:16 PM11/2/12

to crawler...@googlegroups.com

Hi All,

Is v0.2 actually released?
The trunk CHANGES.txt file [0] indicates that it has been released however I don't see an announcement anywhere and it is certainly not uploaded to Maven Central...

Any idea?

Best

Lewis

[0] http://crawler-commons.googlecode.com/svn/trunk/CHANGES.txt

Fuad Efendi

unread,

Nov 2, 2012, 6:16:50 PM11/2/12

to crawler...@googlegroups.com

I have made few important fixes and improvements (including “redirect with session cookies” which allows to avoid session IDs in URL, and truly Keep-Alieve); but it seems no one is interested L

https://github.com/FuadEfendi/crawler-commons

Everything (5 issues) tracked initially at http://code.google.com/p/crawler-commons/issues/list but unfortunately this mailing list doesn’t receive any notification…

No any idea about Maven… you need to download project and “mvn clean install” will put it into your private repository for future use…

--

Fuad Efendi

http://www.tokenizer.ca

--
You received this message because you are subscribed to the Google Groups "crawler-commons" group.
Visit this group at http://groups.google.com/group/crawler-commons?hl=en-US.

Ken Krugler

unread,

Nov 2, 2012, 6:25:34 PM11/2/12

to crawler...@googlegroups.com

Hi Fuad,

I've got your emails on the top of my open-source stack.

I had to work through releasing cascading.utils first, then cascading.avro.

So this is next in line.

-- Ken

--------------------------------------------

http://about.me/kkrugler

+1 530-210-6378

Ken Krugler

unread,

Nov 2, 2012, 6:31:14 PM11/2/12

to crawler...@googlegroups.com

Hi Lewis,

On Nov 2, 2012, at 2:50pm, Lewis John Mcgibbney wrote:

Hi All,

Is v0.2 actually released?

The trunk CHANGES.txt file [0] indicates that it has been released however I don't see an announcement anywhere and it is certainly not uploaded to Maven Central…

I don't believe 0.2 has been released. I think the CHANGES.txt file is a work in progress, tracking what will be in 0.2

From the changes list at: https://code.google.com/p/crawler-commons/source/list

…it looks like Julien added the CHANGES.txt file on July 25th of last year, after the 0.1 release happened.

And yes, we should do a 0.2 release sometime in the near future. Maybe after I roll in Fuad's changes?

-- Ken

[0] http://crawler-commons.googlecode.com/svn/trunk/CHANGES.txt

--
You received this message because you are subscribed to the Google Groups "crawler-commons" group.
Visit this group at http://groups.google.com/group/crawler-commons?hl=en-US.

Fuad Efendi

unread,

Nov 2, 2012, 6:31:00 PM11/2/12

to crawler...@googlegroups.com

Thanks Ken,

Just a late thought: “follow redirects” is not always good because thousands pages can redirect to logon screen; but without auto-redirect we will have session IDs in URL; Nutch approach is “classic” but sometimes we may need auto-redirect… I want to put it as WIKI somewhere… “follow, but keep track of where you haven’t been yet” J

-Fuad

Lewis John Mcgibbney

unread,

Nov 2, 2012, 6:35:29 PM11/2/12

to crawler...@googlegroups.com

Hi Fuad,

On Fri, Nov 2, 2012 at 10:16 PM, Fuad Efendi <fuad....@tokenizer.ca> wrote:
> I have made few important fixes and improvements (including “redirect with
> session cookies” which allows to avoid session IDs in URL, and truly
> Keep-Alieve); but it seems no one is interested L
>
> https://github.com/FuadEfendi/crawler-commons

I think it would be best to open your issues on the issue tracker... I
mean all of them.

I see you have
Issue 5: HttpComponents Upgrade: 4.1.1 -> 4.2.1
Issue 6: FetchedResult doesn't stores HTTP Status Code
Do you have anything else you would like to contribute to the project?
I think you should open issues and attach patchesif the answer is yes.
You seem to have a good grasp on the project and it looks like your
patches are very welcome...

>
>
> Everything (5 issues) tracked initially at
> http://code.google.com/p/crawler-commons/issues/list but unfortunately this
> mailing list doesn’t receive any notification…

I get notifications both on my email and also the Google group gets
notifications as well. Maybe you need to re-jig your group settings.

Yeah it seems like releasing maven artifacts has been a bit of a pain
for the CC team so far. Maybe this is also something which should be
improved.

Lewis

Fuad Efendi

unread,

Nov 2, 2012, 6:42:00 PM11/2/12

to crawler...@googlegroups.com

>I think it would be best to open your issues on the issue tracker... I mean
all of them.

Yes, I have all of them:

http://code.google.com/p/crawler-commons/issues/detail?id=2
http://code.google.com/p/crawler-commons/issues/detail?id=3
http://code.google.com/p/crawler-commons/issues/detail?id=4
http://code.google.com/p/crawler-commons/issues/detail?id=5
http://code.google.com/p/crawler-commons/issues/detail?id=6

And I generated patches (thanks to GItHub)

-Fuad

Lewis John Mcgibbney

unread,

Nov 2, 2012, 6:51:09 PM11/2/12

to crawler...@googlegroups.com

Apologies Fuad I did not realize all of these were yours.

> --
> You received this message because you are subscribed to the Google Groups "crawler-commons" group.
> Visit this group at http://groups.google.com/group/crawler-commons?hl=en-US.
>
>

--
Lewis

Fuad Efendi

unread,

Nov 2, 2012, 6:52:20 PM11/2/12

to crawler...@googlegroups.com

Hi Lewis,

I forgot latest (greatest ;-}) issue already fixed at
https://github.com/FuadEfendi/crawler-commons
- upgrade to HttpCommons 4.2.2

But it is so trivial... 0.01% in comparison with importance of other five...

-Fuad.

P.S.
I am working on special "Vertical Crawl" or "Deep Crawl" or even
"Distributed Executor" framework, everything managed by ZooKeeper (for
instance, we can configure 3 "amazon crawlers" in a cluster of 15 nodes,
"multiton" vs. "singleton"). But I just started... in a week I'll push
out-of-the-box WAR file wich nice GUI to play with. Inspired by LILY
framework's Solr Indexer architecture.
https://github.com/FuadEfendi/tokenizer

Julien Nioche

unread,

Nov 3, 2012, 5:52:43 AM11/3/12

to crawler...@googlegroups.com

Hi

The comments on CHANGES.txt indicate what will be new when the next version is released, not that it has been released

J.

--

You received this message because you are subscribed to the Google Groups "crawler-commons" group.
Visit this group at http://groups.google.com/group/crawler-commons?hl=en-US.

--

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Julien Nioche

unread,

Nov 3, 2012, 6:09:08 AM11/3/12

to crawler...@googlegroups.com

Fuad,

Generating patches from your own copy of the repo on git is not really helpful. The code for crawler-commons is under SVN at google code. Sending patches at the right format is probably the best thing to do if you expect people to review your contributions and integrate them.

Similarly attaching N patches to a single issue (http://code.google.com/p/crawler-commons/issues/detail?id=5) definitely doesn't help especially if you are drifting from the original point.

It's great that you have your clone of the project in GitHub and your suggestions are very welcome but in the future please send patches that can be applied with the command 'patch -p0' and split your issues into smaller issues (or bigger batch patches)

Thanks

Julien

--
You received this message because you are subscribed to the Google Groups "crawler-commons" group.
Visit this group at http://groups.google.com/group/crawler-commons?hl=en-US.

Fuad Efendi

unread,

Nov 3, 2012, 1:08:07 PM11/3/12

to crawler...@googlegroups.com

Hi Julien,

GitHub supports SVN too, you can simply check "commit history" in my repo and you will see all previous SVN history by kkrugle_lists, digitalpeble, and etc. I can even commit to SVN directly from my GitHub repo, using "Tower" Mac OS client.

How to regenerate separate patches for separate issues? I don't know. Any suggestion?.. Thanks,

--

Fuad

From: Julien Nioche <lists.dig...@gmail.com>
Reply-To: "crawler...@googlegroups.com" <crawler...@googlegroups.com>
Date: Saturday, 3 November, 2012 6:09 AM
To: "crawler...@googlegroups.com" <crawler...@googlegroups.com>
Subject: Re: [crawler-common] CC v0.2 Released???

Fuad Efendi

unread,

Nov 3, 2012, 1:22:43 PM11/3/12

to crawler...@googlegroups.com

Perhaps I can checkout from SVN, then override with my version, and generate standard SVN patch covering few issues; single patch. I don't know any better option… Is it Ok?

-Fuad

--

Fuad Efendi

416-993-2060

Tokenizer Inc., Canada

http://www.tokenizer.ca

Julien Nioche

unread,

Nov 4, 2012, 9:46:30 AM11/4/12

to crawler...@googlegroups.com

A single patch for all the issues would not be very helpful and someone else would have to sort it. One way of doing would be for you to edit the patch file manually and separate the various bits. Alternatively, if the modifs for a given issue are not big it would be easier to re-modify the code afresh and regenerate a patch. Up to you really

Julien

Fuad Efendi

unread,

Nov 4, 2012, 10:49:25 AM11/4/12

to crawler...@googlegroups.com

Hi Julien,

I am just learning proper "usage patterns" and I found interesting that huge Liferay community uses SVN together with GitHub to manage patched versions… and I also know that many Lucene/Solr committers use GitHub to manage their contributions, but I am new to this.

And I think proper "pattern" would be to have separate branch for each bug/improvement which I am working; but it is a little late.

What I did indeed is not a big deal (for single patch):

Few minor changes such as pom.xml, and small (very important) fix for Sitemaps Parser (I reported this bug a year ago!!! Nobody encountered it, or nobody shared the fix?! Unbelievable…)
Big changes with new version of HttpComponents and SimpleHttpFetcher

Fortunately I avoided any "reformatting" and single patch will be "readable"; I'll do it in a few days…

-Fuad

Julien Nioche

unread,

Nov 5, 2012, 4:18:06 AM11/5/12

to crawler...@googlegroups.com

I am just learning proper "usage patterns" and I found interesting that huge Liferay community uses SVN together with GitHub to manage patched versions… and I also know that many Lucene/Solr committers use GitHub to manage their contributions, but I am new to this.

Lucene committers probably have their own copy of the repo in GitHub to test their modifications but the actual contributions are done just as we do i.e. svn patch attached to the issue - just like with any Apache project.

Looking at http://wiki.apache.org/hadoop/GitAndHadoop there should be a way of generating patches with Git (git diff --no-prefix) that could be used with the standard 'patch -p0' command. Using this would allow you to go ahead with your own GitHub repo and experiment with all sorts of bugfixes and improvements while contributing back to CC and generate patch files that can easily be used against our SVN repo.

J.

Fuad Efendi

unread,

Nov 5, 2012, 8:48:48 AM11/5/12

to crawler...@googlegroups.com

Hi Julien,

Thanks for the link, very interesting… crawler-commons is not there yet:

https://github.com/apache

We need to create (automated) mirror too. ALso because GitHub has much higher visibility than Google Code; and we can track "forks" and "pull requests".

I'll submit patch shortly…

--

Fuad Efendi

416-993-2060

Tokenizer Inc., Canada

http://www.tokenizer.ca

From: Julien Nioche <lists.dig...@gmail.com>
Reply-To: "crawler...@googlegroups.com" <crawler...@googlegroups.com>
Date: Monday, 5 November, 2012 5:18 AM
To: "crawler...@googlegroups.com" <crawler...@googlegroups.com>
Subject: Re: [crawler-common] CC v0.2 Released???

Fuad Efendi

unread,

Nov 5, 2012, 9:02:03 AM11/5/12

to crawler...@googlegroups.com

Please find attached and let me know if it doesn't work in your environment (Windows vs Mac vs Linux, encoding, CR/LF, etc)

I generated it using

git diff --no-prefix 1e15be8d632f59e5f62a468bcb7a0a227c510d4e > ../crawler-commons.patch

(my local environment);

And it is on top of r34; r35 (Eclipse Formatter) is not in my cloned version…

Let me know you you have any error messages. Thanks!

--

Fuad Efendi

416-993-2060

Tokenizer Inc., Canada

http://www.tokenizer.ca

crawler-commons.patch

Ken Krugler

unread,

Nov 5, 2012, 10:04:33 AM11/5/12

to crawler...@googlegroups.com

On Nov 5, 2012, at 5:48am, Fuad Efendi wrote:

Hi Julien,

Thanks for the link, very interesting… crawler-commons is not there yet:
https://github.com/apache

Not sure what you mean - crawler-commons isn't an Apache Software Foundation project, so it shouldn't be in the list of repos under https://github.com/apache

-- Ken

Fuad Efendi

unread,

Nov 5, 2012, 9:13:31 AM11/5/12

to crawler...@googlegroups.com

>>…crawler-commons isn't an Apache Software Foundation project, so it shouldn't be in the list of repos under https://github.com/apache

Guys it's not kindergartden here (sorry) of course I know that but would be nice to have an automated mirror at GitHub; next time I'll create "mirror" myself and I'll fork and I'll branch it per-issue as per http://wiki.apache.org/hadoop/GitAndHadoop

-Fuad

Julien Nioche

unread,

Nov 5, 2012, 10:28:41 AM11/5/12

to crawler...@googlegroups.com

Fuad,

>>…crawler-commons isn't an Apache Software Foundation project, so it shouldn't be in the list of repos under https://github.com/apache

Guys it's not kindergartden here (sorry) of course I know that

no comment

but would be nice to have an automated mirror at GitHub; next time I'll create "mirror" myself and I'll fork and I'll branch it per-issue as per http://wiki.apache.org/hadoop/GitAndHadoop

you are free to do whatever you want on GitHub and have as many mirrors and forks as you like but as far as contributions for this project are concerned it's google code hosting + one issue per patch and patch -p0 format.

Best,

Julien

Ken Krugler

unread,

Nov 6, 2012, 1:45:29 PM11/6/12

to crawler...@googlegroups.com

On Nov 2, 2012, at 3:31pm, Fuad Efendi wrote:

Thanks Ken,

Just a late thought: “follow redirects” is not always good because thousands pages can redirect to logon screen; but without auto-redirect we will have session IDs in URL; Nutch approach is “classic” but sometimes we may need auto-redirect… I want to put it as WIKI somewhere… “follow, but keep track of where you haven’t been yet” J

We've run into similar issues in the past, e.g. when crawling used car sites - the page for any car that no longer is for sale redirects to a general search page.

But tracking the result of a redirect seems like it should happen outside of the fetch process itself, as (depending on how you spread out fetching) you can have multiple tasks all dealing with redirects to the same location.

One approach would be to not follow the redirect, but emit a fetch result that identifies it as such. Then have a grouping operator in the workflow that reduplicates, and marks dups as skipped-due-to-common-redirect. Then feed these into a another fetch pipe.

-- Ken

--------------------------

Ken Krugler

+1 530-210-6378

http://www.scaleunlimited.com

custom big data solutions & training

Hadoop, Cascading, Mahout & Solr

Reply all

Reply to author

Forward