Hi Derek, hi Richard,
there is an error in the description [1]: instead of
WG_CP="java -cp
target/cc-webgraph-0.1-SNAPSHOT-jar-with-dependencies.jar"
the classpath element must be
WG_CP="target/cc-webgraph-0.1-SNAPSHOT-jar-with-dependencies.jar"
I'll fix the description asap. But I do not know whether this is the
reason for the problem. It depends whether the java executable allows
for multiple "-cp" arguments.
Other points (see the "bash.png" screenshot):
- you'd need to remove "target/" if you're already in the directory "target"
- make sure that the jython jar is found on
../jython-standalone-2.7.2.jar - in doubt, use an absolute path
> You are also using a snapshot version of cc-webgraph according to
> the bash.png, couldn't you be using a release version?
There was never a release of the cc-webgraph project, even no change of
the 0.1 version. But good idea, I'll do a release latest together with
the next webgraph release. Thanks for the suggestion, Richard!
Indeed, in the beginning it was shell only, just a wrapper to call the
webgraph tools. The Java classes and bundling was added later.
Sorry, again. I'll review the description (also the notebook) over the
weekend. And yes, setting up Jython isn't easy at all. I still hope
to use the Python bindings of JGraphT [2] which added support for the
webgraph formats as an replacement. See [3].
Best,
Sebastian
[1]
https://github.com/commoncrawl/cc-notebooks/blob/main/cc-webgraph-statistics/interactive_webgraph.md
[2]
https://jgrapht.org/
[3]
https://github.com/commoncrawl/cc-notebooks/issues/3
> I did some work getting around that *but when the application
> launches it immediately runs into issues with "Immutable Graphs"
> is not defined when I input the graph command. *
> > <
http://vaccinechoicecanada.com
> <
http://vaccinechoicecanada.com>>,
nvic.org
> <
http://nvic.org> <
http://nvic.org <
http://nvic.org>> etc.
> >
> > One approach could be to look into the webgraphs to find
> sites
> > either referencing or being referenced from these domains.
> >
> >
> > Let me know if you need help for any of the described steps.
> >
> >
> > Best,
> > Sebastian
> >
> > On 5/28/22 00:59, D.C Hilliard wrote:
> > > Hello,
> > >
> > > I am a historian doing work on anti-vaccination in
> Canada since the
> > > 1980s. I am hoping to use distant reading (LDA topic
> modelling and
> > > Network Analysis) to talk about the more recent history
> and to try
> > and
> > > push some analysis into COVID.
> > >
> > > I have the topic modelling and Network analysis tools
> figured out
> > with
> > > sample corpora. But I am having some problems creating
> my own corpus.
> > >
> > > I have a few related questions.
> > >
> > > 1) How do I extract a domain (
eg.vaccinechoicecanada.com
> <
http://eg.vaccinechoicecanada.com>
> > <
http://eg.vaccinechoicecanada.com
> <
http://eg.vaccinechoicecanada.com>>) from a crawl?
> > > It is feasible to store and manipulate the data locally?
> Is using the
> > > .CDX to find the file and then extracting the relevant
> the best way.
> > > Should I only extract .WAT and .WET files for the
> textual and
> > hyperlink
> > > analysis ... would I lose valuable pertinent information
> not using
> > the WARC.
> > >
> > > 2) Do you have any suggestions for curating sites to
> include beyond
> > > starting with current sites (
vaccinechoicecanada.com
> <
http://vaccinechoicecanada.com>
> > <
http://vaccinechoicecanada.com
> <
http://vaccinechoicecanada.com>>,
nvic.org
> <
http://nvic.org> <
http://nvic.org <
http://nvic.org>> etc.
> > > etc.) and keyword searching for current sites that
> pertain to the
> > > subject in the now ... Should I use hyperlinks to slowly
> build out my
> > > own corpora (and automate this process using code)?
> > >
> > >
> >
>
> --
> You received this message because you are subscribed to the Google
> Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to
common-crawl...@googlegroups.com
> <mailto:
common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
>
https://groups.google.com/d/msgid/common-crawl/91c8e982-f770-4486-b8fb-be7063670901n%40googlegroups.com
> <
https://groups.google.com/d/msgid/common-crawl/91c8e982-f770-4486-b8fb-be7063670901n%40googlegroups.com?utm_medium=email&utm_source=footer>.