subset of CC data

36 views
Skip to first unread message

Peter Cawdron

unread,
May 12, 2015, 5:34:15 PM5/12/15
to common...@googlegroups.com
I'm interested in extracting just a subset of CC data, taking only the www.amazon domain (including .com, .uk, etc) from the latest month so as to conduct statistical analysis on Amazon reviews (particularly for books).

My preferred database is eXist as it's XML structure will treat HTML pages as documents and should (hopefully) allow for easy data manipulation, if it can manage the size. I'm looking to reverse engineer relationships between products and reviewers, so transforming unstructured web pages into structured data. I hope to expose fraudulent reviews, demonstrate statistical relationships between books, uncover trends in how books are perceived, etc.

Any thoughts or tips would be greatly appreciated.

Cheers,
Peter
Reply all
Reply to author
Forward
0 new messages