liblyric -- part-1.

4 views

Skip to first unread message

Dhruv Matani

unread,

Nov 23, 2006, 11:30:22 AM11/23/06

to Apurva Mehta, Nippun Goel, Ramachandra K, kun...@gmail.com, rao.r...@gmail.com, Aniket Arondekar, Siddharth Shah, Shrikant Shetty, vipin ravindran, The Distibuted DataBase

This module accepts input from 2 files: file1.txt and file2.txt and
returns the best match approximate intersection from their contents.
This can can be used to get the song lyrics from a web page after
stripping the HTML tags. The 2 files supplied are the HTML
tag-stripped versions that will be eventually fed to it when the
different modules start co-operating with each other. oye, and it's
linear in N(N being the number of words in the input text!!!!)

Need to iron out a few corner cases... but on the whole, it just works....

region_struct is the main data structure. It encodes information on a
per-region basis. The main function is intersect_streams, which does
the approximate stream intersection, and returns a list of regions
which the caller may choose from by applying it's own heuristics.

TODO: Make the fucker display correctly formated lyrics. How to go
about doing that???? If you look at struct word_t, you will see that
provision for that has already been made. :-) So, we just need to get
the bounding offsets, and dump the data between those offsets from the
input string to the output stream.

What do you need to compile it?
g++ 3/4, etc....
g++ intersection.cpp [should be just dandy].

How to run:
./a.out 2> /dev/null

[To stop it from displaying a lot of debug junk]....

What is this "Size: 195" displayed???? It's the number of matches that
intersection algorithm found.... whooooow!!!! Hence, we use one more
filter which chooses the longest match.

ps. There shouldn't be any apparent bugs because it's been thought through....
This can be used to get also say poems from the web, etc....
The one problem I know of is commented at line 285, which is why we
_sometimes_ get trailing junk characters -- after the song lyrics are
done. IMHO, this can be taken care of later when we do the choosing of
the best match, and instead of choosing the longest match, choose the
shortest one. The place I'm talking about is when we generate the
fully connected graph for each of the search results, and choose the
one with the most overlap. Another possible solution to this problem
looks like the one of using a two way intersection. Since our
algorithm is asymmetric(A INTERSECT B) is not necessarily the same as
(B INTERSECT A) [hello... it's an approximate algorithm.... You should
have come to expect it!!!!]....

A question I have kept asking myself: Why the f*** am I using
hash_multimap???? I don't know????

--
-Dhruv Matani.
http://www.geocities.com/dhruvbird/

"Be sure brain is in gear before engaging mouth"
-- Anonymous

intersection.cpp

file1.txt

file2.txt

file3.txt

Reply all

Reply to author

Forward

0 new messages