(using namespace std;)
typedef vector<string> TokenVector; // e.g [the] [cat] [sat] [on]
[the] [mat]
typedef map<TokenVector, long > NGramMap; // a list of unique n-grams
(as above), and a freq count.
My goal is to search for n-grams that match a pattern with wildcards,
for instance:
[the] [cat] [sat] [?] [the] [mat]
and return a list of all matches with all the possibilities for the
wildcard [?], along with the freq count
Assuming the map is sorted alphabetically, is there some way to search
them - like a regular expression - that return pattern matches?
The only way I can think of at the moment is to search for "[the]
[cat] [sat]", then sift through those matches.
This gets messy, though, when the pattern starts with a wildcard, or
contains 2 or 3 wildcards.
This seems like it might be a common task, so I thought I'd ask here
before re-inventing the wheel.
Does STL have ready-made solutions for this kind of search?
Does it make the problem any easier if it's constrained to _always_
being 6-grams, and that there are never more than 3 wildcards?
Thanks for any clues, even if it's just what topics I need to study in
order to crack it.
Steve
[the] [cat] [sat] [on] [the] [mat]
matches
[the] [cat] [sat] [?] [the] [mat]
where each of the above n-grams are represented by a collection of
strings and symbol ? is not a question mark but a wildcard?
Yes, succinctly put.
I have a feeling (yes, nothing more reliable than that as I don't deal
text processing) that working with strings might be easier than
breaking them into tokens that are held by a STL container. And do so
using C++ I would look into Boost.Regex. But probably you already
thought of this and have a good reason to use tokens.
Yes, but having still found no shortcuts, I'm beginning to think that
keeping composite string copies for searching is the way to go.
(I really should investigate Boost, but a quick search reveals people
having nightmares building it for OSX due to the transition to 64-bit,
so I'll have to hold off for a while.)
Thanks
Steve
FYI I was able to build a couple of the boost libraries (the bulk of
boost does not need installing) on Snow Leopard (64-bit). Some of the
searches you found probably pertain to advanced features that you
probably don't need.
Did you mean this:
> it was reported that in Snow Leopard, Apple has removed support for
> ppc64, and folks trying to use that receive fairly obscure errors.
> Some of Boost users were already confused by this. Now, this is
> of course more like Apple's problem, but here's a patch, against
> current trunk, to do a nice diagnostic of this in Boost.Build.
> The intention is that if you try to build just ppc64, or a two-way
> fat binary including ppc64 you get an error message, and if you
> try tobuild 4-way fat binary, ppc64 is skipped and you get 3-way
> binary.
Jeff
Yes, not that exact message, but the gist of it. Well, it looks like
it's worth persevering then.
Does ppc64 concern you? Not me, so I did not need a patch.
No it doesn't, but I thought specifying *just* that omission was
tricky for a non-unix man like myself. I'm just being chicken ;) I'm
sure we've all suffered the scenario when, in the middle of an idea,
you get sidetracked installing something, and then 24 hours disappear,
along with clumps of hair.
I think I read somewhere that aspects of boost will be appearing C+
+0x, so it's a no-brainer that I need to start making good use of it.I
may be gone some time ;)
PS Great to be talking to someone here using OSX, I almost thought it
wasn't worth mentioning.