AntConc Suggestion

44 views
Skip to first unread message

Kevin Helmsberg

unread,
May 23, 2024, 7:05:28 AMMay 23
to AntConc-Discussion
Dear Mr. Anthony,

I wanted to thank you for AntConc, which recently solved a big search problem for me in no time. Impressive.

I would also like to make a few suggestions, if you don't mind.

1) Would it be possible to add support for .xhtml files, which are very common in ebooks?

2) In the N-Gram tab, would it be possible to predefine maximum N-Gram size and maximum frequency by adding the relevant dropdown menus or customizable ranges? It would be very helpful in reducing the number of hits in large sources.

Thank you for considering.

Best wishes,
Kevin

Laurence Anthony

unread,
Jun 5, 2024, 4:27:47 AMJun 5
to ant...@googlegroups.com
Hi Kevin,

Thanks for the feedback.

As for point 1), I expect that AntConc can already deal with xhtml. The only problem might be the file extension. When I next update AntConc (for release 4.3.0), I'll add an explicit filter for this filetype.

As for point 2), users usually reduce the number of hits by *raising* the minimum frequency or range to leave only the salient n-grams. Can you explain why you want to filter out the salient hit and only look at the less salient ones. Note that because of Zipf's Law, even if you set the maximum frequency to 1, you would still end up with a huge number of hits.

Laurence.

###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################


--
You received this message because you are subscribed to the Google Groups "AntConc-Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antconc+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antconc/c20d9660-3f01-455d-96eb-b69423b2e100n%40googlegroups.com.

Kevin Helmsberg

unread,
Jun 6, 2024, 9:07:14 AMJun 6
to AntConc-Discussion
Hi Laurence,

Thank you for #1. This may be a stretch, but let me ask anyway: would it also be possible to add support for .ePub files, which are ZIP-compressed collections of .html or .xhtml files (and some other search-irrelevant files)? This way one wouldn't have to unpack/decompress ebooks in order to perform a search. And .ePub is an open format, so no problems with proprietary code.

As for #2, I'd like to be able to look for e.g. groups of 3 to 6 words which are repeated between 2 and 5 times in the text. If this can be defined before the search, the unnecessary hits can automatically be filtered out. As far as I know, this particular kind of search cannot be done at the moment, hence my suggestion.

Best wishes,
Kevin

__________

Laurence Anthony

unread,
Jun 6, 2024, 8:56:26 PMJun 6
to ant...@googlegroups.com
Hi Kevin,

As for 1), the idea of adding support for .ePub files is very good. Can you send me an example file that I can use for testing. Something small would be preferred.

As for 2), I still don't really understand why you are trying to ignore the more salient n-grams. Why would you stop at a frequency of 5? N-grams that appear with a frequency of 6 are more salient than those at a frequency of 5. What is special about 5?

Laurence.


###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

Kevin Helmsberg

unread,
Jun 7, 2024, 9:41:49 AMJun 7
to AntConc-Discussion
Hi Laurence,

As for #1, I'll send a sample file off-list. Please check the email in your signature.

As for #2, I'm thinking in terms of precision, not salience. The proposed addition would make AntConc's search more granular and faster. If one were to use it for pattern analysis, granularity could be essential. So for me the question is not "Why?" but "Why not?".

Best wishes,
Kevin

__________

Laurence Anthony

unread,
Jun 7, 2024, 9:52:59 AMJun 7
to ant...@googlegroups.com
Hi Kevin,

I got the file off list. Thanks!

>As for #2, I'm thinking in terms of precision, not salience. The proposed addition would make AntConc's search more granular and faster. If one were to use it for pattern analysis, granularity could be essential. So for me the question is not "Why?" but "Why not?".

There are many reasons for not adding a new feature. Here are a few:
1) Most people won't use it.
2) It makes the interface more complicated
3) People could change the max freq setting mistaking it for the min freq setting, resulting in confusion
4) The new feature would not speed up the process (due to the algorithm used)
5) The setting makes little sense from a linguistic perspective

So, it really comes down to a cost/benefit analysis. Let's see if anybody else thinks this is a good feature to add. I'm still skeptical and think the results could be fairly easily achieved by simply exporting the results to Excel and filtering the results there. 

As I mention above, there seems to be no linguistically meaningful reason to limit the maximum frequency (at least to me). Also, Zipf's Law indicates that the number of n-grams above your arbitrary threshold of 5 are very few in number, so you're basically just ignoring a few very salient (and important) n-grams from your search. I come back to the question of why you would want to do this?

Laurence.

###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

Kevin Helmsberg

unread,
Jun 8, 2024, 1:52:32 AMJun 8
to AntConc-Discussion
Hi Laurence,

Let me address some of your points:

1) and 5) I was thinking in terms of improving broader search capabilities, not necessarily in the realm of linguistics, as I think AntConc can also be used in non-linguistic research/analysis by more people thank you expect.

2) I think it doesn't have to. Both N-Gram Size and Min. Freq could lose the counter and switch to customizable ranges, so that e.g. "2" means "2 only" (not "2 and more"), "2-" would mean "2 and more", "2,4" would mean "2 and 4" (excluding all other values), while "2-4" would mean "between 2 and 4" (again, excluding all other values). In that case, you could even declutter a bit, as Min. Freq would become just Freq. If this was explained in the new version notes (or wherever you think it best), I don't think people would make mistakes as outlined under 3).

"Exporting the results to Excel and filtering the results there": Sure, but why not avoid the extra work if possible?

Of course, all of the above is irrelevant if you disagree.

Best wishes,
Kevin

__________

Laurence Anthony

unread,
Jun 8, 2024, 4:16:17 AMJun 8
to ant...@googlegroups.com
Hi Kevin,

>1) and 5) I was thinking in terms of improving broader search capabilities, not necessarily in the realm of linguistics, as I think AntConc can also be used in non-linguistic research/analysis by more people thank you expect.

This is a good point. AntConc is used beyond linguistics research, which I often forget.

>2) I think it doesn't have to...

I had a look at the interface and as you say, it would be possible to have custom ranges. The worry is that users would have to select "1-" to get the current default setting, which I think might confuse some people. I suppose I could add a 'secret' setting, so that users could type 2-5 to set the frequency to a range, and just document this outlier case somewhere. I think I prefer that than changing the wording of all labels and adding new syntax for default settings, in order to cater for a fairly rare usage case.

Let's see if others have any opinion on this.

Laurence.


Reply all
Reply to author
Forward
0 new messages