Handling false positives in regex-based extension scanning of AsciiDoc sources

77 views
Skip to first unread message

Yash Kaushik

unread,
May 18, 2026, 6:26:55 AMMay 18
to RISC-V ISA Dev
Hi everyone, 
while working on the RISC-V ISA Explorer challenge I scanned the AsciiDoc source files from the ISA manual to extract extension names using regex patterns. One issue I ran into was false positives — author surnames like Zhang or Zabrocki and prose words like Scalar or Scatter matching the same pattern as Z-extensions. I handled this with a curated stopword list, but it feels fragile as the manual evolves. Is there a more robust approach — like parsing only section headers or extension definition blocks that the community uses when extracting structured data from AsciiDoc sources?

Andrew Waterman

unread,
May 18, 2026, 5:06:51 PM (14 days ago) May 18
to Yash Kaushik, RISC-V ISA Dev
The manual is gradually being updated to use a macro to represent extension names (e.g. ext:zicsr[] to represent Zicsr).  The purpose of these macros is to separate formatting from content, but they also happen to help out with what you're trying to do.  I imagine the work will be complete in a few months.

On Mon, May 18, 2026 at 3:32 AM Yash Kaushik <yash005...@gmail.com> wrote:
Hi everyone, 
while working on the RISC-V ISA Explorer challenge I scanned the AsciiDoc source files from the ISA manual to extract extension names using regex patterns. One issue I ran into was false positives — author surnames like Zhang or Zabrocki and prose words like Scalar or Scatter matching the same pattern as Z-extensions. I handled this with a curated stopword list, but it feels fragile as the manual evolves. Is there a more robust approach — like parsing only section headers or extension definition blocks that the community uses when extracting structured data from AsciiDoc sources?

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/38ac20ce-e34b-4328-8f0e-46a86ed65917n%40groups.riscv.org.
Message has been deleted
Message has been deleted

Ajit Dingankar

unread,
May 19, 2026, 12:31:10 PM (13 days ago) May 19
to Andrew Waterman, Yash Kaushik, RISC-V ISA Dev

@Andrew Waterman Do I infer correctly from the plural that such macros

will be used systematically in the future updates of the manual? I think that

will enable the automatic extraction of a lot of semantics as we care to define

macros for! (In addition to this specific case where the task was “to extract

extension names” and they “happen to help out…”.)

 

Thanks,

Ajit

====

 

From: 'Andrew Waterman' via RISC-V ISA Dev <isa...@groups.riscv.org>
Sent: Monday, May 18, 2026 2:06 PM
To: Yash Kaushik <yash005...@gmail.com>
Cc: RISC-V ISA Dev <isa...@groups.riscv.org>
Subject: Re: [isa-dev] Handling false positives in regex-based extension scanning of AsciiDoc sources

 

WARNING: This email originated from outside of Qualcomm. Please be wary of any links or attachments, and do not enable macros.

Yash Kaushik

unread,
May 22, 2026, 6:58:47 AM (10 days ago) May 22
to RISC-V ISA Dev, Andrew Waterman, RISC-V ISA Dev, Yash Kaushik
  Good to know the manual is moving in that direction, that would make programmatic extraction significantly more reliable. I'll keep an eye on that as the work progresses.  

Andrew Waterman

unread,
May 22, 2026, 6:58:50 AM (10 days ago) May 22
to Ajit Dingankar, Yash Kaushik, RISC-V ISA Dev
On Tue, May 19, 2026 at 9:31 AM Ajit Dingankar <adin...@qti.qualcomm.com> wrote:

@Andrew Waterman Do I infer correctly from the plural that such macros

will be used systematically in the future updates of the manual? I think that

will enable the automatic extraction of a lot of semantics as we care to define

macros for! (In addition to this specific case where the task was “to extract

extension names” and they “happen to help out…”.)


Yeah, the intent is to use them systematically, but reaching that state will be a gradual process.
Reply all
Reply to author
Forward
0 new messages