smallwiki的格式感觉有点乱

68 views
Skip to first unread message

zinc

unread,
Jul 24, 2014, 4:01:02 AM7/24/14
to cs40...@googlegroups.com
[[和]]不是一一对应的,在里面甚至有[[xxxxxx[[xxxxx]]xxxxx]]的嵌套格式,嵌套格式怎么算啊,用正则表达式该怎么处理呢?

PengBo

unread,
Jul 24, 2014, 4:44:29 AM7/24/14
to cs402pku
嵌套格式是存在的,

比如这样的内容:
<title>Extinct birds</title>  。。。。
[[Image:ExtinctDodoBird.jpeg|right|frame|[    [Dodo]], based on [[Roelant Savery]]'s [[1626]] painting of a stuffed specimen - note that it has two left feet.]]

对应于wiki上右侧的图片介绍,近似对应这个网页:
http://en.wikipedia.org/wiki/List_of_extinct_birds

Best,

pb.




On Thu, Jul 24, 2014 at 4:01 PM, zinc <zhx...@gmail.com> wrote:
[[和]]不是一一对应的,在里面甚至有[[xxxxxx[[xxxxx]]xxxxx]]的嵌套格式,嵌套格式怎么算啊,用正则表达式该怎么处理呢?

--
You received this message because you are subscribed to the Google Groups "cs402pku" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cs402pku+u...@googlegroups.com.
To post to this group, send email to cs40...@googlegroups.com.
Visit this group at http://groups.google.com/group/cs402pku.
For more options, visit https://groups.google.com/d/optout.

Han Jiang

unread,
Jul 24, 2014, 4:59:49 AM7/24/14
to cs402pku
我grep了一下,带这样pattern的:

"\[\[[^]]*\["

大都是类似于老师给的例子,一个Image加下带链接的caption,或者一个Category套着另一个Category。

因为这样的嵌套结构只有最里面一层有效,所以,你们在抽取的时候,还是可以用正则的吧?比如这样的regex(中间空格去掉):

\[ \[  [^\[\]]*  \]  \]

保证[[ ]]中间不会嵌套出现[]字符就可以了。
--
Han Jiang

Team of Search Engine and Web Mining,
School of Electronic Engineering and Computer Science
,
Peking University, China
Reply all
Reply to author
Forward
0 new messages