一件苦力活请大家帮忙[非linux非cgywin不知道base64的话就算了]

崔莺莺

unread,

Mar 22, 2010, 3:13:45 AM3/22/10

to scholarz...@googlegroups.com

http://autoproxy-gfwlist.googlecode.com/svn/trunk/gfwlist.txt
上面是autoproxy-gfwlist的GFW关键词列表。需要base64 -d解码。
其中有一些有星号的关键词，希望把这些网址还原为它原始的样子，想请大家帮忙做做这种苦力活。（中文也还原为中文）其中星号没有任何意义的例如http://*blogger.com可以忽略。

可以运行 curl http://autoproxy-gfwlist.googlecode.com/svn/trunk/gfwlist.txt
| base64 -d | grep "*" > starlist 保存到starlist。
我把starlist贴到了小组page里(
https://groups.google.com/group/scholarzhang-dev/web/starlist
)，任何人可以编辑，大家就直接在那里修改吧，谢谢大家。

崔莺莺

unread,

Mar 22, 2010, 3:19:44 AM3/22/10

to scholarz...@googlegroups.com

刚才说的不是很具体，我就举几个例子吧：
docs.google.com*View*id*dg5mtmj9_3188x48zcn
像这个，肯定是对应了一个google docs的文档，gdocs文档链接有固定的格式，就还原为那个格式：
docs.google.com/View?docID=dg5mtmj9_3188x48zcn

packages.debian.org*zh-cn*lenny*gpass
这种也应该恢复为 http://packages.debian.org/zh-cn/lenny/gpass

|http:*.google.com*%E5%89%8D%E4%B8%96%E4%BB%8A%E7%94%9F
这个就写成 .google.com*%E5%89%8D%E4%B8%96%E4%BB%8A%E7%94%9F
好了。因为很明显是这样。没有歧义。当然想写成.google.com/search?q=...... 也可以。

|http:*google.com*search*q*%E5%A4%A9%E5%AE%89%E9%97%A8
这个就必须还原为.google.com/search?q=%E5%A4%A9%E5%AE%89%E9%97%A8，因为不知道关键字到底是.google.com
&& xxx还是search && xxx还是q=xxxx

Darasion!

unread,

Mar 22, 2010, 7:15:02 AM3/22/10

to scholarz...@googlegroups.com

是要挨个试验吗？

2010/3/22 崔莺莺 <yingyingcui....@gmail.com>

--
You received this message because you are subscribed to "scholarzhang-dev".
To post to this group, send email to scholarz...@googlegroups.com
To unsubscribe from this group, send email to
scholarzhang-d...@googlegroups.com

To unsubscribe from this group, send email to scholarzhang-dev+unsubscribegooglegroups.com or reply to this email with the words "REMOVE ME" as the subject.

WindyWinter

unread,

Mar 22, 2010, 7:19:27 AM3/22/10

to scholarz...@googlegroups.com

zh.wikipedia.org那些，*代表的是/wiki/、/zh-cn/、/zh-sg/、/zh-tw/、/zh-hk/，是要求把这些全部还原出来呢，还是任选一种就可以了？

Soli Deo gloria,
yours WindyWinter
and http://www.briefdream.com

2010/3/22 Darasion! <dara...@gmail.com>

崔莺莺

unread,

Mar 22, 2010, 8:49:17 AM3/22/10

to scholarz...@googlegroups.com

在 2010年3月22日下午7:19，WindyWinter <wi...@briefdream.com> 写道：
> zh.wikipedia.org那些，*代表的是/wiki/、/zh-cn/、/zh-sg/、/zh-tw/、/zh-hk/，是要求把这些全部还原出来呢，还是任选一种就可以了？

恩，谢谢！估计这个要挨个实验了。如果真的是星号，那么/asdf/也能匹配的对吧？

崔莺莺

unread,

Mar 22, 2010, 8:52:50 AM3/22/10

to scholarz...@googlegroups.com

还是说的不具体了，不用挨个实验，只要实验出/asdf/不是关键字，就直接把上面每一种对应都写下来就好。

总的目的是把一个并不一定是真正关键字的匹配模板转换成确实*可能是*关键字的字符串。

崔莺莺

unread,

Mar 22, 2010, 9:01:31 AM3/22/10

to scholarz...@googlegroups.com

在 2010年3月22日下午7:15，Darasion! <dara...@gmail.com> 写道：
> 是要挨个试验吗？
不用验证它是关键字，只是尽可能将这些奇怪的匹配模板试图对应的url都猜出来。

Darasion!

unread,

Mar 22, 2010, 9:11:40 AM3/22/10

to scholarz...@googlegroups.com

不懂，我做了wiki那个部分，你看看是不是这样。

在附件(dwikipedia)里。

2010/3/22 崔莺莺 <yingyingcui....@gmail.com>

在 2010年3月22日下午7:15，Darasion! <dara...@gmail.com> 写道：
> 是要挨个试验吗？
不用验证它是关键字，只是尽可能将这些奇怪的匹配模板试图对应的url都猜出来。

--

dwikipedia

WindyWinter

unread,

Mar 22, 2010, 10:21:05 AM3/22/10

to scholarz...@googlegroups.com

应该不是这样的，我说的“*代表的是/wiki/、/zh-cn/、/zh-sg/、/zh-tw/、/zh-hk/”是指：
zh.wikipedia.org*Anti-CNN 匹配且仅匹配了
http://zh.wikipedia.org/wiki/Anti-CNN
http://zh.wikipedia.org/zh-cn/Anti-CNN
http://zh.wikipedia.org/zh-sg/Anti-CNN
http://zh.wikipedia.org/zh-tw/Anti-CNN
http://zh.wikipedia.org/zh-hk/Anti-CNN
这5个url，其他的形式都是404，而这5个url的内容是一致的，只有字体之别。

Soli Deo gloria,
yours WindyWinter
and http://www.briefdream.com

2010/3/22 Darasion! <dara...@gmail.com>

不懂，我做了wiki那个部分，你看看是不是这样。

WindyWinter

unread,

Mar 22, 2010, 10:27:08 AM3/22/10

to scholarz...@googlegroups.com

哦，看来还少了几种：
http://zh.wikipedia.org/wiki/Special:%E6%90%9C%E7%B4%A2/Anti-CNN
http://zh.wikipedia.org/zh-cn/Special:%E6%90%9C%E7%B4%A2/Anti-CNN
http://zh.wikipedia.org/zh-tw/Special:%E6%90%9C%E7%B4%A2/Anti-CNN
http://zh.wikipedia.org/zh-sg/Special:%E6%90%9C%E7%B4%A2/Anti-CNN
http://zh.wikipedia.org/zh-hk/Special:%E6%90%9C%E7%B4%A2/Anti-CNN
%E6%90%9C%E7%B4%A2是“搜索”二字的编码，应该还有对应的正体中文的“搜索”，但不知道正体中文里面这两个字是怎么编码的。

Soli Deo gloria,
yours WindyWinter
and http://www.briefdream.com

2010/3/22 WindyWinter <wi...@briefdream.com>

崔莺莺

unread,

Mar 22, 2010, 10:37:31 AM3/22/10

to scholarz...@googlegroups.com

在 2010年3月22日下午10:27，WindyWinter <wi...@briefdream.com> 写道：
> 哦，看来还少了几种：
> http://zh.wikipedia.org/wiki/Special:%E6%90%9C%E7%B4%A2/Anti-CNN
> http://zh.wikipedia.org/zh-cn/Special:%E6%90%9C%E7%B4%A2/Anti-CNN
> http://zh.wikipedia.org/zh-tw/Special:%E6%90%9C%E7%B4%A2/Anti-CNN
> http://zh.wikipedia.org/zh-sg/Special:%E6%90%9C%E7%B4%A2/Anti-CNN
> http://zh.wikipedia.org/zh-hk/Special:%E6%90%9C%E7%B4%A2/Anti-CNN
> %E6%90%9C%E7%B4%A2是“搜索”二字的编码，应该还有对应的正体中文的“搜索”，但不知道正体中文里面这两个字是怎么编码的。

谢谢。其实一定程度的细致就足够了。我们只用帮autoproxy-gfwlist做一些，关键字列表最终还是要靠他们来维护。过几天应该会发url_keywords的experimental版本来测试，不过考虑到匿名组员的安全希望大家尽量用国外网络测试。而且由于experimental工作状况十分不确定，早期测试只希望有阅读代码意愿的用户参与。我只能做到在发布前进行足够的静态调试。

崔莺莺

unread,

Mar 22, 2010, 2:51:13 PM3/22/10

to scholarz...@googlegroups.com

连这种事都没人愿意做么？
（还是先感谢darasion，尽管做得不对）
一个只有276行的东西，每个人做一点完成起来是很快的吧。
难道连这种事情都必须我亲自动手，这战线拉得太长了

Chao Zhang

unread,

Mar 22, 2010, 2:57:05 PM3/22/10

to scholarz...@googlegroups.com

什么时候需要搞完？

2010/3/22 崔莺莺 <yingyingcui....@gmail.com>

--
You received this message because you are subscribed to "scholarzhang-dev".
To post to this group, send email to scholarz...@googlegroups.com
To unsubscribe from this group, send email to
scholarzhang-d...@googlegroups.com

To unsubscribe from this group, send email to scholarzhang-dev+unsubscribegooglegroups.com or reply to this email with the words "REMOVE ME" as the subject.

--
_____________________________
Chao Zhang

崔莺莺

unread,

Mar 22, 2010, 3:11:21 PM3/22/10

to scholarz...@googlegroups.com

在 2010年3月23日上午2:57，Chao Zhang <chaoz...@gmail.com> 写道：
> 什么时候需要搞完？
其实也不急，也许两三天也许四五天

Chao Zhang

unread,

Mar 22, 2010, 11:14:58 PM3/22/10

to scholarz...@googlegroups.com

还是有漏掉的：

zh.wikipedia.org/zh-cn/Talk:Anti-CNN

zh.wikipedia.org/zh-tw/Talk:Anti-CNN

zh.wikipedia.org/zh-hans/Talk:Anti-CNN

.....

对于比较特殊的链接，匹配正则式的比较少：

.2000fun.com*bbs -> www.2000fun.com/bbs/
bbc.co.uk*chinese -> www.bbc.co.uk/chinese
bbc.co.uk*zhongwen ->bbc.co.uk/zhongwen
news.bbc.co.uk/onthisday*newsid_2496000/2496277 -> http://news.bbc.co.uk/onthisday/hi/dates/stories/june/4/newsid_2496000/2496277.stm

对于含义比较宽泛的词，这种组合很多。比如：

|http://*blogger.com -> www.blogger.com
   -> draft.blogger.com

   ->buzz.blogger.com

   ->status.blogger.com

   ->code.blogger.com

   ->www.alexa.com/siteinfo/blogger.com

   ->play.blogger.com

   -> www.homeschoolblogger.com

   -> privateequityblogger.com/

   ->www.ufo-blogger.com

   ->www.freewayblogger.com

   -> www.becomeablogger.com

   ->www.theurbanblogger.com

。。。。

都能match上。用手工搞还是效率比较低了。可以写个程序，submit these queries to the google, and pick up the links from the results, and then check each link.

2010/3/22 崔莺莺 <yingyingcui....@gmail.com>

--

You received this message because you are subscribed to "scholarzhang-dev".
To post to this group, send email to scholarz...@googlegroups.com
To unsubscribe from this group, send email to
scholarzhang-d...@googlegroups.com

To unsubscribe from this group, send email to scholarzhang-dev+unsubscribegooglegroups.com or reply to this email with the words "REMOVE ME" as the subject.

--
_____________________________
Chao Zhang

Jimmy Xu

unread,

Mar 23, 2010, 12:06:41 AM3/23/10

to scholarz...@googlegroups.com

2010/3/22 崔莺莺 <yingyingcui....@gmail.com>:

zhwiki 的 URL 有以下几种形式：

# http://zh.wikipedia.org/wiki/Foobar
# http://zh.wikipedia.org/zh-cn/Foobar
# http://zh.wikipedia.org/zh-tw/Foobar
# http://zh.wikipedia.org/zh-hk/Foobar
# http://zh.wikipedia.org/zh-sg/Foobar
# http://zh.wikipedia.org/zh-hans/Foobar (deprecated)
# http://zh.wikipedia.org/zh-hant/Foobar (deprecated)
# http://zh.wikipedia.org/index.php?title=Foobar
# http://zh.wikipedia.org/index.php?title=Foobar&variant=zh-*

列表中写 * 的目的主要是为了应付深度包检测。而 URL 关键词则很少有覆盖得这么全的，一般也就是会搞掉前五种地址中的一到两个。

FYI.

--
Jimmy Xu

WindyWinter

unread,

Mar 23, 2010, 12:02:40 AM3/23/10

to scholarz...@googlegroups.com

呃，不是没人愿意做，我在做的时候突然想起来问题不是这么简单，只拿了wiki做个简单的例子，想问清楚到底做成什么样。现在大概明白了，却没法做了，因为有的“*”的确匹配了无穷多的url，比如这几个：

|http:*falun
|http:*freenet
|http:*q=freedom
|http:*search*safeweb

它们的意思应该是阻止任何对这些关键词的搜索，而很显然我们都知道这些是GFW深度包检测的关键词……

Soli Deo gloria,
yours WindyWinter
and http://www.briefdream.com

2010/3/23 崔莺莺 <yingyingcui....@gmail.com>

Darasion!

unread,

Mar 23, 2010, 12:16:13 AM3/23/10

to scholarz...@googlegroups.com

其实URL精确匹配有个最简单的绕过方法。

举例：

这个被重置：http://www.python.org/download/

这个就正常：http://www.python.org////////download/

2010/3/23 Jimmy Xu <xu.jim...@gmail.com>

Jimmy Xu

unread,

Mar 23, 2010, 12:17:41 AM3/23/10

to scholarz...@googlegroups.com

2010/3/23 Darasion! <dara...@gmail.com>:

> 其实URL精确匹配有个最简单的绕过方法。
> 举例：
> 这个被重置：http://www.python.org/download/
> 这个就正常：http://www.python.org////////download/
>

另注，列表中的很多 * 其实只是遗留问题，比如上面的例子，比如 Google Docs 的那些，都是还没有修改而已。

--
Jimmy Xu

dylanklc

unread,

Mar 23, 2010, 2:13:11 AM3/23/10

to scholarzhang-dev

topic是cygwin吧? ......

On Mar 22, 3:13 pm, 崔莺莺 <yingyingcui.scholarzh...@gmail.com> wrote:
> http://autoproxy-gfwlist.googlecode.com/svn/trunk/gfwlist.txt
> 上面是autoproxy-gfwlist的GFW关键词列表。需要base64 -d解码。
> 其中有一些有星号的关键词，希望把这些网址还原为它原始的样子，想请大家帮忙做做这种苦力活。（中文也还原为中文）其中星号没有任何意义的例如http://*blogger.com可以忽略。
>

> 可以运行 curlhttp://autoproxy-gfwlist.googlecode.com/svn/trunk/gfwlist.txt

> | base64 -d | grep "*" > starlist 保存到starlist。

> 我把starlist贴到了小组page里(https://groups.google.com/group/scholarzhang-dev/web/starlist
> )，任何人可以编辑，大家就直接在那里修改吧，谢谢大家。

ZhangJieJing

unread,

Mar 23, 2010, 2:28:39 AM3/23/10

to scholarz...@googlegroups.com

试验结果是，

http://www.python.org///////////download/

也不能访问。

上海电信。
---
Best regards,
Zhang Jiejing

Chunlin Zhang

unread,

Mar 23, 2010, 2:35:58 AM3/23/10

to scholarz...@googlegroups.com

想问一下这样做的目的是?

2010/3/22 崔莺莺 <yingyingcui....@gmail.com>:

dylanklc

unread,

Mar 23, 2010, 3:12:14 AM3/23/10

to scholarzhang-dev

错略看了一下,为了方便协同工作,把关键词列表拆分为3部分,
1.num (num.list num_without_star.list num_with_star.list)
2.a_z (a_z.list a_z_without_star.list a_z_with_star.list)
3.other (other.list other_without_star.list other_with_star.list)
工作量集中在2,3部分. 大家领取后核对吧.希望能提高大家效率.
打包下载地址:
https://groups.google.com/group/scholarzhang-dev/web/gfw_list.tar.gz

On Mar 22, 3:13 pm, 崔莺莺 <yingyingcui.scholarzh...@gmail.com> wrote:

> http://autoproxy-gfwlist.googlecode.com/svn/trunk/gfwlist.txt
> 上面是autoproxy-gfwlist的GFW关键词列表。需要base64 -d解码。
> 其中有一些有星号的关键词，希望把这些网址还原为它原始的样子，想请大家帮忙做做这种苦力活。（中文也还原为中文）其中星号没有任何意义的例如http://*blogger.com可以忽略。
>

> 可以运行 curlhttp://autoproxy-gfwlist.googlecode.com/svn/trunk/gfwlist.txt

> | base64 -d | grep "*" > starlist 保存到starlist。

> 我把starlist贴到了小组page里(https://groups.google.com/group/scholarzhang-dev/web/starlist
> )，任何人可以编辑，大家就直接在那里修改吧，谢谢大家。

崔莺莺

unread,

Mar 23, 2010, 9:35:38 AM3/23/10

to scholarz...@googlegroups.com

> 对于含义比较宽泛的词，这种组合很多。比如：
> |http://*blogger.com -> www.blogger.com
> -> draft.blogger.com
> ->buzz.blogger.com

你想复杂了。.blogger.com就是关键字，所以不需要扩展成上面那种样子了。

在 2010年3月23日下午12:02，WindyWinter <wi...@briefdream.com> 写道：
> |http:*falun
> |http:*freenet
> |http:*q=freedom
> |http:*search*safeweb
> 它们的意思应该是阻止任何对这些关键词的搜索，而很显然我们都知道这些是GFW深度包检测的关键词……

你想当然了。falun是url关键字
q=freedom也是
另外两个我不知情。

在 2010年3月23日下午2:28，ZhangJieJing <kzj...@gmail.com> 写道：
> 试验结果是，
> http://www.python.org///////////download/
> 也不能访问。
但是它不是关键字。被触发是因为被重定向了。

在 2010年3月23日下午2:35，Chunlin Zhang <zhangc...@gmail.com> 写道：
> 想问一下这样做的目的是?
目的就是，autoproxy上有些关键词匹配模板可以匹配一些关键词，但是实际上并不一定所有匹配了这个模板的字符串都是关键词。

举个极端的例子就是q=triangle是关键词，但是q*triangle不一定是。autoproxy上也许会出现q*triangle，那我们要做的事情就是根据一些了解把q*triangle还原成q=triangle

比如说已经知道了".google.com &&
阅后即焚"是关键字，autoproxy中可能写的是".google.com*阅后即焚"，就不需要把它还原成
www.google.com/search?q=阅后即焚&hl=zh-CN，就可以不管了。

在比如如果autoproxy中有一个".google.com*罢课"，其实"罢课“就是关键词，这个时候也可以不管，因为".google.com*罢课"所匹配的所有字符串都可以触发GFW。

WindyWinter

unread,

Mar 23, 2010, 9:39:26 AM3/23/10

to scholarz...@googlegroups.com

意思是指，在autoproxy模板能匹配的url集合里，剔除不触发GFW的部分，留下可以触发GFW的？

Soli Deo gloria,
yours WindyWinter
and http://www.briefdream.com

2010/3/23 崔莺莺 <yingyingcui....@gmail.com>

> 对于含义比较宽泛的词，这种组合很多。比如：

崔莺莺

unread,

Mar 23, 2010, 9:53:36 AM3/23/10

to scholarz...@googlegroups.com

在 2010年3月23日下午9:39，WindyWinter <wi...@briefdream.com> 写道：
> 意思是指，在autoproxy模板能匹配的url集合里，剔除不触发GFW的部分，留下可以触发GFW的？
基本如此，但也不需要像你说的wikipedia里面把那么多都列出来，常用的几个列一下，剩下的事情交给autoproxy。

Reply all

Reply to author

Forward