Donghua World: 十二月 2007

星期二, 十二月 18, 2007

html special character

Name	HTML	Result
Copyright Trademark Cent Degree sign double-less than micron Midline dot Negation, continuation line Paragraph Plus/Minus British Pound double greater than Section Yen	© ® ¢ ° « µ · ¬ ¶ ± £ » § ¥	© ® ¢ ° « µ · ¬ ¶ ± £ » § ¥

星期一, 十二月 10, 2007

My blog is removed from GOOGLE search results

I was punished from google for a post that I didn't aware of break the google police.

I paste the guidelines:

Avoid hidden text or hidden links.
Don't use cloaking or sneaky redirects.
Don't send automated queries to Google.
Don't load pages with irrelevant keywords.
Don't create multiple pages, subdomains, or domains with substantially duplicate content.
Don't create pages that install viruses, trojans, or other badware .
Avoid "doorway" pages created just for search engines, or other "cookie cutter" approaches such as affiliate programs with little or no original content.
If your site participates in an affiliate program, make sure that your site adds value. Provide unique and relevant content that gives users a reason to visit your site first.

Wish google reindex my site.
It's such a terrible thing.

change icon of google page create

In head part add the following code

<link rel="icon" href="http://www . yoursite . com/favicon . ico"
type="image/x-icon" />
<link rel="shortcut icon" http://www . yoursite . com/favicon . ico" type="image/x-icon" />

星期日, 十二月 09, 2007

MFA and Filter Adsense to Get high Ads

MFA = Made for Adsense sites.
They bid extremely low on adwords and their ads you get on your site pay low (like your 0.05).
Filtering some MFA sites can get better results for you.
Have google adsense preview tool and preview the ads that are appearing in US and UK,
there's a "Show Ad URL" option,
check the url to decide if it's an MFA site and filter it in your adsense acount.
It helped me a lot. i now have 1-4$ per click ads on my page.

Keywords CTR and Conversion

That is a loaded question with no fix answer.

There are so many other variables to take into account. My own initial target is a 5% CTR and go from there but revise it depending on circumstances. However, having managed many campaigns for others, there occasionally are things which you can't explain.

Example 1: a client of mine selling aquarium products is in first or second position for most keywords. But, the CTR is low in the 2-3% range and sometimes lower for some categories. Why when he is near or often at the top? I would expect the CTR to be near or easily breaking the 10% barrier in this case. Sometimes it just is and can't explain why people are not clicking on the ad. The first thing to look at is keywords, the next the ads (clickability). Obviously, there is something we have not yet figured out in that what they are searching for is not what the ad is offering. Negative keywords will help in weeding those out. Right now, the target is more like 2.5%. Tests by reducing the CPC to reduce the position and see how the CTR is affected are being conducted. Sometimes a lower position gives best results.

Example 2: another client decided to go and make his own ad (something I just hate when they do). I judge the client's ad very poor (not clickable). However, it does better by twice the CTR any of my ads, about 8% in sixth position. Can't figure out why. Again, one those things. I try to learn from it and testing new ads to try to better his CTR.

As for the conversion rate, I want a rate that will still yield a profit at the average cost per click the client is paying. This means testing not only different ads but different page copy. I normally stop a campaign if it can't bring at least a 1% conversion. Factors to take into consideration are the product and the competition. I have my own scale created based on Adwords' own Quality Score terms. It is not set in stone for every product but used as a general reference.

A conversion rate is considered Poor if it is under 1.25%. It is OK between 1.25 and 2%, Good between two and three percent and Great over three percent. I sometimes go even further by calling it Super if it is over five percent.

But a conversion rate can be considered Great if it is 1%, in a niche with lots of competitors, 500 daily visitors from clicks, an actual CPC of 10 cents on a product selling for $100 with an after-cost profit of $75 meaning a pure total daily profit of $325 after advertising costs on those five sales.

A lower CTR ad can be better than a higher one. It could have a much higher conversion rate than a higher CTR ad with a lower conversion. Which ad is best to use? Simply multiply the CTR and conversion rate. The ad with the higher number is the one giving the best returns, given everything else such as position and CPC are the same.

But it all means nothing if you can't make a profit.

Spam prevention robots

I have created a blog, and post some group content on it.

And After a week, I have received a mail and said the blog is spam.

It seems that there is a spam-prevention robots doing the detect work.

I really want to know how they judge the spam blog.

Following is some info:

Blogger's spam-prevention robots have detected that your blog has characteristics of a spam blog. (What's a spam blog?) Since you're an actual person reading this, your blog is probably not a spam blog. Automated spam detection is inherently fuzzy, and we sincerely apologize for this false positive.

We received your unlock request on December 9, 2007. On behalf of the robots, we apologize for locking your non-spam blog. Please be patient while we take a look at your blog and verify that it is not spam.

Find out more about how Blogger is fighting spam blogs.

星期四, 十二月 06, 2007

Python POP mail parse

总体来说python处理邮件还是比较方便的,库提供了很多工具.下面我把心得写出来,给新手一个启迪,也请高手给些更好的方法.
先说接受邮件.  poplib 方法.
1.poplib.POP3('这里填入你pop邮件服务器地址') 登陆服务器.
2.poplib.user('用户名 ') poplib.pass_('密码')
3.poplib.stat()方法返回一个元组:(邮件数,邮件尺寸)
   mailCount,size=poplib.stat()
   这样mailCount就是邮件的数量,size,就是所有邮件的大小.

3.poplib.rert('邮件号码')方法返回一个元组:(状态信息,邮件,邮件尺寸)
  hdr,message,octet=server.retr(1) 读去第一个邮件信息.
   hdr的内容就是响应信息和邮件大小比如'+OK 12498 octets'
   message 是包含邮件所有行的列表.
   octet 是这个邮件的内容.

得到的message是邮件的原始内容,也就是没有解码过的,里面的内容和标题基本上都是base64编码的,下面说说如何处理原始邮件.
python 的email库里提供了很多处理邮件的方法,我们先把原始邮件转成email实例,这样就可以用库方法处理邮件.
email.message_from_string() 这个方法能把String的邮件转换成email.message实例.
比如我们上面的message,向下面这样调用.
mail=email.message_from_string(string.join(message,'\n'))
这样我们就生成了一个email.Message实例

现在我们来提取邮件内容,和标题,mail支持字典操作.比如下面的操作.
mail['subject'] ,mail.get('subject')
mail['To'],mail.get('to')'
mail.keys() ,mail.items() 等等.

中文邮件的标题和内容都是base64编码的.解码可以使用email.Header 里的decode_header()方法.
比如 print mail['subject']   显示的都未处理的编码.
'=?GB2312?B?UmU6IFtweXRob24tY2hpbmVzZV0g?=\n\t=?GB2312?B?y63E3LDvztLV0tbQzsS1xFBZVEhPTrP10afRp8+wtcTXysHP?='

email.Header.decode_header(mail['subject']) 下面是解码后的信息.
[('Re: [python-chinese] \xcb\xad\xc4\xdc\xb0\xef\xce\xd2\xd5\xd2\xd6\xd0\xce\xc4\xb5\xc4PYTHON\xb3\xf5\xd1\xa7\xd1\xa7\xcf\xb0\xb5\xc4\xd7\xca\xc1\xcf', 'gb2312')]
返回的是一个列表，里面的内容保存在一个元组里,(解码后的字串,字符编码)

显示解码后的标题就象下面这样
print email.Header.decode_header(mail['subject'])[0][0]
Re: [python-chinese] 谁能帮我找中文的PYTHON初学学习的资料

上面的mail标题编码是'gb2312'的,在我的winxp机器上可以直接显示,如果编码是别的比如'utf-8'编码,那么显示出来的就是乱码了.所以我们需要使用unicode()方法,unicode('这里是string','这里是编码,比如UTF-8'),比如
subject=email.Header.decode_header(mail['subject'])[0][0]
subcode=email.Header.decode_header(mail['subject'])[0][1])

print unicode(subject,subcode)
Re: [python-chinese] 谁能帮我找中文的PYTHON初学学习的资料

下面看如何处理邮件内容.
mail里有很多方法,熟悉这些方法处理邮件就很容易了。
get_payload() 这个方法可以把邮件的内容解码并且显示出来.第一个可选择参数是mail实例,第二个参数是decode='编码' ,一般都是,'base64'编码
is_multipart(),这个方法返回boolean值，如果实例包括多段，就返回True,
print mail.is_multipart()
true  ,这说明这个mail邮件包含多个字段。我下面的函数就可以处理，显示邮件的全部内容。


def showmessage(mail):
    if mail.is_multipart():
        for part in mail.get_payload():
            showmessage(part)
    else:
        type=mail.get_content_charset()
        if type==None:
            print mail.get_payload()
        else:
            try:
                print unicode(mail.get_payload('base64'),type)
            except UnicodeDecodeError:
                print mail

最后，有点要说明，如果邮件里的中文用mail.Header.decode_header()方法，和unicode()方法都不能正常显示，那么说明这个中文无法处理了，显示出来就是乱码.比如：看看看见，最终处理完成后，还是乱麻。

>;>;>;mail.get('subject')

'Re: [python-chinese] =?UTF-8?B?wrnDmMOTw5p4bWzCscOgw4LDq8K1w4TDjg==?=\n\t=?UTF-8?B?w4rDjMOi?='

>;>;>;decode_header( mail.get('subject'))

[('Re: [python-chinese]', None), ('\xc2\xb9\xc3\x98\xc3\x93\xc3\x9axml\xc2\xb1\xc3\xa0\xc3\x82\xc3\xab\xc2\xb5\xc3\x84\xc3\x8e\xc3\x8a\xc3\x8c\xc3\xa2', 'utf-8')]

>;>;>;print decode_header( mail.get('subject'))[1][0]

鹿脴脫脷xml卤脿脗毛碌脛脦脢脤芒

>;>;>;print unicode(decode_header( mail.get('subject'))[1][0],'utf-8')

1?óúxml±à??μ??êìa

python pop example

import poplib
import pprint
import email
import string
import re
import sys
import traceback

class GetMail():
    def __init__(self,server,username,passwd):
        self.servername = server
        self.username = username
        self.passwd = passwd
        self.data = []

    def connect(self):
        try:
            self.pop = poplib.POP3(self.servername)
            self.pop.set_debuglevel (1)
            self.pop.user(self.username)
            self.pop.pass_(self.passwd)
            return True
        except:
            print "--"*20
            traceback.print_exc(file=sys.stdout )
            print "--"*20
            return False

    def get_mail(self):
        num,total_size = self.pop.stat()
        for index in range(1, num+1):
            status = self.pop.list(index)
            length = string.atoi(string.split(status)[-1])
            print "length %d" % length
            if length < 50000:
                hdr,messages,octet=self.pop.retr (index)
                message = ""
                for line in messages:
                    if line:
                        if line[-1] == "=":
                            message += line[0:-1]
                        else:
                            message += line
                            message += "\n"
                    else:
                        message += "\n"
                print message
                self.data.append(message)

if __name__ == "__main__":
    gm =GetMail("pop3.126.com","*","*")
    if not gm.connect():
        print "connect error"
        sys.exit(1)
    gm.get_mail()
    fileout = open("dafads","w")
    for message in gm.data:
        fileout.write (message)
        fileout.write("-*-"*20)
        fileout.write("\n\n")

星期三, 十二月 05, 2007

正则表达式预搜索，不匹配；反向预搜索，不匹配

前面的章节中，我讲到了几个代表抽象意义的特殊符号："^"，"$"，"\b"。它们都有一个共同点，那就是：它们本身不匹配任何字符，只是对 "字符串的两头" 或者 "字符之间的缝隙" 附加了一个条件。理解到这个概念以后，本节将继续介绍另外一种对 "两头" 或者 "缝隙" 附加条件的，更加灵活的表示方法。

    正向预搜索："(?=xxxxx)"，"(?!xxxxx)"

    格式："(?=xxxxx)"，在被匹配的字符串中，它对所处的 "缝隙" 或者 "两头" 附加的条件是：所在缝隙的右侧，必须能够匹配上 xxxxx 这部分的表达式。因为它只是在此作为这个缝隙上附加的条件，所以它并不影响后边的表达式去真正匹配这个缝隙之后的字符。这就类似 "\b"，本身不匹配任何字符。"\b" 只是将所在缝隙之前、之后的字符取来进行了一下判断，不会影响后边的表达式来真正的匹配。

    举例1：表达式 "Windows (?=NT|XP)" 在匹配 "Windows 98, Windows NT, Windows 2000" 时，将只匹配 "Windows NT" 中的 "Windows "，其他的 "Windows " 字样则不被匹配。

    举例2：表达式 "(\w)( (?=\1\1\1)(\1) )+" 在匹配字符串 "aaa ffffff 999999999" 时，将可以匹配6个"f"的前4个，可以匹配9个"9"的前7个。这个表达式可以读解成：重复4次以上的字母数字，则匹配其剩下最后2位之前的部分。当然，这个表达式可以不这样写，在此的目的是作为演示之用。

    格式："(?!xxxxx)"，所在缝隙的右侧，必须不能匹配 xxxxx 这部分表达式。

    举例3：表达式 "((?!\bstop\b ).)+" 在匹配 "fdjka ljfdl stop fjdsla fdj" 时，将从头一直匹配到 "stop" 之前的位置，如果字符串中没有 "stop"，则匹配整个字符串。

    举例4：表达式 "do(?!\w)" 在匹配字符串 "done, do, dog" 时，只能匹配 "do"。在本条举例中，"do" 后边使用 "(?!\w)" 和使用 "\b" 效果是一样的。

    反向预搜索："(?<=xxxxx)"，"(?<!xxxxx)"

    这两种格式的概念和正向预搜索是类似的，反向预搜索要求的条件是：所在缝隙的 "左侧"，两种格式分别要求必须能够匹配和必须不能够匹配指定表达式，而不是去判断右侧。与 "正向预搜索" 一样的是：它们都是对所在缝隙的一种附加条件，本身都不匹配任何字符。

    举例5：表达式 "(?<=\d{4}) \d+(?=\d{4})" 在匹配 "1234567890123456" 时，将匹配除了前4个数字和后4个数字之外的中间8个数字。由于 JScript.RegExp 不支持反向预搜索，因此，本条举例不能够进行演示。很多其他的引擎可以支持反向预搜索，比如：Java 1.4 以上的 java.util.regex 包，.NET 中System.Text.RegularExpressions 命名空间，以及本站推荐的最简单易用的 DEELX 正则引擎。

扑克与历法 zz

还有点意思

许多人会利用扑克牌进行各种游戏娱乐活动，但懂得扑克经的人恐怕为数不多。扑克是历法的缩影。

五十四张牌中，有五十二张是正牌，表示一年有五十二个星期；两张是副牌，大王代表太阳，小王代表月亮；一年四季春、夏、秋、冬，用桃、心、梅、方来表示，其中红心、方块代表白昼，黑桃、梅花代表黑夜。

每一季是十三个星期，扑克中每一花色正好是十三张牌；每一季是九十一天，十三张牌的点数相加正好是九十一。四种花色的点数加起来，再加上小王的一点，是三百六十五。如果再加上大王的一点，那就正好是闰年的天数。

扑克中的 J、Q、K 共十二张牌，既表示一年有十二个月，又表示太阳在一年中经过的十二个星座。

扑克牌中的四种花色，还有不同寓意：黑桃象征橄榄叶，表示和平；红桃是心形，表示智慧；梅花是黑色三叶，源于三叶草；方块表示钻石，意味着财富。这四种花色，是对人们在一年中美好的祝愿。

非贪婪匹配的效率 zz

可能有不少的人和我一样，有过这样的经历：当我们要匹配类似 "<td>内容</td>" 或者 "[b]加粗[/b]" 这样的文本时，我们根据正向预搜索功能写出这样的表达式："<td>([^<]|<(?!/td>))*</td>" 或者 "<td>((?!</td>).)*</td>"。

当发现非贪婪匹配之时，恍然大悟，同样功能的表达式可以写得如此简单："<td>.*?</td>"。顿时间如获至宝，凡是按边界匹配的地方，尽量使用简捷的非贪婪匹配 ".*?"。特别是对于复杂的表达式来说，采用非贪婪匹配 ".*?" 写出来的表达式的确是简练了许多。

然而，当一个表达式中，有多个非贪婪匹配时，或者多个未知匹配次数的表达式时，这个表达式将可能存在效率上的陷阱。有时候，匹配速度慢得莫名奇妙，甚至开始怀疑正则表达式是否实用。

效率陷阱的产生：

在本站基础文章里，对非贪婪匹配的描述中说到："如果少匹配就会导致整个表达式匹配失败的时候，与贪婪模式类似，非贪婪模式会最小限度的再匹配一些，以使整个表达式匹配成功。"

具体的匹配过程是这样的：

"非贪婪部分" 先匹配最少次数，然后尝试匹配 "右侧的表达式"。
如果右侧的表达式匹配成功，则整个表达式匹配结束。如果右侧表达式匹配失败，则 "非贪婪部分" 将增加匹配一次，然后再尝试匹配 "右侧的表达式"。
如果右侧的表达式又匹配失败，则 "非贪婪部分" 将再增加匹配一次。再尝试匹配 "右侧的表达式"。
依此类推，最后得到的结果是 "非贪婪部分" 以尽可能少的匹配次数，使整个表达式匹配成功。或者最终仍然匹配失败。

当一个表达式中有多个非贪婪匹配，以表达式 "d(\w+?)d(\w+?)z" 为例，对于第一个括号中的 "\w+?" 来说，右边的 "d(\w+?)z" 属于它的 "右侧的表达式"，对于第二个括号中的 "\w+?" 来说，右边的 "z" 属于它的 "右侧的表达式"。

当 "z" 匹配失败时，第二个 "\w+?" 会 "增加匹配一次"，再尝试匹配 "z"。如果第二个 "\w+?" 无论怎样 "增加匹配次数"，直至整篇文本结束，"z" 都不能匹配，那么表示 "d(\w+?)z" 匹配失败，也就是说第一个 "\w+?" 的 "右侧" 匹配失败。此时，第一个 "\w+?" 会增加匹配一次，然后再进行 "d(\w+?)z" 的匹配。循环前面所讲的过程，直至第一个 "\w+?" 无论怎么 "增加匹配次数"，后边的 "d(\w+?)z" 都不能匹配时，整个表达式才宣告匹配失败。

其实，为了使整个表达式匹配成功，贪婪匹配也会适当的"让出"已经匹配的字符。因此贪婪匹配也有类似的情况。当一个表达式中有较多的未知匹配次数的表达式时，为了让整个表达式匹配成功，各个贪婪或非贪婪的表达式都要进行尝试减少或增加匹配次数，由此容易形成一个大循环的尝试，造成了很长的匹配时间。本文之所以称之为"陷阱"，因为这种效率问题往往不易察觉。

举例："d(\w+?)d(\w+?)d(\w+?)z" 匹配 "ddddddddddd..." 时，将花费较长一段时间才能判断出匹配失败。

效率陷阱的避免：

避免效率陷阱的原则是：避免"多重循环"的"尝试匹配"。并不是说非贪婪匹配就是不好的，只是在运用非贪婪匹配的时候，需要注意避免过多"循环尝试"的问题。

情况一：对于只有一个非贪婪或者贪婪匹配的表达式来说，不存在效率陷阱。也就是说，要匹配类似 "<td> 内容 </td>" 这样的文本，表达式 "<td>([^<]|<(?!/td>))*</td>" 和 "<td>((?!</td>).)*</td>" 和 "<td>.*?</td>" 的效率是完全相同的。

情况二：如果一个表达式中有多个未知匹配次数的表达式，应防止进行不必要的尝试匹配。

比如，对表达式 "<script language='(.*?)'>(.*?)</script>" 来说，如果前面部分表达式在遇到 "<script language='vbscript'>" 时匹配成功后，而后边的 "(.*?)</script>" 却匹配失败，将导致第一个 ".*?" 增加匹配次数再尝试。而对于表达式真正目的，让第一个 ".*?" 增加匹配成"vbscript'>"是不对的，因此这种尝试是不必要的尝试。

因此，对依靠边界来识别的表达式，不要让未知匹配次数的部分跨过它的边界。前面的表达式中，第一个 ".*?" 应该改写成 "[^']*"。后边那个 ".*?" 的右边再没有未知匹配次数的表达式，因此这个非贪婪匹配没有效率陷阱。于是，这个匹配脚本块的表达式，应该写成："<script language='([^']*)'>(.*?)</script>" 更好。

匹配次数中的贪婪与非贪婪(zz)

在使用修饰匹配次数的特殊符号时，有几种表示方法可以使同一个表达式能够匹配不同的次数，比如："{m,n}", "{m,}", "?", "*", "+"，具体匹配的次数随被匹配的字符串而定。这种重复匹配不定次数的表达式在匹配过程中，总是尽可能多的匹配。比如，针对文本 "dxxxdxxxd"，举例如下：

表达式	匹配结果
(d )(\w+)	"\w+" 将匹配第一个 "d" 之后的所有字符 "xxxdxxxd"
(d )(\w+)(d)	"\w+" 将匹配第一个 "d" 和最后一个 "d" 之间的所有字符 "xxxdxxx"。虽然 "\w+" 也能够匹配上最后一个 "d"，但是为了使整个表达式匹配成功，"\w+" 可以 "让出" 它本来能够匹配的最后一个 "d"

由此可见，"\w+" 在匹配的时候，总是尽可能多的匹配符合它规则的字符。虽然第二个举例中，它没有匹配最后一个 "d"，但那也是为了让整个表达式能够匹配成功。同理，带 "*" 和 "{m,n}" 的表达式都是尽可能地多匹配，带 "?" 的表达式在可匹配可不匹配的时候，也是尽可能的 "要匹配"。这种匹配原则就叫作 "贪婪" 模式。

非贪婪模式：

在修饰匹配次数的特殊符号后再加上一个 "?" 号，则可以使匹配次数不定的表达式尽可能少的匹配，使可匹配可不匹配的表达式，尽可能的 "不匹配"。这种匹配原则叫作 "非贪婪" 模式，也叫作 "勉强" 模式。如果少匹配就会导致整个表达式匹配失败的时候，与贪婪模式类似，非贪婪模式会最小限度的再匹配一些，以使整个表达式匹配成功。举例如下，针对文本 "dxxxdxxxd" 举例：

表达式	匹配结果
(d )(\w+?)	"\w+?" 将尽可能少的匹配第一个 "d" 之后的字符，结果是："\w+?" 只匹配了一个 "x"
(d )(\w+?)(d)	为了让整个表达式匹配成功，"\w+?" 不得不匹配 "xxx" 才可以让后边的 "d" 匹配，从而使整个表达式匹配成功。因此，结果是："\w+?" 匹配 "xxx"

    更多的情况，举例如下：

    举例1：表达式 " <td>(.*)</td>" 与字符串 "<td><p>aa</p></td> <td><p>bb</p></td>" 匹配时，匹配的结果是：成功；匹配到的内容是 "<td><p>aa</p></td> <td><p>bb</p></td>" 整个字符串，表达式中的 "</td>" 将与字符串中最后一个 "</td>" 匹配。

    举例2：相比之下，表达式 " <td>(.*?)</td>" 匹配举例1中同样的字符串时，将只得到 "<td><p>aa</p></td>"，再次匹配下一个时，可以得到第二个 "<td><p>bb</p></td>"。

星期二, 十二月 04, 2007

乃乃的熊

昨天一下窜到了13块，
今天早上一看，
已经19块了，
这个来的太快了吧，
乃乃的熊

订阅：博文 (Atom)

Donghua World