Python OCR 验证码识别(pytesseract)

安装

  1. install Python Imaging Library
  2. Install google tesseract-ocr
  3. Install pytesseract with pip
1
2
3
sudo apt-get install python-imaging
sudo apt-get install tesseract-ocr
sudo pip install pytesseract

最小应用

1
2
3
4
5
6
>>> from PIL import Image
>>> import pytesseract

>>> ocr = Image.open('./Pictures/rand.jpg')
>>> pytesseract.image_to_string(ocr)
'240323'

将识别校验码的内容封装为ocr.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
>>> import pytesseract
>>> import httplib2
>>> from PIL import Image
>>> from StringIO import StringIO


>>> def process_image(url):
>>>     image = _get_image(url)
>>>     return pytesseract.image_to_string(image)

>>> def _get_image(url):
>>>     httplib2.debuglevel = 1
>>>     h = httplib2.Http('.cache')
>>>     response, content = h.request(url)
>>>     return Image.open(StringIO(content))

调用ocr的process_image方法

以EMS的校验码为例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
>>> import ocr
>>> url = 'http://www.ems.com.cn/ems/rand'
>>> ocr.process_image(url)
connect: (www.ems.com.cn, 80) ************
send: 'GET /ems/rand HTTP/1.1\r\nHost: www.ems.com.cn\r\naccept-encoding: gzip, deflate\r\nuser-agent: Python-httplib2/0.8 (gzip)\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Cache-Control: no-cache
header: Connection: close
header: Date: Wed, 11 Nov 2015 17:19:47 GMT
header: Content-Type: image/jpeg
header: Set-Cookie: JSESSIONID=Yfr9WD4TJbvpyPhqPdkndlJng6bH5P7LJKCqfygjZn4PvF9Kdlk8!-968903311; path=/; HttpOnly
header: X-Powered-By: Servlet/2.5 JSP/2.1
header: Set-Cookie: BIGipServerweb_pool=67830538.37151.0000; path=/
'443929'

微信在验证消息真实性时, 加密/校验流程如下:

  1. 将token、timestamp、nonce三个参数进行字典序排序
  2. 将三个参数字符串拼接成一个字符串进行sha1加密
  3. 开发者获得加密后的字符串可与signature对比,标识该请求来源于微信

使用Flask作为Web服务器,所以需要用Python实现以上的功能.下面是实现过程中遇到的一些问题:

排序拼接字符串

通过使用Str类型的join方法,可以将排序后的字符串列表,拼接为一个字符串.

join(…)

S.join(iterable) -> string

Return a string which is the concatenation of the strings in the iterable. The separator between elements is S.

1
2
3
4
>>> lst = ['token value', 'timestamp value', 'nonce value']
>>> lst.sort()
// ''表示不需要分隔符
>>> tmp_str = ''.join(lst)

SHA1加密

使用Python内置模块hashlib

openssl_sha1(…)

Returns a sha1 hash object; optionally initialized with a string

1
2
3
4
5
>>> import hashlib
>>> data = 'SHA1 encryption'
>>> sha1 = hashlib.sha1(data)
>>> sha1 = sha1.hexdigest()
'd8e6ade66fb7741271995f29a4f9e93090b2a00b'

三元表达式

按照C的思维写了三元表达式之后,运行报错了: X ? V1 : V2

才想起来Python的写法是不一样的: V1 if X else V2

1
2
3
4
>>> 'Success' if True else 'Failure'
'Success'
>>> 'Success' if False else 'Failure'
'Failure'

You’ll find this post in your

1
_posts
directory. Go ahead and edit it and re-build the site to see your changes. You can rebuild the site in many different ways, but the most common way is to run
1
jekyll serve
, which launches a web server and auto-regenerates your site when a file is updated.

To add new posts, simply add a file in the

1
_posts
directory that follows the convention
1
YYYY-MM-DD-name-of-post.ext
and includes the necessary front matter. Take a look at the source for this post to get an idea about how it works.

Jekyll also offers powerful support for code snippets:

def print_hi(name)
  puts "Hi, #{name}"
end
print_hi('Tom')
#=> prints 'Hi, Tom' to STDOUT.

Check out the Jekyll docs for more info on how to get the most out of Jekyll. File all bugs/feature requests at Jekyll’s GitHub repo. If you have questions, you can ask them on Jekyll’s dedicated Help repository.