encoding - Need to find the requests equivalent of openurl() from urllib2 -
i trying modify script use requests library instead of urllib2 library. haven't used before , looking equivalent of urlopen("http://www.example.org").read()
, tried requests.get("http://www.example.org").text
function.
this works fine normal everyday html, when fetch url (https://gtfsrt.api.translink.com.au/feed/seq) doesn't seem work.
so wrote below code print out responses same url using both requests , urllib2 libraries.
import urllib2 import requests #urllib2 request request = urllib2.request("https://gtfsrt.api.translink.com.au/feed/seq") result = urllib2.urlopen(request) #requests request result2 = requests.get("https://gtfsrt.api.translink.com.au/feed/seq") print result2.encoding #urllib2 write text open("output.txt", 'w').close() text_file = open("output.txt", "w") text_file.write(result.read()) text_file.close() open("output2.txt", 'w').close() text_file = open("output2.txt", "w") text_file.write(result2.text) text_file.close()
the openurl().read()
works fine requests.get().text
doesn't work given url. suspect has encoding, don't know what. thoughts?
note: supplied url feed in google protocol buffer format, once receive message give feed google library interprets it.
your issue you're making requests
module interpret binary content in response text.
a response requests
library has 2 main way access body of response:
response.content
- return response body bytestringresponse.text
- decode response body as text , return unicode
since protocol buffers binary format, should use result2.content
in code instead of result2.text
.
response.content
return body of response as-is, in bytes. binary content want. text content contains non-ascii characters means content must have been encoded server bytestring using particular encoding indicated either http header or <meta charset="..." />
tag. in order make sense of bytes therefore need decoded after receiving using charset.
response.text
convenience method you. assumes response body text, , looks @ response headers find encoding, , decodes you, returning unicode
.
but if response doesn't contain text, wrong method use. binary content doesn't contain characters, because it's not text, whole concept of character encoding not make sense binary content - it's applicable text composed of characters. (that's why you're seeing response.encoding == none
- it's bytes, there no character encoding involved).
see response content , binary response content in requests
documentation more details.
Comments
Post a Comment