xml - Scraping utf-8 web pages with some wrong (incomplete) character -
trying scraping web pages italian blog (all pages formatted in same way, , have 'meta charset="utf-8" ' tag), result ok pages, it's ko others.
looking @ page source browser, noticed there 2 meta-tag first n characters of post (the tags meta name="description" content="..." , meta property="og:description" content="...", respectively showing first 200 , 600 characters of post). if @ least 1 of these 2 string ends truncating n bytes of last character (this happens if last character smart quote, or accented vowel... character using 2,3, or 4 bytes in utf-8), page not correctly processed.
here's code:
library(httr) url_ok <- "http://odifreddi.blogautore.repubblica.it/2015/07/06/e-ora-un-referendum-in-europa/" html0 <- get(url_ok) content0 <- content(html0, as="text") nchar(content0) # -> 85767 : that's okay url_ko <- "http://odifreddi.blogautore.repubblica.it/2013/12/06/lacrime-di-coccodrillo-per-mandela/" html1 <- get(url_ko) content1 <- content(html1, as="text") nchar(content1) # -> 2 : it's na! i tried httpget, particular characters (as accented vowels) missed:
html1 <- rcurl::httpget(url_ko) foo <- xmlinternaltreeparse(html1, astext=true, ishtml=true, encoding="utf-8") foo1 <- getnodeset(foo,"//div[@class='article-maincolblog']") # retry text of post foo1 i'd know if there way obtain correct content, having discarded incorrect (incomplete) utf-8 characters, thank you!
after lot of trials (with lot of different functions), found workaround using tau::readchars :
resu <- download.file(url_ko, "temp.htm") # save web page disk stopifnot (resu==0) html1 <- readchars(file("temp.htm"), encoding="utf-8") this works.
Comments
Post a Comment