xml - Scraping utf-8 web pages with some wrong (incomplete) character -


trying scraping web pages italian blog (all pages formatted in same way, , have 'meta charset="utf-8" ' tag), result ok pages, it's ko others.

looking @ page source browser, noticed there 2 meta-tag first n characters of post (the tags meta name="description" content="..." , meta property="og:description" content="...", respectively showing first 200 , 600 characters of post). if @ least 1 of these 2 string ends truncating n bytes of last character (this happens if last character smart quote, or accented vowel... character using 2,3, or 4 bytes in utf-8), page not correctly processed.

here's code:

library(httr) url_ok <- "http://odifreddi.blogautore.repubblica.it/2015/07/06/e-ora-un-referendum-in-europa/" html0 <- get(url_ok) content0 <- content(html0, as="text") nchar(content0)  # -> 85767 : that's okay  url_ko <- "http://odifreddi.blogautore.repubblica.it/2013/12/06/lacrime-di-coccodrillo-per-mandela/" html1 <- get(url_ko) content1 <- content(html1, as="text") nchar(content1)  # -> 2 : it's na! 

i tried httpget, particular characters (as accented vowels) missed:

html1 <- rcurl::httpget(url_ko) foo <- xmlinternaltreeparse(html1, astext=true, ishtml=true, encoding="utf-8") foo1 <- getnodeset(foo,"//div[@class='article-maincolblog']")  # retry text of post foo1 

i'd know if there way obtain correct content, having discarded incorrect (incomplete) utf-8 characters, thank you!

after lot of trials (with lot of different functions), found workaround using tau::readchars :

 resu <- download.file(url_ko, "temp.htm")   # save web page disk  stopifnot (resu==0)  html1 <- readchars(file("temp.htm"), encoding="utf-8") 

this works.


Comments

Popular posts from this blog

toolbar - How to add link to user registration inside toobar in admin joomla 3 custom component -

linux - disk space limitation when creating war file -

How to provide Authorization & Authentication using Asp.net, C#? -