C# Best way to iterate a tree structure of links -


i'm trying gather list of website links starting root directory can branch down many sub-directory links, below link simplified graphic illustrates structure, i'm concerned getting links in green, yellow links lead other links, output array contain a,b,d,f,g,h,i. i'm trying code in c#.

enter image description here

in generic terms, can like

    private static ienumerable<t> leaves<t>(t root, func<t, ienumerable<t>> childsource)     {         var children = childsource(root).tolist();         if (!children.any()) {             yield return root;             yield break;         }         foreach (var descendant in children.selectmany(child => leaves(child, childsource)))         {             yield return descendant;         }     } 

here, childsource assumed function can take element , return element's children. in case, you'll want make function uses htmlagilitypack take given url, download it, , return links that.

    private static string get(int msbetweenrequests, string url)     {         try         {             var webrequest = webrequest.createhttp(url);             using (var webresponse = webrequest.getresponse())             using (var responsestream = webresponse.getresponsestream())             using (var responsestreamreader = new streamreader(responsestream, system.text.encoding.utf8))             {                 var result = responsestreamreader.readtoend();                 return result;             }         }         catch         {             return null; // nothing sensible here         }                 {             // let's nice server we're crawling             system.threading.thread.sleep(msbetweenrequests);         }     }       private static ienumerable<string> scrapeforlinks(string url)     {         var noresults = enumerable.empty<string>();          var html = get(1000, url);         if (string.isnullorwhitespace(html)) return noresults;          var d = new htmlagilitypack.htmldocument();         d.loadhtml(html);         var links = d.documentnode.selectnodes("//a[@href]");         return links == null ? noresults :             links.select(                 link =>                      link                     .attributes                     .where(a => a.name.tolower() == "href")                     .select(a => a.value)                     .first()              )              .select(linkurl => fixrelativepaths(url, linkurl))                     ;      }      private static string fixrelativepaths(string baseurl, string relativeurl)     {         var combined = new uri(new uri(baseurl), relativeurl);         return combined.tostring();     } 

note that, in naive approach, you'll run infinite loop if there cycles in links between these pages. alleviate this, you'll want avoid expanding children of url you've visited before.

    private static func<string, ienumerable<string>> dontvisitmorethanonce(func<string, ienumerable<string>> naivechildsource)     {         var alreadyvisited = new hashset<string>();         return s =>         {             var children = naivechildsource(s).select(removetrailingslash).tolist();             var filteredchildren = children.where(c => !alreadyvisited.contains(c)).tolist();             alreadyvisited.unionwith(children);             return filteredchildren;         };     }      private static string removetrailingslash(string url)     {         return url.trimend(new[] {'/'});     } 

in case you'd prevent crawler escaping onto internet , spending time on youtube, you'll want

    private static func<string, ienumerable<string>> dontleavethedomain(         string domain,         func<string, ienumerable<string>> wanderer)     {         return u => wanderer(u).where(l => l.startswith(domain));     } 

once you've defined these things, want

    var results = leaves(         myurl,         dontleavethedomain(             mydomain,              dontvisitmorethanonce(scrapeforlinks)))         .distinct()         .tolist(); 

Comments

Popular posts from this blog

toolbar - How to add link to user registration inside toobar in admin joomla 3 custom component -

linux - disk space limitation when creating war file -