php - Strip all elements inside a specific element when descendant elements have same name as ancestor -
i'm using php , strip out tags inside specific tag , keep plain text. issue i'm stuck on there child tags have same name of parents tags:
<corpo> <num>1.</num> <mod id="mod167"> string 1 <commas id="mod167-vir1" type="word">string 2</commas> <com id="mod166-vir1-20090024-art13-com16.1"><num><<16.</num></com> <rif xlink:href="urn" xlink:type="simple">string 3</rif><h:p>something here</h:p> <corpo>string 4</corpo> </mod> </corpo> here, example, corpo have child tag same name (<corpo>string 4</corpo>) , num tag used 2 times (<num>1.</num> , <num><<16.</num>) inside parent tag corpo.
starting highest corpo tag strip out every child tag , keep plain text. result should be:
<corpo> string 1 string 2 <<16. string 3 here string 4 </corpo> up tried simplexml , php strip_tags adding tags want keep, of course not give result expect.
$result = strip_tags($xml, "<corpo></corpo>";
this pretty related @thw wrote, more focussed on simplexml. show different angle on xpath select corpo element(s).
given document same or more ancestors in question string $buffer here example xml:
$xml = simplexml_load_string($buffer); foreach ($xml->xpath('//corpo[not(ancestor::corpo)]') $corpo) { $corpo[0] = dom_import_simplexml($corpo)->textcontent; } $xml->asxml('php://output'); an exemplary output of is:
<a xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:h="ns:h"> <b> <corpo> 1. string 1 string 2 <<16. string 3 here string 4 </corpo> </b> </a> it works following:
get each corpo element has no ancestor name. done xpath:
//corpo[not(ancestor::corpo)] then simplexmlelement , want text-content, accesible through that's $corpo associated domelement node:
dom_import_simplexml($corpo)->textcontent; the remaining expression
$corpo[0] = ... just tells update content of simplexmlelement (so called self-reference).
btw have used strip_tags($corpo->asxml()) here instead of dom_import_simplexml($corpo)->textcontent won't suggest because don't know how stable strip_tags is. it's @ least not xml standard conform.
now might want apply whitespace normalization well, preg_replace handy utf-8 flag string encoding used simplexmlelement , domelement:
foreach ($xml->xpath('//corpo[not(ancestor::corpo)]') $corpo) { $text = dom_import_simplexml($corpo)->textcontent; $corpo[0] = preg_replace('~\s+~u', ' ', $text); } this variant gives you:
<?xml version="1.0"?> <a xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:h="ns:h"> <b> <corpo> 1. string 1 string 2 <<16. string 3 here string 4 </corpo> </b> </a> the full example @ glance demo:
<?php $buffer = <<<xml <a xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:h="ns:h"> <b> <corpo> <num>1.</num> <mod id="mod167"> string 1 <commas id="mod167-vir1" type="word">string 2</commas> <com id="mod166-vir1-20090024-art13-com16.1"> <num><<16.</num> </com> <rif xlink:href="urn" xlink:type="simple">string 3</rif> <h:p>something here</h:p> <corpo>string 4</corpo> </mod> </corpo> </b> </a> xml; $xml = simplexml_load_string($buffer); foreach ($xml->xpath('//corpo[not(ancestor::corpo)]') $corpo) { $text = dom_import_simplexml($corpo)->textcontent; $corpo[0] = preg_replace('~\s+~u', ' ', $text); } $xml->asxml('php://output');
Comments
Post a Comment