Reading XML files
John Rickman (71) 645 posts |
Is there an app that will display or extract text from an XML file? I have a docx file which EasiWriter fails to open. This opens in StrongED and I can pick out and copy/paste the text, but there must be a better way? |
Rick Murray (539) 13747 posts |
When I’ve needed to do stuff like that with HTML, I open the file in Edit, then search/replace looking for StrongEd will be able to do this as well, but I don’t know the search syntax (god help you if it’s regex ;) ). |
David J. Ruck (33) 1585 posts |
Python has various xml modules, I’ve not tried them under RISC OS though. But as Rick says if you are after text you should just be able to search and replace the tags you aren’t interested in, in an editor. Zap will do regex. |
Steve Pampling (1551) 8125 posts |
I don’t see a problem with regex, it’s actually relatively simple stuff once you get into the swing of things. |
Jean-Michel BRUCK (3009) 333 posts |
I tested TechWriter with a docx demo file, it opened it, but this may not always be the case! If this helpful to you, I have a utility that allows me to display XML files in tree form. Note 1 you must drag the file to the XmlEdit icon on the iconbar to use it. Note 2 Advice it is better to use a copy of your file… I noticed that elements containing text are of the form: =>w:t |
John Rickman (71) 645 posts |
I had a look at using StrongED. It doesn’t use regex AFAIK. It uses what it calls Advanced Search Syntax, something invented by Guttorm Vik I imagine. Using the editor was getting a bit messy so I had a look for a Python solution and came across a module on GitHub to convert html2text. The native XML support in I’ll explore another time. |
John Rickman (71) 645 posts |
Thanks – I have downloaded it and will try it out. |
Charles Ferguson (8243) 427 posts |
At the command line, which is often more useful for managing most of these types of files, you can use xmllint2 and xsltproc to parse the contents out of an XML file. For example if you wanted to just extract the text data from an XML file you could use: xmllint --xpath "//text()" <filename> Obviously there’s a lot more you can do but for simple data extraction from XML files, XMLLint and XSLTProc have always been the tools I’ve gone to – of course you haev to learn XPath, but once you have, there’s a lot of things that become much easier. |
Matthew Phillips (473) 709 posts |
ElementTree works fine under RISC OS and is included in the standard Python 3 package on PackMan. If you find yourself needing to do this often, especially if you want to retain any styling (e.g. by converting to RTF or HTML), a Python script might be the way to go. |
Jean-Michel BRUCK (3009) 333 posts |
XML tools under RiscOS: |
David Gee (1833) 268 posts |
I’d just better warn you that the XML structure of .docx files is very complex. Far more complex than (say) .odt files. If getting at the text is important, and you have access to a word processor on another platform that can open .docx files (e.g. LibreOffice) it would be simpler just to use that — if it isn’t installed, you could easily install it on Linux on a Pi. It will be a lot quicker than processing it on RISC OS. |