Friday, October 19, 2018

2018-10-19: Some tricks to parse XML files

Recently I was parsing the ACM DL metadata in XML files. I thought parsing XML is a very straightforward job provided that Python has been there for a long time with sophisticated packages such as BeautifulSoup and lxml. But I still encountered some problems and it took me quite a bit of time to figure how to handle all of them. Here, I share some tricks I learned. They are not meant to be a complete list, but the solutions are general so they can be used as the starting points to handle future XML parsing jobs.


  • CDATA. CDATA is seen in the values of many XML fields. CDATA means Character Data. Strings inside the CDATA section are not parsed. In other words, they are kept as what they are, including marksups. One example is
    <script> 
    <![CDATA[  <message> Welcome to TutorialsPoint </message>  ]] >
    </script >
  • Encoding. Encoding is a pain in text processing. The problem is that there is no way to know what the encoding the text is before opening it and reading it (at least in Python). So we must sniff it by trying to open and read the file using an encoding. If the encoding is wrong, the program usually will throw an error message. In this case, we try another possible encoding. The "file" command in Linux gives the encoding information so I know there are 2 encodings in the ACM DL XML file: ASCII and ISO-8859. 
  • HTML entities, such as &auml; The only 5 built-in entities in XML are quotampaposlt and gt. So any other entities should be defined in the DTD file to show what they mean. For example, the DBLP.xml file comes with a DTD file. The ACM DL XML should have associated DTD files: proceedings.dtd and periodicals.dtd but they are not in my dataset.
The following snippet of Python code solves all the three problems above and give me the correct parsing results.

encodings = ['ISO-8859-1','ascii']
for e in encodings:
    try:
        fh = codecs.open(confc['xmlfile'],'r',encoding=e)
        fh.seek(0)
    except UnicodeDecodeError:
        logging.debug('got unicode error with %s, trying a different encoding' % e)
    else:
        logging.debug('opening the file with encoding: %s' % e)
        break

f = codecs.open('xmlfile',encoding=e)
soup = BeautifulSoup(f.read(),'html.parser')


Note that we use codecs.open() instead of the Python built-in open(). And we open the file twice, the first time only to check the encoding, and the second time the whole file is pass to a handle before it is parsed by BeautifulSoup. I found that BeautifulSoup is better to handle XML parsing than lxml, not just because it is easier to use but also because you are allowed to pick the parser. Note I choose the html.parser instead of the lxml parser. This is because the lxml parser is not able to parse all entries (for some unknown reason). This is reported by other users on stackoverflow.

Jian Wu

No comments:

Post a Comment