2018-10-19: Some tricks to parse XML files
Recently I was parsing the ACM DL metadata in XML files. I thought parsing XML is a very straightforward job provided that Python has been there for a long time with sophisticated packages such as BeautifulSoup and lxml. But I still encountered some problems and it took me quite a bit of time to figure how to handle all of them. Here, I share some tricks I learned. They are not meant to be a complete list, but the solutions are general so they can be used as the starting points to handle future XML parsing jobs. CDATA. CDATA is seen in the values of many XML fields. CDATA means Character Data. Strings inside the CDATA section are not parsed. In other words, they are kept as what they are, including marksups. One example is <script> <![CDATA[ <message> Welcome to TutorialsPoint </message> ]] > </script > Encoding. Encoding is a pain in text processing. The problem is that there is no way to know what the encoding the text is before opening it and re