Showing posts from October, 2018

2018-10-19: Some tricks to parse XML files

Recently I was parsing the ACM DL metadata in XML files. I thought parsing XML is a very straightforward job provided that Python has been there for a long time with sophisticated packages such as BeautifulSoup and lxml. But I still encountered some problems and it took me quite a bit of time to figure how to handle all of them. Here, I share some tricks I learned. They are not meant to be a complete list, but the solutions are general so they can be used as the starting points to handle future XML parsing jobs. CDATA. CDATA is seen in the values of many XML fields. CDATA means Character Data. Strings inside the CDATA section are not parsed. In other words, they are kept as what they are, including marksups. One example is <script>  <![CDATA[  <message> Welcome to TutorialsPoint </message>  ]] > </script > Encoding. Encoding is a pain in text processing. The problem is that there is no way to know what the encoding the text is before opening it and re

2018-10-11: iPRES 2018 Trip Report

September 24th marked the beginning of iPRES 2018 located in Boston, MA, for which both Shawn Jones and I traveled from New Mexico to present our accepted papers: Measuring News Similarity Across Ten U.S. News Sites ,  The Off-Topic Memento Toolkit , and  The Many Shapes of Archive-It . iPRES ran paper and workshop sessions in parallel, therefore I will focus on the sessions I was able to attend. However, this year organizers created and shared  collaborative notes  with all attendees for all sessions to help others who couldn't attend many individual sessions. All the presentation materials and associated papers were also made available via google drive . Day 1 (September 24, 2018): Workshops & Tutorials The first day of iPRES attendees gathered at the  Joseph B. Martin Conference Center at Harvard Medical School  to get their registration lanyards and iPRES swag. Registration desk for #ipres2018 is almost open! We’ll have a total of almost 400 participants! pic

2018-10-10: Americans More Open Than Asians to Sharing Personal Information on Twitter: A Paper Review

Mat Kelly reviews "A Personal Privacy Preserving Framework..." by Song et al. at SIGIR 2018.                                                                                                                                                                                                                                                                                                                                                                            ⓖⓞⓖⓐⓣⓞⓡⓢ Americans are more open to share personal aspects on the Web than Asians. — Song et al. 2018 I recently read a paper published at SIGIR 2018 by Song et al. titled "A Personal Privacy Preserving Framework: I Let You Know Who Can See What" ( PDF ). The title alone captivated my interest with the above claim deep within the text. The authors' goal of the work was to reduce users' privacy risks on social networks by determining who could see what sort of information they posted. They did so by es