Posts

Showing posts with the label html

2020-06-08: Who is that person in the picture? Or, how Python, and Haar can add value to an image.

Image
(Sung to the tune of "How Much is that Doggie in the Window") Who is that person in the picture? The one with the light brown hair. Who is that person in the picture? I do hope that someone would share. Introduction Often times when a group gets together, for whatever reason, there will be a group picture at the end to commemorate the good times had by all. If this "sea of faces" gets published, in hard or soft copy, there may be a one or two line caption giving the name of the group and perhaps where and when the image was created. Six months or a year later, the image has only marginal value to the people who were there, and almost no value to those who were not there, because it is just a sea of faces. We are interested in finding a low cost (very little human time) method of providing a way to add value to the soft copy of the image, so the image will have greater value later. We have developed a Python script that uses a Haar facial detection cascade to create

2019-07-11: Raintale -- A Storytelling Tool For Web Archives

Image
My work builds upon AlNoamany's efforts to use social media storytelling to summarize web archive collections. AlNoamany employed Storify as a visualization platform. Storify is now gone . I explored alternatives to Storify in 2017 and found many of them to be insufficient for our purposes. In 2018, I developed MementoEmbed to produce surrogates for mementos and we used it in a recent research study . Surrogates summarize individual mementos. They are the building blocks of social media storytelling. Using MementoEmbed, Raintale takes surrogates to the next level, providing social media storytelling for web archives. My goal is to help web archives not only summarize their collections but promote their holdings in new ways. Raintale is the latest entry in the Dark and Stormy Archives project . Our goal is to provide research studies and tools for combining web archives and social media storytelling. Raintale provides the storytelling capability. It has been designed to vi

2018-10-19: Some tricks to parse XML files

Recently I was parsing the ACM DL metadata in XML files. I thought parsing XML is a very straightforward job provided that Python has been there for a long time with sophisticated packages such as BeautifulSoup and lxml. But I still encountered some problems and it took me quite a bit of time to figure how to handle all of them. Here, I share some tricks I learned. They are not meant to be a complete list, but the solutions are general so they can be used as the starting points to handle future XML parsing jobs. CDATA. CDATA is seen in the values of many XML fields. CDATA means Character Data. Strings inside the CDATA section are not parsed. In other words, they are kept as what they are, including marksups. One example is <script>  <![CDATA[  <message> Welcome to TutorialsPoint </message>  ]] > </script > Encoding. Encoding is a pain in text processing. The problem is that there is no way to know what the encoding the text is before opening it and re