Posts

2020-05-19: OCR Tools Experiment on Scanned Electronic Theses and Dissertations (ETDs)

Image
A thesis or dissertation is one type of scholarly work that shows a student pursuing higher education and has successfully met the partial requirement of a degree. An electronic thesis or dissertation can be found from either a university's electronic theses and dissertations (ETDs) digital library or ProQuest (a third party ETD repository). ETDs contain lots of rich metadata that can be used for searching ETDs from the repository. However, not all ETD metadata are available. Therefore, it is necessary to extract metadata from scholarly ETDs. Also, extracting metadata could be challenging, mainly when it is found as scanned academic ETDs. Although many open-source tools exhibit satisfying performance in certain types of documents, experiments indicate that they tend to produce unacceptable errors or fail on scanned ETDs. In this blog post, I introduce one of the widely used optical character recognition (OCR) tools called tesseract-OCR and show how tesseract-OCR performs on scann

2020-05-06: PTSD Assessments in COVID-19 Health Care Workers

Image
Figure 1:  Both military and medical personnel are at risk for  psychological trauma [ BBC.com ] Health care workers are working in unfamiliar territory in recent times. Hospitals in major cities are overwhelmed by the number of patients they are handling as a result of the coronavirus disease 2019 (COVID-19) pandemic. There are accounts of people dying in the hospital hallways before help can arrive due to an insufficient amount of space, equipment and staff to handle the influx of patients. Hospital morgues are overflowing. To make matters worse, doctors and nurses have to worry about exposure to COVID-19 and/or possibly exposing their families largely due to a lack of personnel protective equipment (PPE). The current environment is putting health care works at greater risk of developing Post-Traumatic Stress Disorder (PTSD) . As a matter of fact, hospital personnel have started to report symptoms consistent with those suffering with PTSD from sleep disturbances to const

2020-05-06: Teaching a Flipped Hybrid (In-Class/Online) Course

Image
I’ve been meaning to write this for a couple years. Now seems an especially appropriate time for it. In particular, a hybrid course may be an option if staggered in-class attendance is something that will be implemented in the Fall. My first hybrid class began as an in-class "flipped" model.  So first, I'll talk about how I implemented the flipped mode and then I'll discuss how I handled the hybrid (in-class and online) aspects the following year. My definition of a "flipped" class (see https://en.wikipedia.org/wiki/Flipped_classroom , http://flippedclass.com/whyteachersmattermoreinflippedclassroom/ , http://facultyinnovate.utexas.edu/teaching/flipping-a-class ) is one in which students actually do the reading before the class meeting, and the class meeting time is spent discussing the material with students ( not lecturing) and doing in-class activities. There can be several benefits to this, including that class time is changed from content delivery to ac

2020-04-30: Archives Unleashed: New York Datathon Report (From Home Edition)

Image
The Archives Unleashed Datathon is a two-day event hosted by the Archive Unleashed team where participants from different research backgrounds collaborate together to explore web archive collections. The fourth Archives Unleashed datathon partnered with Columbia University Libraries  was supposed to happen in New York City. However, as the spread of COVID-19 cases began to increase, the organizers had to make the tough decision of canceling the New York datathon. Due to the rapidly-evolving COVID-19 situation, we have canceled the datathon which was to be held at Columbia University, March 26-27, 2020. This decision was not taken lightly and was made with the best interests of our attendees. We have been in touch with all attendees. — The Archives Unleashed Project (@unleasharchives) March 3, 2020 In the same email that brought the news of event cancellation, Ian Milligan also mentioned the possibility of organizing the event online through Zoom and Slac

2020-04-26: Large Scale Networking (LSN) Workshop on Huge Data

Image
Between April 13 and 14, 2020, I attended the Large Scale Networking (LSN) workshop on Huge Data. This is a workshop supported by NSF, organized by Clemson University ( Dr. Kuang-Ching Wang ), University of Virginia ( Dr. Ronald Hutchins ), and the University of Kentucky ( Dr. James Griffioen , and Dr. Zongming Fei ). It was supported to be held in Chicago, IL, but due to the coronavirus pandemic, the whole workshop was moved online. The workshop is consists of 4 topic sessions: Data generation (6 presentations) Data storage (7 presentations) Data movement (14 presentations) Data processing and security (14 presentations) Each speaker is given only 5 minutes to do a flash presentation to highlight their work. The workshop also has 4 breakout sessions: New Areas of Research Beyond Big Data New Types of Data & Ways to Get Them  Collaboration across Disciplines  Critical Research Infrastructure Needed Beyond Big Data Dr. C. Lee Giles and I contributed a white paper tit

2020-04-25: Effect of Reading Patterns of Novice Researchers using Eye Tracking

Image
Figure 1: A participant reading the research paper  wearing the PupilLabs Core eye tracker.  Scientific literature gives novel research ideas as well as solutions to various problems. When it comes to scientific literature, reading pattern vary from one person to another. Common reading patterns may exist among researchers having similar expertise in a particular area, novice researchers may have different reading patterns compared to more experienced researchers. We can expect a difference in reading patterns in terms of scan paths and pupillary activity. The ability to seek information from different sections of research papers determines the reading process of a researchers. Some researchers read the research papers starting from the beginning of the research paper till the end, whereas others read them in a different order than presented. One way to read a research paper is the  three-pass approach . Researchers also tend to change their reading patterns over time as they f

2020-04-16: Visual Data Analysis with Streaming-hub

Image
Streaming-hub [ Link ] In my  previous post , I elaborated on how dataset metadata could be standardized in a manner that enables researchers to efficiently discover and reuse data already collected for past studies. Adopting such a standard brings a host of benefits to research communities – such as simplified data sharing, massively collaborative research, and automated data pre-processing. However, formulating and adapting such a standard would take years, if not decades, unless 1) the public realizes its practical benefits over the initial hassle of transition, and 2) tools and libraries are built that would ease workflows after transition. My previous post tries to addresses the first concern by introducing DFS and DDU. In this post, I describe our work towards addressing the second concern.

2020-04-09: After Using Eclipse for 10 Years, I Switched to IDEA [Translated]

Original post:  https://www.cnblogs.com/ouyida3/p/9901312.html  published in December 2018. Preamble : The original text was in Chinese. I first got the "raw translation" from Google Translate. Here is my impression of Google Translate: 75% or more text made sense but only about 25% text read authentic. As a result, I have to manually edit A LOT to make the post readable. The original post was about 50% longer than what I posted here. After using Eclipse for 10 years, I finally switched to IDEA . I did not start with Eclipse when I became a Java programmer, but a tool called jBuilder. When I started using this tool, I already found it very easy to use, because previously, I just used a simple text editor. It didn't take long for me to find a tool called Eclipse, and there was an increasing number of users. At the end of the "test drive", I found it to be very user friendly. The functions inside were just tailored for the programmers. One exciting feature