2022-09-15: Querying the Politwoops Search Engine for Deleted Tweets

2022-09-15: Querying the Politwoops Search Engine for Deleted Tweets

By Caleb Bradford - September 15, 2022

As a part of the ODU Research Experience for Undergraduates (REU) site in disinformation detection and analytics, we began research into querying certain sites for evidence of correct attribution of a tweet as a part of the “Did They Really Say That?” project. One site that proved to be particularly useful for this project was Politwoops, a project created by Propublica which serves as an archive for deleted tweets made by accounts belonging to political officials. Politwoops only tracks the accounts of candidates and current elected officials. Since this project focuses on verifying attribution for social media posts, Politwoops serves as a particularly fruitful source of evidence for us. For one, the existence of a tweet on Politwoops serves as direct evidence that a tweet was actually made. It is also worth noting that even though Politwoops is limited to tracking only political officials, it still serves a valuable purpose for us because these types of public figures are more likely to have their tweets shared as screenshots since their words are subject to more scrutiny than the average person.

Detailed in this blog post is our experimentation using the Politwoops built-in search engine. We ultimately found that Politwoops uses a form of direct substring matching, where queries are only successful if they match a tweet exactly character for character. While there is no clear character limit for these queries, any query that includes substrings from more than one sentence will always fail. Queries can also consist of a substring from any part of a tweet, so long as they follow the previous conditions.

For our purposes here, we will define a successful query (a “hit”) as a query that finds the tweet being searched for, regardless of whether or not other tweets are found as well. We made this distinction because it is possible to automate the process of checking the tweet we are searching for against every individual search result, making the presence of multiple search results trivial to deal with.

Here is a deleted tweet made by Twitter account @RepMondaire that is archived on Politwoops.

A query to the Politwoops built-in search engine utilizing the entire body of the tweet (196 characters) does not find the tweet.

Truncating the query to only the first sentence of the tweet (126 characters) does bring up the tweet in question.

Wanting to go further, I made a search with slightly more of the tweet, up to “America” (151 characters). This did not find the tweet.

Limiting the query to 113 characters (right before “devastating”) actually brings up multiple results. The other results are him retweeting the tweet in question. We found the exact same results in queries using 98 characters and 51 characters.

Live page (113 characters)

Archived page (113 characters)

Live page (98 characters)

Archived page (98 characters)

Live page (51 characters)

Archived page (51 characters)

We have established that Politwoops can find a specific tweet with as little as the first 51 characters. We began examining another tweet by @JunaidForUs (the first one), as it is a little longer (273 characters) and it has some more special characters, like punctuation and numbers, and it’s worth verifying if those affect a search.

For our first search, we tried the whole tweet body. This was not successful, but that was to be expected given the longer length of this tweet. A search excluding the last sentence (224 characters) was also unsuccessful.

Live page (whole tweet)

Archived page (whole tweet)

Live page (224 characters)

Archived page (224 characters)

Cutting out one more sentence from the end leaves 171 characters. This one was successful.

Further experimentation showed that it is possible to find a tweet with a search that has a word cut off, shown in this 163 character search, in which the word “rate” is cut down the middle (i.e., "ra").

Further proving the idea that queries cutting off words in the middle can be successful, shortening the query to only the first 80 characters (up to the "U "in "U.S.") also finds the tweet.

Our success here suggests that queries to Politwoops constructed using the beginning of a tweet are successful. Even if a query conducted in this manner brings up multiple results, comparing all of the results with the text of the tweet being searched for would be trivial. To further test this methodology, we will focus on finding more tweets using it, starting with this tweet from @RepGuthrie.

Conducting a query with the first 80 characters of that tweet is successful.

Testing the limits even further, a search for the first 50 characters of the same tweet still finds it.

Here’s another tweet from @HouseAgGOP.

A 50 character query for this tweet is also successful.

Next, it’s worth trying out a few edge cases, starting with if a tweet contains an @, indicating a mention to another user. This tweet from @JayeForMI is a good example.

A 50 character query for this tweet is successful, even though the mention in question actually gets cut off in the middle because of the character limit.

Next I tried it on this tweet from @RepBobbyRush. It’s a shorter one that contains only some text and a link.

The 50 character query works here, even though the link gets cut off in the middle.

The below tweet made by @AdamKinzinger presents another interesting edge case, that being a quote-tweet (represented in Twitter by "QT" in the text followed by the tweet being quote-tweeted) where the only text of the actual tweet is a link to an image that is no longer available on Twitter.

A 50 character query fails here.

Other attempts to query for this tweet proved ineffective. The reasoning for this failure is difficult to ascertain, but it could have to do with the fact that the broken media is the only part of the actual tweet body. Attempts to find other tweets with links seem to be successful, even when those links are to broken media. This is proven in a query for this tweet by @WinterForMT, which also includes a broken link to an image.

This is not particularly discouraging, as it’s a niche situation that would not likely come up in this research anyway. Our reasoning for this involves the fact that the input that we will ultimately have to extract text from to construct our queries will be a screenshot of a supposed tweet. A screenshot of the tweet that could not be found would simply show a QT with only embedded media and no text. We would not be able to extract the link as it appears on Politwoops from an image of that embedded media. Therefore, we would not be able to construct a query for this tweet using only a screenshot of it. However, searching for other examples of quote-tweets is potentially a case that would need to be covered. Here’s a quote tweet from the account @RepYvetteClarke.

A 50 character query is successful at finding the tweet.

The work we have done up to this point indicates that the Politwoops built-in search engine utilizes some form of direct substring matching that becomes less effective with larger queries. Building queries with up to the first 50 characters of a tweet on Politwoops proved to be effective at finding the tweet in question. It should be noted that the text for a query needs to exactly match whatever tweet being searched for. This is shown in the query below that matches the query above exactly, but without the first space. It fails at finding the tweet.

For this reason, anyone working with Politwoops to find specific tweets will need to ensure that encoding for a query is kept consistent with the encoding used for Politwoops, as any difference will result in a failed query, even though the tweet is present on Politwoops.

We also want to consider whether it’s possible to find tweets with a query that doesn’t start from the beginning of the tweet. Here’s another query meant to find the above tweet in which the query begins at the second word (“should”). Interestingly, this query does find the tweet.

This suggests that the substring matching does look for an exact match, but that match can occur anywhere in the tweet. A query using another part (“yet our broken…”) of the tweet is also successful.

Something else worth considering is whether or not there is a consistent upper limit for the number of characters for a successful query. To test this, we started with this tweet by @JunaidForUs that we used previously.

So far, our longest successful query was a 171 character query to find this tweet. A 180 character query fails to find it.

This 172 character query also fails, confirming that 171 characters is the limit for finding this tweet.

It’s worth noting that the first 171 characters of this tweet is the entirety of the first sentence. We also reexamined this tweet by @RepMondaire.

We saw previously that this tweet could be found with a 126 character query. 127 characters works as well.

However a 128 character query fails.

Once again, this character count corresponds to the ending of the first sentence of the tweet. We also did further examination on this tweet from @JayeForMI.

Querying for the first sentence (138 characters) is successful.

This query uses 140 characters, which includes the first character of the second sentence. It does not find the tweet.

These queries show that while there does not seem to be a consistent character limit on queries, our queries are generally unsuccessful when they include more than one sentence. A query constructed using the second sentence of the @JayeForMI tweet is successful.

Our success here indicates that a successful query can include parts of a sentence, or a whole sentence, but not more than one sentence. These sentences are delimited by the string ". ".

Our experimentation with the Politwoops search engine highlights many of its strengths and weaknesses. The Politwoops search engine works best when queries are limited to either parts of a sentence or a whole sentence. These queries can include any part of a tweet. However, queries are not effective when they include more than one sentence. This behavior could motivate the use of sentence tokenization on tweets being searched for when automatically constructing queries as opposed to constructing queries utilizing a certain number of characters from the beginning of a tweet. Ensuring that we are only searching for one sentence or a limited string from one sentence will be the best way to find tweets on Politwoops.

What we also need to keep in mind when constructing queries is that Politwoops uses exact character matching. If there is any deviation from our query to the tweet we are searching for, the search will fail. This limitation could present problems for our purposes, since ultimately we want to be able to construct queries using strings that were extracted using optical character recognition, which will not always be able to read its input perfectly.

So far, we have developed tools that can construct queries with the first 50 characters of a tweet and verify whether that tweet is found with the query.

-Caleb

Comments