Data Culture Group @ Northeastern University

A New Tool To Help Understand Partisan News in the US

Tue, 30 May 2023 05:09:00 +0000

The latest Zelda game is being covered a lot more in Democrat-serving online news sites. That’s one of the first random tidbits I noticed in “On Our Own Terms”, a new tool built by Claire Pan and I as part of the Media Could project.

Top terms from the week starting May 15th, 2023. I’ve highlighted the Zelda-related terms, which only show up on left-leaning sources.

There are more impactful details to be discovered in there, such as fact that Tucker Carlson’s dismissal from Fox News was covered in Republican-serving online news for longer than in Democrat-serving sites. However, this disparate coverage of Zelda is and intriguing thread to pull on in the future.

Top terms from the week starting May 1st, 2023. I’ve highlighted Tucker Carlson’s name which only shows up on right-leaning sources. This is over a week after he was fired on April 24th.

How we Breakdown Media by Partisanship

We created “On Our Own Terms” to let researchers and news-junkies perform a high-level comparison of terms used in news headlines across a political breakdown of media. This breakdown is a bit complicated to explain, but stick with me as a I run you though it. We’ve build on a Twitter user panel created by my colleague David Lazer’s lab, where they associated half a million Twitter users with their voter registration records. Mining that network for links shared in tweets, they’ve produced ratios indicating the comparative sharing between registered Democrats and Republicans by domain name. We’ve narrowed these domains into US-published news sites, and put them into 3 buckets: domains shared more often by registered Democrats on Twitter, domains shared roughly equally by registered Republicans and registered Democrats on Twitter, and domains shared more often by registered Republicans on Twitter. Hopefully I didn’t lose you in the middle of that, because the end results is three buckers of news sources that we can talk about as left-leaning, right-leaning, and central. That’s a bit hand-wavy, because we’ve not actually rating the sources or their content themselves, but it sure is a lot easier to explain than the rest of the paragraph above.

The “On Our Own Terms” dashboard uses this political breakdown of US online news sources to show you top words in headlines on a week-by-week basis. Why? We have more than a decade of strong evidence about the trend towards polarization in news sources in the US (particularly on the right). In addition language analysis shows how politicians are increasingly using completely different language and terms in congress. Combine there and you can see how as a country we’re beginning to have very different parallel conversations, creating serious threats to shared consensus on reality and overall civic cohesion.

Why Show Top Terms?

Almost 15 years ago the Media Cloud project began with one of the primary goals to map the broader media landscape. A word cloud of top terms was one of the first dashboards the team created (that predated my work on the project). We’re in the process of rebuilding right now, in collaboration with the Internet Archive’s Wayback Machine, so I wanted to honor and return to that history by building an updated version of that dashboard as a first step.

The mediacloud.org homepage circa 2011, courtesy of a snapshot on the Internet Archive’s Wayback Machine.

Looking at top terms in headlines can give informed news followers a quick summary of the events and people that are driving news, and sometimes even clues about the narratives. Of course, out of context terms don’t really explain what different stories are, so we’ve set it up so that clicking on a terms opens up a search in our online news archive for stories that have that term in their headline.

An example of searching for the amount of stories mentioning “tucker” in the week starting May 1st, 2023 in the Media Cloud search tool

An Evolving Tool

Over the summer we will be expanding on this dashboard, giving you an alternate way to explore the news. As I mentioned, we’re still reconstructing this searchable online news archive, so there will be some hiccups and bugs. That said, give it a spin if you’re intrigued and we think it’ll will reveal some interesting threads to pull on, perhaps even more relevant for our lives and civic structures than who is resporting on the latest Zelda game. Just bookmark On Our Own Terms and it will update automatically every week. Want to help? You can check out the source code and historical data in our open-source code repository.

Teaching Physical Computing with Mini-Mini Golf 🤖⛳️🎉

Wed, 03 May 2023 05:08:31 +0000

My family knows one thing about vacationing with me: if we’re anywhere near a mini-golf course we’ll have to stop and play it. Waterfalls, windmills, pirates, animals — I love it all. This semester I was teaching Physical Computing course again, and I wondered… could I combine my love of the quirky American leisure activity with Arduinos, sensors, motors, and more? So I decided to have students design interactive tabletop mini-mini golf holes for a public showcase.

Me trying out the Phineas and Ferb hole. (Photo by Matthew Modoono/Northeastern University)

What is Physical Computing?

Physical Computing is a phrase we use to talk about the craft of building systems that sense and react to the world and us. It is a wonderful way to introduce learners to working with sensors, motors, LEDs, and more because it exists in the real-world and not just on-screen. There is a certain whimsy, delight, and magic to working with products like the micro:bit, Arduino, and Raspberry PI. This year is the second time I’ve taught Physical Computing course here in Art + Design at Northeastern University, so I decided to try some new things.

We used the SparkFun RedBoard from their Inventor’s Kit. (Photo by Charles Gauthier)

Typical introductions follow the mold of Tom Igoe’s decades of work at NYU’s ITP program, they offer invaluable resources, pedagogy, and examples to learn from. With a lack of access to a group soldering space due to construction, I decided to go all-digital in my course. This meant I was focused on microcontrollers and modules that could connect to them easily over things like I2C and other busses to extend core capabilities.

Sadly, another space-driven constraint was that we couldn’t design full-sized mini-golf holes. So taking a cue from my wife’s work with the Beautiful Stuff Project, I decided to work on square 2-foot wooden boards. Students created table-top holes playable with marbles and 3d printed palm-sized putters.

Conceptualizing Interactive Mini-Mini Golf (with AI)

Mini-golf holds a nostalgic charm in the US, associated with family trips and simple fun. The history is connected to this, from the emergence of putt-putt as a pastime associated with the expansion driving and the highway network. Some of my students had fond memories of mini-golf, while others hadn’t ever played. To help get into the mindset I took the whole class on a field trip to a recently opened novel new Puttshack indoor mini-golf place. It included a bit more gamification around the playing experience, and lots more LEDs, but the experience was fun and also helped set the mood and tone for the project.

Mini golf takes a variety of forms across the US.

Brainstorming afterwards, we narrowing in on a course theme of “cartoons” — they all had some animated TV show that they liked and thought would translate well into recognizable physical form. As a test of teaching how to brainstorm with new “AI” tools, I introduced using GPT and other large-language models. My first prompt? “What obstacles would you build in an interactive mini golf hole based on the TV show Tom and Jerry?” The suggestions were good, but not that creative; they were the kind of things we immediately thought of.

OpenAI’s Chat GPT-3 has on-target, but rather obvious suggestions.

Continuing, I introduced using Dall-E and Stable Diffusion for generating image-based suggestions on how holes might look with those themes in mind. These were a bit more off-the-mark.

OpenAI’s DALL·E 2 had some rather hallucinogenic suggestions for mini-golf holes with a SpongeBob theme.

Overall the novelty of brainstorming with AI tools was compelling, and engaged students, but the actual utility of the suggestions wasn’t that great. However, like the field trip it did seem to help get them into the right playful and nostalgic mindset, especially for folks that hadn’t ever played before.

Soon we were off and running with five holes, each with a cartoon theme:

Tom and Jerry
Phineas and Ferb
Cowboy Bebop
SpongeBob SquarePants
Kim Possible

A Pop-Up Putt-Putt Event

Geared down motors, shake sensors, mp3 players, servos, neopixel strips — students pull together all sorts of wacky themed obstacles to create their holes. A laser-cut Tom the cat smashed a hammer down to block your path to the hole. Spinning record turntables decorated with Cowboy Bebop iconography redirected your ball away from your target. Eugene Crabs used a claw to try and deflect your ball from heading up a hill. The Kim Possible jingle played to alert you when an obstacle would move. A cage-o-matic created by Dr. Doofenshmitrz raised and lowered over the hole.

After two quick weeks of building and playing, students were ready (enough) to show these interactive obstacles at a 2-hour pop up playing event. Working with the Center for Design , I set up each hole on a table to create a five-hole course. We made score cards, 3D printed hand-sized golf putters, and invited in the Northeastern community to play.

Prof. Pedro Cruz joined us and played the students’ SpongeBob themed hole. (Photo by Matthew Modoono/Northeastern University)

The Tom and Jerry hole, amongst others, included some laser-cut character obstacles. (Photo by Charles Gauthier)

Students chatting around the Kim Possible mini-mini golf hole. (Photo by Charles Gauthier)

The event was a delight. Students’ friends came in to play, and faculty chatted with me. The holes worked and failed as you might expect; building reliable and repeatable physical computing devices is hard, especially with humans involved. That was one of the main goals of the project, to experience using motors and sensors in a short setting where things had to interact with real people. Overall we had a blast, and I’m excited to teach this again and expand it into a longer unit. Maybe with enough studio space we can create full sized holes.

Data journalism? You can do it.

Tue, 28 Feb 2023 05:08:31 +0000

Data is still hot, but the new skills, math, and technologies can feel overwhelming. In my experience journalism students and professionals approach learning data journalism with both excitement and trepidation. However, over a decade of teaching data literacy to many types of learners I’ve found that journalists are some of the best positioned to dive into data storytelling. If you’re on the fence about trying data journalism, consider this piece a pep talk. I’m here to convince you that fear is the main barrier. If you overcome it you’ll find that you are well prepared to start working with data, and that there is a robust and supportive community waiting for you.

Generated artwork from DALL-E depicting a data journalist.

Increasingly data is central to telling the most important stories of our time, and accordingly the tools and community have matured significantly. Now is a great time to try your hand at integrating data into your stories, creating new ways for readers to understand the issues you cover while simultaneously taking advantage of the popularity of data visualization to bring in new readers.

Your Skills Translate

For an organization with global reach, the UN World Food Programme headquarters are in a rather unassuming office park outside of Rome. About 5 years ago I found myself in a lively discussion there about the meaning of the Spanish word “contar” (with a Colombian attendee of a data literacy training I was helping at). The language of “counting” is a common part of data production; it is quantitative data production. I was describing this with the Spanish verb “contar”, which I intended to mean “to count”. My new Colombian friend reminded me that it also means “to tell”. As this dual meaning of “contar” suggests, from the moment of production data is just us telling ourselves stories. Journalists are storytellers; who could be better equipped to do data work than them? In English the closest analogue is perhaps “account”, where “to account” for something could be the act of counting, while “an account” is often a story of something that happened. This powerful duality in the terms connects the act of recording to the act of storytelling – literally the same word is used for both.

Many core journalism skills directly help in data work.

Many journalism skills easily translate to data work. Consider the first step of working with data - creating it. What would you call collecting a set of interviews? I’d call that producing a dataset. Journalists regularly produce and merge multiple datasets as part of their everyday reporting. What about validating a dataset to verify it is accurate? Sounds a lot like fact-checking to me. Newsrooms, and the profession at large, have robust theory and practice for ensuring the accuracy of the data they collect. How about being critical of datasets and what biases they bring? Journalists are primed to be skeptical of sources, to look for multiple viewpoints; yet another skill that translates directly to working with data. Assembling many datasets into a consistent and compelling narrative? Again, we see journalists doing this regularly by pulling in descriptions, quotes, perspectives and more.

Your journalism skills and training translate directly to telling data stories.

You Can Math

Years ago I visited the classroom of someone that helped me on my journey into data science - Professor Allen Downey. His books on programming and statistics are must reads for anyone learning programming and data science, and were particularly valuable for me as I tried to fill in gaps in my own self-taught skills. Allen generously invited me into his Olin College classroom to introduce his students to a different approach to storytelling with data. As I contrasted storytelling with statistical analysis, he jumped in with a way of talking about math that has stuck with me since then. He pointed out that math, at its simplest, is just counting. At its most complex, it can include rigorous statistical methods with lots of Greek letters. But there is, he argued, a LOT of space in-between. The in-between includes many techniques that offer analytic power to non-statisticians. Normalization is just smart dividing. Binning is just basic rounding. Here was someone that wrote a widely read book on programming and statistics arguing that you didn’t have to be a statistician to produce useful and reliable data analysis. I’ve hung onto this idea ever since, introducing it over and over again in workshops.

Knowing when to normalize, when to bin, when to remove an outlier - these are the challenging concepts. Math is not the hard part; most of it is just counting and even my middle school age kids are pretty good at that (although my son did struggle with forgetting that the number 4 existed for a while).

Tools are Getting Better

I wrote a whole academic paper on the challenges of teaching data journalism in a world of “tool overload”, so I truly feel the pain of navigating the overwhelming set of options for working with data online. There isn’t one tool to rule them all, but there are some compelling examples demonstrating how computational innovation is making data tools easier to learn and more widespread in use.

It used to be that you either learned Excel or programming, but now there are robust major tools like Tableau that bridge the gap. While they can’t help you figure out the right standard chart to support your story, chart choosers help you move from narrative form to a chart that can tell that type of story. Check out the DataVizCatalogue for more summaries of chart types and their intended uses. Worried about production value? DataWrapper is a great example of a simple-to-use tool that produces great looking charts online. Want to make more advanced chart types? RAWGraphs lets you just paste in your final data and try out some more novel chart types you might feel intimidated by.

Some leading tech for creating interactive data pieces emerged from news organizations.

In this industry this isn’t just a story of newsrooms adopting tech startups’ latest products; impactful innovation has come from newsrooms themselves, or intentionally built for them. The Flourish platform for online data visualization was conceived and developed for newspapers. They’ve had so much success that the online design app giant Canva recently acquired them. The key author of d3.js, the go-to technology for creating custom interactive data visualizations, is Michael Bostock. After creating d3.js at the University of Washington he spent years creating award-winning interactive pieces at the New York Times. More recently, the Svelte framework has taken the Javascript developer world by storm, praised by developers for its ease of use. The primary developer, Rich Harris, was a journalism student who learned to code (short documentary). He created Svelte while working at the Guardian.

Still overwhelmed by online charting or analysis tools? Don’t be afraid to return to your paper and pencil like celebrated data visualization designer Giorgia Lupi so often does. I often argue that there’s power in more informal approaches to data visualization.

The Community is Still Growing

You’re not alone on this journey. The community is growing and moving from its technological roots to embrace more diverse sets of data storytelling learners. Reddit’s DataIsBeautiful forum is a great place to get inspired about what to make (or what not to). The Data Visualization Society is a relatively new organization hosting welcoming events for learners of all types. Books like Scwabish’s Better Data Visualization and Fiegenbaum & Alamalhodaei’s Data Storytelling Workbook offer concrete and practical advice and guidance. Here where I teach at Northeastern University our Storybench blog regularly shares tutorials and behind-the-scenes stories from the field to help you see data storytelling in action.

As a journalist, you’re better equipped than most to enter the field of data storytelling. You should try it, because it will make your reporting — and the community itself — stronger.

Thanks to my colleague Meg Heckman for advice on an early draft of this piece.

Originally published on the StoryBench blog.

Digital Storytelling to Support Connective Journalism

Wed, 08 Feb 2023 05:09:00 +0000

Journalism serves many roles in society - informative, investigative, normative, and more. As the tools and pratices of interactive digital storytelling continue to grow, how can they help the connective role journalism plays in society? Read on for some background and a recent experiment I did in creating a digital story focused on building community connections.

The Roles of Journalism

Communciaton scholars often talk about the differing role communcaiton plays in society. Media critic and journalist James Carey is credited with a popular distinction, which teases out the idea of communciaton as transmission and ritual. The latter thinks about how communication plays a role in social interaction and construction of community. Journalism certianly plays the informative role, but also the serves what Carey would call ritualistic functions.

Journalism Professor Jeff Jarvis argues that community in fact means connecting people intimately over time. His definition of journalism centers on “conveneing communities” to build share understanding and world views. Journalism functions as a form of civic participation - increasing people’s feeling of community and willingness to participate in that community. The decline of local news in the US has demonstrated this; people’s sense of community, involvement, and pride decreases as local news sources dissapear. We need more journlaism focused on community-building role.

From a lens of community connectivity, the digital transformations of the last 30 years have created gains and losses. Take classified sections as an exmaple. Most discussions of the migration of classified ads from digital news sites to Craigslist focus on the massive loss of revenue, but with this ritual role in mind it can also be thought about as a loss of community connection. Browsing the classifieds, and posting a listing, instilled a sense of belonging; one associated with the individual and the community but also with the newspaper that hosted it. The news organization often was central to the relationship between the self and the place. Losing classified ads was a blow to the connective role news organizations played. Taken more broadly though, digital storytelling created new ways to discover your place and community as a reader. Consider browsable maps or databases of local businesses as an example. In addition (the early days of) comments and citizen journalism created new digitally-mediated ways to build that sense of community via a news provider’s website.

Scholars Dr. Regina Marchi and Dr. Lynn Schofield Clark offered the idea of “connective journalism” to get at the new ways that role plays out in the diverse media landscape of online news and social media. Sharing links, liking them, reposting them - these are ways people engage with one another that build connective tissue in journalistic ways. That mediates a sense of self and relation to a larger community, especially for marginalized groups like youth. Digital news interactives can be crafted to specifically try and build community connectivity in ways inspired by that work.

Building An Example

So what about a broader application of this idea of building for community connectedness? How can we more intentionally use digital storytelling to support the connective role of journalism? I decided to explore this by piggybacking on the attention to the men’s FIFA World Cup late last year.

The tournament, held in Qatar, was widely criticized for labor issues, gender rights issues, and overall corruption. However, it is still the largest sporting event on our planet, commanding massive amounts of media and popular attention. Rather than further reinforcing this flow of attention and money into FIFA’s pocket, I decided to try something different. I built a digital story to connect readers with local immigrant fans and their cultures.

Screenshot from the Our Cup interactive.

Our Cup uses the reader’s rough location and census data to discover the largest three local immigrant communities from countries competing in the World Cup. I was able to precompute this data for every county in the US via the Census ACS “foreign born population” data. I manually curated a set of information for each country that I thought might be “culturally relevant” – things that might connect a reader to that community and their culture. Specifically the country links to:

Local restaurants service country’s cuisine, sourced from Yelp
Recipes from the country you can make at home, sourced from Yummly
Background about the country, via Wikipedia
A few playlists on Spotify, via Every Noise at Once
A team guide, which The Guardian sourced from local journalists
A heat map of where in the US immigrants from that country live

The goal here is to give readers who might not know about local immigrant populations a chance to learn about them via the media event that is the World Cup. The links push from the sport story to a connective goal, from awareness of the local immigrant communities to direct or indirect engagement with them.

More Experiments Needed

This piece served as a “thing to think with” for me, a chance to explore what a digital story that is built with connective journalistic goals in mind could look like. Is it a success? In one way it is, because it helped me think this question through. In another way I can’t really tell, because I didn’t do the qualitative work to talk to readers about their perceptions before and after. In yet another way it isn’t a success, because it is a bit of a toy. That is all right with me, because I learn best by playing with toys.

I’ll keep looking for other people playing in this space, and the toys they are building. Hopefully we can continue to innovate how and why we build digital stories. There’s more to explore, and we need to think harder, in order to flesh out how approaches to digital storytelilng can support the important connective role journalism plays.

Thanks to my colleague Fernando Bermejo for advice on an early draft of this piece.

Originally published on the StoryBench blog.

New Paper: Taking Data Feminism to School

Tue, 05 Jul 2022 05:09:00 +0000

Excited to share a new paper out in the British Journal of Educational Technology. I worked with collaborators to assess what data feminism looks like in K-12 data science education. We retrospectively review 42 youth data programs and projects, assessing each against the key principles of data feminism.

Taking data feminism to school: A synthesis and review of pre-collegiate data science education projects

Victor R. Lee, Daniel R. Pimentel, Rahul Bhargava, Catherine D’Ignazio

Paper Abstract

As the field of K-12 data science education continues to take form, humanistic approaches to teaching and learning about data are needed. Data feminism is an approach that draws on feminist scholarship and action to humanize data and contend with the relationships between data and power. In this review paper, we draw on principles from data feminism to review 42 different educational research and design approaches that engage youth with data, many of which are educational technology intensive and bear on future data-intensive educational technology research and design projects. We describe how the projects engage students with examining power, challenging power, elevating emotion and lived experience, rethinking binaries and hierarchies, embracing pluralism, considering context, and making labour visible. In doing so, we articulate ways that current data education initiatives involve youth in thinking about issues of justice and inclusion. These projects may offer examples of varying complexity for future work to contend with and, ideally, extend in order to further realize data feminism in K-12 data science education.

Download the paper from BJET website

Upcoming AMC FAccT Talk: Towards Intersectional Feminist and Participatory ML

Mon, 20 Jun 2022 05:09:00 +0000

We’ll be presenting our collaborative work on the Data Against Feminicide project at the 2022 ACM Conference on Fairness, Accountability and Transparency. We’re very excited to put forward this work as a case study in intersectional feminist and participatory approaches to machine learning.

Towards Intersectional Feminist and Participatory ML: A Case Study in Supporting Feminicide Counterdata Collection

Harini Suresh, Rajiv Movva, Amelia Lee Dogan, Rahul Bhargava, Isadora Cruxên, Ángeles Martinez Cuba, Giulia Taurino, Wonyoung So, Catherine D’Ignazio

Paper Abstract

Data ethics and fairness have emerged as important areas of re- search in recent years. However, much work in this area focuses on retroactively auditing and “mitigating bias” in existing, poten- tially flawed systems, without interrogating the deeper structural inequalities underlying them. There are not yet examples of how to apply feminist and participatory methodologies from the start, to conceptualize and design machine learning-based tools that center and aim to challenge power inequalities. Our work targets this more prospective goal. Guided by the framework of data feminism, we co-design datasets and machine learning models to support the efforts of activists who collect and monitor data about femini- cide — gender-based killings of women and girls. We describe how intersectional feminist goals and participatory processes shaped each stage of our approach, from problem conceptualization to data collection to model evaluation. We highlight several method- ological contributions, including 1) an iterative data collection and annotation process that targets model weaknesses and interrogates framing concepts (such as who is included/excluded in “femini- cide”), 2) models that explicitly focus on intersectional identities rather than statistical majorities, and 3) a multi-step evaluation process — with quantitative, qualitative and participatory steps — focused on context-specific relevance. We also distill insights and tensions that arise from bridging intersectional feminist goals with ML. These include reflections on how ML may challenge power, embrace pluralism, rethink binaries and consider context, as well as the inherent limitations of any technology-based solution to address durable structural inequalities.

Download the paper from the ACM digital library or the FAccT website

Upcoming talk at PaCSS'22: Partisan Media Coverage and Intersectionality

Thu, 16 Jun 2022 05:09:00 +0000

I’ll be speaking at the 2022 Politics and Computational Social Science conference today with my colleague Meg Heckman. We’ll be presenting work, with Emily Boardman Ndulue, that took an intersectional lens to analyzing how online news media covered the election of current US Vice President Kamala Harris.

Partisan Media Coverage and Intersectionality: A Case Study of Vice President Kamala Harris

Rahul Bhargava, Meg Heckman, Emily Boardman Ndulue

Talk abstract

Partisan news coverage of politicians is well documented and studied, as are the many ways sexist and racist tropes have historically played out about those who identify as women, people of color, or both. Our current U.S. online media environment, meanwhile, has been shown to have become increasingly polarized. Kamala Harris’s election to the vice presidency presents a unique opportunity to study how these trends play out with regards to her various identities (woman, Black, Southeast Asian, mixed-culture household). To analyze coverage of Harris, we built on prior work to create collections of online news media grouped by a partisan score based on link sharing in a panel of Twitter users associated with voter registration records. We gathered and analyzed a corpus of news stories about VP Harris from these sources from August 2020 through April 2021 (n=17,165). Employing computational and qualitative methods, we found strong evidence of coverage of Harris playing out differently across an asymmetrically polarized media landscape. We found evidence of varying levels of sexist and racist tropes in all sources—but on the far right we discovered a particular cluster of (often factually incorrect) narratives that did not appear elsewhere on the partisan spectrum. Our work contributes to our growing understanding of how online media portray female politicians with multiple identities, suggesting that new intersectional biases may play out as this population grows.

Download the slides

Read the associated piece for Nieman Labs: It’s O.K. to write about women, fashion, and politics — but here’s how to do it better

Helping Computers Find Food in Text

Wed, 15 Jun 2022 05:09:00 +0000

Computers are good at processing large amounts of information, but bad at intuiting what that information actually is. For an ongoing research project looking at mentions of food in online media, we’re trying to help computers get better at recognizing entities in unstructured text. Given the text of an arbitrary webpage or social media post, how successfully can we get a computer to pull out all the words that are food?

How do we recognize foods in texts? source: Bon Appetit

The term “named entity recognition” describes a set of algorithms designed to adapt computational language models to understanding unique classes of information. As a subset of natural language processing (NLP), the goal of named entity recognition is to assign unique values to words and attribute some meaning to them. This contrasts with most other parts of the NLP pipeline, which assess generalized, syntax-related structures like part of speech and dependencies. The focus of this blog will be on how named entity recognition and adjacent strategies can form structured categories for masses of unstructured data sourced from food-related content. Within computational linguistics, a variety of approaches have formed to optimize the extraction of this sort of information.

Technical Approaches to Named Entity Recognition

Some approaches to named entity recognition do not engage much with machine learning environments. Researchers (generally speaking) have three different types of named entity recognition at their disposal:

The first, referred to as the “dictionary-based approach,” catalogs as many known named entities as possible within categories. This approach makes intuitive sense since human recall of “named entities” might engage the same sub-categorization process. However, from a computational perspective, dictionary-based recognition can incur a large memory overhead. Additionally, this approach inefficiently allocates this memory; infrequently used words or archaic terms take up space without offering a large use value. More importantly, the dictionary approach cannot incorporate new words into its lexicon, as it has no means of categorizing available words into existing datasets. Here, intuition breaks apart since the human brain can do this by processing new information about unfamiliar entities. Even if this hurdle were overcome, the dictionary might struggle to categorize words with a variety of applications in different “subcategories.” Thus, the dictionary-based approach does not offer suitable performance for our task at hand.

The second, known as the “rule-based approach,” employs linguistic patterns and context in order to determine classification of words. Consider the sentence “put the pot of pasta on the stove.” If we are looking for the food term, we can use linguistic features to get there. We can see the entity we are searching for, “pasta,” and within English syntax we can locate this as a descriptor of the direct object of the sentence, “pot.” Within a computational context, the source for rule-based named entity recognition comes in tagging the dependencies of individual words within a sentence; it utilizes the dependency structure, as opposed to semantic proximity, to determine the likelihood of a word appearing as a named entity. This approach still requires the use of machine learning, since neural networks must model sentences into a list of dependencies, and a training set that describes the named entity and its location within the dependency structure must be created for this approach to succeed. However, the reliance on linguistic rules can lessen the complexity of the algorithm, as well as improve accuracy of findings regardless of irregularly named entities, such as those with two or three words. This approach sees benefits within the silo we are working in. Research of food recognition has employed a rule-based approach to high rates of precision and recall. Rule-based approaches show some promise for our task, and we’ll return to them later in this discussion.

Finally, the “machine-learning-based approach” relies purely on the proximity of words to one another to predict named entities. Using a labeled training set of data, the machine attempts to classify entities, and then checks its work. Over time, the system should become capable of determining food entities, independent of anything other than the string of text fed to it. This is computationally heavy, but a powerful means of inculcating independent machine recognition. The benefits to this independence are immense; it means that over time, irregular patterns, and unfamiliar terminology can be incorporated with greater success and reflection of reality. As such, most research around named entity recognition has coalesced around this approach, taking insights where needed from the prior two.

One approach to entity classification via machine learning begins with the word2vec - high dimensional vectors representing word meanings and use. In our context, it is helpful to understand word2vec as a prerequisite to named entity recognition techniques (as opposed to its own, freestanding method). Broadly speaking, word2vec utilizes the machine learning approach described above to categorize words based on their proximity, and then mutates auxiliary data structures to reflect this. Some implementations, like skip-grams and bag-of-words, do this very directly through enumeration of words within a given proximity. However, I wish to focus on a word embedding implementation. Word embedding quantifies the utility of a given word through representing the word as an n-dimensional vector. At first, these numbers are random, and produce little predictive value; however, through continual training, a model tweaks the values of the word based on its contextual placement. In time, the set represents the proximity of the word to a cluster of adjacent vocabulary (Altosaar). Word embedding is the bedrock of applied named entity recognition. For a machine to understand the use value of a word, we must first make the linguistic assumption that words with similar applications (even if not synonymic) co-occur more frequently than otherwise. Word embedding condenses this process into numerical representations, similar in nature to hashing in cryptography. Embeddings are also, then, predictive in nature. By assigning clusters of relative proximity, we can utilize word embeddings to guess consecutive words in a string. Learn more in this helpful write from Carnegie Mellon.

Recognizing Food Entities

With these approaches defined, how can we incorporate lessons from their use to create an effective named entity recognition model for food? To answer this, we can look at some strategies employed by relevant projects in the existing academic literature.

The first of these, produced by scholars of the Complex Systems Lab based in New Delhi, combines the rule- and machine-learning- based models to create a graph which reflects the recognition and tagging of ingredients, the processes employed to mutate the ingredients, and the equipment used to facilitate the processes. First, the model iterates over a set of labeled ingredient data. This data incorporates a variety of attributes, such as quantity, temperature and state of ingredient (e.g. chopped, freeze-dried). With this information, parts of speech and dependencies are then flagged to create relationships to processes and utensils. Finally, an NER tagger is trained to deploy flags on word tokens used to describe processes and utensils, like “mix” and “whisk.” The result of this collection is a repository of inter-connected entities, bound together by relationships inferred from their parts of speech in shared sentences and dependencies on one another.

This approach is at the scale of a single recipe, as opposed to a project that might analyze individual entities over a swath of recipes. This might come as the greatest limitation of this research. An approach based on modeling the relationships within a single entity can be difficult to use larger datasets, as processes and utensils can vary greatly based on application, and do not provide too much valuable information about which foods are used and how they are described. Additionally, the emphasis on the food-making process obscures our focus on the food created, as opposed to its components. Because of this, the object of this research might not seem immediately pertinent to our interests.

However, there are many insights to garner from the methodologies employed. First, the use of some rules-based strategies within the implementation of the machine learning model can be a great asset to our understanding of food entities. Regardless of the desired outcome of our project, using dependency structures to inform how we process food entities can be of use. With our present work, more support is needed to facilitate our model’s recognition of multi-word entities. The use of dependencies can help tag these words as joint entities. Further, the use of multi-faceted ingredient data, such as state and temperature, can add nuance to the sort of final meals described.

The second approach utilizes a rule-based approach in order to flag entities as food classifiers. Developed by Macedonian and Slovenian computer scientists, “FoodIE” utilizes a basic NER model at first but iterates over the training data to develop predictive models based on parts of speech and dependency. Through parti of speech and dependency analytics, the semantic classifications assigned to named entities are more nuanced than other models. Advanced semantic tagging describes each item in the sentence, along with possible classes that the item relates to (for instance, “beef” is classified under the subfamily of “bovines”, and “soup” has additional classifications of “wave” and “cloudy). FoodIE is a highly successful model, with 97.6% and 94.3% precision and recall rates. The model can also distinguish between dishes with more than one word, even in the case of compound nouns like “fruit salad.” This suggests that the more a model can incorporate insights from the context of the text, the more granular its observations will be.

From these case studies, as well as the survey of literature on named entity recognition, two pertinent insights should be incorporated into our research. First, our model needs to better identify multi-word entities. Thus far, the model atomizes each word and does not consider context when flagging words. One tool to adopt from the second approach is dependency checks. If our model discovers a classifier belonging to food (which it does with high recall), we must build support for it to consider incorporating surrounding words into the final entity. The model is already capable of this for other tags developed in the broader NER model but does not have the power to do so for our tags. Second, we ought to consider lemmatizing our entities. I have noticed that our early iterations of the model struggle to comprehend words used in different contexts with hanging characters (for instance, our model currently tags egg, but not eggs). This is a simple fix – SpaCy describes its pipeline as lemmatizing succeeding named entity recognition, but this is not fixed, so the order of processing can be inverted. We might want to use this inversion more broadly by scanning for entities first, then running a set of algorithms to determine the belongingness of adjacent words in dependency or part of speech. For lemmatization, however, that must precede NER.

Conclusion

Generally, incorporating named entity recognition techniques will prove essential on our own quest to map cultural signifiers attached to food. The choices we make in how we recognize food-based entities as such varies greatly depending on our use case. A survey of relevant literature grounds our research by providing leads to solve problems along the development process as well as spark new ideas for manipulating existing datasets. No approach that we choose will yield perfect results, so it is important to know how to manage our margin of error to fit the task at hand. Ultimately, we hope to make decisions that harmonize the flow of information at each technical step of our model’s pipeline. Once it is performing well, we look forward to sharing this model.

Understanding the 2020 “Racial Reckoning” In the Media

Thu, 09 Jun 2022 05:09:00 +0000

Just over two years ago, on May 25th, 2020, George Floyd was murdered by Minneapolis police officer Derek Chauvin. The shocking video of the incident quickly circulated across the internet through social media. Mr. Floyd’s murder sparked countless protests across the country and globe in the months that followed, capturing the public attention and initiating a national resurgence of the civil rights movement, often described as a “racial reckoning.”

Individuals, organizations, industries, and institutions across the country were again forced to acknowledge the realities of systemic racism in modern America. Journalism and news media were no exception. Media organizations in particular have a responsibility to model anti-racist practices not only in their internal policies and actions, but in their public-facing role of crafting narratives and framing national discussion as well. Researchers are finding impacts of this already – our own news analysis after the 2014 death of Michael Brown showed increased media coverage and discussion of the deaths of people of color at the hands of police. Another analysis of digital news following Mr. Floyd’s murder showed similar trends of increased coverage which mostly portrayed protestors positively. Mr. Floyd’s death and the ensuing Black Lives Matter protests dominated public discussion and demanded cultural changes - how has the media landscape shifted since then?

Acknowledging a Problematic History

Newsrooms have historically been, and continue to be, predominantly white. Majority white newsrooms were criticized following their coverage of the Civil Rights Movement in the Kerner Commission’s report of 1968, and in 1978, the American Society of News Editors (ASNE) made a commitment to greater newsroom diversity and declared a goal that newsrooms would be reflective of their community demographics by 2000. In 1998, only 11.5% of daily newspaper reporters were journalists of color, still far below the stated 26% goal. The ASNE then postponed its goal date to 2025, but a 2017 analysis still found actual statistics to be far from the stated goals.

Photo of a Black Lives Matter Protest in 2020 by Jared Wickerham for the Pittsburgh City Paper. (Source)

Even as newsrooms slowly become more diverse, professional journalistic structures and practices are still major obstacles to fully addressing the profession’s preservation of whiteness and exclusivity. Just one example of this occurred at the Pittsburgh Post-Gazette in June of 2020, when editors at the paper prevented two of their top reporters, both of whom are Black, from reporting on the Black Lives Matter protests in Pittsburgh. This occurred in part because Alexis Johnson, one of the excluded reporters, shared a tweet of her own personal commentary on the protests that she felt was funny and thought-provoking. Despite no staff social media policy, the upper-level management rejected her multiple story pitches and eventually outright excluded her from reporting on the protests because they felt that further coverage could be seen as biased and cause “the credibility of the newsroom [to] be questioned.”

Objectivity and neutrality, which have long been core tenants of journalistic practice, have contributed to maintaining white perspectives in professional journalism as the only form of “objective” and “neutral” reporting. Ultimately, this means that reporting that is not racially informed is considered properly neutral and objective, furthermore promoting the neoliberal ideal of color-blindness. In contrast, racially informed reporting, reporting from journalists of color, and stories for audiences of color are viewed as biased storytelling or relegated to niche media.

The dedication to “objectivity” in reporting is just one example of centering white perspectives in journalism; multiple publications have admitted to producing overtly racist content and enacting discriminatory business practices in both the past and present.

Screenshot from Editor-in-Chief Goldberg’s 2018 letter about National Geographic’s history. (Source)

National Geographic began their own racial reckoning in 2018, when Susan Goldberg, the magazine’s first female Editor-in-Chief, acknowledged the publication’s problematic past of disregarding American people of color and pushing stereotypes of exotic and primitive native people around the world. In the years following, the magazine hired more women, publicly covered LGBTQ+ issues, and added two women of color to their executive team.

National Geographic recommitted to this work during the summer of 2020, and staffers say there have been improvements but more are needed, structurally and socially. Lower level staffers felt that changes in coverage, content, and workplace culture were lacking and that their pushback was met with resistance from upper-level leaders.

Across many media organizations, lower-level staffers have felt the added pressure to advocate for these changes and were sometimes met with dismissal, leading to intense stress and a complicated relationship with the brand and their work. Unfortunately, vowing to value and prioritize people of color and then continuing to ignore or tokenize their ideas and careers is not uncommon, as an investigation by the Nieman Lab found in newsrooms and Diversity, Equity, and Inclusion (DEI) positions at publications across the country. These reporters shared anecdotes which included feeling overlooked, overworked and emotionally drained, in addition to being stifled by the editing process and limited in their ability to adequately call out instances of racism and sexism.

Varying Paths within One Publisher

An interesting contrast exists between the responses from two prominent Condé Nast publications: Bon Appétit and Vanity Fair. In late May of 2020, Adam Rapoport, then-Editor-in-Chief of Bon Appétit, wrote a column addressing the murder of George Floyd and affirming the magazine’s editorial mission and values of justice; just a few weeks later he resigned when a picture surfaced of him in brownface. Following his resignation, staffers opened up about the toxic work environment at Bon Appétit, where BIPOC staffers were taken advantage of, underpaid, and mistreated. Assistant Food Editor Sohla El-Waylly shared that she had been paraded as a token of diversity in Bon Appétit’s popular Test Kitchen videos, but her appearances were not compensated, unlike her white coworkers. In addition, Bon Appétit’s content contributed to the overall food-industry trends of white-washing. This is exemplified by instances of white cooks and creators taking credit for and profiting off of diverse recipes that are painted as “trendy,” when they are not outright ignored or discredited.

Screenshot of the web edition of Vanity Fair’s September 2020 edition. (source)

Though institutional issues at Condé Nast remind us workplace and content discrimination was most certainly pervasive throughout the company, some individual publications fared better than others. Vanity Fair has made strides in inclusive coverage since Radhika Jones took over as Editor-in-Chief in 2017, making her one of the only top editors of color in Condé Nast’s history. Immediately after taking control, Jones championed increased content featuring people of color, especially as cover stars. Following the 2020 summer of Black Lives Matter protests sparked by instances of police violence against unarmed Black people including George Floyd and Breonna Taylor, Ms. Jones assisted in the curation of an issue that highlighted Black stories and voices. The September edition, which is annually one of the most important issues, was guest edited by best-selling and highly-revered author Ta-Nehisi Coates and featured a portrait of Breonna Taylor on the cover. The issue highlighted content from multiple Black writers and creators, sharing stories of everything from the Black Lives Matter and abolition movements to the realities of unpaid college football. This edition of the magazine, titled “The Great Fire,” exemplifies moral solidarity in journalism. Moral solidarity in the news portrays marginalized people as “subjects of justice” with personal and political agency, and requires comparatively more privileged groups of people to amplify the experiences, desires, and solutions put forth by the affected communities.

As part of Condé Nast’s company-wide reckoning, they pledged to incorporate more inclusive hiring practices to diversify their staff at all levels, including editorial. As part of their accountability goals, Condé Nast committed to releasing an annual Diversity & Inclusion Report, which includes statistics on new-hire, editorial, and overall staff diversity. In 2021, 32% of all U.S. staffers identified as people of color (compared to approximately 28% in 2020) and 41% of new hires were people of color (compared to approximately 38% in 2020). However, as of 2021, 78% of Senior Leadership positions were held by white people, up from 77% in 2020. These numbers unfortunately are in accordance with the anecdotal struggles from reporters in newsrooms across the country who feel much of the work towards inclusivity rests on the shoulders of staffers of color tasked with standing up to upper management.

Major Challenges Remain for Media Companies

While these staffing numbers are just a piece of the overall picture, they give an indication of the significant obstacles still in place that may prevent a truly inclusive workplace and perspective shift at Condé Nast and newsrooms around the country. As we move towards a new era of journalism, the role of apology letters and declarations of inclusivity are still unclear. Idealistic narratives of American progress promote the idea of a society that naturally trends towards justice and equity. However, this narrative perpetuates an optimistic view of slow but consistent progress that can minimize the severity and urgency of current racial inequities.

As Condé Nast’s annual Diversity & Inclusion Report gives us a statistical window to the hiring practices and future diversification within the organization, the parallel issue of content diversity is just as important to study. A recent piece by The Hollywood Reporter highlighted the new wave of POC editors at various publications across the country, discussing the challenges, responses, and ways that they have banded together to promote long term change in their newsrooms. Across the country, journalists at all staff levels have shown a dedication to racial equity in media, inspiring hope about the direction of the profession despite the significant progress left to be made. The racial reckoning of 2020 catalyzed a long-overdue reflection and restructuring process in professional journalism which will set industry standards for years to come. As we move forward, true progress towards racial equity in the media will require pay and benefits transparency, comprehensive shifts in internal policy, a new definition of journalistic objectivity, and diligence at all staff levels.

Upcoming Talk at C+J'22 Conference: News as Data for Activists

Wed, 08 Jun 2022 05:09:00 +0000

I’ll be speaking at the 2022 Computation + Journalism conference, hosted at Columbia University from June 9-11. I’ll be presenting a paper on the software architecture supporting the Data Against Feminicide project.

News as Data for Activists: a case study in feminicide counterdata production

Rahul Bhargava, Harini Suresh, Amelia Lee Doğan, Wonyoung So, Helena Suárez Val, Silvana Fumega, and Catherine D’Ignazio

Paper abstract

News articles are an important source of data for recording and aggregating a range of social phenomena. In this paper, we ask if and how technology can support civil society activists who challenge asymmetrical power relations by producing counterdata—datasets missing from mainstream counting institutions. We consider a case study centered on activists who monitor feminicide, or the lethal outcome of gender-related violence, often using news as a main source to identify and compile databases of incidents. We describe a system that we collaboratively built with activists, aimed at relieving some of the emotional and time-intensive labor this work entails. The system discovers relevant news stories on multiple systems, classifies them based on machine learning models, clusters them into groups of stories about the same incident, and delivers regular email alerts to users. Currently, 26 groups across different geographical regions are using the system, and groups who broadly monitor feminicide report that they are regularly discovering new cases. We also reflect on the short-comings of the pilot system for groups with more specific, intersectional monitoring focuses, and the implications of biased narratives or under-reporting on the system’s design. This case study contributes a grounded example of computational journalism built in collaboration with, and in service of, activists working on critical human rights issues.

Download the conference paper

Slides from our talk