Data is crucial to our society. It delivers benefits but as with any human tool, it can also do harm. our host Dr Craig Poku (he/they) asks: can the way that we use data causes further harm to marginalised groups such as the LGBTQ+ community?
The panel consists of:
Dr Kevin Guyan (he/him) – author and researcher on LGBTQ data
Erika Gravina (she/her) – Senior Data Scientist and D&I lead at Datasparq
Dr Paul Sharkey (he/him) – Senior Data Scientist at Datasparq
This podcast was recorded live at Queer Britain and it is a collaboration between Pride in STEM and Datasparq.
All apps link for our podcast here.
TRANSCRIPT
Intro – Craig Hi. I’m Dr Craig Poku and you are listening to the Pride in STEM podcast. A series that explores different barriers faced by the LGBTQ+ community. In this special one-off episode for Pride Month 2024, we hosted a live panel discussion at Queer Britain titled: Is Data Bias Harming The LGBTQ+ Community? A collaborative project between Pride in STEM and Datasparq.
Data is something that’s all around us and dictates how we operate as a society. However, despite it most people seeing data as “just a number”, can the way that we use data cause further harm to marginalised groups such as the LGBTQ+ community? For this discussion we spoke to a panel of experts Dr Kevin Guyan, Erika Gravina, Dr Paul Sharkey of provide insights into this topic. Our conversation covered the panellists’s career journeys, the barriers faced by LGBTQ+ people who engage with data and what steps are critical to ensure that data does not unintentionally harm the community.
Here is that conversation.
00:00:00 Craig Welcome to the Pride in STEM podcast which is hosted by myself, Dr Craig Poku. And today we are doing our first live discussion. [chuckle] Which is exciting and terrifying, and oh my gosh people actually want to pay me to talk! But with that in mind we are going to be talking about data biases and one of the things we talk about with data, data is something that is around us and it dictates how we can operate as a society. For some people it is just a number but the reality is data isn’t just a number, it dictates how we can actually inform policy, it can dictate how we basically operate as people. And the question is: we are aware that data is seen as objective but is there a way for it to actually cause harm to marginalised people and more specifically the LGBTQ+ community. So to celebrate Pride month, we’re actually going to talk about it! With that in mind I have some wonderful panellists here so we have Dr Kevin Guyan, who is an author and researcher in LGBTQ+ data. We have Erika Gravina, who is a senior data scientist and the DNI lead at Datasparq and Dr Paul Sharkey, who is a senior data scientists at Datasparq. So firstly, hi!
00:01:15 Erika Hello.
00:01:16 Paul Hi Craig! [applause]
00:01:24 Craig We may as well cut straight to the chase. My question is: tell me a little bit about yourselves.
00:01:29 Erika As previously mentioned, I am Erika Gravina, I am a senior data scientist at Datasparq, I started about three years ago. Before that, I did a undergraduate in Mathematics and Statistics and it was quite a smooth transition into being a data scientist. I had the experience of being exposed to it through some hackathons, which is just coding challenges. I enjoyed the process so much; it was problem solving on steroids and I loved it. I then went straight in to getting to know some people and got to know Datasparq from that, so it was quite nice. Ever since joining I have been exposed to all sorts of different projects, so I’ve worked on some recommending systems, some pricing models, some passive segmentation models so quite a large variety. And, yeah, it’s been really nice since.
00:02:20 Craig Can I also just say that I slightly envious with how smooth your transition into data science is, comparative to my career. How about yourself Paul? Tell us a bit about yourself.
00:02:27 Paul Yes, so hi everyone. I am Paul Sharkey and I am also a senior data scientist at Datasparq and Craig I see quite a lot of you day to day.
00:02:45 Craig For context, Paul is my boss. [laughter]
00:02:49 Paul Boss is such an ogre-ish term. [laughter]
00:02:52 Craig Okay, Paul is my line manager. Better?
00:02:53 Paul That’s Better, thank you. So—I guess where do I start? I did my PhD in statistics, it was an application in the environmental sciences, so I worked a bit with industry during my PhD with the MET office and EDF energy. And then from there I worked in several roles since then, a variety of different organisations, different applications. I guess I first experienced data science—proper data science I guess, when I went to the BBC and I was exposed to vast amounts of data. Larger data sets than I was able to fit on my laptop and I was quite curious about the methods and approaches people take to handle that scale. Then I found my way to Datasparq. So I guess in terms of where I see my career now as supposed to when I was just starting out. Starting out I was very much into research, I loved getting into the nitty gritty of a problem, even if it was the most niche problem in the world. But since then, I have really valued actually trying to get my work into action where possible. I am prioritising simple solutions to actually sort of make an impact.
00:04:12 Craig We love simple solutions.
00:04:13 Paul Yeah. [laughter]
00:04:15 Craig Even when a simple solution can sometimes be, to us it is obvious but to others it is one of those things where they kind of go, “Oh wait a second actually, that feels quite novel.” So with that in mind, Kevin tell us a little bit about yourself.
00:04:30 Kevin Hello everyone. I am not a data scientist, I work as an academic at The University of Edinburgh, in the business school. Doing work around data, particularly gender sexuality data. My background is a hot mix of disciplines across different areas of work. So my PhD was actually history, in kind of architecture and planning history—
00:04:59 Craig Oh! I’ve learned something new today.
00:04:49 Kevin —post war, if anyone was chatting on post war housing in London, that’s my speciality subject. I took a pivot after that and worked in diversity inclusion work for five years. Again, very much practical focused, very action orientated. Then, the census happened in the UK and that re-sparked my interest in how something like a census collects data about the communities in very problematic ways. Pivoted to change my work to look into more quantitative data and how these data collection tools count, categorise and record queer lives. I think that kind of cross disciplinary, interdisciplinary focus has been a strength and even if I do get terrified when people ask overly data specific questions.
00:05:40 Craig Don’t worry, I know lots of data scientists, and they still get terrified by data specific questions. So before I carry on with the discussion, so, to the audience, you may have noticed that we have a screen over here. And we will ask you if you have the QR code. If you have any questions that you want to submit to the panellists, feel free to scan the QR code and add it in and then our wonderful volunteer Ruby will be passing all the questions on to me and then I will be asking it to the panellists. We are going to start another bit about the idea of how the data biases that we basically are talking about actually impact the community directly. We mentioned bias quite a lot within this kind of like conversation, but what do we actually mean by bias? So Erika can you start us off with that please?
00:06:25 Erika Yes of course. So whenever we talk about data more broadly, I think it’s probably worth taking a bit of a step back and, kind of, really thinking about what we are discussing. And I think, data is really just a collection of information, and with that information we want to be representing something. Something meaningful. More often than not, that thing that we want to represent is a population or a process and we don’t really have the luxury as practitioners most of the time to have a full picture of everything or everyone in a population. So what we want to be able to do is collect samples of data to be able to represent them. And then the question is: how good is that sample? How representative is the data that we’ve collected, of our actual population. So sometimes where biases can come in is in that selection process. So is the sample that we are using to represent the population truly a good representation of that population. And when we’re putting statistics out are we confident that we are representing that population appropriately. 00:07:22 So this is where potentially marginalised groups might not be well represented and where the statistics that we put out might be potentially incorrect. And then I suppose there is the second element to it, which is not so much in the selection process—So let’s just say hypothetically you have the best sample possible, and your population is truly getting the best representation possible. But then, as practitioners we might handle it inappropriately. So we might do something with that data that’s actually leading to some incorrect conclusion. And this is where maybe different types of biases can come in, such as maybe confirmation bias. So if I really believe that something is meaningful I am going to—I could potentially try to make my analysis with that data tell me the conclusion that I want to reach anyway. And that is potentially a form of bias. Or, you can have variable bias, you know, you could have situations where you are just not including certain variables that might be relevant. You know there is a lot of protected characteristics that we can’t really collect data about, most of the time. And if you don’t include those in certain types of analysis it might mean that certain populations or subgroups of population, get washed out. So many different areas where biases can come in. But I think truly it is either in the collection process or in the way that we manipulate it, and we have to be very careful as practitioners in both.
00:08:43 Craig And that is a really nice way and it actually quite nicely leads onto the next part of the discussion because—so Kevin is the author of Queer Data and within one of the themes that you speak about in this book, you mention about the idea of big data, and more specifically how bigger organisations are utilising it, but actually you can do that sort of washout aspect that Erika was mentioning. But, more specifically, what I actually want to ask you is: why is it that these biases should be of concern to the LGBTQ+ community?
00:09:21 Kevin So I think the problem isn’t unique to LGBTQ people, I think an minoritised or marginalised community is going to rub up against some of the challenges with data. So in my work and tray and argue that when we think of—particularly quantitative data, it does have this veneer of being objective, being apolitical, being ahistorical. Particularly something like a census which is often seen as the gold-standard of data collection practices. With my work I hope to, kind of, argue and try to argue that it’s none of that. All of this data is extremely political, it’s extremely historical and as we see, particularly with LGBTQ communities, can be used in ways that helps, but also harms us. So we’ve seen the publication of census data for England and Wales in the past year or so, and I kind of predicted that, particularly the size of the trans population in England in Wales, would be both too big and too small for anti-trans campaign groups. So thinking through how kind of the politics of numbers can do a few different things, and I kind of try and push back against this argument that we just need to collect more and more data about LGBTQ communities, as if that will somehow fix problems of homophobia, transphobia, biphobia. 00:10:29 Actually, I think as we’re seeing, you need to kind of follow up that data collection practices with some meaningful action. So I think bias comes into it in that when we’re working with numbers, I think we need to be mindful that whether it’s qualitative data, which has—is often more seen as subjective, but kind of quantitative data has the exact same challenges, and just to shine a kind of light that numbers can be used, misused, weaponised, and, as we’re seeing in the UK, particularly against LGBTQ people.
00:11:04 Craig Yeah. And it’s interesting if you talk about this idea of manipulation, this concept of good data, which is something you bring up—now, what’s quite interesting is the fact that—so I, for those who are listening, I identify as a Black, queer man. However, more specifically, I identify as Ghanaian and Jamaican in terms of descent. Now, when you look at PhD applications, you note that Black Caribbean students make up the smallest proportion of PhD applications across the entire country, and a lot of the times, when they’re doing this whole ‘good data’ aspect, what they will just do is they’ll just aggregate all Black people into one—or more specifically, they’ll just say ‘BAME’. So, like—[laughter] all non-white people basically look the same, kind of thing. But the reason why that’s also cause for concern is because, especially around Black Lives Matter, there was this whole big push of trying to say, “Look, the problem isn’t as bad as it is.” And I’m like, “That’s great, but what are you going to use this data to actually do in terms of actionable choices?” And the reality is that this could actually have an impact on the LGBTQ+ community, but also more specifically, it can also have an impact on the way that we kid of action these reports. So therefore this kind of leads me to you, actually. So one of the things that I’ve also been interested to know it that—so we’ve got this data that we’ve now collected that we’ve claimed is now unbiased—and I’m saying ‘unbiased’ in air quotes—but where can that actually have a knock-on effect to, say, model development?
00:12:36 Paul Yeah, so, as data scientist, I work a lot with machine learning models. These are essentially algorithms that, you know, are used to find patterns in the data, in these data sets, right? And usually with the goal of optimising a process for decision making, or, like, trying to gain insight about a particular process. Yeah, so the algorithms learn these patterns, what they look like, with the view that when data sort of—unseen data is presented to the algorithm, that it will translate that into an outcome. The problem with that is that if a data set is biased, then that bias will be reinforced by the algorithm that we put it through, and any bias will be perpetuated as a result. So if, for example, within an organisation, a source data set is put into an algorithm, it creates a new data set from that, like, set of predictions. 00:13:42 Those predictions will have all those biases inherent in the input dataset. So it essentially creates another source of bias that could then be fed into other algorithms that then create this cycle. So yeah, this is problematic, and there’s a duty in us as data scientists to be able to identify whether the data we are working with is suitable for use, whether it’s suitable for answering the questions that we’re answering, and whether we’re asking the right questions in the first place. Yeah, so I think there’s a couple of examples where this could have an impact on the LBGTQ community. So one sort of bias that might arise is, as you said, if certain groups are underrepresented in samples of data, then the consequence of that is that the dominant class or the dominant behaviour within that dataset will dominate whatever prediction you get out of that. 00:14:49 So there’s an example proposed that, you know, we’re living in an age of AI, obviously, and medicine is one of the sort of top talked about applications of AI to help improve diagnosis, for example. But what happens, say, if we let an algorithm, you know, diagnose people, like, across society? If that data set’s been trained on a, you know, a set of data that’s, you know, standard body types, whatever that looks like, but then you present it to somebody who, you know, maybe they’re going through hormonal therapy or something like that, and they don’t respond in the same way as other people who aren’t going through hormonal therapy. So there’s questions like that to think of. And another way is—another example is—of bias is when the data set, the content within the data is biased itself. So I don’t know—maybe you might have heard of ChatGPT if you—[chuckiles] want to cheat on your university assignments. [laughter]
00:15:58 SX Yeah, yeah, yeah. [laughs]
00:15:59 Craig Actually, Paul, what is ChatGPT?
00:16:01 Erika What is it, please?
00:16:01 SX [chuckles] Yeah.
00:16:04 Paul Funny you should ask. So yeah, ChatGPT is an example of a large language model. So it’s a model that is trained on vast amounts of data—like, across the internet—with the purpose of providing a service in terms of answering questions, summarisation, all sorts of text transformation tasks that you can think of. But obviously if it’s trained on the entire internet, the internet is a cesspool of [audience groans, chuckles] information and biases, and a lot of it can be inflammatory and toxic and—yeah. And if such a system is sort of trained on that sort of information, that presents a question of, you know, what kind of content can we expect, actually, to come out of that? And there’s a problem here, because the inputs into these datasets are not very visible. We as data scientists don’t have huge amounts of oversight on what data’s being used to train these models. So yeah, it—so yeah, it begs the question, “What do we do about this?”
00:17:28 GREG Yeah. And I guess this is more of an open-ended question to any of the panellists, so feel free to kind of, like, follow up, but one thing that I’ve definitely noted in all three of your answers and within this discussion so far is this idea of society at large and the way in which we’ve had these ingrained biases have essentially fed into the way in which we are doing queer data. Obviously, Kevin, you mentioned the fact that we were talking about how data collection will perpetually harm the LGBTQ+ community, because of course, the idea is, why would you want to partake in a data collection process when you’re not going to be able to see action? And likewise [inaudible 00:18:05] both Erika and Paul, you’re obviously following that, about the idea of, again, aggregation, de-aggregation, so and so forth. So I guess one thing that would be quite interesting would be, would we be able to even address these biases, or what do we—if you could think of the most radical way to address this bias, what would it be?
00:18:32 Kevin Like, break a machine? Abolish the system? [audience laughter] That’s the most radical way, I think. Like, there is one thread running through queer data in the final chapter and my kind of work at the moment, is are we just going down a path where there is no solution and we’re basically wasting our time, our energy, our labour, trying to fix data systems, trying to de-bias, trying to kind of roll out diversity interventions, but ultimately fix something that’s beyond repair. And I think that’s a conversation that needs to be part of the broader conversation. I’m not saying we need to jump to kind of ripping up the system and burning it down quite yet, but I think we need to have that card—
00:19:17 Craig Why not? I love the idea of ripping down the system and [laughs] burning it down.
00:19:20 Kevin —But I think anybody working with data has to realise that it might not be possible to fix this or to create data systems that work for everyone, so the idea of, I guess, ‘queer’ and ‘data’ might seem like kind of oxymorons, in that data is about categories, classifications, ones and zeroes; queerness is something that’s not that. And it might be impossible to actually bridge the two, so I think part of, I guess, all of our work is figuring out what good can we do to try and push things in the right direction. But I think ultimately we also need to have in mind that it might be beyond fixing and learn from people doing work in kind of abolitionist spaces and apply some of that to data and think, “Actually, maybe we do need to abolish some data systems—”
00:19:59 Erika? [stage whispering] Yes. [chuckles]
00:19:59 Kevin —If they can’t be fixed.
00:20:03 Craig Any other thoughts?
00:20:06 Erika I suppose—I completely agree. I think—I suppose there’s an element of, some things will be very unfeasible to abolish, and I think to try our best as practitioners, to try to embed that ethics review and, like, the open discussion as early as possible during the design process is something that could hopefully help in minimising the effect of those biases seeping through into the final output. But I completely take your point that there is a lot of AI solutions that have already been designed and built, and now we’re in a bit of situation of ,”The innovation has happened, and now we have to fix it.” And it’s—you know, it’s inconvenient to have to do it that way around, but, you know, the work that me and Paul have been doing, and I assume lots of other data scientists around the UK, more broadly. But I think these conversations are starting to become—I want to believe—a bit more of the common norm, and having users or people who are impacted by these statistics and these models be part of the conversation is going to be so important.
00:21:13 Craig I’m all about burning the system down.
00:21:14 Erika [chuckles] Yeah. Same, same.
00:21:16 Craig I am, you know, let me not say anything that’s going to incriminate myself. [audience laughter]
00:21:24 Paul Yeah, I—just to add quickly to that, I think I’d like to see data practitioners come together more to—like, we have a good ethos at Datasparq, that—you know, with these sorts of situations, but I think it needs a more concerted effort by the data community as a whole to, you know, come up with a set of good practices that can be codified. Almost like a Hippocratic Oath for data scientists. [audience laughter]
00:21:57 Craig Oh, without a doubt. And I think it’s quite interesting you say that because I think, sometimes, especially with sort of, like, elite establishments, it sort of the idea of, “We are the knowledge-bearers. We know how everything works, and then we’re just going to produce a product for said person,” without actually acknowledging that communities have been doing the work. However, we’re not allowing them access into our spaces. And I think the idea of bringing people into these spaces is quite interesting and really important. So we spoke about biases that have faced the community directly, but now I’m interested to know about the biases that are faced by the community who want to engage in tech or data science specifically, because I feel like when I think about this idea of—one of my two favourite TV shows is Ugly Betty. [audience laughter] And—yeah. And within that, Wilhemina Slater, played by Vanessa Williams, is an excellent character. But one of the things that you notice throughout this entire show is this idea of how she has to try and emanate the mindset of a straight, white, cis man. But yet Daniel and his nepo-baby arse comes in and basically takes her job that she’s been working all of this time. And it’s interesting because, like, queer people or marginalised people are always told to assimilate. So I guess I’m interested to explore that a little bit more, so.
00:23:16 Kevin Big question.
00:23:18 Paul Yeah.
00:23:18 Craig I know.
00:23:20 Kevin I guess there’s a few layers to it, a few strands to it. So when you were speaking, I thought, I guess one expectation, particularly around data, is to be kind of—to feature in these data sets, you need to be out, in some shape or form. You need to tick a box, you need to disclose, you need to share some information about your gender, sex, and sexuality, and that already is splitting queer communities. It’s already creating a two-tier arrangement where outness is being valued beyond not being out, and I think that dimension to data, that requirement to be out, to be counted, is a real problematic element, whether you’re working with data or being counted by the data sets. I guess another element that came to mind when you were speaking was around, I guess, maybe the flip side of what you were saying with the Ugly Betty example. There’s also a labour of being queer in the workplace. There’s actually work and energy and kind of labour involved in performing that identity—and I don’t mean ‘performing’ in a way that you’re putting it on. 00:24:28 But actually being the kind of gay employee that your colleagues might expect you to be, or being the kind of queer boss does require some kind of strategising. Sometimes you might go in the closet, out the closet, disclose to some people, not disclose to some people. This is all expectations not expected of cisgender, straight colleagues, and I think that’s—I mean, people like Sahir Ahmed 00:24:47, who’s a kind of scholar, have written on that energy that requires, how privilege in those contexts is a kind of energy-saving device. So I think that’s something that has differential harms across queer communities as well. I know maybe being a kind of cisgender, gay white man will not have the same labour or same energy required as other parts of the queer community. But I think that kind of performance of the self in the workplace is something that I don’t think we speak a ton about. I think there’s just this assumption that once you’re out, it’s all fine, everything’s merry—
00:25:32 Craig It gets worse.
00:25:31 Kevin —actually, you might go out, you might come back in again. You’re out to some people. All of that’s kind of the politics of disclosure, I think, and a really interesting area.
00:25:38 Craig Yeah, I think, to me—I’m gonna probably follow this up, because as somebody who’s got intersecting identities, who for my entire life have had to go, “Am I black enough? Am I queer enough? Am I black and queer enough?” Am I in those spaces where I have to go—people don’t see me as a, “Hey, sis!” They see me as a Black man who basically walks in and therefore they laugh, and they go, “Am I able to take up these spaces?” And I can be quite exhausting, because you then get to a point where you’re like—you’ve got two applicants who, for example—so you’ve got one who’s a cisgendered, white, straight man, and you’ve then got somebody who’s of a more marginalised intersection of identity, and the hoops that the second person may have gone through means that by the time they get to that final interview, they’re already emotionally exhausted. So—that really hit my soul. Oh, god. I didn’t know [inaudible 00:26:20]. [audience chuckles]
00:26:21 Craig [chuckles] And I guess—and actually, Paul, I’m going to bring this onto you. Can organisations do anything to help that? Like, are there any barriers that they can do to address these things? Is it one of those things that if it’s all doom and gloom, I have to basically go, “Oh, I have to find an organisation like Pride in STEM.” Basically, they go through there [chuckles] whatsoever.
00:26:53 Paul Yeah, I think—yeah, my own personal experience, I’ve been lucky in the sense that I’ve—I don’t think I’ve faced any significant barriers in entering the workplace, but obviously that can’t be said for everybody in the queer community, and there’s evidence to suggest that—yeah, as Kevin said, it is a bit of a labour to actually be in the workplace, and other people have it much worse. Like, I think LGBTQ people are the more likely to get in conflicts at work, they’re more likely to leave their jobs sooner than others. And yeah, I think—like, [scoffs] what people want is to just be their authentic selves at work, right? And if that’s not possible, as is the case in general, then I think the workplace is broken. I think it’s completely broken. And yeah, so I think workplaces can help them in terms of the, you know, the policies that they put in to support—uh, yeah, to support queer people. Yeah, I don’t think any of this rainbow-washing stuff quite cuts through.
00:28:07 Craig Yeah. Happy Pride Month. [audience laughs] [inaudible 00:28:10].
00:28:13 Paul Yeah, but I think—I don’t know. I think there are green shoots. I don’t want to be all doom and gloom. So I think the workplace is almost predicated, in a sense, on the individual. Like, everybody, every single person is trying to get ahead, right, in the workplace. I think there’s signs within the data science field that we are shifting to thinking more in a sort of collectivist fashion in terms of how we set up our teams, and I think there is more of a sense that we’re all in it together in some way to achieve, you know. The teams that we’re in at Datasparq are, you know, they’re quite mission orientated, but, you know, we have these goals that we want to set together, and feel when you’re in a culture like that, it does inspire people to be more open about who they are on a personal level as well. So I think just a general change in culture is needed, but obviously the communities like Pride in STEM and others are, like, really important for people to have. 00:29:14 But yeah, being a—[chuckles] yeah, I think there are reasons why being a data scientist is an attractive role for LGBTQ people. It’s very remote-friendly, so people can live in, you know, states that have more supportive environments and policies and work elsewhere if they want to. And I think there’s a, there’s quite a diversity in how you can get into the role in the first place, right? We know that LGBTQ people are underrepresented in higher education, but, you know, they—you know, there are multiple routes to becoming a data scientist, so I think there’s—yeah, there’s opportunities there, too.
00:30:05 Craig That sounds really positive. I—okay, now I think your optimism is definitely rubbing off on me. I really like the idea of how organisations, they’re using that concept of good data—and I’m using ‘good data’ in air quotes—and actually trying to make actionable changes. Because I’ve definitely been in a lot of organisations where they’re—it’s all very lip service, but I’ve definitely found that with, like, much smaller organisations such as Datasparq, that it’s not perfect, but they’re actually trying to ensure that it’s a more inclusive workspace. So Erika, because I know that you’ve done a lot of stuff around diversity, inclusion in your time at Datasparq, I guess what I want to know is, what steps have Datasparq done in order to ensure that we have a more inclusive workspace? And more specifically, what lessons do you think the field and the tech community can actually learn from Datasparq?
00:31:05 Erika It’s quite hard to really pinpoint where it all stemmed from. Like, when the workplace really did feel inclusive. Like, when I started—this is a bit of a backstory, but when I started, I was, I think, the fourteenth employee at Datasparq? Like, very much, like, a startup vibe that word’s—allowed it’s a bit young 00:31:25. [audience laughter] I’m sorry. And yeah, it was quite early on, and I was the only person who identified as a woman and as a queer person within the technical team. So amongst product, data engineering, data science, I was the only person that looked or [chuckles] sounded like myself. And that was a really important thing. Like, I think at the time, it was definitely a bit, like, “Oh, how should I act?” You know, I’ve got know frame of reference for people who look like me that should act in a specific way, so I tried to be like the boys, and I was like, “I’ll try to sound smart like everybody else does.” And I soon realised that that was just quite silly. And I thought to myself, “I shall open up to someone how I feel about this,” and I found kind of a confidante within Datasparq who’s a very dear friend of mine, and I explained to them, like, how I felt, and they suggested to me, like, “Just be yourself. Like, “It’s fine.” And I did, and it was great. So the first day, I think I showed up in a white shirt and really nice trousers, and then the day after I was, like, in a t-shirt and everything was fine. Like, you know what I mean? Like, the vibe—
00:32:38 Craig Love this for you.
00:32:39 Erika Yes. It really did change—I’m back to the shirts now, but more casual. [audience chuckles] And I think there was a moment in that realisation of, “I don’t want anyone else coming through the door tomorrow to feel like I did when I first came in.” Like, to feel like, “I don’t know how to act, and I need to be this carbon copy version of everybody else around me.” And I didn’t want that. So I really made a point to be, you know, for lack of a better term, like, out and proud, and really, then, drive and champion that idea of diversity and inclusion, and the first step from there was really, like, okay, we need to hire a more diverse [chuckles] group of people. Because I can’t champion myself and just be the only person who looks like me [audience laughter]. It’s really not going to lead us anywhere. And so I think that was quite a nice mentality shift and, yeah, diversity and inclusion group started, and we put a lot of effort into our recruitment to try and target a more diverse branch of people, not just relying on our networks, looking beyond that, so that we would avoid this kind of echo chamber feeling coming from that. 00:33:40 And then as soon as we started hiring more people that were in fact more diverse, then that’s where the inclusivity came in. So how do we ensure that everybody feels like they can just be themselves, as you guys were staying before? Like, I completely agree, so important to feel like we can push for a culture where everybody feels enabled to speak their mind and really feel like they can be, like, out and proud, or be quiet about it, and, you know, they can be everything in between, and they don’t have to be any particular way, and, you know, it may be a bit cliché, but I think really leading by example in that was a really good way of ensuring that people felt enabled to do that. So when you say kind of, “What practices are there?” Like, there are lots of practices that you can put in place, like user guides to me and, like, you know, making it, like, you know, gender-neutral bathrooms, and, like, lots of other practical things that we put in place. And, you know, we were talking about being able to work from home. Like, you know, if you don’t feel comfortable coming into the office for whatever reason, you’re more than able to just work from home and still get the chance to speak to lots of people on the phone. So all of those things are in place, but I do think that the key thing is just, yeah, letting people see so many different varieties as to what a data scientist engineer could look like and normalising that.
00:34:57 Craig Yeah. I think that last bit, I want to hone on, because—so for transparency, we are all—we all identify as cisgender. Three of the panellists are white. I mention these points not from a kind of, “Oh, this is a danger point,” but more from the perspective of going, “We’ve been able to at least make some progress in terms of diversity,” and I think that the steps that all three of you have been doing over the last five to ten years of your career will then mean that if there were to be a copy of this panel hosted by another version of Craig Poku—which honestly terrifies me, [audience chuckles] because there can only be one of us—basically allows it to be that, actually, this panel could even be more diverse, and it’s very reflective of where we want society to go. And before I finish up this discussion and then go to audience questions, I just want to say I’ve really enjoyed this chat. Like, it’s been really wholesome. Like, I’m here for it. So we’re going to kind of, I guess, do final remarks before we take some audience questions, and this is to everyone here. If there is one takeaway that you can give the audience based on this discussion, what would it be? And I’m going to start with you, Kevin.
00:36:19 Kevin I think it goes back to that kind of point I was making earlier around, what do we want—which kind of path are we on? Are we on this path around fixing things or breaking things apart? And I think part of that discussion is the ambition of the work we’re doing. Are we aiming for diversity, for equality, for inclusion, for social justice, for liberation, for all of these big concepts? And I think in my work, I’m becoming increasingly critical, maybe, of some of the discussions around diversity and inclusion, in particular, and I’ll say a bit more on that in terms of data systems. I think this idea that as a kind of—as queer communities, do we want to be included in these systems? I think that’s a question we need to be asking ourselves? I think this pursuit of being enrolled, co-opted, brought in to—whether it’s the census, whether it’s a workplace diversity monitoring form, I think it is meaningful for some people, and it does definitely do some good, but some of these systems and institutions are really problematic. 00:37:21 They cause a lot of harm, they—from the design, are designed to be exploitative and to bake in inequalities, so I think my take-home would be just that critical thinking about, do we want to be part of these systems? And if not, do we build something else? Do we build our own side project, our own data system? And I think—I mean, my work is around the census. That’s kind of the idea going through my head, whether all of that work the LGBTQ community’s put into designing, providing input into design, providing evidence, getting people enthused kind of around the census, and a lot of the coverage, the media coverage since the census published has been really negative. It’s been used to weaponise and target, particularly trans communities in the UK. So I think just, yeah, thinking whether we want to be part of the system would be my take-home.
00:38:21 Craig Yeah. Paul?
00:38:23 Paul Yeah, so I think still thinking from a data scientist perspective, so we’re living in an age where data and models now are becoming more and more available for people to just pick up and use. And again, there’s little visibility on sort of, sometimes, what goes into that data, what goes into those models. I think as a data scientist, we need to be more cognisant of the efforts that go into data collection. Sometimes I think we’re almost too eager to just lift data sets and try and get some—apply some fancy models to it and get some outputs as quickly as possible. So I think we need to be more cognisant of that. I think as queer people, we’re in a good place to be able to—through our lived experience, to be able to identify some impacts, some unintended impacts, some applications to models, and I think we should stand up and voice these concerns where they come up.
00:39:17 Erika I think you guys covered quite good ones, to be honest. [audience laughs] But I suppose something that’s maybe left unsaid is around participation and just having individuals like yourselves, like, being part of the conversation, feeding back into the systems, feeding back—like, again, a lot of AI and machine learning is built through kind of this feedback loop. You know, we want to know what the users think. We want to understand what the impact is. We want to try and feed it back as often as we can, to maybe try and not impact negatively anyone, wherever possible. So having this conversation is so important, and—yeah, I hope we can keep having it. [chuckles] [audience laughter]
00:40:08 Craig We could edit the background stuff.
00:40:11 Erika No, we can’t. [audience laughter]
00:40:16 SX It’s colourful. It’s fine. [audience laughter]
00:40:20 Craig Okay, so I guess my final remarks before we go to—
00:40:25 SX That’s live showbis, baby. [audience laughter] Can we keep that?
00:40:29 Craig So I guess my final remark before we open it to audience questions is the idea of conversation. So one of my big things is queer community conversation and discussions like this are really important. And I say this because I think that when we think about knowledge, there’s this idea that we have to be a specialist, but the reality is that actually participating in the field of data, that that in itself puts you in this state of knowledge, because obviously you’re talking about your lived experience. And I just want to say, again, a massive thank you to you as panellists for actually being able to open up about these discussions, because I think that, again, we are continuing the conversation. So with that in mind, we have some audience questions, and I’m going to start off with an open question to the entire panel, going, “So who has oversight over the data inputs, and how receptive are the people in this position to diversify it?”
00:41:34 Erika So at least in the—in our case, as consultants, the data is often provided to us, and this is usually commercial data. But we do have the ability, when we design a solution, to encourage leveraging additional data sources that might be publicly available if we see the need for that. But it really does depend on the usage of the data, and most often than not, you might actually not want to discriminate for different individuals or different populations. So I think there is a situation where we’ll kind of do a review of the data that we have available and whether it’s suitable in its form as provided or whether we need to augment it further, and that’s usually an internal discussion and the first part of an ethics review of the models that we try to build.
00:42:23 Craig Mm. Any other thoughts?
00:42:25 Kevin I guess—I mean, it depends on the data set. I can use the example of kind of census data, which I think is really exciting, because it’s self-identify data, so people report data about themselves, which I think prevents—provides some really exciting opportunities that isn’t there with all other data tools. So something like lots of tech and technology will use, like, behavioural data, where it’s scraping information about your clicks, your like, what you view, and then maybe discerning, “These are the viewing habits of a gay man in his mid-twenties.” So again, taking away some of the agency about how the data is collected. So it depends on the data set, but for the census or things where it’s self-identified data, a workplace diversity monitoring form, there’s some exciting potentials there.
00:43:12 Craig So we actually have another question. This is specific to Paul, however, some of the other panellists can answer. “How can we, if at all, use LLMs for good?”
00:43:27 Paul [chuckles] [audience chuckles] That’s a really, really good question, and I think a few of us at Datasparq in the room are very much thinking about at the minute. I think I might preface the question a bit by saying, I think what we’re trying to do first is ensuring that they don’t do bad. And—so I think there’s a lot of stuff that we can do, a lot of research going into this to be able to ensure that the outputs of these models are stable, sort of well behaved, and sort of benchmarked against, you know, sort of data sets that we know can sort of detect biases where, you know, where they exist. Yeah, that’s kind of all I have to say.
00:44:16 Craig Yeah. And also, like, there are some examples of LLMs that I know that are able to do good. So in December of last year, they actually have now got a ChatGPT bot for environmental justice. So the idea being is that if you are within the activist space, you can ask this tool to actually allow you, to help you go, “What are the best ways to be able to deal with protests? How am I able to ensure that this source or this funding source is actually not going against my personal morals and stuff?” And I think that there are definitely groups within that space that are using things such as ChatGPT for good. So again, it always goes back to that idea of, “Why are we building the model in the first place?” Because whilst, of course, there is this concept of people going, “Oh, you know, we do data for good,” but also some people who do data for very dangerous things. And so it’s really important that whilst we have that form of agency that we, of course, are quite critical of it—
00:45:20 Paul Yeah, I think—just—yeah, just one more thing to add to that. I think one of the things we’ve experienced working with LLMs is this idea of hallucination, that you ask the LLM a question, and even if it doesn’t know the answer, it will pretend that it does. So it’s very eager to, essentially, lie to you. So if you’re depending on it for, you know, a reason that might be, you know, quite good and altruistic in nature, you still fall, you know, fall foul to the risk that, you know, it’s just not telling you the truth. So we’re working on—we’re actually working on a project within Datasparq now that’s trying to come up with ways to almost force the LLM to be as truthful as possible, but’s still very much a risky venture for anyone trying to do good. Sorry, I’m not being—I’m bringing back that doom and gloom again. [audience laughter]
00:46:17 Craig I was just going to just say something along the lines of, I don’t know enough about LLMs. But speaking of LLMs, Paul, what does LLM stand for, and, like, what, like, what are LLMs, please, because….
00:46:29 Paul Yeah, so LLMs are Large Language Models. This what—I think what I was speaking about earlier, that ChatGPT is based on. And these models are essentially—they’re used for, like, almost question-and-answering type tasks like you would see on the ChatGPT interface. They’re also used with sort of text summarisation, and they’re used a lot, like, in the legal space to sort of accelerate sort of quite, you know, laborious tasks like content review, for example, in contracts. So you see these—the outputs of ChatGPT, and you ask it a question, and it gives you a response. And for each word in that sequence of words that it responds to you, it’s essentially computing what the probability of the next word is, right? And it’s trained—that model, that probability model is trained on vast amounts of the internet, so it’s using all that experience, all that text information mined from the internet, just to answer the question, what’s the word that gives me the—that’s the maximum probability of the next word? 00:47:32 And then, you know, once you chain that together, you can have full on conversations, and it looks like it’s reasoning with you, and it’s, you know—yeah. You know, you’re having a great time together. You can be, you know—you’re best friends. [audience chuckles] I think we’ve worked with some applications in Datasparq where it appears clever when you have an ad hoc conversation within the interface, but I think we’ve seen that maybe it’s not quite as clever as it looks. And I’m not too worried about society falling—yeah, to AI anytime soon.
00:48:13 Craig Even though I’m a data scientist, I don’t work with large language models, and I now—I can now understand what’s going on. So when all of you are talking about it at work, I’m just like, “I have no idea what you’re on about,” [audience chuckles] but now I can say. So we probably have time for one more question, and this is something that is, again, open to the panel. So it’s, “We hope that we can see right from wrong, but bias is measured against principle, so how do we educate people to better perceive fairness so that they identify bias?”
00:48:39 Erika I can speak to it briefly. I think I cannot take any credit, but I know that within Datasparq, we have some initiatives around kind of data ethics book clubs. One of the organisers is here in the room. [audience chuckles] And, you know, one of the, I think, best ways to ensure fairness is education and for all of us to be aware, irrespective of our own personal circumstances, but, you know, to be aware of marginalised communities more broadly and how we can impact them. And obviously I’m queer, but I’m also not many other things, so there is definitely more learning that we can all do to ensure that we think about the broader picture, and a lot of the books that we read—or that Datasparq employees read—tend to cover a broad range of different applications as to ensure that whenever we do do review processes of our models, that we can take those into account. So yeah, I think education is really the answer, but it takes time. It takes time.
00:49:48 Craig And on that topic—so Kevin has, of course, got a book called Queer Data, and the idea is that every person who has come to this discussion will be getting their own copy. So—[audience gasps, cheers, applause]
00:50:02 Craig So following on from the topic of education and the idea of ensuring that you are in the know, there will be a copy that every person here will have. Anything else for this discussion?
00:50:09 Kevin I guess I was going to say, there’s a section of the book called ‘Queer Data Competence’, which kind of educated you on this idea about, you don’t have to have lived experience of the topic under investigation. You don’t have to be LGBTQ to work with LGBTQ data, but what you do have to have is an enthusiasm, an interest, a willingness to make mistakes, to learn, I think, that broader competency in the issues. So yeah, I don’t know the page number, but there’s a part of the book on queer data competence.
00:50:42 Craig That is the end of our discussion, so firstly, I just want to say a few thank-yous. So I would like to thank Pride in STEM and Datasparq for being able to sponsor this event and being able to allow us to actually have this really important discussion. I would also like to thank Queer Britain for allowing us to have the space to be able to have this discussion. I think venues such as Queer Britain are so important in order to be able to capture that history of queer Britain. [audience laughter] Again, I would like to thank our panelists, Kevin, Paul, and Erika, for your wonderful discussion and being able to, like, you know, contribute so much life to this really important topic. And I am your host, Dr Craig Poku. This is the Pride in STEM podcast, and, again, thank you ever so much for attending. I will see you [laughs] on the other side. [applause]
Outro – Craig Thank you for listening to this episode. I would like to thank Pride in Stem, Queer Britain, and Datasparq for providing us with the space to have this conversation, as well as our lovely production team, Shivani Dave and Alfredo Carpenti, who have been brilliant at putting this podcast together and making it come to life. If you can, please rate your podcast wherever you get your podcasts. If you’d like to learn more about pride and Stem or queer Britain, you can follow both organisations via Instagram and threads. You can also find out about Dataspark and what we get up to via LinkedIn. I’m your host, Dr. Craig Poku, and I hope to see you soon.