The importance of language in Data Science
In this episode, Paul Kelly (D/A) and Dr James Piecowye (Zayed University), with Reem Elmahdi, a Data Scientist at D/A dive into the topic of Data Science practices and how it enables us to analyse data beyond libraries of individual words or ‘keywords and phrases’ and contextualises conversations at scale using machine learning.
Listen to the full recording below or on your favorite podcast platform.
Dr James Piecowye: Hello, my name is James Piecowye.
Paul Kelly: I am Paul Kelly.
Reem Elmahdi: And this is Reem Elmahdi.
James: And welcome to know your audience, the podcast. Alright, Paul, we’ve brought yet another data scientist into the mix. This is becoming a habit of this podcast. But I think this is really going to open up a lot of doors in our thought process. And I say that, because in our last podcast, we were talking about natural language processing, how it works, what the what the whole what that even means. And I think Diego summed it up beautifully when he described how our phones and how our computers when we’re engaged in emails or messages, how they finish the sentences. And essentially, that is natural language processing on the part of the AI system behind all of the softwares we’re using.
Getting a sense of what we’re going to say and where we’re going with something. And I think that really sets us up nicely to move into where we’re going in this podcast, which is getting a sense of how this is actually really complicated by language. And I gotta say, before we sat down, and before we started having this talk, I wasn’t putting together how one language versus another language versus another language complicates natural language processing, and ultimately, the research process in general. Have I got that right, Paul?
Paul: Yeah. And I think, just to add to that, in this case, it would be useful to look at stories, just thinking about data science and what it is, and how we can think about that. So yeah, we get Reem in because Reem is a data scientist, and it’d be nice to hear things from her side of the story and her perspective, and yeah, so maybe we could start off?
James: Well, you know, I think you’ve you’ve got a there’s a question right off the bat, I’m gonna we’re gonna throw right to Reem is.. What is a data scientist? Reem, you want to give us the, you know, the the elevator pitch on what a data scientist is?
Reem: Yeah. So basically, to start defining something as a science you you’re referring to understanding a phenomena or something. And then you want to get deep understanding, explaining the behaviors, and get to the core of everything that you want to study. And by saying that we want to study data science is basically referring to understanding, finding explanations, and answering questions that are related to data.
And as you mentioned previously, since we’re working like nowadays, with mobile phones, we’re using like browsers and everything that is easily Googled. And you can get like massive information on the www.
From that we’ve accumulated, like so many data and huge profiles for every individual on Earth. That could be in like different platforms, including social media, and all other platforms of communications. And as I said, it’s basically understanding and trying to get like the meaning out of the data that we have. So that is data science and data scientists is the person who does that by analyzing, trying to understand the data, finding, get, like insights out of this data and trying to find answers to business questions.
James: So a logical connection to what we’re talking about with respect to audience research, from social media, and understanding the sentiment of that audience, not just how many likes how many clicks, etc, it’s really being able to, to add meaning and understanding to the data. So ultimately, the marketing folks can use it for greater insight and to create better programs with respect to getting their ideas out. So the question that I have is what made you get into data science as a career?
Reem: Well, honestly, it was a coincidence. But it was a nice one. So I am from Sudan. I studied there my whole life. And after I finished university, I started looking for like, something more exciting than work. Work for me at that time wasn’t exciting enough. So I thought maybe obtaining like farther degree, higher degree would be a good answer to all the questions that I have and maybe it will All, like, put everything together and everything that I studied at university would make much more sense than that that time.
So I started applying for different masters and I was planning to go somewhere else, like I don’t want to continue studying in my country, I would love to explore more about other countries, not only to learn the science part and develop my like, computational like background or computer science skills, but also to get a new exposure, learn a new culture and like, just learn about the world. So I was lucky enough to join the African Institute for Mathematical Sciences, where we mainly started mathematical related courses. But only one course that was more related to computer science, which I used to rely a lot on. Because if you come from a computer science background, you might not have that big exposure to mathematics as your fellow applied mathematicians or like pure mathematicians do and it was very interesting for me to know that I can actually merge some statistics with mathematics as well as computational skills to to find something very interesting out of something that we already have, which is data.
So I got very interested in that course and after that, I decided to move further and do another research Master’s on that specific topic, which is data science and I was lucky enough to find very support from like different communities. Not only in, in South Africa, but then but also in, in Africa and in other Institute’s like, the Shindi community, which I’m currently volunteered, as an ambassador, where I helped like growing the machine learning and data science community in general, in Africa by introducing more and more people to the data science world, and allowing them to read and study data and basically how to put their skills into action by using all the the concepts that they have learned in their courses.
Back in the university, or if anyone of them interested, people are coming from like a different background. So I’m not saying that you have to come from, like computer science, mathematics, or statistics background for you to work in data science, but I find that many, many people are joining nowadays, knowing that they have like, for example, a very different background like in forestry, some people are joining from, like medical background. And those are the people who can actually make more effect. Why? Because they do have like a domain knowledge on their, like specific domain and also when they acquired these types of skills, they that they could solve their problems in a better way than if you bring a data scientist who would spend like more time understanding about the market, the business, the idea, the questions that he’s trying to answer, and then he will be able to apply those skills. So yeah, I think I have touched among different topics by answering this question. But yeah, I have so many things in my mind.
James: Paul, I got a question for you. You’ve got two data scientists on board. As Reem has just mentioned there, there is a lot of range in what data scientists will be doing and their backgrounds. What did you look for in your data scientists for D/A?
Paul: Good question, good personalities? I think you sort of nailed it there were I guess with that word range. I think range and different variety of backgrounds are critical to anything because otherwise, you know, at all costs, we have to try and avoid groupthink. That’s a really dangerous concept and bias where you have your various assumptions, you know, people who have the same background as you and basically what Reem was just saying same background, same, perhaps training and all of that we just confirm each other’s viewpoints and move on.
What’s important, I guess, in any kind of recruiting is to ensure that there’s that range, that people have a variety of experiences, a variety of backgrounds, not just in recruiting, I think in life, that’s just super important. And in your own life to try a number of different things and, you know, like I my trainings in economics and urban planning, like I’m not, I’m not a marketing guy or anything like that, but my knowledge and training, you know, come in the intervening years through experience and understanding how that all can connect up. And when somebody can do that, then it adds a lot more value to the proposition because problems will get looked at differently, which is what Reem sort of said, is that you bring some domain expertise to a different set of problems are you look at life experience in different ways, and then you understand, for instance, challenges, challenges that different communities might have, all those sorts of things can stack up. And that’s why it’s, it’s very important to get that word back to again range and not do the Gladwell 10,000 hours thing.
James: One of one of the things that I find interesting and Reem, and her your conversation, and explaining why you got into data science and your background, and then how this all links in with D/A, something that that eluded me. And it really wasn’t until we started talking with Diego about natural language processing is the importance of language in the whole coding process. In this whole technical background side of looking at audiences and going through content to get a sense of their sentiment. Can Can you talk to us a little bit about that Reem? And how important and how integral understanding language is when you’re working with code?
Reem: Yes, definitely, of course, that you have to work with different building libraries for you to solve a problem, let’s say like a sentiment analysis, if we have different posts, let’s say that are collected from one of the social media platforms. And now you want to know exactly what is this person is talking about? Or maybe not even interested in classifying the topic, but you’re now interested in finding the sentiment and you want to know if that like specific post is negative? Or is it positive, or it’s just a neutral post that has no like percentage of classification, no clear division.
In that sense, you would be interested in knowing that relationship between each and every word with the other words in that specific sentence. And of course, when we’re talking about social media, we need also to think about emojis and, and their effect on the total, like the whole sentence. So for example, if I say, what a good morning, of course, if I say it in Arabic, what a good morning and I put a smiley face, that could refer to something nice. So it’s more of a positive sentiment.
But then if I say what a day or what a morning, and then I saw a grumpy face, I put a grumpy face or an angry face emoji, that in that sense, it would be more of a negative post. So when it comes to languages, and it’s specifically Arabic, since it has like a different structure than English, which is the mostly used and common commonly used language worldwide, you would find difficulties, especially if you’re not very familiar with the context of that specific post.
So one world one word, for example. Al Ain. It could refer to AI. And it could refer to Al Ain, which is the the town in in Abu Dhabi, or it could also refer to a brand name. So any one of these categories will classify like understanding the structure of the of the post will allow you to classify that specific post into those different categories. Now, you will know what is this person is talking about? Is he referring to the brand? Or is he referring to the place? Or is he referring to part of the body? So that adds more complexity. So understanding the structure of every sentence would be beneficial, rather than just applying the built in libraries and calling functions to find like sentiment or classification for each and every posts.
James: How do you how do you factor in then for the added complexity on top of what’s already very complex of dialects and when you start changing that not only within the UAE, but within Saudi and you’ve got Kuwaiti dialects and that must just magnify and intensify the challenge that you’re facing in this process.
Reem: Yes, as you said, it makes the problem more complicated. But now, instead of just throwing everything away and then started starting from scratch building your own model, in that case, you will need huge power for you to run like your models, you will need very big amount of data to rely on. And for that you would be you would need maybe people to help you like doing those types of classification manually, if you want to classify the topic for each and every sentence, or if you want to classify the sentence into like different sentiments. And it if now, you want to compare these two, and if we are running, like we have a business and we wanted to run on so we we will make use of the existing technologies in what’s already have been built. And then we can just use that as a baseline. And then we can build upon that.
So I think, to solve this problem, not only for Arabic language, but also for any other languages you’d make use of the existed, building models, and then move on. Try to build like more complexes on the top of of the building models.
James: As a data scientist, how far along Do you think we are in this process?
Reem: I think it’s rapidly moving. Because if you have a look at like languages, like natural language processing, functions, or models, or even processes that used to be done in the early 20s, it’s not like what we have now. And of course, it won’t be the same as what we will see in 10 years, or maybe even one year from now. The number of data scientists are an in increase. As I mentioned earlier, now people are joining from very various disciplines. That’s one thing. Secondly, there is more interest from the companies to make use of their data, which also encourages more people to join. And third, I think worldwide, there is this tendency to move towards technology, which is got, like, huge support from different governments around the globe. So in that sense, having more data encourages more people to join, which helps in solving those different problems.
Now talking about languages, as far as I know that they were like only one group previously used to work in, in solving Arabic related dialect sentiment, like all these types of problems that are related specifically to Arabic, but then now it’s like increasing in numbers and now we have like different various groups who are working on Arabic language. Specifically, we have a big group in the UAE, in Abu Dhabi, the New York University in Abu Dhabi, they are work, their work in Arabic language is very intense and it keeps growing. They do have very updated and strong models that everyone can use. Now, if you look at other languages, I’m aware that there is a big group of like data scientists from Africa that are focusing on developing corporates for minor languages like Swahili. So I think the focus on data science in general is increasing and when when we talk about like natural language processing, specifically, it’s it definitely increases much more than before, and it will keep increasing due to the fact that the data is increasing. I don’t know how to make myself clear. But yeah, I mean, like it’s a cycle, everything that affects something which affects on something which increases the cycle. That is how I see it, honestly.
Paul: Just to add to that, I think as well. James, talking about a bigger level. I guess the amount of data in the world, I guess increases exponentially almost but the usefulness of that is only a tiny fraction in whatever pursuit you’re doing, I’m not saying generally I’m saying like to ask to you to developing a new technology etc. and making sense of that it’s just obviously becoming more and more important, because there’s a lot of noise that all sort of back to the audio getting the signal from the noise. And understanding exactly what you need to do and what can be useful. And what to discard, almost more importantly, is going to become more and more important as a general field and I’m not just talking here about us or what we do, I think just generally in the world, that unless unless we’re able to sort of process and make sense of the data and sort of have different information points come to light, it’s going to be more and more difficult for, for people to continue doing what they’re doing today.
I think, as we evolve, and as time moves on, you know, cities, for example, need to be run. There’s so many information points that how a city runs for potentially don’t get used, for example, like they on an individual basis, they do say, with an electricity authority, or transport authority, but stitching all that together is going to be the work in the future of data scientists. And as we’ve talked about a lot, almost every episode, I think you can really conflate data with insight, there are two different things.
Understanding information is a really, really important thing. And that is typically, AI could help us process the data and put things together. But understanding the implications of data is really a job for humans. And I think that’s the really interesting next phase of, I guess, really young children today is the world that they come to work in. I think all this hyperbole about, you know, machines, taking our jobs and things like that. It’s just as it’s always been false. I mean, people were saying that about the steam engine, right. And all it did was changed fundamentally the role that society has for people coming into it. And as we become smarter, we need to leverage it, you know, like, what, you just start to think about all the possibilities, right? Like, we It wasn’t that long ago, that life, I don’t know, it’s not an even by any token, measure around the world, but I’m speaking generally about life expectancy, for instance, in developed countries, it’s not that long ago that that was, you know, significantly less than what it is today and that’s because of our our ability to process information is getting better and better and better.
So our own understanding, as Reem said in medical science, we’ve got people working on how to, predictably diagnose various ailments, or to help prescribe What’s wrong or to help doctors make decisions, to have those information points to make faster decisions, instead of waiting days for prognosis, you know, it can happen quicker. Same thing happening with getting efficiencies out of electricity, plants and things like that you don’t, you might not need to build an extra coal fired power facility, because you can understand how to get efficiencies out of an existing one by just looking at data, you know what I mean? It’s not an engineer going in. And, you know, looking a civil engineer, for instance, looking at a structure and going, maybe should just build a new plan. It’s about understanding that and then working downstream about demand management, all those things are coming together in building blocks and changing at a rapid pace. And it’s something that people need to understand.
I think we’ve touched on it, in many episodes, that this isn’t a big business thing. It’s big business leverages it, because potentially, in a lot of use cases, that requires a significant investment, but it’s something that’s touching us on a daily basis, getting right back to the start of this episode remote talking about phones, and AutoCorrect, and predictive text and all that sort of thing. Like every day, our life is being made a little bit better. In some cases, sometimes it’s a bit clunky. Like if we think about the development of voice assistants, for example, like we think about Siri and those types of Cortana all those things 10 years ago, or maybe not 10, 5, or whenever they came out compared to today, the accuracy just improves every time every time and that’s because in the background, we have data scientists pulling together information points to make better decisions that then help the coders write better code, understand how this all comes together. And yeah, just I think it’s a really big topic that I think, you know, we don’t often think about about how the the opportunity that lies ahead for future generations isn’t one to be scared about. It’s one to embrace, and to think differently about and our life is demonstratively better. through understanding the information in the world around us.
James: I get the sense that we know this is all happening and we’ve seen it happen around us. But there’s still this apprehension. And you highlighted it Paul, people say all this this AI the this data is going to take away our freedom. This data is going to give us give people too much information that they can manipulate in all sorts of ways. What do we do to can to get people to realize that it’s, it’s really a helpful thing, what we’re talking about.
Paul: I think in any pursuit, there are nefarious actors, right? There’s people that will always try and leverage whatever it is. So whether you’re, whether you’re talking about data, or you’re talking about anything in life, there’s always people who try it. It will get gains from that, whatever it is, right? Like if you can get back, everything from gambling, to everything like that. I think that’s something that yeah, it gets attention so you sort of start to hear about things because it’s newsworthy, I think we’ve talked about and demonstrated before house short cycle news is and how it influences public opinion.
But for a very short amount of time, people forget very quickly. And that’s an unfortunate thing, I think about today’s news cycles and the need for clicks and revenue and it’s a whole different story. But where I’m getting at that is that new cycles drive these this information and disinformation. And when you’re talking about people being worried about, say how their data is used, then they should be, you need to really understand terms and conditions of what you sign up for and that’s, that’s the same as buying a house, you know, people wouldn’t just sign away their life on a house mortgage, right, you understand that you’ll be paying for this thing for maybe 20 years. But some people don’t take that same care when they sign up to take photos of their kids on Instagram or something, you know, that’s an education thing and I think future generations will be more and more aware of that, because privacy will be controlled by the people who use these things, and they’ll vote with their feet, so to speak.
So I think that putting that to one side, I think the opportunities are immense. Because if we can get back to a position where people don’t need to be, for instance, sitting in an office or a factory, or something from 6am till 6pm, or something like that. But instead or in a shorter amount of time, for example, like it’s 10 to four or something and they’ve got a better quality of life, because there’s a better information flow, they’re understanding how things can happen, they’re making better decisions faster, I guess, using the medical example for from earlier lives are being saved, decisions are being made about who to treat, and when to treat, and how so somebody can be unnecessarily using resources that perhaps doesn’t need immediate treatment.
All those sorts of things can can make our life a lot better. And I think it’s it’s our condition to sort of take the bad news and the scary news, and in our realm of the world, it’s like, people get nervous about when they talk about something and then they see an ad for it, but like nine, eight times out of 10, that’s just purely coincidental, it’s the rest of your life, that is giving the machine signals that you’re interested in that product. And it’s just because you’re talking about it, you’re also googling stuff every time you Google something, unless you explicitly state it, Google’s keeping that data on you and tracing it to you, you know, you’re using Google Maps, it’s using your transit data, all that type of stuff comes together.
And, and then you’ll get an ad for something. And it just so happens that you’re talking about it with somebody because you’ve been googling, you’ve gone to different websites, you’re on social media visiting profiles, so the machine can build that profile of you. And I think people unnecessarily get nervous about this stuff. Because I mean, a lot of products that have worked, for instance, in the last year, I would never have known about unless I had that information targeted at me, I’m happy to accept this stuff. Because it makes my life easier. I don’t need to go to 16 different shops and my time off to find, I don’t know, like we’re talking about earlier like, like ski gloves or, you know, insulated like a beanie or something, you know, like, I don’t need to go everywhere, because I’m getting ads for places where I can go and choose to buy them online and I’m not wasting my time.
These cumulative effects build up over time so that we have more time for ourselves. And I think in the UAE that’s sort of coming to light post pandemic, because we’ve changed work habits and you know, flexible working and that type of stuff is coming more to the fore and it’s harder, it’s a lot harder from government and private enterprise, of course, but like, as we move on, this stuff is changing. It’s making our life better that are incrementally but the bad news takes precedent, you know, so and that’s and that’s something that is probably going to happen so long as advertising supports media. That’s just it’s just gonna be that’s how it’s gonna work. Unfortunately, seeing beyond that, I think when people start to have this more demonstrated stretchable ease of life that things become clearer to us that like hey, actually, you know what, this is better. I’m better overall for this.
James: I want to give the last word to Reem because she you’ve you’ve really touched on so many things Paul, and Reem, you’re you’re in the fray of of data science, and natural language processing, what excites you that’s coming on the horizon? In this field of study and experimentation and data gathering? What excites you?
Reem: Yeah. So I think, in regard to that, that when you talk about like natural language processing, now you’re thinking about different topics that you can touch upon. Like, for example, the differences between translation and vectorization. For specific topics, you might need literalization more than translation. And when we previously mentioned the sentiment analysis, you might not be interested in like, just classifying any sentence or any posts into neutral, positive or negative, maybe now you might think of like, in that positive, how positive is it like 100%, positive 70% positive, which might increase the horizon, in that topic, specifically. So now, instead of not only looking at three categories of sentiment, now you want to increase your thoughts, and maybe upgrade it to 10. Or maybe, maybe you can think of it as a ranking system, where you do not want only to look at the sentiment as these three very rigid boxes, you want to allow some kind of interfering between these boxes. And also, as I mentioned earlier, like topic classification is very interesting.
Because imagine if we’re now like, if you’re a journalist, and you have like, a huge pile of old topics, or like, let’s say articles that you’ve written, and you want to classify them, you could either choose to do it manually, which will take a lot of time. Or you could actually just apply a machine learning model, and use that to classify all these articles into different categories, political, sport, related, religious related, and whatever topics that you’re interested in. So those are some of the topics that I feel like they are more interested. Interesting, to me, at least at this point. And I would love to learn more about how can I like make improvements on the, like, available models to get to like higher accuracy, and to reach that, like, very, the nice state of the art models that will allow you to do all of this .
James: Awesome. That’s it, you know what this whole conversation has made me really excited about data science, and I’m kind of going, I need to go back to school, it’s too late, I need to go back to school, because this just seems really, it’s a really interesting, it’s really applicable. And see, it’s really needed. So this is it’s, you know, it’s all three, it’s fantastic.
Reem: Definitely, it’s very easy nowadays, like to go online, and then just surf the internet to find what are the most useful resources for you. You can find them in all different formats, articles, videos, audio books, and even podcasts. There are many, many resources that you can rely upon to increase your knowledge in that domain. And nowadays, everything becomes like very easy to call, everything is building. If you want to start from like building an application tonight in 10 lines of codes that are understandable to you, in a very high high language, like the language that we use now to speak, use very small phrases to call specific functions, and it does what you want.
James: Reem, thank you very much. This has been a lot of fun, really enjoyable. And I look forward to having more conversations with you about this whole process and how it becomes important.
Paul, we’ve done it again. Another interesting conversation, we’ve opened more doors and and I like to think that the process of having these conversations is just bringing this whole subject matter this whole topic, a little bit closer to everyone so that they can make better decisions, when it comes to how they’re dealing with data, and ultimately, how they’re going to get insight from it.
Paul: Yeah, and that’s the aim! That’s that’s exactly it. And the more that people realize that the world around them doesn’t necessarily this isn’t just for big business. It’s for everybody, then, like they’ve discussed many times before, the better. The adoption, the better. We can bring everything together and the more people get helped, so it’s great.
James: On that note, I’m James Piecowye.
Paul: I’m Paul Kelly.
Reem: I’m Reem Elmahdi, thank you for having me.
James: And this is ‘Know your audience!’