Video Transcript
Good evening. Thank you all for attending tonight's event identifying cyber threats.
My name is Olivia Bellas and I'm a PhD candidate in the school of bioedicine here at the University of Adelaide.
Today we'll be hearing about cyber data collection and in particular web scrapers, what they are, how they work and how your information may be targeted.
On one hand information gathered by web scrapers can be used for legitimate purposes and easy access to high quality data can be beneficial for a range of Industries, but regardless of the intent of the data collection it raises profound ethical questions and legal risks.
Thank you very much. My name is Russell Brewer, and I'm associate professor of criminology and I lead the Adelaide Cyber Crime lab which is an interdisciplinary group of researchers, who've come together to combat various cyber threats.
We collect and analyse data that is found on the internet for public good and we use tools to identify child sexual abuse material and work with police to enhance police practice and save young people from harm.
As we began doing this work it became apparent immediately at this scale of data collection that was being undertaken.
It became apparent to us that there was huge immense potential for legal and ethical strife in this area and as we began to dig into this, we realised that these legal and ethical
dimensions were often misunderstood but also applied quite inconsistently.
Today we're going to take you through some of the principal findings of this work and provide insights into how this has implications for you, for your researchers and also what you in the audience can do about it.
Let’s talk about the types of information that is available on the internet and also talk about why it is attractive to researchers like us.
There's data about all of us available on the internet. This could include information that we broadcast for public consumption, text-based posts on Facebook, pictures or videos that we post, things that we've put up for sale on Gumtree or Facebook Marketplace.
We've got lots of information that we put online that we think is restricted to only a select few. We've got messages that we might post in closed groups on various platforms. We might have files that we send to each other within those groups.
Then we've also got a heap of other types of information that we might not even realise is on the internet and available.
Stuff like what we have bought online, financial information, data contained within the various Internet connected apps and services we use.
For those of us in the room that use Tinder, who we like and who we don't like, who we find attractive and don't find attractive.
All of this data when you put it together, can tell us a lot about you, who we hang out with, who our close and extended friendship groups are, what we do for a living, and who we work with, what our faces look like, what our voices sound like.
Biometrics, the places that we have been, the places that we're thinking about going, the kind of food that we like to eat, you'll see more about that later, what our interests are, what our hobbies are, what we're currently interested in and learning about, what our political views are or views on certain issues.
All of this information is online and this data is really attractive to researchers because it can tell us a lot.
There is a large volume of really detailed information available about each and every one of us, often linked together and of particularly high quality.
It's also extremely easy and inexpensive to collect. That's very important in a research context. It's easy to find, it's easy to pull that data and it's very easy to analyse that data because we use these automated tools that structure it all for us.
For a fun little example, let's look at my own digital footprint. This is just a snapshot into the wildness that is my life.
We can glean a heap of information about me from Twitter, what my name is, my workplace, a list of all my research outputs, links to the university website that has heaps of other information about my qualifications, where I got my degrees.
We've got my location information, where I'm based. We know if we look at all my followers, who my colleagues are who I'm messaging and interacting with.
We can go over to Marketplace we can see what I was doing when I was on long service leave, packing up my parents' house and selling all of the crap that we had stored in the attic and the garage. We can see what I'm selling, but if you click through you can also see what I'm buying on Facebook Marketplace.
You can figure out whether I'm a good reliable seller or not. If we go over to YouTube, we can actually extract biometric information from this research video that I posted a few years ago.
We have many frames of high-quality faces to extract. We've got my voice droning on for 40 minutes, very high-quality voice segments that we can extract as well.
If we go down to a snippet of a Discord server that I am a member of, we can identify hobbies that I'm interested in or pretend to be interested in, and the conversations that I'm having with other people within that server about those interests.
We know who I'm interacting with in different contexts. This is just obviously a tiny little snapshot of my overall digital footprint, but you can appreciate how much data is actually there.
Way too much for a single person to go out and find, collect, extract and analyse at scale, and this is where these automated data collection technologies come in to help researchers or others collect all of this kind of information, extract it and analyse it to address social problems or do other things that may not be so noble.
Let's talk a bit about these technologies and how they actually work.
I'm only going to give you a very brief snippet. I'm going to gloss over the distinction between web crawlers and web scrapers.
Given the scope of this presentation, we're going to put them together but it's important to articulate that they are a little bit different.
We've got web crawlers. These are things that most of us are familiar with.
This is how Google works. These are a series of automated scripts that you assign a set of criteria. We want to find these things, and it will systematically crawl websites to look for certain information contained on websites. And then once it finds that information, it will index that information so we know where everything is located.
What is interesting about web crawlers is they actually start to click through links so if you've got links on a website, it will actually mimic a human in so far as it will follow a link index everything on that page, click on all the links on that page, go to the next suite of pages and just keep going until the programmer tells it to stop.
Now this is a little different from web scrapers, which again are automated scripts and they work in the same way, but they are programmed to extract specific types of data from very specific sources, right from Twitter from Facebook groups from YouTube and collect that information and then store it in a structured format so it's easier for the researcher to be able to access and analyse that data later.
It's able to do this automatically and at scale, so these two different types of technologies can be used independently, because they do have different purposes, but you can also pair them together.
You can have a web crawler that's crawling the internet looking for things and then when it finds it will automatically download that information and structure that data for the researcher to revisit later, all happening in the background as we sleep.
As they're working in tandem together, there are virtually unlimited applications of this kind of work they can work across a myriad of different websites, but also different services that we use online, and for a lot of different reasons.
I'm going to turn it over to Katie right now, who is going to talk us through some of the different reasons why researchers and others use web scrapers and crawlers.
You can use crawlers and scrapers for many different applications. I'm going to talk about a couple of ways in which they are being used in different forms of research and within different contexts.
One context in which the collection of online data is incredibly useful, and is quite a powerful tool is in market research and targeted advertising.
This is where companies will scrape your online data, get an understanding of the sort of products that you might be interested in, perhaps reviews that you leave on different pages.
We have an example here of a review that was not so great. We were in Spain for a conference and it was not a great restaurant.
They scrape reviews for different products and get a real sense of how people are finding their products, to understand how to improve things. How we can target certain people as well.
They might click onto a name and get a sense of demographic data that's linked to a profile. They know what sorts of people are purchasing their products or are not so happy with
their products. It can be really powerful. They can also do the same thing looking at competitor’s products.
This online scraped data can be used by researchers and law enforcement to understand potential security risks and to examine criminal behaviour as well.
That information is then generally used to try and inform responses to crime.
In our own research, we will scrape different online spaces to try and understand what's driving crime in certain areas, and then try to build up ways to prevent it, once we have that knowledge.
Criminological research that's scraped these online data sources includes things like social media posts or forum posts, where they can look at how people are interacting.
An example of that might be hacker forums that exist. Scraping data and looking at how these hackers are interacting with each other, how they're learning techniques to engage in those behaviours and what might be some factors related to their profile that we can understand to then try and combat that type of crime.
Researchers also scrape marketplaces online. This can be marketplace places on the surface web, things like Gumtree, eBay, Facebook Marketplace and it can also include the dark web as well, looking at where people are selling illegal services or products and getting a sense of who it is that's selling those products, and who's buying those products.
We've got some examples that are taken from a study that scraped the dark web, and it was looking at the sale of illicit firearms. You can really get a good sense of what's driving some of these crimes, by looking directly at the crimes taking place, and also looking at the characteristics that are underlying the people who are engaging in those crimes.
A third area that’s useful to scrape data is to gain social insights.
This can be used in a number of different contexts. It could be to look at social trends, public sentiment and political opinions. As an example, during elections or political campaigns political analysts will scrape data from social media posts or forum posts and get a sense of what people are thinking about their particular party.
That can help inform better campaign strategies to get voters on their side, and perhaps even inform policy strategies to see people are really keen on this, so we're going to go in that direction.
It can be used by social psychologists. We use it in criminology to understand perceptions of crimes online and social insights through looking at public discourse through things like Twitter.
Here is a study that was carried out in 2017 when Richmond footballer Nathan Broad was named for engaging in image-based sexual abuse. The researchers went and pulled all of the Twitter posts that were related to either his name or the crime or the club at the time.
They did a discourse analysis to look at what people were thinking about this particular issue. Were they on his side? Were they against him? Were they outraged at the club response?
They found a few different viewpoints surrounding that, and were able to get an insight into the Australian Public's perceptions of violence against women more broadly as well.
You can see that it can be used powerfully in a number of different ways.
You might think, that's all well and good my data is not being used for any of this stuff. I've got privacy settings in place.
But I'm going to hand back to Russell who's going to demonstrate that perhaps you might be caught up in some of this yourself.
It's not unreasonable for any of us to think that we should be protected. Most of us in this room probably have some sort of social media account.
Here's a snippet of my Facebook account.
I have something in the order of 500 friends on Facebook. It might not necessarily be as private as I'd like to think.
Even with these settings, there's still a ton of data out there that is accessible using creative means in terms of setting up a web scraper to collect this data.
Every single “terms of service” for any site that you use will typically expressly prohibit the use of automated tools such as web crawlers or web scrapers.
You would think that a lot of effort would go in to trying to combat scrapers and the use of web crawlers on these platforms, but it's actually really hard to do this because these web sites are designed to draw people in, to deal with lots and lots of traffic and these scrapers or tools are designed to mimic humans and how they might interact with a site.
It's actually hard to block these things and not block legitimate human beings. And one of the ways these service sites and services get around this is they use things called application programming interfaces or APIs, which allow researchers or these tools to directly interface with the site or service and download Material in accordance with the policies as set out by that particular vendor.
But that process can limit the amount of data that automated technologies might be able to pull, so these APIs are used but it's also possible to collect data by circumventing these APIs too.
When you consider all this together and the amount of data that's out there, how easy it is to actually access this data, it's kind of like bringing a knife to a gunfight.
I'm going to give you a very short little demonstration of just how easy it is to stand up a web scraper on virtually any type of site or service.
To do this demonstration I moved to my favourite new AI platform and asked it what to do.
I've asked it to write me a web scraper that will download all text-based data and linked media files from an online Marketplace.
It gives me some helpful hints right off.
Creating a web scraper to download text based and linked media files from an online Marketplace requires careful consideration of the legal and ethical implications. Make sure you have permission to scrape this data from the website and with that said, here's how you do it.
Here is the script and some explanatory notes at the end with steps.
It canvases some of those key legal and ethical dimensions, but still gives us that information. Just because we can do something doesn't necessarily mean that we should.
Even chat GPT is telling us this right.
We need to be mindful of these legal and ethical dimensions as researchers that are employing tools such as this.
Colette and Katie are going to walk you through some of those legal and ethical dimensions, particularly as they pertain to an Australian context.
I'm the lawyer in the crew and I'm going to give you a little bit of information about some of the potential legal risks that could arise when we're talking about collecting data via automated data collection technologies.
I think it's important to know that there is no specific law which provides us a one-size-fits-all panacea to misusing data, collected via these automated data collection technologies.
No one particular law that we can turn our mind to. There are however a really big suite of laws around data misuse, which will provide for civil and or criminal penalties if a breach can be established.
For example, legal risk is connected to infringement of copyright. It might be legal risks around breaching privacy laws, when we're talking about using personal and sensitive information and possessing content which is illegal in nature.
The legal guard rails are in place and they give us the left and right, as to what is legally unacceptable in certain contexts. It is however incredibly hard for the law to keep up and be sufficiently agile to respond at the rate at which technology is progressing.
And at the moment here in Australia, there are very few legal cases which have specifically considered the lawfulness of collection, use and dissemination of data, collected via automated data collection technologies such as web scrapers and web crawlers.
It's very much an evolving and untested legal space which doesn't give us much confidence in assessing the degree of legal risk to individuals and to those who are collecting the data.
Let's have a look at the patchwork of laws which could potentially apply, depending on the circumstances.
The highlighted section refers to the federal legislation which is in place, and all of the references to the other laws, relate to the state and territory pieces of legislation which are in place.
I'll briefly turn to the four specific sections that you can see up there on the slide.
The very first relevant law exists under the data availability and transparency act, and it's a very new framework which has recently been enacted.
It creates a process and best practices surrounding the sharing of Australian government data between accredited bodies to deliver government services. Use and sharing of data aligns with our federal Privacy Act.
We have a national regulator, in charge of this data scheme called the National Data Commissioner, responsible for regulation and enforcement of the scheme.
It is important for us to note as a university that Australian universities can become accredited users.
What this will do is increase opportunities for lawful data sharing, particularly in the pursuit of research and development. The key privacy protections in place in this scheme include things like prohibiting the reidentification of data that has been deidentified. It prohibits the storage and access of personal information outside Australia, and there is a requirement for express consent in every instance in relation to biometric data.
There is no particular reference to automated data collection within the scheme itself and it has limited use in that regard, remembering it only applies to specific bodies, so it is limited in its application.
The second federal law that you can see relates to the Copyright Act.
Now the interesting aspect here to consider is that compilations of information may attract copyright protection, where they can be considered of having sufficient originality.
There must be some independent intellectual effort in creating that compilation. The high court of Australia has held that human involvement is required for material consisting of a compilation of information to be protected under the copyright law.
What does that mean? It means that it's therefore quite unlikely for unorganised data collection via a web scraper, which is collected in a mechanical manner that it would infringe copyright law.
Once again, we can see here it is of limited application for most of you.
It might also be interesting to know that not everything we post online, like on social media is going to be the subject of copyright.
It may be okay and this is forever the wonderful term that lawyers refer to. It might be. It depends and that's the case here.
In regard to social media posts where content can be construed as a literary work, or photos being an artistic work, you would need to be able to prove that the content is of sufficient originality. The work must have a degree of independent intellectual effort for copyright to subsist in it.
Takeaway here, Copyright Act is of limited use, but individuals and those who are scraping the data certainly need to be aware of some protections which exist.
Let’s turn to the third listed Federal piece of legislation, the Privacy Act.
This really is the keystone or the cornerstone legislation which governs the use of personal and sensitive information of individuals.
However, whether this act applies, depends on who is using or sharing the data.
It's really interesting to note that many of the large social media platforms located outside of Australia fall within the reach of the Privacy Act.
Why? By virtue of something that we call the Australian link. Where an organisation has collected or held personal information in Australia, the ACT will apply.
The ACT may or may not apply to researchers, depending on the status of their institution or the status of the institution with whom they are collaborating.
Where Federal legislation doesn't apply, there are state-based privacy laws which you can see listed that could also be relevant to the use, collection and sharing of personal and sensitive information.
There is significant variation between the state and territory laws, and this really does make understanding what the legal limits of automated data collection use and sharing are, quite difficult.
I wanted to highlight that there are an array of federal and state criminal laws which may have application.
There are two areas which have direct relevance to web scraping.
The first being unauthorised access to a computer system, otherwise known as hacking and that's criminalised at both federal and state level.
What this would require is that the web scraper actually breaches the security of a website to execute the crawl and harvest of the data.
It can also be a criminal offense to collect possess and distribute certain kinds of digital content. For example, those scraping data would need to be very mindful that collection and possession of child exploitation material is prohibited.
There are a host of exemptions which apply, and these vary from one jurisdiction to the next but generally speaking there are exemptions for collection, possession and use of otherwise illegal content where these activities are conducted for law enforcement purposes or formal classification of child protection, or the purpose of advancing educational, scientific and medical knowledge.
There is no specific law which protects our personal and sensitive information from collection via automated collection technologies.
A range of legislation exists okay to protect individuals from misuse of personal and sensitive information, but it is very context specific as to which law may or may not apply.
It is expected that a suite of reforms will be made to the Privacy Act and the anticipated time frame for that for that is in the second half of 2024.
One of the potential changes is that it would enable individuals to seek direct legal remedies for privacy breaches. Currently there is limited scope for that to occur.
I'd like to bring this Clear View case to your attention. It was a very high-profile case.
Clear View AI is actually a US company which used a web crawler to harvest data from public websites and social media platforms to identify and collect facial images.
We know that is a form of biometric data and that is sensitive information and they use this to create biometric templates for identification purposes.
Australian data was collected for this purpose, without consent from the individuals whose sensitive information was collected.
Now a customer of Clear View could undertake a search on Clear Views website, upload an image into the system that would then be compared to all the images on the data system and similar images were then provided to the customer.
As you can imagine this sort of practice clearly poses some risks of harm to individuals, especially vulnerable groups such as children.
The Australian information commissioner, that is Australia's privacy regulator, conducted an investigation into Clear View for potential breach of our privacy act and considered whether there was a sufficient link between the collection of data from Australian accounts, websites and storage and disclosure via US-based servers.
Importantly, it was deemed that the Privacy Act applies. Clear View was carrying on business in Australia and it breached several provisions under Australia's privacy act.
Clear View did seek a review of that decision by the information commissioner and the forum for them to do that was the administrative appeals tribunal.
In 2023 the administrative appeals tribunal handed down a decision which confirmed that the Privacy Act applied in this context, that biometric information is sensitive information and that this was being collected without consent, breaching several provisions under Australia's privacy legislation.
It's a significant decision and what this ultimately means is that foreign corporations need only have an Australian link for its practices to be bound under Australian law.
One other really important framework to mention, relates to service terms.
Earlier you heard Russell talk about this, when he pulled up an extract of the YouTube service terms.
These Agreements are actually really important to understand, so that you know what you are signing up to in terms of how your data is going to be used or can be used.
Now we have very little case law which provides us with judicial guidance on the violation of a website's terms of service or terms of use, through automated data collection technologies. Today not a single Australian case has specifically dealt with this issue.
A case which does provide us with some insight though is the HQ labs and Linkedin case which you can see on the screen above. It was a really long, drawn-out matter involving over six years of litigation.
HQ Labs analytics was a business that was actually harvesting public data from LinkedIn profiles and providing that service to businesses.
Businesses might have found that useful in terms of predicting employee turnover. The decision was really important for two key reasons.
First where terms of use prohibit scraping, via automated technology and creation of fake profiles to scrape logged in data, it will open up those scraping the data to potential contract-based liability.
For those conducting the scraping or creating fake profiles to gather that logged in data they are breaching the terms of service, and they are potentially opening themselves up to contract based liability.
Second point is that data scraping of public websites so truly open-source data is not unlawful, so it tells us that courts may be less inclined to criminalise the scraping of those truly open sources.
The takeaway here is that it's highly advisable for those who intend on scraping platforms not to proceed where there is an express provision prohibiting automated data collection, not only for ethical reasons but now we have this key decision which Australian can take into account should a case be litigated here in Australia, so it's not binding in Australia but it certainly does have some persuasive value.
I'm going to turn over to my colleague Katie now, to talk about what researchers need to be mindful of, some of those ethical implications that you've heard reference throughout the presentation.
Russell showed it is easy to use these tools. You can have chat GPT make you one straight away and when we're trying to look for guidance on responsible use, it's really hard to find anything.
In our own research we've set out to pull together and define what responsible use might look like, and to provide other researchers who are going to be using these tools with some guidance on how they can do it responsibly and ethically.
Research indicates that there can be some variation in the types of data that is scraped with these automated tools for research purposes, but in most cases web scrapers are pulling in data that a human has created.
Whether that be photos that you've created of yourself or other people, posts that you're posting online that's human-based data, so this means that in our research we have to be mindful of ethical considerations that relate to human subjects, which includes issues around consent.
Voluntary and informed consent is the foundation upon which ethical research involving human subjects can take place, and so we need to be mindful that just because open-source data is out there for the taking, that doesn't mean that the humans who created that data and put it out there are necessarily consenting to us using it.
In a lot of cases, we do see that usernames and aliases are used and so how do you find that person to ask them for their consent to use their data?
In some situations, it is ethically permissible to seek a waiver of consent but with that it's also up to the researcher to then make sure that they're justifying why the research should still take place and that they're taking the appropriate precautions to make sure that it is ethical and legal.
I mentioned earlier where researchers are using the data to examine some serious crime problems and gauge insights to try and develop strategies to prevent those crimes. That's probably a pretty beneficial reason and could perhaps justify the lack of the waiver of consent in these cases, but it can also be quite subjective.
Some people might think some of that research is really beneficial, others might not be so on board with it and they might not think that the waiver of consent is really justified.
One of the main ways that researchers can reduce those harms is by ensuring that they de-identify or anonymise the data that they scrape, and what that's doing is protecting user privacy and reducing any harm that might come about from not seeking consent.
Researchers should also have processes in place for the safe storage of data, making sure they have strong security measures, so that they are not getting hacked.
Engaging in scraping it's possible that you might accidentally scrape harmful material or illegal material, so it's particularly high risk if you're collecting data that is on the dark web, which is an unregulated space where there might be child sexual abuse material, there might be discussions of serious crimes, there could be images of extreme violence, that you might accidentally scrape.
I’d just like to finish now by thanking tonight's inspiring speakers.