Scrape a Website on a Schedule with Script Kit

John Lindquist
InstructorJohn Lindquist
Share this video with your friends

Social Share Links

Send Tweet
Published 2 years ago
Updated 2 years ago

When you want to collect news sources, airline ticket prices, or any events from sites that don't offer APIs, you can use scrapers to grab elements from off the page.

Script Kit includes a scrapeSelector() helper that takes the URL you want to scrape and the selector you want from the page. Using the // Schedule metadata, you can also have this script run in the background on a Chron schedule and collect the data for you.

Install scrape-tech-news

// Name: Scrape Tech News
// Schedule: 0 11 * * *

import "@johnlindquist/kit"

let h3s = await scrapeSelector(
  "https://news.google.com/topics/CAAqJggKIiBDQkFTRWdvSUwyMHZNRGRqTVhZU0FtVnVHZ0pWVXlnQVAB?hl=en-US&gl=US&ceid=US%3Aen",
  "h3"
)

let filePath = home("tech.md")
await ensureFile(filePath)
let contents =
  `

## ${new Date()}

` + h3s.map(h3 => `### ${h3}`).join("\n")
await appendFile(filePath, contents)

Instructor: [0:00] To make a scraper for the Google News tech page, let's grab the URL. Then we'll make a script called scrape tech news. Hit Enter. We can use the scrape selector helper and pass in the URL. Then say, we want all of the h3s from the page.

[0:21] This will return all of the h3s. Let's go ahead and write these to a file. We'll call our file path just home tech MD. We want to ensure that that file exists. We'll pass in file path to ensure file. Then we want to append to that file each time the script is run.

[0:44] We'll append file, file path, and then the contents that we want to append can be the h3s mapped. We'll take each h3 and map it to a header three inside of markdown, so h3 there. Then join them back using a new line and then pass the contents in there.

[1:07] Then once that's done, let's go ahead and edit file path just to see the results. We'll go ahead and run scrape tech news. That takes just a second. You can see, we have all the headings from the Google News tech section.

[1:26] Rather than manually running this, we can actually instead, let's remove edit. We can instead place this on a schedule and I can say run it every minute. This is just cron syntax to tell us to run every minute. Then the script would run in the background.

[1:45] Every minute it would scrape this and append to the file. If you're unfamiliar with cron's syntax, then come over to crontab.guru and play around with the syntax in here. There's plenty of good examples in this section that will show you a lot of the scenarios that it can run under.

[2:02] Right now, this ran again and it just ran the same headline so you don't want this to run every single minute. You definitely want it to run maybe every day at 6:00 AM, Noon, and 6:00 PM, or something. Then since this is on a schedule, you probably also want to add a date in here.

[2:18] We'll say the contents is Heading 2 with a new date, and then just add some new lines. That way the new date will pop open and separate the contents a little bit. We'll just wait right here for the content to pop in. You can see now we have this conten separated by this Header 2.

[2:49] When you add a schedule to a script, if you launch Script Kit and search for scrape tech news, you can see the next time it will run. It'll run in 16 seconds from now.

[2:59] If I were to change this to every day at 11:00 AM and launch the app, you'll see under scrape tech news that it'll run again in 23 hours. It's around noon today and it'll run around 11:00 AM tomorrow.

[3:14] Under the kit tab, you can also look at the schedule of every script you have scheduled. You can see I have two running. One will run in the next 22 hours, and one will run in the next 23 hours. This will allow you to manage them from here.

Christine Wilks
Christine Wilks
~ 2 years ago

Strange, I couldn't get this to work with google tech news: "https://news.google.com/topics/CAAqKggKIiRDQkFTRlFvSUwyMHZNRGRqTVhZU0JXVnVMVWRDR2dKSFFpZ0FQAQ?hl=en-GB&gl=GB&ceid=GB%3Aen" But it worked fine with "reddit.com"

John Lindquist
John Lindquistinstructor
~ 2 years ago

I imagine it's timing out. You can try increasing the timeout to 10 seconds (it defaults to 5):

let h3s = await scrapeSelector(
"https://news.google.com/topics/CAAqJggKIiBDQkFTRWdvSUwyMHZNRGRqTVhZU0FtVnVHZ0pWVXlnQVAB?hl=en-US&gl=US&ceid=US%3Aen",
  "h3",
  el => el.innerText,
  {
    timeout: 10000,
  }
)
Markdown supported.
Become a member to join the discussionEnroll Today