A Very Basic Scraper/Aggregator Site in Next.js with Go Cloud Functions and Supabase

Wouldn’t it be neat to have aggregated data (for a website, daily email, push alert, etc) of kids events in our surrounding area so we know about them right away?
— My wife, possibly salty we missed out on Bluey Live tickets in Portland

That’s proper nerd sniping on a Saturday morning a few weeks back when I didn’t have too much else to do.

In the past, I’ve been guilty of waving my hands at ideas that “just” aggregate data from other websites. That’s just a scraper site, I’d think. Hit some websites, scrape what you need, chuck it in a database, and voila. This Saturday: OK — big boy — but you still gotta do it — you think you got what it takes? 👀

Wanna hear this talked about with mouthblogging? Dave and I chatted about it on ShopTalk.

First — what’s going to do the scraping?

I happened to have run across Colly the other day.

This choice tickled me in three ways:

I get to use a scraper technology I’ve never seen or used before.
It’s in Go, and I like opportunities to learn more Go, something I’ve been investing time in learning for work.
I can try to make it work as a Netlify function because they support Go, which is yet another learning opportunity.

So the very first thing I did was write the bare minimum Go code to verify the idea was feasible at all.

The Scrapable Website

This website, as a first test, fortunately, is easily scrapable.

Screenshot of https://www.portland5.com/ on the Kids Event page. DevTools is open, highlighting one of the events with a HML class of "views-row" — What would make a website *not* easily scrapable? 1) no useful attributes to select against or worse 2) is client-side rendered

Here’s the code to visit the website and pluck out the useful information:

package main

import (
  "fmt"

  "github.com/gocolly/colly"
)

func main() {
  c := colly.NewCollector(
    colly.AllowedDomains("www.portland5.com"),
  )

  c.OnHTML(".views-row", func(e *colly.HTMLElement) {
    titleText := e.ChildText(".views-field-title .field-content a")

    url := e.ChildAttr(".views-field-title .field-content a", "href")

    date := e.ChildAttr(".date-display-single", "content")

    venue := e.ChildText(".views-field-field-event-venue")

    fmt.Println(titleText, url, date, venue)
  })

  c.Visit("https://www.portland5.com/event-types/kids")
}

Code language: Go (go)

The above code was working for me, which was encouraging!

Do websites owners care if we do this? I’m sure some do, where it is a competitive advantage thing for them for you to have to visit their website for information. Bandwidth costs and such probably aren’t much of a concern these days, unless you are really hammering them with requests. But in the case of this event scraper, I’d think they would be happy to spread the word about events.

The highlighted lines above are a real showcase of the fragility of this code. That’s the same as document.querySelectorAll(".views-row .views-field-title .field-content a"). Any little change on the parent site to the DOM and this code is toast.

Printing to standard out isn’t a particularly useful format and this code generally isn’t very usable. It’s not formatted as a cloud function and it’s not giving us the data in a format we can use.

An Event Type

This is the kind of data format that would be useful to have in Go:

type KidsEvent struct {
  ID      int    `json:"id"`
  Title   string `json:"title"`   // e.g. "Storytime at the Zoo"
  URL     string `json:"url"`     // Careful to make fully qualified!
  Date    string `json:"date"`
  Venue   string `json:"venue"`
  Display bool   `json:"display"` // Show in UI or not
}Code language: Go (go)

Once we have a struct like that, we can make a []KidsEvent slice of them and json.Marshal() that into JSON. JSON is great to work with on the web, natch.

My next step was to have the cloud function do this scraping and return JSON data immediately.

Returning JSON from a Go Cloud Function

That is structured like this, which will run on Netlify, which uses AWS Lambda under the hood:

package main

import (
  "context"
  "encoding/json"
  "log"

  "github.com/aws/aws-lambda-go/events"
  "github.com/aws/aws-lambda-go/lambda"
)

func handler(ctx context.Context, request events.APIGatewayProxyRequest) (*events.APIGatewayProxyResponse, error) {
  // This returns a []KidsEvent, so instead of just printing the scraped info, we're filling up this slice with data.
  kidsEvents, err := doTheScraping()
  if err != nil {
    log.Fatal(err)
  }

  b, err := json.Marshal(kidsEvents)
  if err != nil {
    log.Fatal(err)
  }

  return &events.APIGatewayProxyResponse{
    StatusCode:      200,
    Headers:         map[string]string{"Content-Type": "text/json"},
    Body:            string(b),
    IsBase64Encoded: false,
  }, nil
}

func main() {
  lambda.Start(handler)
}Code language: Go (go)

Once this was working I was quite encouraged!

Getting it working was a bit tricky (because I’m a newb). You might want to see my public repo if you want to do the same. It’s a little journey of making sure you’ve got all the right files in place, like go.mod and go.sum, which are produced when you run the right terminal commands and do the correct go get * incantations. This is all so it builds correctly, and Netlify can do whatever it needs to do to make sure they work in production.

Fortunately, you can test this stuff locally. I would spin them up using the Netlify CLI like:

netlify functions:serveCode language: Bash (bash)

Then hit the URL in the browser to invoke it like:

localhost:9999/.netlify/functions/scrape

screenshot of VS code open running a terminal doing `netlify functions:serve` and a browser at the URL localhost:9999/.netlify/functions/scrape — Notice the output is saying it’s scheduled. We’ll get to that later.

I don’t actually want to scrape and return on every single page view, that’s wackadoo.

It’s not very efficient and really not necessary. What I want to do is:

Scrape once in a while. Say, hourly or daily.
Save or cache the data somehow.
When the website is loaded (we’ll get to building that soon), it loads the saved or cached data, it doesn’t always do a fresh scrape.

One thing that would be perfect for this is Netlify On Demand Builders. They run once, cache what they return, and only ever return that cache until they are specifically de-cached. That’s wonderful, and we could use that here… except that Netlify doesn’t support On Demand builders with Go.

This is a moment where I might have gone: screw it, let’s make the cloud function in TypeScript or JavaScript. That would open the door to using On Demand builders, but also open the door to doing scraping with something like Puppeteer, which could scrape the real DOM of a website, not just the first HTML response. But in my case, I didn’t really care, I was playing and I wanted to play with Go.

So since we can’t use On Demand builders, let’s do it another way!

Let’s keep the data in Postgres

We use Postgres at CodePen so another opportunity to practice with tech used on a far more important project is nice. Where should I keep my little Postgres, though? Netlify doesn’t have any official integrations with a Postgres provider, I don’t think, but they are at least friendly with Supabase. Supabase has long appealed to me, so yet another opportunity here! Pretty much the whole point of Supabase is providing a Postgres with nice DX around it. Sold.

Screenshot of Supabase Database page.

"Every Supabase project is a dedicated Postgres database.

100% portable. Bring your existing Postgres database, or migrate away at any time." — Like Netlify, Supabase has a free tier, so far this little weekend project has a $0 net cost, which is, I think, how it should be for playing with new tech.

I was assuming I was going to have to write a least a little SQL. But no — setting up and manipulating the database can be entirely done with UI controls.

screenshot of the Supabase UI where the database is set up with a table layout of columns and what type those columns are. Like the `title` column is `text` with a default value of an empty string.

But surely writing and reading from the DB requires some SQL? No again — they have a variety of what they call “Client Libraries” (I think you’d call them ORMs, generally) that help you connect and deal with the data through APIs that feel much more approachable, at least to someone like me. So we’ll write code more like this than SQL:

const { error } = await supabase
  .from('countries')
  .update({ name: 'Australia' })
  .eq('id', 1)Code language: JavaScript (javascript)

Unfortunately, Go, yet again isn’t a first-class citizen here. They have client libraries for JavaScript and Flutter and point you towards a userland Python library, but nothing for Go. Fortunately, a quick Google search turned up supabase-go. So we’ll be using it more like this:

row := Country{
  ID:      5,
  Name:    "Germany",
  Capital: "Berlin",
}

var results []Country
err := supabase.DB.From("countries").Insert(row).Execute(&results)
if err != nil {
  panic(err)
}Code language: JavaScript (javascript)

The goal is to have it mimic the official JavaScript client, so that’s good. It would feel better if it was official, but whattayagonnado.

Saving (and actually, mostly Updating) to Supabase

I can’t just scrape the data and immediately append it to the database. What if a scraped event has already been saved there? In that case, we might as well just update the record that is already there. That takes care of the situation of event details changing (like the date or something) but otherwise being the same event. So the plan is:

Scrape the data
Loop over all events
Check the DB if they are already there
If they are, update that record
If they aren’t, insert a new record

package main

import (
  "fmt"
  "os"

  "github.com/google/uuid"
  supa "github.com/nedpals/supabase-go"
)

func saveToDB(events []KidsEvent) error {
  supabaseUrl := os.Getenv("SUPABASE_URL")
  supabaseKey := os.Getenv("SUPABASE_API_KEY")
  supabase := supa.CreateClient(supabaseUrl, supabaseKey)

  for _, event := range events {
    result := KidsEvent{
      Title: "No Match",
    }

    err := supabase.DB.From("events").Select("id, date, title, url, venue, display").Single().Eq("title", event.Title).Execute(&result)
    if err != nil {
      fmt.Println("Error!", err)
    }

    if result.Title == "No Match" {
      var saveResults []KidsEvent
      event.ID = uuid.New().String()
      err := supabase.DB.From("events").Insert(event).Execute(&saveResults)
      if err != nil {
        return err
      }
    } else {
      var updateResults []KidsEvent
      err := supabase.DB.From("events").Update(event).Eq("title", event.Title).Execute(&updateResults)
      if err != nil {
        return err
      }
    }
  }

  return nil
}
Code language: Go (go)

My newb-ness is fully displayed here, but at least this is functional. Notes:

I’m plucking those ENV variables from Netlify. I added them via the Netlify dashboard.
The ORM puts the data from the query into a variable you give it. So the way I’m checking if the query actually found anything is to make the struct variable have that “No Match” title and check against that value after the query. Feels janky.
I’m checking for the uniqueness of the event by querying for the url which seems, ya know, unique. But the .Eq() I was doing would never find a matching event, and I couldn’t figure it out. Title worked.
The save-or-update logic otherwise works fine, but I’m sure there is a more logical and succinct way to perform that kind of action.
I’m making the ID of the event a UUID. I was so stumped at how you have to provide an ID for a record while inserting it. Shouldn’t it accept a null or whatever and auto-increment that? 🤷‍♀️

screenshot of the data in the Supabase Postgres DB.

The point of this experiment is to scrape from multiple websites. That is happening in the “final” product, I just didn’t that up in this blog post. I made functions with the unique scraping code for each site and called them all in order, appending to the overall array. Now that I think about it, I wonder if a Go routine would quicken that up?

Scheduling

Netlify makes running cloud functions on a schedule stupid easy. Here’s the relevant bit of a netlify.toml file:

[functions]
directory = "functions/"

[functions."scrape"]
schedule = "@hourly"Code language: TOML, also INI (ini)

JavaScript cloud functions have the benefit of an in-code way of declaring this information, which I prefer, so it’s another little jab at Go functions, but oh well, at least it’s possible.

Pulling the data from the database

This is the easy part for now. I query for every single event:

var results []KidsEvent
err := supabase.DB.From("events").Select("*").Execute(&results)Code language: JavaScript (javascript)

Then turn that into JSON and return it, just like I was doing above when it returned JSON immediately after the scrape.

This could and should be a bit more complicated. For example, I should filter out past events. I should probably filter out the events with Display as false too. That was a thing where some scraped data was bogus, and rather than write really bespoke rules for the scraping, I’d flip the Boolean so I had something the avoid displaying it. I did that on the front end but it should be done in the back.

A Website

screenshot of the Next.js homepage "The React Framework for the Web"

I figured I’d go with Next.js. Yet another thing we’re using at CodePen but I could use more experience with, particularly the very latest version. I figured I could do this the smart way by using getServerSideProps so a Node server would hit the cloud function, get the JSON data, and be able to render the HTML server side.

I spun it up in TypeScript too. Ya know, for the practice! This is pretty much the entire thing:

export async function getServerSideProps() {
  const res = await fetch(
    `${process.env.SITE_URL}/.netlify/functions/get-events`
  );
  const data = await res.json();

  return { props: { data } };
}

type kidEvent = {
  id: string;
  title: string;
  date: string;
  display: boolean;
  url: string;
  venue: string;
};

export default function Home(props: { data: kidEvent[] }) {
  return (
      <main>
        <h1>Kids Events</h1>

        <div>
          {props.data.map((event) => {
            const date = new Date(event.date).toDateString();

            if (event.display === false) return null;

            return (
              <div key={event.id}>
                <dl>
                  <dt>Title</dt>
                  <dd>
                    <h3>
                      <a href={event.url}>{event.title}</a>
                    </h3>
                  </dd>

                  <dt>Date</dt>
                  <dd>
                    <time>{date}</time>
                  </dd>

                  <dt>Venue</dt>
                  <dd>{event.venue}</dd>
                </dl>
              </div>
            );
          })}
        </div>
      </main>
  );
}Code language: JavaScript (javascript)

Styling

I kinda love styling stuff in Next.js. It supports CSS modules out of the box, and Sass is as easy as npm install sass. So I can make files like Home.module.scss for individual components, and use the scoped styles. It’s a smidge heavy on CSS tooling, but I admit I find this particular alchemy pleasing.

import styles from "../styles/Home.module.scss";

export default function Home() {
  return(
    <main className={styles.main}
    </main> 
  );
}Code language: JavaScript (javascript)

I also took the opportunity to use Open Props for the values of as much as I possibly could. That leads to this kind of thing:

.card {
  padding: var(--size-5);
  background: var(--gray-1);
  color: var(--gray-9);
  box-shadow: var(--shadow-3);
  border-radius: var(--radius-3);
}Code language: CSS (css)

I found Open Props very nice to use, but I kinda wish it were somehow “typed” in that it would help me discover and use the correct variables and show me what the values are right within VS Code.

screenshot of the Kids Event homepage. A Grid of events on a purple background.

Pittered Out

The very basics of this thing are working and I published it. I’m not sure I’d even call this a minimum viable prototype since it has so many rough edges. It doesn’t deal with any date complexities like events that run multiple days or even expiring past events. It scrapes from a few sites, but only one of them is particularly interesting, so I haven’t even proven that there are enough places to make this an interesting scraping job. It doesn’t send alerts or be helpful in any way my wife originally envisioned.

But it did teach me some things and I had fun doing it. Maybe another full day, and it could be in decent enough shape to use, but my energy for it is gone for now. As quickly as a nerd sniping can come on, it comes off.