← Home
How the events are collected
The events are collected in two ways (and a third way that doesn't work, listed below)
1. .ics / iCalendar import
This is the better way, but it's harder to start with.
This way is precise and doesn't use much electricity. The downside is that it's more work and it's a bit complicated. Either a developer needs to make an automatic export or someone needs to add the events manually in Google Calendar or Nextcloud Calendar. Therefore, this method is better for medium and larger venues that has good websites based on a database.
This way is precise and doesn't use much electricity. The downside is that it's more work and it's a bit complicated. Either a developer needs to make an automatic export or someone needs to add the events manually in Google Calendar or Nextcloud Calendar. Therefore, this method is better for medium and larger venues that has good websites based on a database.
2. Using a LLM (AI) to extract event data from email newsletters
This is what I'll be trying now, as a hope it could be a way to get started. My hope is that, if this becomes somewhat established, I can usher the participating venues towards using .ics import instead.
Most venues and art spaces send email newsletters. By running those email newsletters through a locally hosted AI LLM (Language Model) I am able to extract the dates and use them here. The advantage with this method is that I can collect events of the smaller spaces that might not have a reliable website, since almost everyone do send email newsletters.
AI comes with a host of problems, but it solves the largest problem: All the event data is already available. There is no 'critical mass' of users, no network effect, etc.
Here are some drawbacks and how I try to address them
AI is imprecise
In my testing, most models (>20b) almost always gets the dates right. It's good enough to be usable. Sometimes the wrong image is used since it's hard to establish if the image above or underneath an event is the right one. Also, sometimes it gets the city wrong if another city is mentioned. To compensate for this, once you click on an event, the actual email newsletter is shown. This is an extra security mechanism against mistakes, but also so you can see the true context of the event.
Most venues and art spaces send email newsletters. By running those email newsletters through a locally hosted AI LLM (Language Model) I am able to extract the dates and use them here. The advantage with this method is that I can collect events of the smaller spaces that might not have a reliable website, since almost everyone do send email newsletters.
AI comes with a host of problems, but it solves the largest problem: All the event data is already available. There is no 'critical mass' of users, no network effect, etc.
Here are some drawbacks and how I try to address them
AI is imprecise
In my testing, most models (>20b) almost always gets the dates right. It's good enough to be usable. Sometimes the wrong image is used since it's hard to establish if the image above or underneath an event is the right one. Also, sometimes it gets the city wrong if another city is mentioned. To compensate for this, once you click on an event, the actual email newsletter is shown. This is an extra security mechanism against mistakes, but also so you can see the true context of the event.
AI has bias and is trained on god-knows-what
This is small scale and I'm keeping an eye on the result. For this reason this can probably not scale up too much. AI is, in this case, more of a help to avoid common mistakes when copy-pasting than true automation. The most subjective task the LLM has is to pick keywords from the lead text, so far that's enough of a straightforward task it hasn't left me worried about bias, but I can see how this can be problematic on a bigger scale.
Many existing models have dubious licensing
The model is easily interchangeable. To ask a model to extract data (in the form of JSON) is a standard task. More open model (both in terms of licensing, but also in terms of openness about the training data) is coming, if it doesn't exist already, it will soon.
Newsletters is meant for a closed audience and there are privacy concerns sharing this data
For this reason I run the LLM locally in a "black box". The only data entering this black box is the newsletter, and the only thing coming out is the event data. The creator of the LLM doesn't know I'm running it.
AI uses more energy
This is what bothers me the most, currently. On average I can currently parse around 500 newsletters per kWh. To cover the places I'm interested in I get about one newsletter per day, on average. I'm currently using a general purpose LLM which probably is a massive waste of computational power. It feels like using a car as a hammer. A specialised model for this purpose might be magnitudes smaller and work equally well, but this is just me guessing.
This is small scale and I'm keeping an eye on the result. For this reason this can probably not scale up too much. AI is, in this case, more of a help to avoid common mistakes when copy-pasting than true automation. The most subjective task the LLM has is to pick keywords from the lead text, so far that's enough of a straightforward task it hasn't left me worried about bias, but I can see how this can be problematic on a bigger scale.
Many existing models have dubious licensing
The model is easily interchangeable. To ask a model to extract data (in the form of JSON) is a standard task. More open model (both in terms of licensing, but also in terms of openness about the training data) is coming, if it doesn't exist already, it will soon.
Newsletters is meant for a closed audience and there are privacy concerns sharing this data
For this reason I run the LLM locally in a "black box". The only data entering this black box is the newsletter, and the only thing coming out is the event data. The creator of the LLM doesn't know I'm running it.
AI uses more energy
This is what bothers me the most, currently. On average I can currently parse around 500 newsletters per kWh. To cover the places I'm interested in I get about one newsletter per day, on average. I'm currently using a general purpose LLM which probably is a massive waste of computational power. It feels like using a car as a hammer. A specialised model for this purpose might be magnitudes smaller and work equally well, but this is just me guessing.
3. Scraping
I should mention scraping since it's a common method to get access to data you're not entirely entitled to. I tried it, and surprisingly, it doesn't work very well. Most of the places I'm interested in doesn't have a reliable website. If they do, it's often used as an archive rather than to announce upcoming events. I might use scraping as a last resort in some special cases, but so far it's proven more or less useless.