Jeremy:
If you have to summarize your overall system architecture with this app – we've touched on Symfony, on your ORM – I already forgot the name of it again...
Alena:
Doctrine.
Jeremy:
Doctrine, right.
Alena:
For overall architecture, my site is a LAMP application.
It is running with Ubuntu, Apache, PHP 8.1 and PostgreSQL.
In your typical LAMP stack, the M usually stands for MySQL, but in this case, I'm running with PostgreSQL.
And there are other components that are not web server components.
There is an asynchronous queue system that is at play as well which processes all of the background work.
Jeremy:
What's that? referring to queue system
Alena:
That is another Symfony component.
It's the Symfony Messenger component.
Jeremy:
Messenger.
Alena:
Messengers -
You could think of it as a system where there's a database table of queued jobs to do, and a worker process just sits there listening, saying, “Check the database. Do I have any work to do? Do I have any work to do?”
And when any new rows end up in the jobs table, the queue worker says, “Oh, there's work to do.”
It grabs the row out of the queue, processes it, runs the job, and continues checking out work until there's no work left to do.
This kind of enables an asynchronous style of programming that currently is not a native language feature of PHP at this point. In PHP, you can't really create a bunch of promises and await all of those promises to finish like you would in Javascript.
So, having a queue worker system allows you to build asynchronous features with a slightly different approach.
In this case, it would be more of a system where you dispatch work to the queue, and your queue worker, who's sitting there listening in the background, will pick it up and work on it.
When writing code for a web server, you usually want web requests to finish as quickly as possible because you have a user who's sitting there who's either clicked a button, or who's refreshed the page, and they want to see something come back as soon as possible.
You should send a response back to the user within a few hundred milliseconds or more to give the confirmation of saying, “Hey, the thing has been queued up to be worked on” or “We're working on it”.
That can give a sense of confidence or a sense that your app is speedy or that things are working to the user.
Whereas if your page sat there for five, ten seconds while something really long is running -
that's something that, as far as user experience goes, is something you don't want the user to be stuck waiting for.
Jeremy:
Of course, got to reassure the users that progress is happening, things are moving along.
I like the way you work around the lack of native async functionality.
It's interesting to think about.
The whole concept of workers is really cool; [it's] something that goes under discussed.
When I learned programming [the field] was all [about] object-oriented this and that.
It's wild to think how far things have come around.
I know there's that one language that has workers as its fundamental core construct of doing anything.
I forget what that one's called. [Editor's note: It's Erlang/OTP.]
Alena:
You'll have to mention that one to me. I'm curious.
Jeremy:
Oh, yes, I will. It's not OCaml. It's another one of those weird, niche, functional languages.
It's not Clojure either. I forget; I'll get back to you on that.
I know that on the web these days you have web workers, which you can use to run stuff outside of the main process.
I'm pretty sure that's how those work. I don't quite remember... but it's cool that [there are] parallel approaches to the same problem.
Alena:
You bet.
Jeremy:
Just because you're lacking in a feature async/await doesn't mean you don't have a robust system in place.
Alena:
Correct.
You can still do the same kind of asynchronous/multiple things at a time approach with a queue worker system.
Because... a single worker can only do one thing at a time, true, but you can also spin up multiple workers, so that you can process multiple pieces of work at a time.
Theoretically, an app with a worker approach, depending on how many workers you've spun up, could be much faster than a single process application that has asynchronous components.
But that's all something that could be heavily, heavily debated about what approach is best for what thing.
Jeremy:
Right, right. But you always have to factor [in] scaling if you want your app to be used by anyone.
These are good conversations to have early.
Alena:
Exactly. There's no one best tool out there.
There are so many different tools and they are all equally well suited for different things.
Jeremy:
That sounds just about right. Well, I know that there's a worst tool for everything and it's the official KDE tool 🤡 No, I'm just kidding!
Alena:
There's a lot of KDE bashing going on. 🤨
Jeremy:
I'm a huge KDE aficionado; I love KDE, and so... as a hardcore KDE user, I reserve the right to trash KDE for free.
Alena:
Ya trash them because ya love them. 🥰
Jeremy:
Exactly.
Alena:
You want them to be great.
Jeremy:
I would feel bad if it was any other projects, but I don't feel bad for KDE.
I've put way too much of my life into banging my head against the wall, trying to get KDE software to do what it's supposed to do.
Anyway, 😅 was there anything that got you banging your head against the wall working on this project?
What were some challenges you ran into? Was there anything that really took some time to get over?
This is a pretty big app. There's a lot of code to it. It certainly took a while to make.
Alena:
Yeah, pieces that took quite a large time to get just right is... number one, pulling data in.
Pulling data into the app, like the clinics, as well as processing them.
This was a problem that took me a while to find the right approach.
I knew that I wanted to pull in clinics from Erin Reed's data source, and I did not want any duplicates of them.
I wanted to keep them up to date over time.
The Erin Reed data source is a user Google Map, which can be exported in KML format, which is kind of like an XML maps format.
I was able to write a scraper that would pull down that XML, parse through it, and create a bunch of database objects for each of those clinics.
But the hard part is, over time, how do I make sure that I'm not continually re-importing the same ones?
I ran into another problem that these clinic or location records did not have any kind of unique identifier about them.
Jeremy:
Oh no.
Alena:
Which is... problematic, because there's not really a great way to check if I have the record already other than taking some of the properties of the clinic and doing comparisons on records, like “do these couple fields match?”, and doing database searches like that.
I landed on creating a hash of a couple of the clinic properties and saving the hashes. With the hashes of the clinics I have already, I can compare against the hash of new clinics, so that I can quickly know whether or not I already pulled something in.
But what if I pull an updated clinic that has a piece of text slightly changed?
Well, the hash is going to differ, of course.
So the system is going to pull in that clinic, and now I have essentially two of the same thing, one which is slightly different from the other.
Ideally, I don't want three or more copies of the same clinic, as things change over time.
So my thought was, well, how can I figure out how to combine all of these duplicates, or flag all of these that are theoretically the same ones?
If I'm going to pull in other datasets, like Southern Equality's dataset, I also want to be able to flag which of Southern Equality's clinics and which of Erin Reed's clinics are the same ones.
I wanted a process to be able to flag those so I spent a lot of time, a couple weeks, coming up with the right way to check for duplicate clinics.
So whenever a new clinic comes in, there's the first piece that determines whether or not it's already in the system and the clinic gets saved if a hash of it hasn't been saved already.
Once that happens, a job is sent to the queue worker to say, hey, we need to check for duplicates for this specific clinic.
The queue worker will pick up that job, then search the database for nearby clinics using lat-long coordinates as well as checking for clinics with similar names.
I spent a bunch of time figuring out how to do geo coordinate stuff with PostgreSQL and with Doctrine, the ORM, because I needed to do database searches on lat long coordinates.
I found a PostgreSQL extension, PostGIS, that allows you to store lat long coordinates and the extension will generate metadata that can be used to perform distance checks and more advanced geographical searches.
Probably most lat long operations you could think of are implemented in this PostgreSQL extension.
In this case, I was mostly interested in just the distance between two points. I was able to efficiently search my data to find what clinics were within a half mile or a mile of the new clinic that was imported.
And the next piece was, once I find clinics that are nearby to each other, how do I get as close as I can be, to know if they're duplicates or not?
So I landed on a solution with help from my brother.
He was very, very helpful in picking out a strategy that I could use to compare two different strings – a comparison technique called Levenshtein.
Levenshtein is a method of counting the number of additions, modifications and removals you have to make to a string to make it look like another string.
Kind of like the distance between two strings; the changes you have to make, like do you add a character, do you remove a character, do you change a character to make two strings the same?
I used a PHP implementation of Levenshtein to do those string comparisons for the final judgment of “How likely are these two clinics to be the same?”
Jeremy:
Your last filter.
Alena:
I've had very good success.
I have had very few false positives and have made the system automatically flag duplicates for me to review and then deal with.
I added some UI where you could review the duplicates that the system flagged and be able to resolve them by picking one of them to save.
So I just have to log in every so often to the system to see if any duplicates were found. There are a couple of resolution steps to do, rather than having to go through all the data either by hand or with a script. [Editor's note: I wonder if you could do this with makefiles and cron jobs.]
Jeremy:
Right.
That makes sense. Is that what's in the — where do you exactly do that? Is that what's in the admin page? I know [it] exists, but I can't see [it].
Alena:
So yeah, on the admin page, there's a view of how many clinics have been collected in total, how many new clinics have been recently pulled in, and how many unpublished clinics there are.
Because when new clinics are pulled in, they don't get automatically enabled in the search. I have to manually review them and say yes to them, as well as the duplicates the system flagged.
There's a quick link to just jump to the list of different duplicates.
Jeremy:
Makes sense. Your review queue of sorts.
Alena:
Yeah.
Jeremy:
Certainly quite a struggle. I didn't even think about all the data pre processing that you have to do to keep your dataset up to date.
Alena:
Oh, there's a lot of data pre processing, not even just pulling in the the clinics themselves, but also – how do you pull in the data used to search the clinics?
Like if a user wants to search via a zip code or they want to search via a city, how is my system supposed to know what's a city in the world? Or how should a city be converted to lat long coordinates? Or how do you get from a zip code to lat long coordinates?
Jeremy:
😰
Alena:
So I had to look for open data sets of cities in the United States and zip codes.
I landed on an open data set called GeoNames that has a couple fairly large data sets. They take about a half hour to fully import.
I had to write some console commands to pull down the zip files for them, process all the rows, and actually link everything up.
And that takes about a couple gigs worth of data. So it takes about 20, 25 minutes or so.
Jeremy:
Sure. Oh, wow.
Alena:
And so there's not just pulling in the clinics, but also pulling in those data sets to actually have what's needed to do the actual text search, or zip code search.
Jeremy:
Right. Oh, my. 😳
You have to be able to speak the same language, you know? You need one metric by which you can judge things. Getting everything nice and unified like that must be an insane amount of work to do, [especially] invisibly, under the hood.
Alena:
Yeah, when you type in your location, a zip code or a city, your search isn't going to anyone else.
My server has all of the locations saved in a couple database tables that I can query in MySQL.
I keep trying to say MySQL. I work in MySQL too much. So PostgreSQL queries just run to compare the search to any of the stored locations.
Jeremy:
Hmm, right. I noticed that code. That was really interesting. The code where the users location that they put in gets compared to a distance check. That little query you did in there was really interesting.
Alena:
It's not exactly perfect. The actual city name search needs some work. It doesn't always perfectly translate a search because it's using a likeness check in PostgreSQL to do fuzzy string matches to try to get as close to cities as it can.
It works a lot of the time, but try St. Paul, for example. The database is storing it as S-A-I-N-T Paul. But... a lot of times someone who might be searching for St. Paul might just type in S-T dot Paul.
That will not pull up the autocomplete on the site. It won't autocomplete to St. Paul, Minnesota right away. So that is one area for improvement, say, abbreviations of city names.
Or maybe even there needs to be a component that is able to auto convert those abbreviations into full names. You could get very magical with what's going on to actually convert that user input. I may have just given myself an idea for how to improve that.
Jeremy:
Oh boy! Looks like we need more datasets!! 😆
Alena:
I might need some more collaboration here 😅 This conversation has been helpful 🙂
Jeremy:
That is pretty cool. You know, the search part of it is interesting. If I had to make a suggestion, I think a ranking system of sorts would be a super [useful feature]. I have no clue how you'd implement it, but it would be nice, you know? Because when I type in Minneapolis, the first thing comes up is Minneapolis, Kansas -
And I'm like, what? Who cares about Minneapolis, Kansas? 🙄 There should be some kind of metric to [say] “I know what the most important Minneapolis is”. Maybe by population of the city.
Alena:
😂 That's true, that's true. I think I could be wrong, but I'm pretty sure that there are population values in the Geonames dataset. I'm pretty sure that might be something that can be pulled in.
Jeremy:
Well, because of this one single tiny nitpicky flaw, your app is terrible. 😤 Absolutely, you know, worst app ever. Not worth using. And you used PostgreSQL instead of the obviously based, glorious, much better MySQL.
Alena:
Daddy Oracle.
Jeremy:
How dare you? 😠
Alena:
Big Daddy Oracle. 😏
Jeremy:
🤣
[Only] praise [allowed]! I will accept no Oracular slander in this house. Could you tell me more? I know so little about the database side of things. Why PostgreSQL over, say, MariaDB?
Alena:
Honestly, I knew that PostgreSQL is quite big in the open source space. And I kind of wanted a chance to try it, give it a spin.
I kind of figured, “Well, if things are not working super well, within the first few weeks or so, it wouldn't be too hard, using an ORM, to switch from PostgreSQL to MySQL, or back to, say, MariaDB.
I haven't had any personal experience with MariaDB. I've been told that it's kind of like a drop-in replacement for MySQL, which would be great to have that same kind of syntax and functionality that I expect.
Just minus, you know, the Big Daddy Oracle. 😁
Jeremy:
Right. 😆
Alena:
I kinda wanted to take PostgreSQL for a spin.
Jeremy:
Give it a shot. Yeah, why not?
Alena:
So far, I haven't run into any performance or specific lack of functionality that it didn't have compared to other relational database servers. I'm quite happy with PostgreSQL. On another note, on what you said earlier – we should totally make an app called OpenSpores, a mushroom identification app with an open data set, where you can input different mushroom features and narrow down which shroom you're looking at.
Alena:
😂
Jeremy:
You know, maybe some computational photography stuff, although that gets kind of icky and proprietary.
Alena:
😔
Jeremy:
Who knows. Just food for thought and thought about food.
Alena:
Maybe we will have to make our own machine learning model for identifying mushrooms and make our own Friday night hack project of a mushroom identifier.
Jeremy:
Perhaps, perhaps. It's certainly an option. I think there's certainly some intersectionality.
Anyone interested in networking should be interested in mushrooms, I think.
Alena:
😆
Jeremy:
Perhaps that's just me. You know, we need more alternatives to these hot technologies.
They're so, so proprietary. I really like especially how this app... it feels professional.
Alena:
Oh, thank you. ☺️
Jeremy:
You wouldn't think it was made by a single person.
Alena:
Oh, thank you. 😊
Jeremy:
[You'd think it was made by a] team subdivision of, I don't know, some sort of activist group.
Alena:
Someone told me my app look quite corporate. 🤣
Jeremy:
Oh! 😵 Oh no.
Alena:
And I'm like, oh, no! 😅
Jeremy:
Oh, no, no. I'll make you a style sheet to make it look like a dive bar.
I'll add some Web 1.0 flair.
Alena:
Excellent.
Jeremy:
But, you know, it's cool to see this sort of... grassroots free software activism in the place of more traditional, official, state or corporate offerings.
And it's kind of a radical model you have here.
Do you have any other areas of social malaise that you think would benefit from similar projects like this? from active open source [development]?
What do you think needs to be made less proprietary, if you could choose?
Alena:
Things that could be made less proprietary... 🤔
My focus has been tools for the LGBTQ space.
I'm sure there could be quite a lot more search tools created for all sorts of different specialized medical care.
But for any particular applications, what would I like to see more open source of? hmm...
Jeremy:
I mean, for my money, I talked – like a year ago – to my boyfriend about making a open source gay dating app.
But Lord knows [certain reactionary] heterosexual people would riot if they found out that was a thing. 😒
Alena:
😞
I'm sure there's a plethora of great ideas that could be used for that.
Just even glancing at my phone – I'm using a proprietary medication tracker app.
I'd love it if there was something I could use that had a good hook into a federal medication database, to allow me to quickly add meds without all of the proprietary, nasty tracking features that are probably going on under the hood.
Jeremy:
Mozilla released their report last year – Privacy Not Included – that talked about all these different apps in the Google Play store and how the permissions that the [Play store page] said the app would need had no correlation with the actual privacy policies on the apps' website.
I saw an email a few months ago [from them] – they've done updates on this since, and the worst performing section probably shouldn't be any surprise:
Mental health apps.
Alena:
Geez.
Jeremy:
People who need care?
Tell me all your deepest secrets so that I can monetize them.
Doesn't that sound fun? 🤑
Alena:
That's... kind of disheartening to hear. 😞
Well, if anyone has any ideas to make good mental health apps, that sounds like something that could benefit from a new contender in this space, a new free contender.
Jeremy:
For sure. Definitely.
Alena:
Or paid.
Jeremy:
Or paid!
Alena:
I'm not against paid open source applications.
Jeremy:
Yeah. Open core or other models.
NOOO!! You must starve for your work. 😈
Absolutely.
You know, I'd love to do something similar [to] this – make some kind of contribution to the queer community in my... tech bro way.
You know, convince them I'm on their side.
Alena:
Yeah.
Jeremy:
It would be nice to [perhaps] fork this and do something with that, maybe with a different data set.
I like the idea of a locator app of things that are mappable.
It makes it very tangible, you know.
Alena:
This application is GPL 3.0, sooo have at it! 🥳
Jeremy:
Hey, there we go!
Well, it seems like all the work was in the dataset preparation anyway.
Alena:
Reskin it.
Go for it.
Jeremy:
All right.
Perhaps I will.
That could be kind of fun. 🙂