I need to keep track of which teachers move from one school to another so I wrote a web scraper that scrapes all teacher websites. I am fairly familiar with scrapers, but I wanted the program to run on a weekly basis. I decided to take this opportunity to learn a little cloud. I looked into lambda functions and S3 buckets so I could scrape the websites, read my previous list, then send myself an update with any changes.
I originally wrote the program to read and write to local files. It would scrape the teacher names, write them to a local file that it could compare against next to see if any staff changed. I had to make changes to this program because the way you read and write to S3 buckets is a little more involved, and I don’t think you have the option to append, so you need to get all your changes ready, before writing them to the file.
S3 Buckets
Before anything else, I set up an S3 bucket. The S3 bucket is a cloud storage location. Here is the documentation on how to create an S3 bucket. https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html
Roles
In order for your lamda function to read and write to an S3 Bucket, you need to set up a role with permissions to access the S3.
I followed the directions and made changes to the JSON file in the role permissions. You can see how to make changes to permissions here. https://docs.aws.amazon.com/lambda/latest/dg/python-package.html
Lambda Function
When you create a new lambda function, you need a file called lambda_function with a function in the program called lambda_handler with will act as the entry point for your program. From there you can import other functions like normal, but you need to zip everything together before you can deploy it as a lambda function. You can read about how to zip everything together here:
In order to avoid putting any secrets in the code, you can set up environment variables just like you would locally. Here are the instructions on adding environment variables. https://docs.aws.amazon.com/lambda/latest/dg/configuration-envvars.html
In order to use any pip installs, you need to zip them into your package. For example, I needed beautiful soup, so I made a virtual environment, then installed all of the requirements into a folder called package. Below you can see how to zip all your requirements into a single folder along with the lambda_function.py https://docs.aws.amazon.com/lambda/latest/dg/python-package.html
pip freeze requirements.txt
mkdir package
cd package
zip -r ../deployment_package.zip .
cd ..
zip deployment_package.zip lambda_function.py
After everything was ready and deployed, I added an scheduler to my lambda using cron. Here is the documentation on how to write the cron job on AWS https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-cron-expressions.html
Your lambda function needs to write to and read from S3 buckets. In order for that to work you do need to make sure that your lambda is assigned a role that has permissions to your S3 bucket. Below you can see the functions I used to read and write from S3
def read_file_from_s3(file_name, bucket_name=bucket_name) -> list[str]:
try:
obj = s3_client.get_object(Bucket=bucket_name, Key=file_name)
data:str = obj['Body'].read().decode("UTF-8")
returned_data = json.loads(data)
except Exception as e:
returned_data = []
print(e)
print(returned_data)
return returned_data
def write_to_s3(content_to_write, file_name, bucket_name = bucket_name):
try:
s3_client.put_object(
Key = f"teachers/{file_name}",
Bucket = bucket_name,
Body = json.dumps(content_to_write).encode('UTF-8')
)
except Exception as e:
print(e)
Docker
I messed around with docker with my original program because it seemed like a good way to deploy to a lamda, but I had to make major changes to the program so it would work with the S3 buckets and I ended up deploying without Docker (although it might be a good idea to redeploy with docker. Docker was pretty straightforward. You basically just write all the commands you need to get your environment set up.
Conclusion
In a tale as old as time, I spent 10 hours automating a task that would take me one hour to do by hand, but I think it was worth it. I learned a lot about lambda, and I successfully deployed to the cloud! I will have to wait until Friday afternoon to see if the cronjob is working the way it should, but I have faith that it is working. This will really help me keep on top of my record keeping. It took so long to check all of the websites and compare to previous records that I only had time to do it once a year. Now I can quickly make updates every week.