My First: Web Scrape, Neural Net, Hackathon, Group Project, and Planning Capstone.

David Lee
5 min readSep 18, 2020

--

I’m now in the middle of my 9th week out of a total 12. With only a month to go I’ll be attempting to work with some of the latest in computer vision for my final project (capstone), something that seemed impossible a month earlier.

Me testing Yolov5 from my webcam

Month 2 recap in my General Assembly Data Science boot camp.

As I think about my personal journey, I can see the light shining at the end. The constant flow of information is getting easier to digest, and I’m finding time to learn about aspects in Data Science that interests me. None of this would be possible without the building blocks that were covered in the program. The pace of the boot camp experience is fast, but by touching on several aspects of the field I am starting to understand where I can see myself after I graduate in October.

Week 5–6: Supervised Learning and Project 3

If I were to sum up my first month in a word, it would be foundation. Building upon that, my cohort was fortunate in learning about web scraping, APIs and advanced supervised learning models.

It was an explosive two weeks, where we dove into tools and services like BeautifulSoup and AWS. We learned more about machine learning algorithms like Random Forests, Support Vector Machines along with concepts including Gradient Descent.

A fellow member of my cohort and I studied further from inspiration of the lessons, and even implemented an LSTM RNN (long-short term memory recurrent neural network) model in both of our 3rd major projects.

My project 3 was about creating a tool to differentiate posts from two related subreddits. I chose AMD a major computer parts manufacturer and BuildaPC, a computer building community.

Using python and Pushshift’s API, I was able to scrape nearly 220,000 reddit posts, to run an analysis using Natural Language Processing (NLP) models. Basically, this consists of pulling the text on each post, cleaning them by removing unwanted characters, and vectorizing the words into numbers for the computer to learn in understanding the relationships on each of the total word corpus through complex computations.

This was a lot of fun and hard work, and extremely satisfying seeing how our final model correctly distinguished new posts through the contents in the text of a title.

My LSTM RNN model correctly predicted which subreddit a post belonged to by inputting the text in only the titles on new unseen data.

Want to read this story later? Save it in Journal.

Project 4: Hackathon. Day 1 of Week 7

The following Monday after my third major project was greeted with an all day hackathon where the class is split into groups of 3–4 people solving problems with restrictions. We were given approximately 7 hours to complete the entire data science process on data we haven’t seen before.

The problem we were trying to determine from a feature rich data frame was whether certain features have an impact in determining if someone makes over $50k per year.

My group’s limitation was to answer the problem only using a Random Forest model.

This process entailed understanding our data, cleaning it including the removal outliers, and analyzing it to define what our success metric was going to be. In my team’s case, it was precision as we wanted our model to predict those that hit our target of over 50k better than other variations of our model.

After Identifying an optimal Random Forest model through a gridsearch on the best tuning parameters, we completed a slide presentation with only minutes left before we presented our findings.

Although my group(7) didn’t get the highest accuracy, our model achieved the highest precision score on new data, which is what my team was going for.

Week 7–9: Big Data, Unsupervised Learning, Time Series, and Group Project 5

A little reflection:

Ok, so these are a lot of major topics in Data Science, but my class and I are taking our training wheels off figuratively. The introduction to many of these topics are, in my opinion, to provide my class the many available tools to choose from as we develop our own specialties. Each of these topics are still being built upon, which is pretty amazing to be a part of. I’m also really glad that I spend the time building my own Linux computer before the course as it’s helping me understand more on the Devops side of Data Science.

Project 5: Mapping Power Outages through Social Media — Presenting Tomorrow to FEMA and New Light Technologies

I’m extremely excited to have my group present our solutions to real problems companies have asked from General Assembly students.

I’m teamed up with three amazing individuals in answering this problem together, and we will be presenting our most complete project to date with Scientists from FEMA and a government consulting agency in attendance.

I plan to write more about this in a future post.

Planning Capstone:

Its difficult for me to grasp just how quickly the program is going. I definitely would like to go more in depth on many of the small steps it took for me to get to where I am now, but it may need to wait until after my final few weeks.

As shown in the beginning in the post, I’ll be tackling yet another new technology I haven’t tried yet for my capstone using machine learning to contribute towards the American Sign Language(ASL) community.

I’ve already collected some data, and I’ve opened a dropbox request link until Sep. 27 for anyone that wants to send in their attempts at the letters, or learn more about my project: https://docs.google.com/document/d/1ChZPPr1dsHtgNqQ55a0FMngJj8PJbGgArm8xsiNYlRQ/edit?usp=sharing

Final Remarks:

This post is really condensed, and I’d really like to go in more depth on my experience as a Data Science student at General Assembly. Feel free to leave me a note if you would like to learn more about my experience. Many thanks!

Note — The Reddit posts used for testing my model on my third project can be found here:

AMD Post:

Build a PC Post:

📝 Save this story in Journal.

👩‍💻 Wake up every Sunday morning to the week’s most noteworthy stories in Tech waiting in your inbox. Read the Noteworthy in Tech newsletter.

--

--

David Lee