Esprii Chapman – Original Research Master Post – 01/14/20

Preface

From December I wrote that “I would like to create an algorithm that mimics COMPAS’s risk assessment, but with a lighter topic. Creating something along the lines of an algorithm that can predict your favorite superhero based on what people of similar demographics like. It obviously wouldn’t be perfect but that’s what I want to prove, that it isn’t perfect, and it’s not individualizing someone.”

This turned out to be far from what my final product would be. Algorithms, as it turns out, take time to produce and get to work properly. Instead of some elaborate math, I kept it simple. I decided to, with aid of a video, produce and refine an algorithm that learns a simple rule. In the end, it came out better than I could have hoped, having spent around 55 hours on the whole process.

Logs/Thought Process

Sometime in November 2020 – 

Create an algorithm that mimics what is known about the COMPAS algorithm with a lighter topic.

This was what I wrote before the break, nothing else. What I was trying to do was apply a methodology that I had and thought I understood to my algorithm but as it turns out there’s not much to go off of in terms of writing an algorithm. What I have learned so far, as of December 26th, 2020, is that you cannot develop complex algorithms such as the COMPAS algorithm in a month. Even just a simple boolean output algorithm like I was writing can take months to complete.


Sometime in December 2020 – 

I didn’t write logs for it, but according to my OR, I have to have a question for my algorithm to try to predict the answer for. I compiled a few screenshots of my search history with dates to show me trying to overcome this step of the equation.

The search for a suitable question and dataset (with proof that was taken after the fact)

  • I searched around on Google, GitHub, and Kaggle for half-decent datasets that I could use but found nothing.
  • This part of my research has taken the longest, which I was not expecting.

This is the search on several different data sites to find a dataset

Potential questions that didn’t happen to work out

  • What superhero is your favorite?
    • Based on demographic data instead of direct answers
    • Couldn’t find a suitable dataset
  • Chance of heart failure
    • Too personal
    • Couldn’t be given a definitive answer because no one in the school has gone through heart failure

After Dec 1st –

Meeting with Mr. Haske (Video uploaded to drive)


December 25th, 2020 – 

I’ve spent several unproven hours looking for datasets that I’m likely to never use. In actually starting to work on the algorithm I’ve discovered that it’s going to take far longer to complete a full-fledged algorithm in under a month. Writing is one thing, but training is a whole other part entirely. The absolute minimum that I’ve found people train boolean (true or false output) algorithms on is over 20,000 repeats of the same idea. I should have foreseen this in talking to David Luebke just after he had said that it takes a few weeks to train a singular GAN algorithm even while running on 8 of the fastest processors that run hundreds of thousands of dollars per card. Making an algorithm like I’m trying to will take months if not years running on what I have right now. 

I’m switching my focus from trying to make an entirely original algorithm, to replicating an algorithm to gain a more full understanding of what I’m talking about when I talk about how the COMPAS algorithm works. I cannot find out every single nick and cranny in the algorithm itself as it is proprietary and the weights are hidden as well as I don’t know the exact date.


December 26th, 2020 –

I wrote a REALLY simple algorithm that has potential; I might not need excess datasets in the end. I had a very long conversation with my mom of all people, explaining my thought process of how this ties into my original research and generally about the research I’ve been doing.

  • This is the algorithm I followed this guide to make. It explained how over anywhere between 20,000 and 1,000,000 iterations/layers, it taught an algorithm how to tell whether, based on 3 questions, the output should be true (1) or false (0). Despite the video being around 14 minutes, it took just over 5 hours (between 2 pm and 7:30 pm) to Start to wrap my head around some of the math and understand some of the functions it was using. This turned out to be probably one of the most helpful videos I could have ever found to guide me through making an algorithm learn to produce boolean values based on informational data. 
    • I plan to use this basic algorithm to try to apply some of my datasets to and then accurately predict something off of it. I’ve figured that I don’t need human input, I can just train an algorithm on 90% of the data and use the last 10% of the data left as tests to see if it can accurately predict the original outcomes of the data.

Conversation with mother:

  • The biased data that algorithms like COMPAS have been based on. Even without including race in the algorithm, 
  • Biased in the favor of the majority
  • Gaugan & criminal justice
  • Discriminators
  • Checks and balances

December 28th, 2020 –

I’ve spent many hours relearning some of the more complex functions and uses of for loops in Python. I had the idea of doing everything I could in Lua however I felt like that might invalidate the book I read on Python so I decided to try to learn how to do it in Python. I still haven’t (as of 6:06 PM) completely figured it out and right now I’m just trying to play around with them until I understand how to do it. For loops are important for generating a sample dataset that I can use immediately while I continue to search for other potential datasets that would be compatible with the sigmoid function method I’m trying to go by. I plan on looking back in Mathematics Behind Machine Learning to see if I can find how the sigmoid function is applied to everything and used effectively so I can make sure I’m using it correctly in my programming.

When I started earlier today I thought I could use two module packs in Python called xldr and xlwt to read and write to CSV files (spreadsheets); however, after conversing with a few friends who know Python a lot better than I do, I plan on using a module package I learned about from them called Pandas in conjunction with NumPy which Pandas was originally built off of. It’s much more robust from what I can tell, saving dictionary datasets as their data window and then reading and writing it to a CSV file. 


December 29th, 2020 –

I’m writing this quite a bit after the 29th, so forgive me if my memory is a little foggy. I didn’t work a lot that day, however, I did work for a bit. I never figured out the Pandas module set but I started researching the necessary functions to create a dataset I could use (a list of all of the sources is at the bottom of the logs here). After researching for a bit, it turned out that Pandas was not necessarily the best to create a functioning dataset as I could generate one each time I ran the program with relative ease.


January 7th, 2021

I’ve spent about 6 hours today trying to figure out how to create and edit around a 2D array that contains 10000 lines of 3 numbers to replace the puny 4 lines long array I was using previously. It’s quite amazing how long I spent on this considering how simple the output code eventually was. I had to go through many sources (all listed below) surrounding individual modules and functions within NumPy to find out how to accomplish this task in the most efficient way possible. I did this to not only create a better dataset that can be used to better train this simple algorithm I started last year (haha funny new years joke) but also to practice and understand the array system NumPy has created more. Now that I’ve gotten a method (shown in the screenshot below) that works, I tried porting it into my original project to use it as the generated dataset, however, it’s using quite a lot of memory and returning an error that I hope to solve in the morning. After that, it’s just a matter of allowing the user input values and letting the algorithm predict whether the output will be a 1 or a 0.

I’m hoping to reach out and gather more datasets over the weekend that I might potentially be able to import, but as is, I believe I have accomplished the goal of my original research and have spent more than enough time on it.


January 9th, 2021- 

I’ve tried and failed to fix this memory issue for about 4 hours in quite a few techniques to no avail. Even allocating more memory to the program has not helped. I’ve switched over to a new interpreter called Atom, something I had from back when I was programming Discord bots. I’m looking into trying to run it off of that instead, but every kernel I try to install throws an error with no code. I’m sure I’ll find a way eventually, but it’s getting annoying and turned from something I thought I could fix in a few minutes to something that I’m struggling to fix after hours.


January 10th, 2021 –

After spending most of the day on the issue, trying different fixes, and getting frustrated with the program, I solved the issue. I eventually just gave in and called my friend Luke who knows a LOT more about PyCharm, the interpreter I was using, than I do. In the end, the issue was that I was using a 32-bit interpreter version of python 3.8 instead of the normal 64-bit version of 3.9 that I should have been using. The memory usage cap for all 32-bit software is 4Gb, so no matter how much extra memory I allocated, it physically couldn’t use more. I’ve finally figured this out and revised my program and ironed out the issues with different arrays being cast. I’m planning on taking down results tomorrow and then concluding everything by potentially trying to implement a pre-existing dataset into it. I also uploaded the entire script to a GitHub repository so I can now let others pull and push to that repository; meaning teachers at renaissance, with a little bit of work, could grab and run the program on their system with minimal effort. I’m quite happy with how everything is turning out!

GitHub Repository (all the code can be viewed here): https://github.com/Shroomsher/STOriginalResearch


January 13th, 2021

Well, it’s a little after “tomorrow” as I noted in my last log, but here we are. I’m revising these logs and finalizing everything tonight. It’s quite enjoyable getting everything finished and analyzed; a massive weight that’s finally off my shoulders.

I was too tired to write it last night but I made a ton of changes to the filesystems and can now automatically save results.


January 14th, 2021

Finally done and finalized, presenting the results, and writing up conclusions. I’m super happy to finally have this done.

Reflection

I’m terrified of confrontation, and that seriously slowed me from reaching out to make a plan. The alarms had been ringing and I was the one who tried to ignore them, it wasn’t a healthy approach to a bad situation. Thankfully I pulled through and created someone even better than my original plan at this point. It highlights the potential inaccuracies of algorithms well. Even math can turn out to hurt you in the end.

I’m honestly glad that I started this late as it gave me a chance to do a truly deep dive without much distraction and being spread thin from other classwork. It gave me a sense of what algorithms can and cannot do and the complexity of some of them.

Analysis

The math behind this algorithm is quite simple:

  1. It takes an input layer and the randomly generated synaptic weights and calculates the dot product of the two lists.
  2. It calculates the sigmoid of this dot product list.
  3. It then calculates the error by subtracting the actual outputs to the one it calculates, in an ideal world it wants to see all 0s
  4. From there it calculates the adjustments it needs to make to its synaptic weights by multiplying the error and the derivative of the sigmoid of the outputs.
  5. Finally, it adjusts the synaptic weights by adding the dot product of the input layer and the adjustments to the synaptic weights.
  6. It repeats this process, adjusting the weights and thus calculating its outputs closer and closer to 1 or 0 depending on the feedback it receives in correlation with the rule specified. This will go on a certain number of times specified either by a random number generator, or the user through input.

For actual results and exhibits, I had the program generate fifty random run-throughs of the algorithm I created to highlight both accuracy and inaccuracy. It returned accurate datasets and weights about 78% of the time and returned datasets and weights with significant potential for error 22% of the time. To calculate this, you need to look at the synaptic weights which is what’s trained in this to produce an accurate output list. 

About 22% of the time, they were trained to have multiple weights that were above 0 whereas the more accurate weights will always have all but one below 0.

Here are a few datasets with samples of accurate weights (can be found under column E):

Dataset 1

Dataset 2

Here are a few datasets with samples of inaccurate weights (can be found under column E):

Dataset 1

Dataset 2

One should notice that in addition to the weights being over 0 in some cases, they still seem to pretty accurately predict, however, they are more subject to error due to having more similar values. This matters because the closer they get, the algorithm determines the numbers that do not matter to be more significant, and the number that does matter for the output to be less significant. It’s not 100% confirmed that the outputs couldn’t be trusted, it just means that they have the potential to think that numbers that arent significant would change the output to a 1.

Exhibits

The main exhibit is the GitHub README file which can be found through this link

All of the datasets and their associated variables can be viewed here.

Simple Conclusions

This project has taken a simple and narrow approach that I hadn’t expected coming into the new year. I think that this highlights how powerful algorithms can be and how good they can be at simple things; however, even something this simple has a high likelihood of producing false positives, much like the COMPAS algorithm. Even though the COMPAS algorithm is proprietary, I can say with 100% confidence that it’s more complex. If this is how badly a simple algorithm that outputs 1s and 0s can mess up, I think it proves my “hypothesis” to be correct. Combining that with knowledge from my traditional research, algorithms are good at certain things; however, moral decisions like this can be beyond their capabilities but can still lead to serious issues when feedback loops built-in unintentionally make it appear like something that’s not as easily testable is working perfectly fine.

I think that this was much better than my original idea and happened to prove my hypothesis and thesis unintentionally. Everything came together very well.

All the sources I’ve used:

https://pandas.pydata.org/pandas-docs/stable/reference/general_functions.html

https://realpython.com/pandas-read-write-files/

https://www.forbes.com/sites/bernardmarr/2016/02/12/big-data-35-brilliant-and-free-data-sources-for-2016/?sh=7f2ef962b54d

https://www.w3schools.com/python/numpy_random.asp

https://www.w3schools.com/python/python_for_loops.asp

https://www.thepythoncode.com/article/generate-random-data-in-python

https://pynative.com/python-mysql-insert-data-into-database-table/

https://stackoverflow.com/questions/60980944/creating-multiple-random-numpy-arrays-with-different-range-of-values

https://numpy.org/doc/stable/reference/generated/numpy.empty.html

https://stackoverflow.com/questions/24108417/simple-way-of-creating-a-2d-array-with-random-numbers-python

https://numpy.org/doc/stable/reference/random/generated/numpy.random.rand.html

https://towardsdatascience.com/numpy-array-cookbook-generating-and-manipulating-arrays-in-python-2195c3988b09

https://numpy.org/doc/stable/reference/random/generated/numpy.random.rand.html

https://stackoverflow.com/questions/13572448/replace-values-of-a-numpy-index-array-with-values-of-a-list

https://numpy.org/doc/stable/reference/generated/numpy.around.html

https://numpy.org/devdocs/user/basics.indexing.html

https://xspdf.com/resolution/58686879.html

https://stackoverflow.com/questions/31909722/how-to-write-python-array-into-excel-spread-sheet

https://numpy.org/doc/stable/reference/generated/numpy.exp.html

https://forum.qiime2.org/t/the-solution-about-the-memoryerror-when-training-classifier-with-silva-132/13668

https://stackoverflow.com/questions/47493559/valueerror-non-broadcastable-output-operand-with-shape-3-1-doesnt-match-the/47493938

https://github.com/atom-community/atom-script

https://nteract.io/kernels/python

https://nteract.gitbooks.io/hydrogen/content/docs/Installation.html#kernels

https://nteract.io/kernels

http://ipython.org/

https://jupyter.readthedocs.io/en/latest/install.html

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s