My learning on trying to predict the survival of the Titanic ship
Machine learning is used to solve a variety of problems ranging from low to high complexity. Machine learning is the basis to get into smart computing eventually leading to fully developed artificial intelligence solutions. Machine learning is made simpler to try with advanced tools and technologies. Cloud/Edge computing and also federated learning solutions help in determining the type of solution needed to address the problem.
Kaggle is a good place to start with learning and solving machine learning problems. Like many, I also tried my hands on trying to understand the basics of ML by picking the Titanic ship survival prediction problem. As big as it looks Kaggle provides a clean data set and test data for beginners to get started.
As I was going through the solution, I realized that there are more things to consider even before jumping on to solutions. Not every time we will be provided with the needed information to solve the problem.
So, what do we need to start solving an ML problem? Below are some of my understandings.
- What are the needed data? — It is a myth to assume that more parameters will give accurate predictions than less number of parameters in a data set. The truth is to determine the right parameter. Understanding the domain and the problem to solve is important to arrive at critical parameters.
- Do we have the means to measure all the data? — Sometimes it may not be possible to collect the data on a needed variable. How do we collect data? What are the different devices or sensors we may need to collect the various data? A feasibility check on those is crucial. What are the other alternatives to derive the parameters? What is the criticality of not having the parameter? Answering some of these questions will help in collecting the needed and right data needed to solve the problem.
- Local data storage vs cloud storage — Federated learning is a sub-classification within ML. In this case, the data does not leave the data source in the raw format, rather the models are trained right at the source where the data is generated. Only the models are then transferred to the centralized cloud systems. Determination of federated vs cloud processing of data can be evaluated by knowing whether all the parameters needed are generated by single or multiple sources. Federated learning or on-device learning is good for a single source of data versus centralized cloud processing where data should be collected from various sources.
- Network availability for cloud storage — Not all data generators can transmit data intermittently. In some cases, the network may not be there for continuous shipping of data. Other issues like the size of the data, frequency of data transmission, etc will all depend on the network availability and the type of protocol used to transmit the data from the source to the cloud.
- Sophisticated energy system for the additional computational needs — Since we are doing it now, we are calling it a titanic survival prediction. If the majesty was floated today and we will not be doing survival prediction, maybe rather safety prediction and precautions. In this case, the ship has a finite amount of energy that it can deploy (Even with solar panels equipped), however, ML on the device (In this case on the ship) will require additional power to perform the computations. This additional need to support data generation should be accounted for.
- Additional storage — This depends on whether the solution runs on on-device vs cloud computing. In the case of on-device learning involving huge data and complex problems, the storage space utilization should be estimated beforehand.
- Availability of old data vs usability of current data — The availability of previous data adds strength to the prediction accuracy. As old as the data is, the reliability of the prediction will be better.
- Wait time on the data — How long does it take to get the data to start processing on it?
- Secure data processing — Data at transit and rest should always be encrypted. Purge the data that is not needed anymore. Has the life cycle of the data been determined?
Share your thoughts on the ML journey.
Happy learning!
Originally published at http://shankarkumarasamy.blog on October 23, 2021.