Machine Learning - RISHABH LALA

Machine Learning - By Dr. Vayrol Kayhan

Ways to Handle Overfitting:

Train-Test Split: Ensure that your data is properly split into training and test sets. This helps in evaluating the model on unseen data, which is crucial for detecting overfitting. In the tutorial, a 70-30 split ratio is used.
Feature Selection: Overfitting can occur when the model is too complex with too many features. Consider reducing the number of features by removing irrelevant or less important ones. In the tutorial, irrelevant columns are dropped, and the focus is on relevant features.
Regularization: Apply regularization techniques to penalize complex models. The tutorial discusses using L1 (Lasso), L2 (Ridge), and Elastic Net regularization methods. These techniques add a penalty to the loss function, which helps in reducing overfitting by keeping the weights small.
Cross-Validation: Implement cross-validation to assess how the results of your statistical analysis will generalize to an independent dataset. Cross-validation involves dividing the dataset into multiple parts, training the model on some parts, and validating on others.
Simplifying the Model: Sometimes, simpler models are less prone to overfitting. If you're using a very complex model, consider using a less complex one.
Early Stopping: In iterative models, stop training as soon as the performance on a validation set starts to degrade. This technique is particularly useful in neural networks or gradient boosting models.
Pruning Decision Trees: If using decision trees, pruning can help reduce overfitting by cutting back the branches of the tree.
Increasing Training Data: More data can help the model generalize better. If possible, try to increase the size of your training dataset.
Ensemble Methods: Use ensemble methods like bagging or boosting, which combine the predictions from multiple models to improve generalization.
Hyperparameter Tuning: Carefully tune the hyperparameters of your model. This can involve techniques like grid search or random search to find the optimal settings.

Text Mining:

-> Model is agnostic to language. This means, the model would even understand and learn the Chinese language if you ask to learn it. It just relates the associated text to the target variable.
-> The more numbers of columns (features) you have -> the models may perform worse. Lesser the features the easier it is for the model to learn.
-> It could be that you might be providing useless columns for the model to learn. The lesser number of features is better for algorithms.
-> Sometimes large number of SVDs can improve the accuracy.

RNN: Recurrent Neural Network is. Neural network is resemblance of human neurons. Many AI models these days are powered by Neural Network. It mimics the human brain. Each biological neuron is connected to other biological neuron. Input Layers and Output layers. Input layers count is dependent on the number of features (columns) in the model. We randomly initialize the arrows, we call them weights and biases.

We compare the predicted sum with actual (while learning). Then go back and fine tune it. If the target variable is binary class: then only one neuron in the output variable is required. However, if the target variable is multi-class, the number of the target variable matches the number of class in the model.

Parameters: Learning rate, alpha, etc.

Training using Tensor Flow or Scikit Learn. Deep learning is error prone process. If you make one mistake, you would be skewing your results erroneously. Deep nural networks are usually used for image learning processes. Making things deep does not help much with tabular data sets. Neural networks are like backbox.

AUTOENCODERS: DONT NEED TARGET VARIABLE: They are unsupervised learning. Neural network, same number of variables (nodes) in the input layer and output layer. Good for fraud detection in credit card companies. Conventional models may not be good. Reason is 99.9% data would not be fraud, in those cases other models would show 99.9% accuracy. But that is not what we want. We want to detect 0.0% frauds. Autoencoder learns from the normal dataset and reconstructs the output layer. Once the NN learns the dataset, the layers and weights and biases are frozen. Now, when we pass a fraudulant dataset, the reconstruction of the dataset (output layer) would have a lot of error. This indicates the that the data - is potentially a fraudulent data. The output layer is used to reconstruct the input layer- at this point the reconstruction error is calculated. The reconstruction error is small for a legit dataset. The output layer values for each node is compared with the corresponding values for input layer. We use the bottleneck to prevent the autoencoder from cheating.

If we have very few labels (like a multiclass target variable) then autoencoders can work like supervised learning as well. There are convolutional autoencoders for images, encoder increase the depth of feature maps and reduce the height width using pooling layers and decoders upscale the image size back up and reduce the depth of features. There is RNN Autoencoders for timeseries data. There is a way to not create a bottleneck by using the noise in the input variable to create output variables and then comparing it back to the input variable to see if the noise is removed while learning the data.

Generative Adversarial Network (GAN): There is a generator and discriminator. generator generates something but discriminator knows how the actual shape should look like, sends feedback to the generator to fix it. Generator fixes it and this loop goes on and on until discriminator does not have anymore ability to give more feedback. Know that generator does not see the information that discriminator has while generating. Generator generates data that look like training data. Generator generates data that is completely fake and label it as 0. Then take the sample from real data is labeled as 1 (discriminator is produced this way). This way fake data starts looking like real data. This is how image generators today work like DALL-E.

Difficulty in training:
1) It may be a zero sum Game. The parameters become unstable in the race to outsmart the other (generator vs discriminator).
2) Nash Eqbm: nobody changes their strategy unless the other changes their strategy (generator vs discriminator).
3) Mode Collapse: in multi-class tasks: the generator may forget the previous category after learning one category and then crash.

This way new content is generated.

Feature Engineering: is a course in itself. Always ask question before including each column whether this column is important. The idea is maybe I can come up with an interesting column, example: consider the ratio of mother and father education level ratios to their kid; maybe a new column that differenciates between the drinkers and non drinkers based on your learning, guesses, etc. like putting 0 for the people use age is below 18 and people use age is above 18 as 1 that may indicate the people who drink and drive more.

GPUs are much more powerful in running machine learning algorithms than CPUs.
Early stopping is a technique

GRU Cells: much simpler than LSTM cells, works really well.

CONVOLUTIONAL NEURAL NETWORK (CNN): Image processing, videos. Here we make use of images, A black and white image has only one channel. A colored image has 3 channels (RGB). RGB images are colored images. The other important thing about images are pixels. Size like 480x480x3 means, 480 dots columns wise, and 480 dots row wise, 3 channels (RGB). Each pixel has a value between 0 and 255. 28x28 = 784 pixels when they show a number '8', all the other locations are 0 while when you get closer to number 8, the numbers become longer to increase the density. 1 mega pixel means 1000 (rows dots )x1000 (column dots) pixels. We can send the images through deep neural network. What we do is flatten the image, put all rows side by side, meaning, 100,000 rows side by side, meaning, 100,000 columns. This means, 100,000 input layer nodes, with so many hidden layers is computationally expansive for deep neural network. Therefore, we use CNN (Convolutional neural network).

CNN: the visual cortex is made of millions of neurons. Each neuron has a specific receptive field ie. a little area that that neuron actually sees. It does not see the whole story or the whole picture. There are multiple layers of neurons, each lower layer picks up individual patterns and sends it to higher layer and this goes on until we see the object. Convolutional layer goes from left to right and then right to left -> with a vertical line filter that remembers all the vertical lines and then separately it goes in same pattern to remember horizontal layers. These filters are called, feature maps, they summarize information about the horizontal, vertical lines in the image. We can have many feature maps, as we want. There can be feature maps for diagonals, edges, curved lines, etc that extract features from single image. We can have zero padding that would allow to capture the corners of the images as well. The Google Net is an example of this. The last few layers are deep neural network while other layers are convolutional neural network. ResNET does even better job. Other examples: VGGNET, Inception V4 (combination of ResNet and Google).