Statistics - RISHABH LALA

Baseline Category in RIn categorical variables, the baseline category serves as the reference group against which the effects of other categories are measured. In R, when performing regression analysis with categorical variables, the first level of the factor is chosen as the baseline by default. However, this can be changed based on analytical needs or interpretability by releveling the factor so that another category serves as the reference.
Transformations: Log and Exponential ModelsTransformations such as log and exponential are used to linearize relationships between variables, making linear regression models more applicable when the original relationship is non-linear.

Log Transformation: Applying the log function to one or more variables can help in stabilizing variance and making the relationship between variables more linear. For example, a log transformation of the independent variable (log⁡(�)log(x)) is useful when dealing with multiplicative effects.
Exponential Transformation: An exponential transformation might be applied to the dependent variable to model exponential growth or decay processes. The exponential model can describe how changes in the independent variable have multiplicative effects on the dependent variable.

Baseline InterpretationsIn the context of regression with categorical variables, the baseline interpretation refers to the expected change in the dependent variable when moving from the baseline category to another category, holding all other variables constant.
QQ Plot: Normality CheckA QQ (Quantile-Quantile) plot is a graphical tool to assess if a dataset follows a particular distribution, such as the normal distribution. If the points in a QQ plot lie roughly along a straight line, the data is considered to follow that distribution. Heavy tails suggest the presence of outliers or deviations from the assumed distribution.

Model Comparison

R² (R-squared): A measure of the proportion of variance in the dependent variable that is predictable from the independent variables. It is used for comparing the goodness of fit for different models on the same data. Comparing R² across models with different dependent variables or datasets is not appropriate.
AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion): Both are used for model selection among a finite set of models. They take into account the goodness of fit of the model and the complexity of the model, helping to balance between overfitting and underfitting.

Interpretation of Log and Exponential Models

Log Model: The coefficient in a log-transformed regression can be interpreted as the percentage change in the dependent variable for a one percent change in the independent variable, holding other variables constant.
Exponential Model: In an exponential model, the coefficient can be interpreted in terms of multiplicative effects on the dependent variable for a one-unit change in the independent variable.

Heteroscedasticity and Robust Standard Errors

Heteroscedasticity Assumption Failure: Heteroscedasticity occurs when the variance of the error terms varies across levels of an independent variable, violating one of the key OLS assumptions. This can lead to biased estimates of standard errors, affecting confidence intervals and hypothesis tests.
Robust Standard Errors: These are adjusted standard errors that account for heteroscedasticity, providing more reliable hypothesis testing. However, they correct only the standard errors for inconsistency; they do not correct bias in the coefficient estimates themselves.
Omega (Ω) in Practice: In the context of heteroscedasticity, Ω represents the true variance-covariance matrix of the error terms, which is rarely known in practice. Various techniques, including robust standard errors and heteroscedasticity-consistent (HC) estimators, are used to approximate Ω without explicitly knowing it.

Feature Engineering and NonlinearityFeature engineering involves creating new predictors or transforming existing ones to better capture the relationship between independent and dependent variables, especially when transformations like log or exponential do not fully linearize the data. Machine learning models can often handle non-linearity more flexibly than traditional linear models, potentially reducing the need for transformations or feature engineering to achieve linearity.

OLS (Ordinary Least Squares) Regression
OLS regression is the most common method used for linear regression analysis. It aims to find the line (or hyperplane in higher dimensions) that best fits a set of data points. The "best fit" is determined by minimizing the sum of the squared differences between the observed values and the values predicted by the linear model. This method assumes that the errors (residuals) between the observed values and the model's predicted values are homoscedastic, meaning they have the same variance across all levels of the independent variables.
Key Characteristics of OLS:

Assumes homoscedasticity (constant variance of errors).
The estimates are obtained by minimizing the sum of squared residuals.
Under the Gauss-Markov theorem, OLS estimators are the Best Linear Unbiased Estimators (BLUE) if the assumptions hold, including linearity, independence, and homoscedasticity of errors.

HC (Heteroscedasticity-Consistent) RegressionHC regression refers to a set of modifications to the standard OLS regression to allow for heteroscedasticity, where the variance of the error terms varies across observations. Heteroscedasticity is common in real-world data and violates one of the key OLS assumptions, potentially leading to inefficient and biased estimates of the standard errors, which in turn affects hypothesis testing and confidence intervals.
Key Characteristics of HC:

Does not assume constant variance of errors across observations.
Adjusts the standard errors of the OLS estimates to be consistent in the presence of heteroscedasticity.
There are several versions of HC standard errors (e.g., HC0, HC1, HC2, HC3), with different adjustments for small sample sizes or other concerns.

Applying HC in Regression:

The regression model can be estimated using OLS to obtain parameter estimates.
Then, HC standard errors are calculated to correct for heteroscedasticity, allowing for more reliable hypothesis testing and confidence intervals.

HC standard errors are often used in empirical research to ensure the robustness of statistical inference when the assumption of homoscedasticity is violated. This adjustment is crucial in fields like economics, finance, and social sciences, where heteroscedasticity is a common issue.

Check for Linearity:
normality
multi-colinearity

Check for Non Linearity:

GLA assumptions:
-> Y is independent
-> Distribution is from the exponential family.
-> Linear relationship not required b/w Y and X.
-> Common variance is not required.
->

Maximum Likelihood:
likelihood function

## Generally, the accepted way to get rich is using the researched knowledge and experienced knowledge in a way that generate a product that reaches/is useful/and replicable to large scale population.

Survival Models:
Alternate Names: Duratino Analysis, failure analysis, duration analysis,
Event: Patent Died, person god a job, loan was repaid (always binary)
Time Scale: Year, month, week, days (
Origin of the event:
Why cant we use linear regression:
a) Dependent variable and residuals are not normally distributed:
data is not normal, time of entry into the sample may be a poisson distribution, Y is assumed to have continuous probability distribution,

b) Y may be censored (incomplete, time series, binary):
Patent hasn't died at teh current time, we dont know if they will die in a future time, subjects may have dropped out ( or we stopped tracking them) before the sduty ended.

Example: How will you find out the probability of you not dying given that you have not died until time <t (given max life span is 100 years old)?
Answer: P(die at time t|still alive at t-epslon) = P (die time - t)/survived (upto time t)

Censored Data:
1) Patent Died
2) Patent Survived
3) Patent Dropped Out
4) Patient Entered the study Later
5) Patent Died
6) Patent survived

Hazard Rate: probability that the event will happen at time t and given that is has not happened at time < t.

Popular Distributions:
1) Normal -> variance = 1/sigma | PDF (probability density function) = [1/(signma*root(2pi)]*e^(-x-u)^2/(2*sigma^2) |
2) Bernoulli -> single trial with only two outcomes (binary) -> yes/no -> PDF -> 1-P or P | by symmetry the center is at p=1/2 | variance = 1/4
3) Binomial -> probability of observing a set of bernoulli's trials -> Pr =(X=k) -> (n,k)p^k (1-p)^(n-k) where (n,k) = n!/(k(n-k)! | Three parameters, n=total number of trials , x = current trial; p = probability of success
4) Poissons: failures are indepnednt and they dont happen at the same time - (example how many smoke detectors fail at my home at the same time) - assuming there is no other factors causing the failure meaning the failures are independent.
5) Uniform Distribution: any number in the bucket b/w A and B (connected by straight line) | PDF = 1/n when x belongs to A and 1/(b-a) if a<x<b | E(x) = (a+b)/2 and Var = (b-a)^2/12 | standard Uniform density distribution | very similar to linear distribution | but doesn't increase/decrease with change in Y (axis)
Can a Von-Neumann type of computer generate real random numbers? Answer is no: when we set a random seed = 42 example - it always generate same set of random numbers -> because there is a pseudo set of numbers that uses uniform distribution equation and generates random numbers based on the random seed number we assign.

RGTI -> quantum computer block chain company.
Talking about vacation next week - > we want to

#missed first 5 minute of the class

Original Notes:
The baseline category for R when producing the linear model (regression), R chooses the first entry as the baseline, but we can change the baseline per our needs.
Transformations: Log and exponential models

Baseline interpretations: for 1 unit increase in the baseline, there is related x magnitude increase/decrease per unit change in the baseline.
We apply the log function to the independent variable to linearize the data (for QQ plot).

QQ Plot: Normality Check: Heavy tails on left and right indicate outliers. if the values of QQ plot lie on the same line, then the data is not normal.
Model Comparison:
R^2: is worse: of same models should only be compared. Never compare R^2 of two different models.
QQ Plot: is worse

Universal model comparison is done by AIC and BIC.

Interpretation of LOG and Exponential Model:
-> Both models are used to make the data look more linear.
-> Log model -> increase or decrease is in terms of 1%
-> Exponential model -> 1 Unit

log(y)=a+blog(x)
interpretation: if I increase x by 1% y increases by abc%. Example: demand supply elastic curve.
Future Engineering: required when just log and exponential do not work ie. make the data linear on QQ plot.
-> make new predictors (feature engineering)
-> Machine learning models deal with nonlinearity more naturally.
Note: we are trying to linearize the data (through transformations) otherwise the assumptions will fail.

Heteroscedasticity Assumption: If the assumption fails, the data would be biased. Do we know omega in practice? No
So, we assume omega and see how model works. We try to infer omega based on how data responds.

Robust Standard Errors: can only fix the variance of data but they dont fix the bias.

OLS (Ordinary Least Squares) and HC (Heteroscedasticity-Consistent) regression are both statistical methods used in econometrics and statistics for estimating the parameters of a linear regression model. Let's break down what each of these terms means and how they are applied:
OLS (Ordinary Least Squares) RegressionOLS regression is the most common method used for linear regression analysis. It aims to find the line (or hyperplane in higher dimensions) that best fits a set of data points. The "best fit" is determined by minimizing the sum of the squared differences between the observed values and the values predicted by the linear model. This method assumes that the errors (residuals) between the observed values and the model's predicted values are homoscedastic, meaning they have the same variance across all levels of the independent variables.
Key Characteristics of OLS:

Assumes homoscedasticity (constant variance of errors).
The estimates are obtained by minimizing the sum of squared residuals.
Under the Gauss-Markov theorem, OLS estimators are the Best Linear Unbiased Estimators (BLUE) if the assumptions hold, including linearity, independence, and homoscedasticity of errors.

Does not assume constant variance of errors across observations.
Adjusts the standard errors of the OLS estimates to be consistent in the presence of heteroscedasticity.
There are several versions of HC standard errors (e.g., HC0, HC1, HC2, HC3), with different adjustments for small sample sizes or other concerns.

Applying HC in Regression:

The regression model can be estimated using OLS to obtain parameter estimates.
Then, HC standard errors are calculated to correct for heteroscedasticity, allowing for more reliable hypothesis testing and confidence intervals.