Baseline Category in RIn categorical variables, the baseline category serves as the reference group against which the effects of other categories are measured. In R, when performing regression analysis with categorical variables, the first level of the factor is chosen as the baseline by default. However, this can be changed based on analytical needs or interpretability by releveling the factor so that another category serves as the reference.
Transformations: Log and Exponential ModelsTransformations such as log and exponential are used to linearize relationships between variables, making linear regression models more applicable when the original relationship is non-linear.
QQ Plot: Normality CheckA QQ (Quantile-Quantile) plot is a graphical tool to assess if a dataset follows a particular distribution, such as the normal distribution. If the points in a QQ plot lie roughly along a straight line, the data is considered to follow that distribution. Heavy tails suggest the presence of outliers or deviations from the assumed distribution.
Model Comparison
OLS (Ordinary Least Squares) Regression
OLS regression is the most common method used for linear regression analysis. It aims to find the line (or hyperplane in higher dimensions) that best fits a set of data points. The "best fit" is determined by minimizing the sum of the squared differences between the observed values and the values predicted by the linear model. This method assumes that the errors (residuals) between the observed values and the model's predicted values are homoscedastic, meaning they have the same variance across all levels of the independent variables.
Key Characteristics of OLS:
Key Characteristics of HC:
Check for Linearity:
normality
multi-colinearity
Check for Non Linearity:
GLA assumptions:
-> Y is independent
-> Distribution is from the exponential family.
-> Linear relationship not required b/w Y and X.
-> Common variance is not required.
->
Maximum Likelihood:
likelihood function
## Generally, the accepted way to get rich is using the researched knowledge and experienced knowledge in a way that generate a product that reaches/is useful/and replicable to large scale population.
Survival Models:
Alternate Names: Duratino Analysis, failure analysis, duration analysis,
Event: Patent Died, person god a job, loan was repaid (always binary)
Time Scale: Year, month, week, days (
Origin of the event:
Why cant we use linear regression:
a) Dependent variable and residuals are not normally distributed:
data is not normal, time of entry into the sample may be a poisson distribution, Y is assumed to have continuous probability distribution,
b) Y may be censored (incomplete, time series, binary):
Patent hasn't died at teh current time, we dont know if they will die in a future time, subjects may have dropped out ( or we stopped tracking them) before the sduty ended.
Example: How will you find out the probability of you not dying given that you have not died until time <t (given max life span is 100 years old)?
Answer: P(die at time t|still alive at t-epslon) = P (die time - t)/survived (upto time t)
Censored Data:
1) Patent Died
2) Patent Survived
3) Patent Dropped Out
4) Patient Entered the study Later
5) Patent Died
6) Patent survived
Hazard Rate: probability that the event will happen at time t and given that is has not happened at time < t.
Popular Distributions:
1) Normal -> variance = 1/sigma | PDF (probability density function) = [1/(signma*root(2pi)]*e^(-x-u)^2/(2*sigma^2) |
2) Bernoulli -> single trial with only two outcomes (binary) -> yes/no -> PDF -> 1-P or P | by symmetry the center is at p=1/2 | variance = 1/4
3) Binomial -> probability of observing a set of bernoulli's trials -> Pr =(X=k) -> (n,k)p^k (1-p)^(n-k) where (n,k) = n!/(k(n-k)! | Three parameters, n=total number of trials , x = current trial; p = probability of success
4) Poissons: failures are indepnednt and they dont happen at the same time - (example how many smoke detectors fail at my home at the same time) - assuming there is no other factors causing the failure meaning the failures are independent.
5) Uniform Distribution: any number in the bucket b/w A and B (connected by straight line) | PDF = 1/n when x belongs to A and 1/(b-a) if a<x<b | E(x) = (a+b)/2 and Var = (b-a)^2/12 | standard Uniform density distribution | very similar to linear distribution | but doesn't increase/decrease with change in Y (axis)
Can a Von-Neumann type of computer generate real random numbers? Answer is no: when we set a random seed = 42 example - it always generate same set of random numbers -> because there is a pseudo set of numbers that uses uniform distribution equation and generates random numbers based on the random seed number we assign.
RGTI -> quantum computer block chain company.
Talking about vacation next week - > we want to
#missed first 5 minute of the class
Original Notes:
The baseline category for R when producing the linear model (regression), R chooses the first entry as the baseline, but we can change the baseline per our needs.
Transformations: Log and exponential models
Baseline interpretations: for 1 unit increase in the baseline, there is related x magnitude increase/decrease per unit change in the baseline.
We apply the log function to the independent variable to linearize the data (for QQ plot).
QQ Plot: Normality Check: Heavy tails on left and right indicate outliers. if the values of QQ plot lie on the same line, then the data is not normal.
Model Comparison:
R^2: is worse: of same models should only be compared. Never compare R^2 of two different models.
QQ Plot: is worse
Universal model comparison is done by AIC and BIC.
Interpretation of LOG and Exponential Model:
-> Both models are used to make the data look more linear.
-> Log model -> increase or decrease is in terms of 1%
-> Exponential model -> 1 Unit
log(y)=a+blog(x)
interpretation: if I increase x by 1% y increases by abc%. Example: demand supply elastic curve.
Future Engineering: required when just log and exponential do not work ie. make the data linear on QQ plot.
-> make new predictors (feature engineering)
-> Machine learning models deal with nonlinearity more naturally.
Note: we are trying to linearize the data (through transformations) otherwise the assumptions will fail.
Heteroscedasticity Assumption: If the assumption fails, the data would be biased. Do we know omega in practice? No
So, we assume omega and see how model works. We try to infer omega based on how data responds.
Robust Standard Errors: can only fix the variance of data but they dont fix the bias.
OLS (Ordinary Least Squares) and HC (Heteroscedasticity-Consistent) regression are both statistical methods used in econometrics and statistics for estimating the parameters of a linear regression model. Let's break down what each of these terms means and how they are applied:
OLS (Ordinary Least Squares) RegressionOLS regression is the most common method used for linear regression analysis. It aims to find the line (or hyperplane in higher dimensions) that best fits a set of data points. The "best fit" is determined by minimizing the sum of the squared differences between the observed values and the values predicted by the linear model. This method assumes that the errors (residuals) between the observed values and the model's predicted values are homoscedastic, meaning they have the same variance across all levels of the independent variables.
Key Characteristics of OLS:
Key Characteristics of HC:
It is not uncommon to see better errors (R^2) than HC models with robust standard errors.
STEPS: Estimate the basic OLS
Then Obtain the fitted the values of least square and compute the weights
pass the weights to the new LM call
Weighted Least Squares (WLS): (above): Fitted Square.
Generalized Least Squares (GLS): is more generalized, when omeaga =1 it becomes OLS.
Flexible GLS: reduces the steps of OLS. a +bx. If you are nailing the right omega function, this method works fantastic.
Transformations: Log and Exponential ModelsTransformations such as log and exponential are used to linearize relationships between variables, making linear regression models more applicable when the original relationship is non-linear.
- Log Transformation: Applying the log function to one or more variables can help in stabilizing variance and making the relationship between variables more linear. For example, a log transformation of the independent variable (log(�)log(x)) is useful when dealing with multiplicative effects.
- Exponential Transformation: An exponential transformation might be applied to the dependent variable to model exponential growth or decay processes. The exponential model can describe how changes in the independent variable have multiplicative effects on the dependent variable.
QQ Plot: Normality CheckA QQ (Quantile-Quantile) plot is a graphical tool to assess if a dataset follows a particular distribution, such as the normal distribution. If the points in a QQ plot lie roughly along a straight line, the data is considered to follow that distribution. Heavy tails suggest the presence of outliers or deviations from the assumed distribution.
Model Comparison
- R² (R-squared): A measure of the proportion of variance in the dependent variable that is predictable from the independent variables. It is used for comparing the goodness of fit for different models on the same data. Comparing R² across models with different dependent variables or datasets is not appropriate.
- AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion): Both are used for model selection among a finite set of models. They take into account the goodness of fit of the model and the complexity of the model, helping to balance between overfitting and underfitting.
- Log Model: The coefficient in a log-transformed regression can be interpreted as the percentage change in the dependent variable for a one percent change in the independent variable, holding other variables constant.
- Exponential Model: In an exponential model, the coefficient can be interpreted in terms of multiplicative effects on the dependent variable for a one-unit change in the independent variable.
- Heteroscedasticity Assumption Failure: Heteroscedasticity occurs when the variance of the error terms varies across levels of an independent variable, violating one of the key OLS assumptions. This can lead to biased estimates of standard errors, affecting confidence intervals and hypothesis tests.
- Robust Standard Errors: These are adjusted standard errors that account for heteroscedasticity, providing more reliable hypothesis testing. However, they correct only the standard errors for inconsistency; they do not correct bias in the coefficient estimates themselves.
- Omega (Ω) in Practice: In the context of heteroscedasticity, Ω represents the true variance-covariance matrix of the error terms, which is rarely known in practice. Various techniques, including robust standard errors and heteroscedasticity-consistent (HC) estimators, are used to approximate Ω without explicitly knowing it.
OLS (Ordinary Least Squares) Regression
OLS regression is the most common method used for linear regression analysis. It aims to find the line (or hyperplane in higher dimensions) that best fits a set of data points. The "best fit" is determined by minimizing the sum of the squared differences between the observed values and the values predicted by the linear model. This method assumes that the errors (residuals) between the observed values and the model's predicted values are homoscedastic, meaning they have the same variance across all levels of the independent variables.
Key Characteristics of OLS:
- Assumes homoscedasticity (constant variance of errors).
- The estimates are obtained by minimizing the sum of squared residuals.
- Under the Gauss-Markov theorem, OLS estimators are the Best Linear Unbiased Estimators (BLUE) if the assumptions hold, including linearity, independence, and homoscedasticity of errors.
Key Characteristics of HC:
- Does not assume constant variance of errors across observations.
- Adjusts the standard errors of the OLS estimates to be consistent in the presence of heteroscedasticity.
- There are several versions of HC standard errors (e.g., HC0, HC1, HC2, HC3), with different adjustments for small sample sizes or other concerns.
- The regression model can be estimated using OLS to obtain parameter estimates.
- Then, HC standard errors are calculated to correct for heteroscedasticity, allowing for more reliable hypothesis testing and confidence intervals.
Check for Linearity:
normality
multi-colinearity
Check for Non Linearity:
GLA assumptions:
-> Y is independent
-> Distribution is from the exponential family.
-> Linear relationship not required b/w Y and X.
-> Common variance is not required.
->
Maximum Likelihood:
likelihood function
## Generally, the accepted way to get rich is using the researched knowledge and experienced knowledge in a way that generate a product that reaches/is useful/and replicable to large scale population.
Survival Models:
Alternate Names: Duratino Analysis, failure analysis, duration analysis,
Event: Patent Died, person god a job, loan was repaid (always binary)
Time Scale: Year, month, week, days (
Origin of the event:
Why cant we use linear regression:
a) Dependent variable and residuals are not normally distributed:
data is not normal, time of entry into the sample may be a poisson distribution, Y is assumed to have continuous probability distribution,
b) Y may be censored (incomplete, time series, binary):
Patent hasn't died at teh current time, we dont know if they will die in a future time, subjects may have dropped out ( or we stopped tracking them) before the sduty ended.
Example: How will you find out the probability of you not dying given that you have not died until time <t (given max life span is 100 years old)?
Answer: P(die at time t|still alive at t-epslon) = P (die time - t)/survived (upto time t)
Censored Data:
1) Patent Died
2) Patent Survived
3) Patent Dropped Out
4) Patient Entered the study Later
5) Patent Died
6) Patent survived
Hazard Rate: probability that the event will happen at time t and given that is has not happened at time < t.
Popular Distributions:
1) Normal -> variance = 1/sigma | PDF (probability density function) = [1/(signma*root(2pi)]*e^(-x-u)^2/(2*sigma^2) |
2) Bernoulli -> single trial with only two outcomes (binary) -> yes/no -> PDF -> 1-P or P | by symmetry the center is at p=1/2 | variance = 1/4
3) Binomial -> probability of observing a set of bernoulli's trials -> Pr =(X=k) -> (n,k)p^k (1-p)^(n-k) where (n,k) = n!/(k(n-k)! | Three parameters, n=total number of trials , x = current trial; p = probability of success
4) Poissons: failures are indepnednt and they dont happen at the same time - (example how many smoke detectors fail at my home at the same time) - assuming there is no other factors causing the failure meaning the failures are independent.
5) Uniform Distribution: any number in the bucket b/w A and B (connected by straight line) | PDF = 1/n when x belongs to A and 1/(b-a) if a<x<b | E(x) = (a+b)/2 and Var = (b-a)^2/12 | standard Uniform density distribution | very similar to linear distribution | but doesn't increase/decrease with change in Y (axis)
Can a Von-Neumann type of computer generate real random numbers? Answer is no: when we set a random seed = 42 example - it always generate same set of random numbers -> because there is a pseudo set of numbers that uses uniform distribution equation and generates random numbers based on the random seed number we assign.
RGTI -> quantum computer block chain company.
Talking about vacation next week - > we want to
#missed first 5 minute of the class
Original Notes:
The baseline category for R when producing the linear model (regression), R chooses the first entry as the baseline, but we can change the baseline per our needs.
Transformations: Log and exponential models
Baseline interpretations: for 1 unit increase in the baseline, there is related x magnitude increase/decrease per unit change in the baseline.
We apply the log function to the independent variable to linearize the data (for QQ plot).
QQ Plot: Normality Check: Heavy tails on left and right indicate outliers. if the values of QQ plot lie on the same line, then the data is not normal.
Model Comparison:
R^2: is worse: of same models should only be compared. Never compare R^2 of two different models.
QQ Plot: is worse
Universal model comparison is done by AIC and BIC.
Interpretation of LOG and Exponential Model:
-> Both models are used to make the data look more linear.
-> Log model -> increase or decrease is in terms of 1%
-> Exponential model -> 1 Unit
log(y)=a+blog(x)
interpretation: if I increase x by 1% y increases by abc%. Example: demand supply elastic curve.
Future Engineering: required when just log and exponential do not work ie. make the data linear on QQ plot.
-> make new predictors (feature engineering)
-> Machine learning models deal with nonlinearity more naturally.
Note: we are trying to linearize the data (through transformations) otherwise the assumptions will fail.
Heteroscedasticity Assumption: If the assumption fails, the data would be biased. Do we know omega in practice? No
So, we assume omega and see how model works. We try to infer omega based on how data responds.
Robust Standard Errors: can only fix the variance of data but they dont fix the bias.
OLS (Ordinary Least Squares) and HC (Heteroscedasticity-Consistent) regression are both statistical methods used in econometrics and statistics for estimating the parameters of a linear regression model. Let's break down what each of these terms means and how they are applied:
OLS (Ordinary Least Squares) RegressionOLS regression is the most common method used for linear regression analysis. It aims to find the line (or hyperplane in higher dimensions) that best fits a set of data points. The "best fit" is determined by minimizing the sum of the squared differences between the observed values and the values predicted by the linear model. This method assumes that the errors (residuals) between the observed values and the model's predicted values are homoscedastic, meaning they have the same variance across all levels of the independent variables.
Key Characteristics of OLS:
- Assumes homoscedasticity (constant variance of errors).
- The estimates are obtained by minimizing the sum of squared residuals.
- Under the Gauss-Markov theorem, OLS estimators are the Best Linear Unbiased Estimators (BLUE) if the assumptions hold, including linearity, independence, and homoscedasticity of errors.
Key Characteristics of HC:
- Does not assume constant variance of errors across observations.
- Adjusts the standard errors of the OLS estimates to be consistent in the presence of heteroscedasticity.
- There are several versions of HC standard errors (e.g., HC0, HC1, HC2, HC3), with different adjustments for small sample sizes or other concerns.
- The regression model can be estimated using OLS to obtain parameter estimates.
- Then, HC standard errors are calculated to correct for heteroscedasticity, allowing for more reliable hypothesis testing and confidence intervals.
It is not uncommon to see better errors (R^2) than HC models with robust standard errors.
STEPS: Estimate the basic OLS
Then Obtain the fitted the values of least square and compute the weights
pass the weights to the new LM call
Weighted Least Squares (WLS): (above): Fitted Square.
Generalized Least Squares (GLS): is more generalized, when omeaga =1 it becomes OLS.
Flexible GLS: reduces the steps of OLS. a +bx. If you are nailing the right omega function, this method works fantastic.