Skip to content

A collection of DS blog posts, papers, etc that I want to keep track of. None of these are related to DS or data team leadership

Notifications You must be signed in to change notification settings

ronikobrosly/misc-articles

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 

Repository files navigation

misc-articles

A collection of blog posts, papers, videos, links, etc that I want to keep track of. None of these are related to data team leadership (see this repo for those links)

Also, here are fantastic awesome lists, aggregations, etc:

Topics

Career

Title Summary Year
8 Hard Truths I learned when I got laid off from my SWE job A set of fantastic "hard truths" related to being laid off. Includes topics like: "Getting laid off is a profoundly lonely experience", "It’s gonna take longer than you think", "Interview invites are a poor proxy for your desirability", "Honesty can only hurt you", and more. 2022
My questions for prospective employers (Director/VP roles) Covers questions that Director or VP-level candidates should ask of potential employers. These include, for example: What does success in the role look like? What’s my boss’s (or board’s) expectations for me? What’s the degree of managerial discretion in the role? 2019
When is short tenure a red flag When is short tenure a red flag, how short is too short, how often can one change jobs, and under what conditions should you get a new job. This applies for less senior roles, and there is another article just for staff-level, director or VP-level roles and this topic 2022

Data Engineering

Title Summary Year
An Engineer's Guide to Data Contracts - Part 1 Goes over a technical implementation of a data contract. Specifically, a CDC-based (Change Data Capture) implementation of Entity-based Data Contracts, covering contract definition, schema enforcement, and fulfillment. 2022
Supercharge your data processing with DuckDB: Efficient & blazing fast SQL analytics in Pandas with DuckDB Have been wondering for a while what the point of DuckDB is given the modern cloud stack, but I misunderstood the point of it. It's great for when there are large local files you need to query, when pandas will typically choke. Article gives plenty of example code on its use. 2022
The Beginner's Guide to Databases Covers the various types of DBs, and why you'd use each depending on the use case. 2023

Deep Learning

Title Summary Year
Do Large Language Models learn world models or just surface statistics? Explores whether large language models (LLM) are able to learn deeper meaning of language or just surface statistics. Describes an interesting analogy to a chess game with a crow that watches. Through a series of probe experiments, the authors suggest these LMMs do in fact learn the deeper meaning of language AKA an inner world model. 2023
On the Opportunities and Risks of Foundation Models A major, large report by the Center for Research on Foundation Models (CRFM) at the Stanford Institute for Human-Centered Artificial Intelligence. Covers their current (as of 2021) capabilities, their potential, their social implications, and their drawbacks. 2021
Building LLM applications for production Chip Huyen's great overview of the tech challenges of productionizing LLMs 2023

Experiments and Causality

Title Summary Year
A Simpler Alternative to X-Learner for Uplift Modeling In this post, Rob Donnelly describes an approach he calls simplified X-learner (Xs-learner) that is easier to understand, faster to implement, and in my experience often works as well or better in practice than other meta-learners (s-learner, x-learner, etc). S-learner can be problematic because it biases effects towards zero. He provides python code for this new approach. 2023
An introduction to g methods The use of "g methods" by epidemiologists has been hampered by limitations in understanding both conceptual and technical details. The authors present a simple worked example that illustrates basic concepts, while minimizing technical complications. Written by epidemiologists Ashley I Naimi, Stephen R Cole, and Edward H Kennedy. 2017
Getting to decisions faster in A/B tests – part 1: literature review Discusses what teams do to 1) get to decisions faster (i.e. dealing with the "peeking problem") and 2) alternatives used that may be more intuitive than null-hypothesis testing (NHT). In future posts, the author will dive into the most interesting methods to fully understand what they promise and what they deliver. Author posits that methods to address early peeking may reduce ability to detect small changes (i.e. power), but says that this trade off depends on your context (If you are at a maturity stage where A/B testing is about tiny incremental changes, where you run 100s of experiments simultaneously in a self-service manner, then avoiding false positives may matter more than missing out on one of those changes that mattered, but you did not detect.) This post has so many excellent links. 2023
How Etsy Handles Peeking in A/B Testing Excellent write-up with links on Etsy's methods for dealing with the A/B testing peeking problem. Essentially they create a p-value threshold curve for any given experiment, and if they current p-value falls below that threshold you can stop the experiment early. 2018
confseq: A python package for confidence sequences and uniform boundaries Documentation around "always-valid p-values". That is, no matter how many times you peak at the p-value, the results account for inflated false positives and the p-values are valid. 2021
How to Double A/B Testing Speed with CUPED: Microsoft’s variance reduction that’s becoming industry standard. Very gentle introduction to CUPED approach to A/B testing. Basically, you leverage pre-experiment data to reduce the variance estimates of your test outcomes and thus you will need less sample size. 2021
Causal Inference for the Brave and True An e-book that discusses the many techniques around causal inference 2022

General Management

Title Summary Year
The Feedback Equation A framework to help one successfully structure specific and actionable feedback. The equation is: Observation of a behavior + Impact of the behavior + Question or Request = Actionable, specific feedback that has a chance of landing. 2018

Job Search

Title Summary Year

MLOps

Title Summary Year

Organizations

Title Summary Year
IT Assets, Organizational Capabilities, and Firm Performance: How Resource Allocations and Organizational Differences Explain Performance Variation Some organizations prioritize innovation and IT assets but this doesn't translate to business value. What explains this difference? 2007

Software Engineering and App Development

Title Summary Year
The Twelve-Factor App The twelve-factor methodology can be applied to apps written in any programming language, and which use any combination of backing services (database, queue, memory cache, etc). Factors include: Codebase, Dependencies, Config, Backing services, Build, release, run, Processes, Port binding, Concurrency, Disposability, Dev/prod parity, Logs, and Admin processes 2017
Evidence-based Software Engineering Entire book describing software engineering and ML principles with tons of analysis of public code. Lots of super interestng plots. 2020
Real-world Engineering Challenges #8: Breaking up a Monolith A deep dive into how Khan Academy took a 1 million-line Python monolith and split it into ~40 Go services in a more than 3 year-long project. Incredible story about how to structure and carry out a huge migration. 2023
Keep the monolith, but split the workloads Discusses pros and cons of monolith and microservices, and a nice pattern for running monolith better 2023

Statistics

Title Summary Year
FOUNDED UPON AN ERROR Describes how the argument that Bayesian statistics didn't take off earlier in history due to computational limitations is bunk. It was due to statistical leaders at the time threw their weight behind frequentist stats 2023

Tools

Title Summary Year
Manipulate big data with Arrow & DuckDB Gives primer on DuckDB and Apache Arrow and how they can be used together to super quickly analyze big data using a personal machine. 2022
Setting up a new machine for data science A bunch of useful python, docker, git, and terminal settings for doing ML Engineering work 2023

Visualization and plotting code

Title Summary Year
SciencePlots A collection of Matplotlib styles for plotting 2022
Aquarel A lightweight templating engine and wrapper around Matplotlibs' rcparams to make styling plots simple. 2022
Randy Chase custom config A nice custom matplotlib config with code 2023
TUEplots A package for figure sizes, font sizes, fonts, and more configurations at minimal overhead. 2023
matplotx More styles and useful extensions for Matplotlib 2022

TODO

Customer growth modeling:

About

A collection of DS blog posts, papers, etc that I want to keep track of. None of these are related to DS or data team leadership

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published