commit a813c1521a3e8fdee0b5ae1b72b6ae2621628fd7 Author: Maximilian Keßler Date: Wed Nov 22 14:21:09 2023 +0100 Initial commit: Draft of database schema diff --git a/install/database_schema.sql b/install/database_schema.sql new file mode 100644 index 0000000..78313ac --- /dev/null +++ b/install/database_schema.sql @@ -0,0 +1,312 @@ +/** +* Some general thoughts first regarding how I think we should design the database: +* +* There should be a semantic of what it means for our data to be 'consistent': +* If it is consistent, our program interacting with the database should be well-behaved, i.e. never crash or lead to weird behaviour. +* From any consistent state, our program should be able to bring the database up to the most recent state automatically, i.e. +* - Update games +* - Calculate ELO +* - Calculate statistics +* without us having to fix anything in the database. +* +* Conversely, our program should be designed in such a way that every database transaction that we do will leave the database in a consistent state. +* This ensures that (as long as there is no programming error), we can start/stop/run the program at any time, always leaving the database in a +* consistent state. +* In particular, in order to be able to have reasonably small transactions, the following all should be valid transactions: +* - Adding just the metadata (who, score, when, but not detailed actions or deck) of some game, without processing any of its statistics or elo changes +* - Adding detailed data of a game (actions, deck) +* - Calculating any subset of statistics for a game (BDR, moves, loss reasons (like: 'Discard crit', 'Bomb crit', 'Strikeout' or similar)) +* - Adding any set of players to the database +* - Processing a single game for ELO change +* +* Conversely, the database should be designed in such a way that all of the above updates lead to consistent states. In particular, I'd like it to be ok if +* - Games are stored in the database whose statistics or ELO have not been processed yet +* - Game metadata is stored in the database, but their actions/deck is not +* - New players are part of the database, but we did not check for games involving them +* +* Also, I think that there should be *no* state-change in any other files except the database when running the program: +* That is, for a full replica of the server state, the source code and the current database state should be enough. +* This means no json/csv files or whatever that we both write and read to. +* (Input config files as part of the repository are ouf course fine, as well as output files like log) +* +* One thing I felt like was a bit difficult to handle during season zero is people registering after they played some games. +* At least I think this caused you quite some effort, especially since it's then not so easy to get their games into the database. +* I recall that you just stored a single game id which represented the latest one that your program already processed. +* To account for late sign-ups, I suggest the following: +* - For each registered league user, store the latest game id that we processed (=downloaded, not necessarily calculated ELO or anything) separately. +* This ensures that for new users, we will fetch all their games on the first database run, whereas for already-registered ones, we do not. +* - When checking for new games, we can now send these ids to the hanab.live site (there is an API parameter to request only games with id >= x, I think you did not use it last time) +* +* Now there's a slight problem, since calculating ELO depends on the order of games. In case we have late sign-ups, I think it's not so nice if they retroactively +* affect ELO. In particular, this could lead to ELO changes in players X that did not even play a game with new player N, +* if N and Y played some game and after that, Y and X did, meaning that Y's elo when playing with X changes retroactively, thus also affecting X. +* +* Therefore, I suggest that additionally, each game that we download from hanab.live gets a 'league id', which is just a serial number and represents +* the ordering of games in which we process them. +* So in case of late sign-ups we would just process these games a bit later. In particular, in case of no late sign-ups, league ids and game ids from hanab.live +* should have the same ordering, but this might change when we have late sign-ups. +* I'd like to store this, because in case of bugs or anything, we would then be able to re-calculate all ELO deterministically with this data. +* Also, it's nice to be transparent about this, when we do have some game logs on the website or anything, we should order by this league id, since this is the +* order that we used for calculating the ELO. +* Let me know if you think this is too complicated, but to me personally, I'd quite like that, since it just means +* (together with the above mentioned solution of storing latest download ids on a per-user basis) that late sign-ups are handled well automatically. +* +*/ + + +/** +* To store the league participants, I suggest the following: +* - Represent each participant by internal user id +* - Store 'display name' (what shows up in the tables etc) for each participant. +* This will only be relevant for output, not any of the algorithms, so we could easily change this at any time. +* - Store the account names (from hanab.live) separately for each account, thereby allowing for multiple ones having the same user_id +* I'm not sure how much we need this, but I feel like it's not too hard to have this extra flexibility in our data structure, so just in case +* I would support this. +* To be honest, I'd also like to apply all of the code to some non-league related games where I'd need this :D. +*/ + +DROP TABLE IF EXISTS users; +CREATE TABLE users ( + id SERIAL PRIMARY KEY, + name TEXT NOT NULL +); + + +DROP TABLE IF EXISTS user_accounts; +CREATE TABLE user_accounts ( + /* The username from hanab.live. This will be used a) for associating games to one of our user_ids and b) show up in the player profile. */ + username TEXT NOT NULL, + /** + * Hanab.live stores a normalized username for each account to ensure that usernames are case-insensitive and don't have weird unicodes that resemble each other. + * It uses go-unidecode to generate these from the actual usernames, and I suggest we do the same. + * The corresponding python package is 'unidecode', so we should just use that here. + * This will then ensure that no two sign-ups correspond to the same hanab.live account. + */ + normalized_username TEXT NOT NULL UNIQUE + user_id INTEGER NOT NULL, + FOREIGN KEY (user_id) REFERENCES users (id) ON DELETE CASCADE +); + +/* +* The latest game id that we already downloaded for a specific user. Note that this can never decrease, since on hanab.live, game ids are only incrementing. +* In my opinion, it should be a consistent state to *not* have an entry in this table for a user. +* In that case, when querying, we would of course assume nothing and fetch all the data from hanab.live (from the start of the league) +* and place an entry here after that. +* Also I suggest that the latest_game_id actually corresponds to a game that we have stored in the database, +* this helps with further sanity checks or if we want to know if statistics are up to date. +*/ +DROP TABLE IF EXISTS downloads; +CREATE TABLE downloads ( + /** Notice this has to be a hanab.live username, not a user_id */ + username INTEGER NOT NULL, + latest_game_id INTEGER NOT NULL, + FOREIGN KEY (user_id) REFERENCES users (id) +); + +/** +* For completeness, I suggest we store the hanab.live variants here together with their names. +* So this will just be a static table that we populate once. +*/ +DROP TABLE IF EXISTS variants; +CREATE TABLE variants ( + id INTEGER PRIMARY KEY, + name TEXT NOT NULL + num_suits INTEGER NOT NULL, + clue_starved BOOLEAN NOT NULL, + max_score SMALLINT NOT NULL +); + +/** +* We should store data on all games locally for two reasons: +* - Only download detailed data for each game once (The actions taken and the actual deck) +* - We can evaluate statistics (also complicated ones) whenever we want, independent of the server +*/ +DROP TABLE IF EXISTS games; +CREATE TABLE games ( + /** Here, we should use the same ids as the game ids from hanab.live */ + id INTEGER PRIMARY KEY, + num_players SMALLINT NOT NULL, + /** This should be the same variant id as on hanab.live, even if we don't use all of them. */ + variant_id SMALLINT NOT NULL, + /** Same here: seed from hanab.live */ + seed TEXT NOT NULL, + score SMALLINT NOT NULL, + /** + * This is the league id mentioned above that will represent the ordering of games regarding being processed by ELO. + * Note that this means when fetching new data from hanab.live, we have to fetch *all* of it and insert the games sorted + * by their ids. It's not technically needed for the ELO system to function, but it's definitely desirable to process them in order, of course. + */ + league_id SERIAL NOT NULL UNIQUE, + FOREIGN KEY (variant_id) REFERENCES variants (id) +); + +CREATE TABLE game_participants ( + game_id INTEGER NOT NULL, + /** + * This is our internal user id for league accounts. In particular, if a league account is associated to multiple hanab.live accounts, + * we will lose the distinction between them when storing participants in this way (which I think is totally ok). + * Also, this makes insertion slightly tricky: You always have to look up these ids first before being able to insert a game. + * There could be one argument that we should store the hanab.live username here: This would enable us to list games (if we were to do such a stats page) + * with the hanab.live account names instead of the 'display names' that people used for league sign-up. + * Let me know what you think. + */ + user_id INTEGER NOT NULL, + /** + * Hanab.live stores this as well, I don't think that we need it currently, but there's no harm done in storing this, since it would enable us to know + * which player is which if we were to do per-player statistics also inside an individual game. + * For example, I could imagine stuff like: 'Clues given' or 'Cards played', which might actually vary across players. + */ + seat SMALLINT NOT NULL, + FOREIGN KEY (game_id) REFERENCES games (id), + FOREIGN KEY (user_id) REFERENCES users (id), + CONSTRAINT game_participants_unique UNIQUE (game_id, user_id) +); + +/** +* This is just exactly the same table as on hanab.live, since we might want to do statistics like BDR, we have to replicate this. +* Note that for now, we can just ignore this and code up the rest, but we will need this in the long run if we want to do BDR. +*/ +DROP TABLE IF EXISTS game_actions CASCADE; +CREATE TABLE game_actions ( + game_id INTEGER NOT NULL, + turn SMALLINT NOT NULL, + + /** + * - 0 - play + * - 1 - discard + * - 2 - color clue + * - 3 - rank clue + * - 4 - game over + */ + type SMALLINT NOT NULL, + + /** + * - If a play or a discard, corresponds to the order of the the card that was played/discarded. + * - If a clue, corresponds to the index of the player that received the clue. + * - If a game over, corresponds to the index of the player that caused the game to end or -1 if + * the game was terminated by the server. + */ + target SMALLINT NOT NULL, + + /** + * - If a play or discard, then 0 (as NULL). It uses less database space and reduces code + * complexity to use a value of 0 for NULL than to use a SQL NULL: + * https://dev.mysql.com/doc/refman/8.0/en/data-size.html + * - If a color clue, then 0 if red, 1 if yellow, etc. + * - If a rank clue, then 1 if 1, 2 if 2, etc. + * - If a game over, then the value corresponds to the "endCondition" values in "constants.go". + */ + value SMALLINT NOT NULL, + + FOREIGN KEY (game_id) REFERENCES games (id) ON DELETE CASCADE, + PRIMARY KEY (game_id, turn) +); + + +/** +* Hanab.live uses the go random number generator to deterministically compute a deck shuffle out of a seed, +* where the seed is just used to seed the RNG. +* Since we can't replicate this, we need to store the seeds explicitly, but of course it's enough to do this on a per-seed basis +* rather than a per-game basis. +* Note that for now, we can just not populate this table, since this would only be relevant for statistics anyway. +* TODO: I think it should be an invariant that whenever we have stored the actions of a game, we ensure that we also have stored the +* corresponding seed here (but not vice versa). +* Unfortunately, to my knowledge, it is not possible to encode this well in PostgreSQL itself, so we will need to make sure our program +* has this invariant. +*/ +DROP TABLE IF EXISTS seeds CASCADE; +CREATE TABLE seeds ( + seed TEXT NOT NULL, + /** The index of the card in the deck of the seed, so 0,1,..., 49 for NoVar */ + card_index SMALLINT NOT NULL, + suit_index SMALLINT NOT NULL, + rank SMALLINT NOT NULL, + CONSTRAINT cards_unique UNIQUE (seed, card_index) +) + + +/** +* I think the tables above now store everything that we need to know from hanab.live +* So the following should collect our statistics and ELO now. +* +* To store ELO, I suggest we use an append-only data structure, i.e. we just insert a new row whenever any ELO changes +* and keep the old ones. This as good for roll-back, easy to maintain, and also useful if we at some point want to model progression somwhere. +* Also, we need to separately store the initial ratings for the variants and users, in my opinion for two reasons: +* - Do not rely on some config file in the repository to be available to interpret the data here +* - Allow for separate base ratings on a per-user basis (this might be useful at some point). +*/ + +/* TABLES RELATED TO ELO */ + +/** +* Records the initial ratings of variants +*/ +CREATE TABLE variant_base_ratings ( + variant_id SMALLINT PRIMARY KEY, + rating REAL NOT NULL, +); + +/** +* Store a separate row for each elo update to one of the variants +*/ +CREATE TABLE variant_ratings ( + /** This should reference the game that triggered the elo update */ + league_id INTEGER PRIMARY KEY REFERENCES games, + + variant_id SMALLINT NOT NULL, + change REAL NOT NULL, + value_after REAL NOT NULL, + + FOREIGN KEY (variant_id) REFERENCES variants (id), +); + +CREATE TABLE user_base_ratings ( + user_id SMALLINT, + /** + * Since a user has different ratings now, this should represent which of the ratings we mean. + * For now, I suggest 0 = NoVar, 1 = ClueStarved. + * I'd use an integer here to be open for more elos per player in future seasons + */ + type SMALLINT NOT NULL, + + rating REAL NOT NULL, + + PRIMARY KEY (user_id, type), + FOREIGN KEY (user_id) REFERENCES users (id) +); + +CREATE TABLE user_ratings ( + /** This should reference the game that triggered the elo update */ + league_id INTEGER PRIMARY KEY REFERENCES games, + + user_id SMALLINT NOT NULL, + type SMALLINT NOT NULL, + + change REAL NOT NULL, + value_after REAL NOT NULL, + + FOREIGN KEY (user_id) REFERENCES users (id) +); + + +/* TABLES RELATED TO STATISTICS */ + +/** This is a rough outline, not really happy with it right now. */ +CREATE TABLE statistics ( + game_id INTEGER PRIMARY KEY, + /** I'd say all of the following can just be null in case we have not evaluated them yet. */ + bdr SMALLINT, + moves SMALLINT, + strikeout BOOLEAN +); + +/** +* TODO: I'm really unsure right now how to store user-related statistics, stuff like +* - Current streak +* - Average bdrs/moves/whatever +* - Cumulative stuff like 'number of wins/losses', 'number of strikeouts' etc. +* Computationally it would just be fine to re-evaluate them whenever needed, since it's only looking stuff up in the database + some linear time calculation. +* On the other hand, it would be sort of nice to have them here as well. +* In the latter case, I'd probably suggest having just one row for each user that we automatically update whenever we process a game of that user, +* together with an indication on how recent that entry is (so that on query, we always know if it's up to date). +*/