-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Video Question and Answering model JSFusion #61
base: master
Are you sure you want to change the base?
Conversation
(frames,), filename = tensors | ||
resnet_output = self.pool(frames) | ||
resnet_output = resnet_output.view(resnet_output.shape[0], resnet_output.shape[1]) | ||
# TODO handle the case when resnet_output.shape[0] < num_frames (fill zeros) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment points to the branch for computing mask
in https://github.com/jsjason/jsfusion-pytorch/blob/master/infer.py#L177, which I think can become simplified.
models/jsfusion/model.py
Outdated
# frames: (40, 3, 224, 224) | ||
|
||
filename = os.path.basename(file_path) | ||
out = (frames, filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be ((frames,), filename)
. The first item needs to be a tuple of tensors, not a single tensor.
|
||
conv3 = self.conv3(bn2) | ||
relu3 = self.relu3(conv3) | ||
bn3 = self.bn3(relu3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've seen a lot of open PyTorch code that does something like the following.
x = self.conv3(x)
x = self.relu3(x)
x = self.bn3(x)
I'm not sure what the real intention of this is, but I'm assuming it's for reducing the memory footprint? If our implementations takes up a lot of memory, then I guess we could change our code to have this pattern, too. I'm not sure how this affects inference time.
|
||
cut_mask_indices = [i for i in range(cut_mask.shape[1]) if i % 2 == 1 and i < cut_mask.shape[1] - 1] | ||
cut_mask_indices = torch.tensor(cut_mask_indices) | ||
cut_mask_indices = cut_mask_indices.to(device=self.device, non_blocking=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this part takes long, then we perhaps could pre-allocate a bunch of tensors for every possible mask length, and simply fetch the correct cut_mask_indices
tensor at runtime. The default max length 40 isn't THAT long, so this shouldn't take up a lot of memory.
|
||
# sentences: np.ndarray shape=(Bx5xL) dtype=int32 | ||
# sentence_masks: np.ndarray shape=(Bx5xL) dtype=float32 | ||
sentences, sentence_masks = self.parse_sentences(self.word2idx, mc_path, self.num_frames) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this segment turns out to be a bottleneck, I guess we could construct a very complex pipeline in which two steps (e.g., ResNetFeatureExtractor
and SentenceParser
) can be run concurrently and their results are aggregated in the final step MCModel
... maybe next year?
This PR adds a video QA model named JSFusion (Yu et al. ECCV 2018). Most of the code has been taken from @jsjason's implementation (https://github.com/jsjason/jsfusion-pytorch).
Changes specifically made for our system:
torch.nn.Module.register_buffer
is used if unavailable.How to run
Although the code still has a vast space to optimize, I would like to parallelize the review process by taking a chance to ask questions on my side.